diff --git "a/Data/Submissions.csv" "b/Data/Submissions.csv" new file mode 100644--- /dev/null +++ "b/Data/Submissions.csv" @@ -0,0 +1,6902 @@ +id,original,tcdate,tmdate,number,title,authorids,authors,keywords,abstract,pdf,conf_name,conf_year,tldr +GJwMHetHc73,e6n7a82zaoJ,1601310000000.0,1615220000000.0,1953,Unsupervised Object Keypoint Learning using Local Spatial Predictability,"[""~Anand_Gopalakrishnan1"", ""~Sjoerd_van_Steenkiste1"", ""~J\u00fcrgen_Schmidhuber1""]","[""Anand Gopalakrishnan"", ""Sjoerd van Steenkiste"", ""J\u00fcrgen Schmidhuber""]","[""unsupervised representation learning"", ""object-keypoint representations"", ""visual saliency""]","We propose PermaKey, a novel approach to representation learning based on object keypoints. It leverages the predictability of local image regions from spatial neighborhoods to identify salient regions that correspond to object parts, which are then converted to keypoints. Unlike prior approaches, it utilizes predictability to discover object keypoints, an intrinsic property of objects. This ensures that it does not overly bias keypoints to focus on characteristics that are not unique to objects, such as movement, shape, colour etc. We demonstrate the efficacy of PermaKey on Atari where it learns keypoints corresponding to the most salient object parts and is robust to certain visual distractors. Further, on downstream RL tasks in the Atari domain we demonstrate how agents equipped with our keypoints outperform those using competing alternatives, even on challenging environments with moving backgrounds or distractor objects. +",/pdf/9c46c3c76f9646f35cf6acdbca761d23e88bde59.pdf,ICLR,2021,"We propose PermaKey, a novel method for learning object keypoint representations that leverages local predictability as a measure of objectness." +xF5r3dVeaEl,Q91ONcJmOG,1601310000000.0,1614990000000.0,1121,Local Information Opponent Modelling Using Variational Autoencoders,"[""~Georgios_Papoudakis1"", ""f.christianos@ed.ac.uk"", ""~Stefano_V_Albrecht1""]","[""Georgios Papoudakis"", ""Filippos Christianos"", ""Stefano V Albrecht""]","[""multi-agent systems"", ""opponent modelling"", ""reinforcement learning""]","Modelling the behaviours of other agents (opponents) is essential for understanding how agents interact and making effective decisions. Existing methods for opponent modelling commonly assume knowledge of the local observations and chosen actions of the modelled opponents, which can significantly limit their applicability. We propose a new modelling technique based on variational autoencoders, which are trained to reconstruct the local actions and observations of the opponent based on embeddings which depend only on the local observations of the modelling agent (its observed world state, chosen actions, and received rewards). The embeddings are used to augment the modelling agent's decision policy which is trained via deep reinforcement learning; thus the policy does not require access to opponent observations. We provide a comprehensive evaluation and ablation study in diverse multi-agent tasks, showing that our method achieves comparable performance to an ideal baseline which has full access to opponent's information, and significantly higher returns than a baseline method which does not use the learned embeddings.",/pdf/77fcc564fd0bcb6d2eb5500a9f3e79a531e70c2f.pdf,ICLR,2021, +HJe7unNFDH,SyghbQieUB,1569440000000.0,1577170000000.0,35,Scaling Up Neural Architecture Search with Big Single-Stage Models,"[""jyu79@illinois.edu"", ""pengchong@google.com"", ""hanxiaol@google.com"", ""gbender@google.com"", ""pikinder@google.com"", ""tanmingxing@google.com"", ""t-huang1@illinois.edu"", ""xiaodansong@google.com"", ""qvl@google.com""]","[""Jiahui Yu"", ""Pengchong Jin"", ""Hanxiao Liu"", ""Gabriel Bender"", ""Pieter-Jan Kindermans"", ""Mingxing Tan"", ""Thomas Huang"", ""Xiaodan Song"", ""Quoc Le""]","[""Single-Stage Neural Architecture Search""]","Neural architecture search (NAS) methods have shown promising results discovering models that are both accurate and fast. For NAS, training a one-shot model has became a popular strategy to approximate the quality of multiple architectures (child models) using a single set of shared weights. To avoid performance degradation due to parameter sharing, most existing methods have a two-stage workflow where the best child model induced from the one-shot model has to be retrained or finetuned. In this work, we propose BigNAS, an approach that simplifies this workflow and scales up neural architecture search to target a wide range of model sizes simultaneously. We propose several techniques to bridge the gap between the distinct initialization and learning dynamics across small and big models with shared parameters, which enable us to train a single-stage model: a single model from which we can directly slice high-quality child models without retraining or finetuning. With BigNAS we are able to train a single set of shared weights on ImageNet and use these weights to obtain child models whose sizes range from 200 to 1000 MFLOPs. Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%, surpassing all state-of-the-art models in this range including EfficientNets.",/pdf/8c491aba97c87b4ed398baf786e99637483d3bef.pdf,ICLR,2020,"We scale up neural architecture search with big single-stage models, surpassing all state-of-the-art models from 200 to 1000 MFLOPs including EfficientNets." +Byl3HxBFwH,HylalFgYPB,1569440000000.0,1577170000000.0,2300,Efficient Deep Representation Learning by Adaptive Latent Space Sampling,"[""y.mo16@imperial.ac.uk"", ""shuo.wang@imperial.ac.uk"", ""c.dai@imperial.ac.uk"", ""rui.zhou18@imperial.ac.uk"", ""zt215@cam.ac.uk"", ""w.bai@imperial.ac.uk"", ""y.guo@imperial.ac.uk""]","[""Yuanhan Mo"", ""Shuo Wang"", ""Chengliang Dai"", ""Rui Zhou"", ""Zhongzhao Teng"", ""Wenjia Bai"", ""Yike Guo""]","[""Deep learning"", ""Data efficiency""]","Supervised deep learning requires a large amount of training samples with annotations (e.g. label class for classification task, pixel- or voxel-wised label map for segmentation tasks), which are expensive and time-consuming to obtain. During the training of a deep neural network, the annotated samples are fed into the network in a mini-batch way, where they are often regarded of equal importance. However, some of the samples may become less informative during training, as the magnitude of the gradient start to vanish for these samples. In the meantime, other samples of higher utility or hardness may be more demanded for the training process to proceed and require more exploitation. To address the challenges of expensive annotations and loss of sample informativeness, here we propose a novel training framework which adaptively selects informative samples that are fed to the training process. The adaptive selection or sampling is performed based on a hardness-aware strategy in the latent space constructed by a generative model. To evaluate the proposed training framework, we perform experiments on three different datasets, including MNIST and CIFAR-10 for image classification task and a medical image dataset IVUS for biophysical simulation task. On all three datasets, the proposed framework outperforms a random sampling method, which demonstrates the effectiveness of our framework.",/pdf/4d5a71b5795167c32d8ef27e810bfa7c4c5c2f2f.pdf,ICLR,2020,This paper introduces a framework for data-efficient representation learning by adaptive sampling in latent space. +BklxI0VtDB,SkeLJCLOPB,1569440000000.0,1577170000000.0,1126,ROS-HPL: Robotic Object Search with Hierarchical Policy Learning and Intrinsic-Extrinsic Modeling,"[""xinye1@asu.edu"", ""szheng31@asu.edu"", ""yz.yang@asu.edu""]","[""Xin Ye"", ""Shibin Zheng"", ""Yezhou Yang""]","[""Robotic Object Search"", ""Hierarchical Reinforcement Learning""]","Despite significant progress in Robotic Object Search (ROS) over the recent years with deep reinforcement learning based approaches, the sparsity issue in reward setting as well as the lack of interpretability of the previous ROS approaches leave much to be desired. We present a novel policy learning approach for ROS, based on a hierarchical and interpretable modeling with intrinsic/extrinsic reward setting, to tackle these two challenges. More specifically, we train the low-level policy by deliberating between an action that achieves an immediate sub-goal and the one that is better suited for achieving the final goal. We also introduce a new evaluation metric, namely the extrinsic reward, as a harmonic measure of the object search success rate and the average steps taken. Experiments conducted with multiple settings on the House3D environment validate and show that the intelligent agent, trained with our model, can achieve a better object search performance (higher success rate with lower average steps, measured by SPL: Success weighted by inverse Path Length). In addition, we conduct studies w.r.t. the parameter that controls the weighted overall reward from intrinsic and extrinsic components. The results suggest it is critical to devise a proper trade-off strategy to perform the object search well.",/pdf/bd8d4596a4a2f38785f5fc06ec736e56d713c26f.pdf,ICLR,2020,"In this paper, we present a novel two-layer hierarchical policy learning framework that builds on intrinsic and extrinsic rewards for the task of robotic object search." +B1edvs05Y7,rkeb1Av5F7,1538090000000.0,1545360000000.0,275,Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication,"[""felix.sattler@hhi.fraunhofer.de"", ""simon.wiedemann@hhi.fraunhofer.de"", ""klaus-robert.mueller@tu-berlin.de"", ""wojciech.samek@hhi.fraunhofer.de""]","[""Felix Sattler"", ""Simon Wiedemann"", ""Klaus-Robert M\u00fcller"", ""Wojciech Samek""]",[],"Currently, progressively larger deep neural networks are trained on ever growing data corpora. In result, distributed training schemes are becoming increasingly relevant. A major issue in distributed training is the limited communication bandwidth between contributing nodes or prohibitive communication cost in general. +%These challenges become even more pressing, as the number of computation nodes increases. +To mitigate this problem we propose Sparse Binary Compression (SBC), a compression framework that allows for a drastic reduction of communication cost for distributed training. SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits. By doing so, our method also allows us to smoothly trade-off gradient sparsity and temporal sparsity to adapt to the requirements of the learning task. +%We use tools from information theory to reason why SBC can achieve the striking compression rates observed in the experiments. +Our experiments show, that SBC can reduce the upstream communication on a variety of convolutional and recurrent neural network architectures by more than four orders of magnitude without significantly harming the convergence speed in terms of forward-backward passes. For instance, we can train ResNet50 on ImageNet in the same number of iterations to the baseline accuracy, using $\times 3531$ less bits or train it to a $1\%$ lower accuracy using $\times 37208$ less bits. In the latter case, the total upstream communication required is cut from 125 terabytes to 3.35 gigabytes for every participating client. Our method also achieves state-of-the-art compression rates in a Federated Learning setting with 400 clients.",/pdf/79a831ab8097889e3fd0194e2ca435da6c069550.pdf,ICLR,2019, +rkgMNnC9YQ,r1gOq7Rctm,1538090000000.0,1545360000000.0,1429,ATTENTIVE EXPLAINABILITY FOR PATIENT TEMPORAL EMBEDDING,"[""sowdaby@us.ibm.com"", ""mohamed.ghalwash@ibm.com"", ""zach.shahn@ibm.com"", ""deysa@us.ibm.com"", ""mzdraidia@berkeley.edu"", ""lilehman@mit.edu""]","[""Daby Sow"", ""Mohamed Ghalwash"", ""Zach Shahn"", ""Sanjoy Dey"", ""Moulay Draidia"", ""Li-wei Lehmann""]",[],"Learning explainable patient temporal embeddings from observational data has mostly ignored the use of RNN architecture that excel in capturing temporal data dependencies but at the expense of explainability. This paper addresses this problem by introducing and applying an information theoretic approach to estimate the degree of explainability of such architectures. Using a communication paradigm, we formalize metrics of explainability by estimating the amount of information that an AI model needs to convey to a human end user to explain and rationalize its outputs. A key aspect of this work is to model human prior knowledge at the receiving end and measure the lack of explainability as a deviation from human prior knowledge. We apply this paradigm to medical concept representation problems by regularizing loss functions of temporal autoencoders according to the derived explainability metrics to guide the learning process towards models producing explainable outputs. We illustrate the approach with convincing experimental results for the generation of explainable temporal embeddings for critical care patient data.",/pdf/658553e714b122ee7d40ab8047bea6f4423e8347.pdf,ICLR,2019, +Hygp1nR9FQ,H1gZrMa5tQ,1538090000000.0,1545360000000.0,1025,Unifying Bilateral Filtering and Adversarial Training for Robust Neural Networks,"[""ratzlafn@oregonstate.edu"", ""lif@oregonstate.edu""]","[""Neale Ratzlaff"", ""Li Fuxin""]","[""Adversarial examples"", ""Image denoising""]","Recent analysis of deep neural networks has revealed their vulnerability to carefully structured adversarial examples. Many effective algorithms exist to craft these adversarial examples, but performant defenses seem to be far away. In this work, we explore the use of edge-aware bilateral filtering as a projection back to the space of natural images. We show that bilateral filtering is an effective defense in multiple attack settings, where the strength of the adversary gradually increases. In the case of adversary who has no knowledge of the defense, bilateral filtering can remove more than 90% of adversarial examples from a variety of different attacks. To evaluate against an adversary with complete knowledge of our defense, we adapt the bilateral filter as a trainable layer in a neural network and show that adding this layer makes ImageNet images significantly more robust to attacks. When trained under a framework of adversarial training, we show that the resulting model is hard to fool with even the best attack methods. ",/pdf/4712397ffd054f9749841b0e4cfb86cb0507935f.pdf,ICLR,2019,We adapt bilateral filtering as a layer in a neural network which improves robustness to adversarial examples using nonlocal filtering. +rJEjjoR9K7,r1xAGY55Km,1538090000000.0,1556740000000.0,654,Learning Robust Representations by Projecting Superficial Statistics Out,"[""haohanw@cs.cmu.edu"", ""zexueh@mail.bnu.edu.cn"", ""zlipton@cmu.edu"", ""epxing@cs.cmu.edu""]","[""Haohan Wang"", ""Zexue He"", ""Zachary C. Lipton"", ""Eric P. Xing""]","[""domain generalization"", ""robustness""]","Despite impressive performance as evaluated on i.i.d. holdout data, deep neural networks depend heavily on superficial statistics of the training data and are liable to break under distribution shift. For example, subtle changes to the background or texture of an image can break a seemingly powerful classifier. Building on previous work on domain generalization, we hope to produce a classifier that will generalize to previously unseen domains, even when domain identifiers are not available during training. This setting is challenging because the model may extract many distribution-specific (superficial) signals together with distribution-agnostic (semantic) signals. To overcome this challenge, we incorporate the gray-level co-occurrence matrix (GLCM) to extract patterns that our prior knowledge suggests are superficial: they are sensitive to the texture but unable to capture the gestalt of an image. Then we introduce two techniques for improving our networks' out-of-sample performance. The first method is built on the reverse gradient method that pushes our model to learn representations from which the GLCM representation is not predictable. The second method is built on the independence introduced by projecting the model's representation onto the subspace orthogonal to GLCM representation's. +We test our method on the battery of standard domain generalization data sets and, interestingly, achieve comparable or better performance as compared to other domain generalization methods that explicitly require samples from the target distribution for training.",/pdf/cb0dba8c98d2a19b1b0d69c89c49c3e91b2233d4.pdf,ICLR,2019,"Building on previous work on domain generalization, we hope to produce a classifier that will generalize to previously unseen domains, even when domain identifiers are not available during training." +rJxBa1HFvS,B1l74VkFvH,1569440000000.0,1577170000000.0,1988,Value-Driven Hindsight Modelling,"[""aguez@google.com"", ""fviola@google.com"", ""theophane@google.com"", ""lbuesing@google.com"", ""skapturowski@google.com"", ""doinap@google.com"", ""davidsilver@google.com"", ""heess@google.com""]","[""Arthur Guez"", ""Fabio Viola"", ""Theophane Weber"", ""Lars Buesing"", ""Steven Kapturowski"", ""Doina Precup"", ""David Silver"", ""Nicolas Heess""]",[],"Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.",/pdf/611d672a10c6d8763e3fac576cc22ae8f838f1cf.pdf,ICLR,2020, +SyYe6k-CW,SkKe6yW0W,1509130000000.0,1519450000000.0,524,Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling,"[""rikel@google.com"", ""gjt@google.com"", ""jsnoek@google.com""]","[""Carlos Riquelme"", ""George Tucker"", ""Jasper Snoek""]","[""exploration"", ""Thompson Sampling"", ""Bayesian neural networks"", ""bandits"", ""reinforcement learning"", ""variational inference"", ""Monte Carlo""]","Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.",/pdf/51b030c71e6d3d9875304d2643bc76565c8f5f7c.pdf,ICLR,2018,An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling +BJgYl205tQ,Skl34l7qYQ,1538090000000.0,1545360000000.0,1095,Quality Evaluation of GANs Using Cross Local Intrinsic Dimensionality,"[""sukarnab@student.unimelb.edu.au"", ""xingjun.ma@unimelb.edu.au"", ""sarah.erfani@unimelb.edu.au"", ""meh@nii.ac.jp"", ""baileyj@unimelb.edu.au""]","[""Sukarna Barua"", ""Xingjun Ma"", ""Sarah Monazam Erfani"", ""Michael Houle"", ""James Bailey""]","[""Generative Adversarial Networks"", ""Evaluation Metric"", ""Local Intrinsic Dimensionality""]","Generative Adversarial Networks (GANs) are an elegant mechanism for data generation. However, a key challenge when using GANs is how to best measure their ability to generate realistic data. In this paper, we demonstrate that an intrinsic dimensional characterization of the data space learned by a GAN model leads to an effective evaluation metric for GAN quality. In particular, we propose a new evaluation measure, CrossLID, that assesses the local intrinsic dimensionality (LID) of input data with respect to neighborhoods within GAN-generated samples. In experiments on 3 benchmark image datasets, we compare our proposed measure to several state-of-the-art evaluation metrics. Our experiments show that CrossLID is strongly correlated with sample quality, is sensitive to mode collapse, is robust to small-scale noise and image transformations, and can be applied in a model-free manner. Furthermore, we show how CrossLID can be used within the GAN training process to improve generation quality.",/pdf/53373bebac0add180413be1effa3f039968f8f0b.pdf,ICLR,2019,We propose a new metric for evaluating GAN models. +mEdwVCRJuX4,oNxzCsApfMm,1601310000000.0,1616020000000.0,926,Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization,"[""~Kaidi_Cao1"", ""~Yining_Chen1"", ""~Junwei_Lu1"", ""nikos.arechiga@tri.global"", ""~Adrien_Gaidon1"", ""~Tengyu_Ma1""]","[""Kaidi Cao"", ""Yining Chen"", ""Junwei Lu"", ""Nikos Arechiga"", ""Adrien Gaidon"", ""Tengyu Ma""]","[""deep learning"", ""noise robust learning"", ""imbalanced learning""]","Real-world large-scale datasets are heteroskedastic and imbalanced --- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning. ",/pdf/99d4ce0622fd824cd067efb8dda5611146eb4070.pdf,ICLR,2021,We propose a data-dependent regularization technique for learning heteroskedastic and imbalanced datasets. +BJexP6VKwH,S1lBj4qDwS,1569440000000.0,1577170000000.0,584,Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment,"[""tanshh@mail2.sysu.edu.cn"", ""xpeng@bu.edu"", ""saenko@bu.edu""]","[""Shuhan Tan"", ""Xingchao Peng"", ""Kate Saenko""]","[""Domain Adaptation"", ""Label Shift"", ""Covariate Shift""]","Unsupervised knowledge transfer has a great potential to improve the generalizability of deep models to novel domains. Yet the current literature assumes that the label distribution is domain-invariant and only aligns the covariate or vice versa. In this paper, we explore the task of Generalized Domain Adaptation (GDA): How to transfer knowledge across different domains in the presence of both covariate and label shift? We propose a covariate and label distribution CO-ALignment (COAL) model to tackle this problem. Our model leverages prototype-based conditional alignment and label distribution estimation to diminish the covariate and label shifts, respectively. We demonstrate experimentally that when both types of shift exist in the data, COAL leads to state-of-the-art performance on several cross-domain benchmarks.",/pdf/49359924a9c87634a3286caa7daa593654180c0a.pdf,ICLR,2020,We propose a covariate and label distribution CO-ALignment (COAL) model to tackle Generalized Domain Adaptation (GDA) with covariant shift and label shift. +BkesGnCcFX,HkxAheGEYX,1538090000000.0,1545360000000.0,1298,Learning Goal-Conditioned Value Functions with one-step Path rewards rather than Goal-Rewards,"[""dhiman@umich.edu"", ""shurjo@umich.edu"", ""qobi@purdue.edu"", ""jjcorso@umich.edu""]","[""Vikas Dhiman"", ""Shurjo Banerjee"", ""Jeffrey M Siskind"", ""Jason J Corso""]","[""Floyd-Warshall"", ""Reinforcement learning"", ""goal conditioned value functions"", ""multi-goal""]","Multi-goal reinforcement learning (MGRL) addresses tasks where the desired goal state can change for every trial. State-of-the-art algorithms model these problems such that the reward formulation depends on the goals, to associate them with high reward. This dependence introduces additional goal reward resampling steps in algorithms like Hindsight Experience Replay (HER) that reuse trials in which the agent fails to reach the goal by recomputing rewards as if reached states were psuedo-desired goals. We propose a reformulation of goal-conditioned value functions for MGRL that yields a similar algorithm, while removing the dependence of reward functions on the goal. Our formulation thus obviates the requirement of reward-recomputation that is needed by HER and its extensions. We also extend a closely related algorithm, Floyd-Warshall Reinforcement Learning, from tabular domains to deep neural networks for use as a baseline. Our results are competetive with HER while substantially improving sampling efficiency in terms of reward computation. +",/pdf/d1d75927837e92ab56e67019bb8f4a00e1d3097e.pdf,ICLR,2019,Do Goal-Conditioned Value Functions need Goal-Rewards to Learn? +ee6W5UgQLa,UB4tvC88s-z,1601310000000.0,1615980000000.0,1545,"MultiModalQA: complex question answering over text, tables and images","[""~Alon_Talmor1"", ""oriy@mail.tau.ac.il"", ""amnoncatav@mail.tau.ac.il"", ""lahav@mail.tau.ac.il"", ""~Yizhong_Wang2"", ""~Akari_Asai2"", ""~Gabriel_Ilharco1"", ""~Hannaneh_Hajishirzi1"", ""~Jonathan_Berant1""]","[""Alon Talmor"", ""Ori Yoran"", ""Amnon Catav"", ""Dan Lahav"", ""Yizhong Wang"", ""Akari Asai"", ""Gabriel Ilharco"", ""Hannaneh Hajishirzi"", ""Jonathan Berant""]","[""NLP"", ""Question Answering"", ""Dataset"", ""Multi-Modal"", ""Multi-Hop""]","When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. +While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. +In this paper, we present MultiModalQA (MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. +We create MMQA using a new framework for generating complex multi-modal questions at scale, harvesting tables from Wikipedia, and attaching images and text paragraphs using entities that appear in each table. We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions. Last, crowdsourcing workers take these automatically generated questions and rephrase them into more fluent language. +We create 29,918 questions through this procedure, and empirically demonstrate the necessity of a multi-modal multi-hop approach to solve our task: our multi-hop model, ImplicitDecomp, achieves an average F1 of 51.7 over cross-modal questions, substantially outperforming a strong baseline that achieves 38.2 F1, but still lags significantly behind human performance, which is at 90.1 F1.",/pdf/f3dad930cb55abce99a229e35cc131a2db791b66.pdf,ICLR,2021,"MultiModalQA: A question answering dataset that requires multi-modal multi-hop reasoning over wikipedia text, tables and images, accompanied by a new multi-hop model for tackling the task." +uBHs6zpY4in,QXsEMTGs3lQ,1601310000000.0,1614990000000.0,2918,H-divergence: A Decision-Theoretic Probability Discrepancy Measure ,"[""~Shengjia_Zhao1"", ""~Abhishek_Sinha1"", ""~Yutong_He1"", ""~Aidan_Perreault1"", ""~Jiaming_Song1"", ""~Stefano_Ermon1""]","[""Shengjia Zhao"", ""Abhishek Sinha"", ""Yutong He"", ""Aidan Perreault"", ""Jiaming Song"", ""Stefano Ermon""]","[""probability divergence"", ""two sample test"", ""maximum mean discrepancy""]","Measuring the discrepancy between two probability distributions is a fundamental problem in machine learning and statistics. Based on ideas from decision theory, we investigate a new class of discrepancies that are based on the optimal decision loss. Two probability distributions are different if the optimal decision loss is higher on the mixture distribution than on each individual distribution. We show that this generalizes popular notions of discrepancy measurements such as the Jensen Shannon divergence and the maximum mean discrepancy. We apply our approach to two-sample tests, which evaluates whether two sets of samples come from the same distribution. On various benchmark and real datasets, we demonstrate that tests based on our generalized notion of discrepancy is able to achieve superior test power. We also apply our approach to sample quality evaluation as an alternative to the FID score, and to understanding the effects of climate change on different social and economic activities.",/pdf/29ba5a99028c5547427fa70d85282376463a9d0c.pdf,ICLR,2021, +rkg98yBFDr,HJx9ADpODr,1569440000000.0,1577170000000.0,1740,Reject Illegal Inputs: Scaling Generative Classifiers with Supervised Deep Infomax,"[""xwang@cs.hku.hk"", ""smyiu@cs.hku.hk""]","[""Xin WANG"", ""SiuMing Yiu""]","[""generative classifiers"", ""selective classification"", ""classification with rejection""]","Deep Infomax~(DIM) is an unsupervised representation learning framework by maximizing the mutual information between the inputs and the outputs of an encoder, while probabilistic constraints are imposed on the outputs. In this paper, we propose Supervised Deep InfoMax~(SDIM), which introduces supervised probabilistic constraints to the encoder outputs. The supervised probabilistic constraints are equivalent to a generative classifier on high-level data representations, where class conditional log-likelihoods of samples can be evaluated. Unlike other works building generative classifiers with conditional generative models, SDIMs scale on complex datasets, and can achieve comparable performance with discriminative counterparts. With SDIM, we could perform \emph{classification with rejection}. +Instead of always reporting a class label, SDIM only makes predictions when test samples' largest logits surpass some pre-chosen thresholds, otherwise they will be deemed as out of the data distributions, and be rejected. Our experiments show that SDIM with rejection policy can effectively reject illegal inputs including out-of-distribution samples and adversarial examples.",/pdf/331627e4f586c605ee4430c30e4fb7ee2a59ae66.pdf,ICLR,2020,"scale generative classifiers on complex datasets, and evaluate their effectiveness to reject illegal inputs including out-of-distribution samples and adversarial examples." +fycxGdpCCmW,mEogQpjAM0l,1601310000000.0,1614990000000.0,3002,Hybrid Discriminative-Generative Training via Contrastive Learning,"[""~Hao_Liu1"", ""~Pieter_Abbeel2""]","[""Hao Liu"", ""Pieter Abbeel""]","[""Hybrid Models"", ""Contrastive Learning"", ""Energy-Based Models"", ""Discriminative-Generative Models""]","Contrastive learning and supervised learning have both seen significant progress and success. However, thus far they have largely been treated as two separate objectives, brought together only by having a shared neural network. In this paper we show that through the perspective of hybrid discriminative-generative training of energy-based models we can make a direct connection between contrastive learning and supervised learning. Beyond presenting this unified view, we show our specific choice of approximation of the energy-based loss significantly improves energy-based models and contrastive learning based methods in confidence-calibration, out-of-distribution detection, adversarial robustness, generative modeling, and image classification tasks. In addition to significantly improved performance, our method also gets rid of SGLD training and does not suffer from training instability. Our evaluations also demonstrate that our method performs better than or on par with state-of-the-art hand-tailored methods in each task. ",/pdf/ea6c2a7627dcdc88d202cb109383fa4ae48ebf13.pdf,ICLR,2021,"We propose a hybrid discriminative-generative model based on contrastive loss and energy-based models, which significantly improves state-of-the-art energy-based models and contrastive learning methods in multiple tasks." +HJ4IhxZAb,S1QI2ebAW,1509130000000.0,1518730000000.0,627,Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning,"[""k.pang@ed.ac.uk"", ""mingzhi.dong.13@ucl.ac.uk"", ""t.hospedales@ed.ac.uk""]","[""Kunkun Pang"", ""Mingzhi Dong"", ""Timothy Hospedales""]","[""Active Learning"", ""Deep Reinforcement Learning""]","Active learning (AL) aims to enable training high performance classifiers with low annotation cost by predicting which subset of unlabelled instances would be most beneficial to label. The importance of AL has motivated extensive research, proposing a wide variety of manually designed AL algorithms with diverse theoretical and intuitive motivations. In contrast to this body of research, we propose to treat active learning algorithm design as a meta-learning problem and learn the best criterion from data. We model an active learning algorithm as a deep neural network that inputs the base learner state and the unlabelled point set and predicts the best point to annotate next. Training this active query policy network with reinforcement learning, produces the best non-myopic policy for a given dataset. The key challenge in achieving a general solution to AL then becomes that of learner generalisation, particularly across heterogeneous datasets. We propose a multi-task dataset-embedding approach that allows dataset-agnostic active learners to be trained. Our evaluation shows that AL algorithms trained in this way can directly generalize across diverse problems.",/pdf/667b2dc6585b9e6f09bfc409b5558469556484ba.pdf,ICLR,2018, +oeHTRAehiFF,jiQjzn6npKV,1601310000000.0,1614990000000.0,3350,ChemistryQA: A Complex Question Answering Dataset from Chemistry,"[""~Zhuoyu_Wei1"", ""jiwe@microsoft.com"", ""xiubo.geng@microsoft.com"", ""yining.chen@microsoft.com"", ""baihua.chen@microsoft.com"", ""~Tao_Qin1"", ""djiang@microsoft.com""]","[""Zhuoyu Wei"", ""Wei Ji"", ""Xiubo Geng"", ""Yining Chen"", ""Baihua Chen"", ""Tao Qin"", ""Daxin Jiang""]",[],"Many Question Answering (QA) tasks have been studied in NLP and employed to evaluate the progress of machine intelligence. One kind of QA tasks, such as Machine Reading Comprehension QA, is well solved by end-to-end neural networks; another kind of QA tasks, such as Knowledge Base QA, needs to be translated to a formatted representations and then solved by a well-designed solver. We notice that some real-world QA tasks are more complex, which cannot be solved by end-to-end neural networks or translated to any kind of formal representations. To further stimulate the research of QA and development of QA techniques, in this work, we create a new and complex QA dataset, ChemistryQA, based on real-world chemical calculation questions. To answer chemical questions, machines need to understand questions, apply chemistry and Math knowledge, and do calculation and reasoning. To help researchers ramp up, we build two baselines: the first one is BERT-based sequence to sequence model, and the second one is an extraction system plus a graph search based solver. These two methods achieved 0.164 and 0.169 accuracy on the development set, respectively, which clearly demonstrate that new techniques are needed for complex QA tasks. ChemistryQA dataset will be available for public download once the paper is published.",/pdf/532ee6316c613095dea67afd41077305d28539bc.pdf,ICLR,2021, +hLElJeJKxzY,ETKe2h0KGEa,1601310000000.0,1614990000000.0,2859,Deep Q Learning from Dynamic Demonstration with Behavioral Cloning,"[""~Xiaoshuang_Li1"", ""junchen@kth.se"", ""x.wang@ia.ac.cn"", ""~Fei-Yue_Wang2""]","[""Xiaoshuang Li"", ""Junchen Jin"", ""Xiao Wang"", ""Fei-Yue Wang""]",[]," Although Deep Reinforcement Learning (DRL) has proven its capability to learn optimal policies by directly interacting with simulation environments, how to combine DRL with supervised learning and leverage additional knowledge to assist the DRL agent effectively still remains difficult. This study proposes a novel approach integrating deep Q learning from dynamic demonstrations with a behavioral cloning model (DQfDD-BC), which includes a supervised learning technique of instructing a DRL model to enhance its performance. Specifically, the DQfDD-BC model leverages historical demonstrations to pre-train a supervised BC model and consistently update it by learning the dynamically updated demonstrations. Then the DQfDD-BC model manages the sample complexity by exploiting both the historical and generated demonstrations. An expert loss function is designed to compare actions generated by the DRL model with those obtained from the BC model to provide advantageous guidance for policy improvements. Experimental results in several OpenAI Gym environments show that the proposed approach adapts to different performance levels of demonstrations, and meanwhile, accelerates the learning processes. As illustrated in an ablation study, the dynamic demonstration and expert loss mechanisms with the utilization of a BC model contribute to improving the learning convergence performance compared with the origin DQfD model. ",/pdf/1f61489d65f2e690f6362b0e31d5c6b732b5225d.pdf,ICLR,2021, +rke4HiAcY7,rJxVUUUJtQ,1538090000000.0,1549650000000.0,77,Caveats for information bottleneck in deterministic scenarios,"[""artemyk@gmail.com"", ""tracey.brendan@gmail.com"", ""steven.jvk@gmail.com""]","[""Artemy Kolchinsky"", ""Brendan D. Tracey"", ""Steven Van Kuyk""]","[""information bottleneck"", ""supervised learning"", ""deep learning"", ""information theory""]","Information bottleneck (IB) is a method for extracting information from one random variable X that is relevant for predicting another random variable Y. To do so, IB identifies an intermediate ""bottleneck"" variable T that has low mutual information I(X;T) and high mutual information I(Y;T). The ""IB curve"" characterizes the set of bottleneck variables that achieve maximal I(Y;T) for a given I(X;T), and is typically explored by maximizing the ""IB Lagrangian"", I(Y;T) - βI(X;T). In some cases, Y is a deterministic function of X, including many classification problems in supervised learning where the output class Y is a deterministic function of the input X. We demonstrate three caveats when using IB in any situation where Y is a deterministic function of X: (1) the IB curve cannot be recovered by maximizing the IB Lagrangian for different values of β; (2) there are ""uninteresting"" trivial solutions at all points of the IB curve; and (3) for multi-layer classifiers that achieve low prediction error, different layers cannot exhibit a strict trade-off between compression and prediction, contrary to a recent proposal. We also show that when Y is a small perturbation away from being a deterministic function of X, these three caveats arise in an approximate way. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We demonstrate the three caveats on the MNIST dataset.",/pdf/47bd1858166dca7085d7f7c4021a1d3b5bead7f1.pdf,ICLR,2019,Information bottleneck behaves in surprising ways whenever the output is a deterministic function of the input. +OPyWRrcjVQw,lRK1bblNAuO,1601310000000.0,1614010000000.0,2261,Shapley explainability on the data manifold,"[""~Christopher_Frye1"", ""damiendemijolla@gmail.com"", ""~Tom_Begley1"", ""laurence.c@faculty.ai"", ""t-mestan@microsoft.com"", ""~Ilya_Feige1""]","[""Christopher Frye"", ""Damien de Mijolla"", ""Tom Begley"", ""Laurence Cowton"", ""Megan Stanley"", ""Ilya Feige""]",[],"Explainability in AI is crucial for model development, compliance with regulation, and providing operational nuance to predictions. The Shapley framework for explainability attributes a model’s predictions to its input features in a mathematically principled and model-agnostic way. However, general implementations of Shapley explainability make an untenable assumption: that the model’s features are uncorrelated. In this work, we demonstrate unambiguous drawbacks of this assumption and develop two solutions to Shapley explainability that respect the data manifold. One solution, based on generative modelling, provides flexible access to data imputations; the other directly learns the Shapley value-function, providing performance and stability at the cost of flexibility. While “off-manifold” Shapley values can (i) give rise to incorrect explanations, (ii) hide implicit model dependence on sensitive attributes, and (iii) lead to unintelligible explanations in higher-dimensional data, on-manifold explainability overcomes these problems. +",/pdf/ed871c78bdc2768918e12775dd57dff6b36e4c24.pdf,ICLR,2021,"We present drawbacks of model explanations that do not respect the data manifold, and introduce two methods for on-manifold explainability." +HyxjNyrtPr,rJxnN03uvB,1569440000000.0,1583910000000.0,1666,RGBD-GAN: Unsupervised 3D Representation Learning From Natural Image Datasets via RGBD Image Synthesis,"[""noguchi@mi.t.u-tokyo.ac.jp"", ""harada@mi.t.u-tokyo.ac.jp""]","[""Atsuhiro Noguchi"", ""Tatsuya Harada""]","[""image generation"", ""3D vision"", ""unsupervised representation learning""]","Understanding three-dimensional (3D) geometries from two-dimensional (2D) images without any labeled information is promising for understanding the real world without incurring annotation cost. We herein propose a novel generative model, RGBD-GAN, which achieves unsupervised 3D representation learning from 2D images. The proposed method enables camera parameter--conditional image generation and depth image generation without any 3D annotations, such as camera poses or depth. We use an explicit 3D consistency loss for two RGBD images generated from different camera parameters, in addition to the ordinal GAN objective. The loss is simple yet effective for any type of image generator such as DCGAN and StyleGAN to be conditioned on camera parameters. Through experiments, we demonstrated that the proposed method could learn 3D representations from 2D images with various generator architectures.",/pdf/4b6c7ceb3bbdbd74afd4be3030689fa92e35bd9e.pdf,ICLR,2020,RGBD image generation for unsupervised camera parameter conditioning +rkl3m1BFDB,BylHTu3dwr,1569440000000.0,1583910000000.0,1632,Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement Learning,"[""aatrey@cs.umass.edu"", ""kclary@cs.umass.edu"", ""jensen@cs.umass.edu""]","[""Akanksha Atrey"", ""Kaleigh Clary"", ""David Jensen""]","[""explainability"", ""saliency maps"", ""representations"", ""deep reinforcement learning""]","Saliency maps are frequently used to support explanations of the behavior of deep reinforcement learning (RL) agents. However, a review of how saliency maps are used in practice indicates that the derived explanations are often unfalsifiable and can be highly subjective. We introduce an empirical approach grounded in counterfactual reasoning to test the hypotheses generated from saliency maps and assess the degree to which they correspond to the semantics of RL environments. We use Atari games, a common benchmark for deep RL, to evaluate three types of saliency maps. Our results show the extent to which existing claims about Atari games can be evaluated and suggest that saliency maps are best viewed as an exploratory tool rather than an explanatory tool.",/pdf/d507903358256c3f4f6f7aab41b3e6e68b6e2578.pdf,ICLR,2020,Proposing a new counterfactual-based methodology to evaluate the hypotheses generated from saliency maps about deep RL agent behavior. +s9788-pPB2,C2-Oe0mX9Gn,1601310000000.0,1614990000000.0,2102,LLBoost: Last Layer Perturbation to Boost Pre-trained Neural Networks,"[""~Adityanarayanan_Radhakrishnan1"", ""nehap@mit.edu"", ""~Caroline_Uhler1""]","[""Adityanarayanan Radhakrishnan"", ""Neha Prasad"", ""Caroline Uhler""]","[""Pre-trained Neural Networks"", ""Over-parameterization"", ""Perturbations""]"," While deep networks have produced state-of-the-art results in several domains from image classification to machine translation, hyper-parameter selection remains a significant computational bottleneck. In order to produce the best possible model, practitioners often search across random seeds or use ensemble methods. As models get larger, any method to improve neural network performance that involves re-training becomes intractable. For example, computing the training accuracy of FixResNext-101 (829 million parameters) on ImageNet takes roughly 1~day when using 1~GPU. + In this work, we present LLBoost, a theoretically-grounded, computationally-efficient method to boost the validation accuracy of pre-trained over-parameterized models without impacting the original training accuracy. LLBoost adjusts the last layer of a neural network by adding a term that is orthogonal to the training feature matrix, which is constructed by applying all layers but the last to the training data. We provide an efficient implementation of LLBoost on the GPU and demonstrate that LLBoost, run using only 1 GPU, improves the test/validation accuracy of pre-trained models on CIFAR10, ImageNet32, and ImageNet. In the over-parameterized linear regression setting, we prove that LLBoost reduces the generalization error of any interpolating solution with high probability without affecting training error. ",/pdf/23506b6fe3bd4cf0f89943cd362234f5c23aae27.pdf,ICLR,2021,"We present LLBoost, a theoretically-grounded, computationally-efficient method to boost the test accuracy of pre-trained neural networks without impacting the original training accuracy." +LFjnKhTNNQD,yB0CPVjKJ3-,1601310000000.0,1614990000000.0,2087,Prepare for the Worst: Generalizing across Domain Shifts with Adversarial Batch Normalization,"[""~Manli_Shu1"", ""~Zuxuan_Wu1"", ""~Micah_Goldblum1"", ""~Tom_Goldstein1""]","[""Manli Shu"", ""Zuxuan Wu"", ""Micah Goldblum"", ""Tom Goldstein""]","[""adversarial training"", ""distributional shifts""]","Adversarial training is the industry standard for producing models that are robust to small adversarial perturbations. However, machine learning practitioners need models that are robust to other kinds of changes that occur naturally, such as changes in the style or illumination of input images. Such changes in input distribution have been effectively modeled as shifts in the mean and variance of deep image features. We adapt adversarial training by adversarially perturbing these feature statistics, rather than image pixels, to produce models that are robust to distributional shifts. We also visualize images from adversarially crafted distributions. Our method, Adversarial Batch Normalization (AdvBN), significantly improves the performance of ResNet-50 on ImageNet-C (+8.1%), Stylized-ImageNet (+6.7%), and ImageNet-Instagram (+3.9%) over standard training practices. In addition, we demonstrate that AdvBN can also improve generalization on semantic segmentation.",/pdf/b8131b0f600444e3e9c4a2df7b2c00abdc97c641.pdf,ICLR,2021,"This work proposes a feature space adversarial training method based on Batchnorm statistics, to attain generalization to distributional shifted data." +HkgxheBFDS,rJxvleWYPB,1569440000000.0,1577170000000.0,2524,Undersensitivity in Neural Reading Comprehension,"[""johannes.welbl.14@ucl.ac.uk"", ""p.minervini@gmail.com"", ""maxbartolo@gmail.com"", ""pontus.stenetorp@gmail.com"", ""s.riedel@ucl.ac.uk""]","[""Johannes Welbl"", ""Pasquale Minervini"", ""Max Bartolo"", ""Pontus Stenetorp"", ""Sebastian Riedel""]","[""reading comprehension"", ""undersensitivity"", ""adversarial questions"", ""adversarial training"", ""robustness"", ""biased data setting""]","Neural reading comprehension models have recently achieved impressive gener- alisation results, yet still perform poorly when given adversarially selected input. Most prior work has studied semantically invariant text perturbations which cause a model’s prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity where input text is meaningfully changed, and the model’s prediction does not change when it should. We formulate a noisy adversarial attack which searches among semantic variations of comprehension questions for which a model still erroneously pro- duces the same answer as the original question – and with an even higher prob- ability. We show that – despite comprising unanswerable questions – SQuAD2.0 and NewsQA models are vulnerable to this attack and commit a substantial frac- tion of errors on adversarially generated questions. This indicates that current models—even where they can correctly predict the answer—rely on spurious sur- face patterns and are not necessarily aware of all information provided in a given comprehension question. Developing this further, we experiment with both data augmentation and adversarial training as defence strategies: both are able to sub- stantially decrease a model’s vulnerability to undersensitivity attacks on held out evaluation data. Finally, we demonstrate that adversarially robust models gener- alise better in a biased data setting with a train/evaluation distribution mismatch; they are less prone to overly rely on predictive cues only present in the training set and outperform a conventional model in the biased data setting by up to 11% F1.",/pdf/c75f2876e3666f7dde1c19feb80900f94998277b.pdf,ICLR,2020,"We demonstrate vulnerability to undersensitivity attacks in SQuAD2.0 and NewsQA neural reading comprehension models, where the model predicts the same answer with increased confidence to adversarially chosen questions, and compare defence strategies." +bVzUDC_4ls,dduX2ilsqUU,1601310000000.0,1614990000000.0,2625,Exploiting Verified Neural Networks via Floating Point Numerical Error,"[""~Kai_Jia2"", ""~Martin_Rinard1""]","[""Kai Jia"", ""Martin Rinard""]",[],"Motivated by the need to reliably characterize the robustness of deep neural networks, researchers have developed verification algorithms for deep neural networks. Given a neural network, the verifiers aim to answer whether certain properties are guaranteed with respect to all inputs in a space. However, little attention has been paid to floating point numerical error in neural network verification. + +We exploit floating point errors in the inference and verification implementations to construct adversarial examples for neural networks that a verifier claims to be robust with respect to certain inputs. We argue that, to produce sound verification results, any verification system must accurately (or conservatively) model the effects of any float point computations in the network inference or verification system.",/pdf/c893b11e98f08d9a0aef4bae6ca4070709ea60e6.pdf,ICLR,2021,We show that floating point error in neural network verifiers and neural network inference implementations can be systematically exploited to invalidate robustness claims. +BylXi3NKvS,S1lLgkTgvS,1569440000000.0,1577170000000.0,145,FALCON: Fast and Lightweight Convolution for Compressing and Accelerating CNN,"[""chunquan_cs@outlook.com"", ""elnino4@snu.ac.kr"", ""hyundonglee1015@gmail.com"", ""ukang@snu.ac.kr""]","[""Chun Quan"", ""Jun-Gi Jang"", ""Hyun Dong Lee"", ""U Kang""]","[""CNN compression"", ""CNN acceleration"", ""model compression""]","How can we efficiently compress Convolutional Neural Networks (CNN) while retaining their accuracy on classification tasks? A promising direction is based on depthwise separable convolution which replaces a standard convolution with a depthwise convolution and a pointwise convolution. However, previous works based on depthwise separable convolution are limited since 1) they are mostly heuristic approaches without a precise understanding of their relations to standard convolution, and 2) their accuracies do not match that of the standard convolution. + +In this paper, we propose FALCON, an accurate and lightweight method for compressing CNN. FALCON is derived by interpreting existing convolution methods based on depthwise separable convolution using EHP, our proposed mathematical formulation to approximate the standard convolution kernel. Such interpretation leads to developing a generalized version rank-k FALCON which further improves the accuracy while sacrificing a bit of compression and computation reduction rates. In addition, we propose FALCON-branch by fitting FALCON into the previous state-of-the-art convolution unit ShuffleUnitV2 which gives even better accuracy. Experiments show that FALCON and FALCON-branch outperform 1) existing methods based on depthwise separable convolution and 2) standard CNN models by up to 8x compression and 8x computation reduction while ensuring similar accuracy. We also demonstrate that rank-k FALCON provides even better accuracy than standard convolution in many cases, while using a smaller number of parameters and floating-point operations.",/pdf/b4bb50b246b917eaf962ca0bbbc351639641bce0.pdf,ICLR,2020,FALCON is an accurate and lightweight method for compressing CNNs based on depthwise separable convolution. +8QAXsAOSBjE,hP2Gz55erLaK,1601310000000.0,1614990000000.0,466,Reusing Preprocessing Data as Auxiliary Supervision in Conversational Analysis,"[""~Joshua_Yee_Kim1"", ""kalina.yacef@sydney.edu.au""]","[""Joshua Yee Kim"", ""Kalina Yacef""]","[""Multitask Learning"", ""Multimodal Conversational Analysis""]","Conversational analysis systems are trained using noisy human labels and often require heavy preprocessing during multi-modal feature extraction. Using noisy labels in single-task learning increases the risk of over-fitting. However, auxiliary tasks could improve the performance of the primary task learning. This approach is known as Primary Multi-Task Learning (MTL). A challenge of MTL is the selection of beneficial auxiliary tasks that avoid negative transfer. In this paper, we explore how the preprocessed data used for feature engineering can be re-used as auxiliary tasks in Primary MTL, thereby promoting the productive use of data in the form of auxiliary supervision learning. Our main contributions are: (1) the identification of sixteen beneficially auxiliary tasks, (2) the method of distributing learning capacity between the primary and auxiliary tasks, and (3) the relative supervision hierarchy between the primary and auxiliary tasks. Extensive experiments on IEMOCAP and SEMAINE data validate the improvements over single-task approaches, and suggest that it may generalize across multiple primary tasks.",/pdf/da2a1adc3badad1b7601fdfc2d1972c6b1843a75.pdf,ICLR,2021,"For multimodal conversational analysis, we have identified what are the beneficially auxiliary tasks, how to construct them through reusing preprocessing data, and the model architecture design to improve the primary tasks performances." +HK_B2K0026,g8U641FxlwW,1601310000000.0,1614990000000.0,3474,Attention Based Joint Learning for Supervised Electrocardiogram Arrhythmia Differentiation with Unsupervised Abnormal Beat Segmentation,"[""~Xinrong_Hu1"", ""wl960201@163.com"", ""635386607@qq.com"", ""41421891@qq.com"", ""~Jian_Zhuang1"", ""~Yiyu_Shi1""]","[""Xinrong Hu"", ""long wen"", ""shushui wang"", ""Dongpo Liang"", ""Jian Zhuang"", ""Yiyu Shi""]","[""interpretability"", ""multitask learning"", ""attention mechanism"", ""electrocardiography""]","Deep learning has shown great promise in arrhythmia classification in electrocar- diogram (ECG). Existing works, when classifying an ECG segment with multiple beats, do not identify the locations of the anomalies, which reduces clinical inter- pretability. On the other hand, segmenting abnormal beats by deep learning usu- ally requires annotation for a large number of regular and irregular beats, which can be laborious, sometimes even challenging, with strong inter-observer variabil- ity between experts. In this work, we propose a method capable of not only dif- ferentiating arrhythmia but also segmenting the associated abnormal beats in the ECG segment. The only annotation used in the training is the type of abnormal beats and no segmentation labels are needed. Imitating human’s perception of an ECG signal, the framework consists of a segmenter and classifier. The segmenter outputs an attention map, which aims to highlight the abnormal sections in the ECG by element-wise modulation. Afterwards, the signals are sent to a classifier for arrhythmia differentiation. Though the training data is only labeled to super- vise the classifier, the segmenter and the classifier are trained in an end-to-end manner so that optimizing classification performance also adjusts how the abnor- mal beats are segmented. Validation of our method is conducted on two dataset. We observe that involving the unsupervised segmentation in fact boosts the clas- sification performance. Meanwhile, a grade study performed by experts suggests that the segmenter also achieves satisfactory quality in identifying abnormal beats, which significantly enhances the interpretability of the classification results.",/pdf/609e06c71d4acfd8f207a05915c3578cd58404b2.pdf,ICLR,2021,"This paper presents a joint learning framework for supervised arrhythmia differentiation with unsupervised abnormal heart beat seg mentation on ECG, where the two tasks can benefit from each other. " +Hkx-ii05FQ,Sklwp0LYFQ,1538090000000.0,1545360000000.0,596,The Cakewalk Method,"[""uri.patish@gmail.com"", ""shimon.ullman@gmail.com""]","[""Uri Patish"", ""Shimon Ullman""]","[""policy gradient"", ""combinatorial optimization"", ""blackbox optimization"", ""stochastic optimization"", ""reinforcement learning""]","Combinatorial optimization is a common theme in computer science. While in general such problems are NP-Hard, from a practical point of view, locally optimal solutions can be useful. In some combinatorial problems however, it can be hard to define meaningful solution neighborhoods that connect large portions of the search space, thus hindering methods that search this space directly. We suggest to circumvent such cases by utilizing a policy gradient algorithm that transforms the problem to the continuous domain, and to optimize a new surrogate objective that renders the former as generic stochastic optimizer. This is achieved by producing a surrogate objective whose distribution is fixed and predetermined, thus removing the need to fine-tune various hyper-parameters in a case by case manner. Since we are interested in methods which can successfully recover locally optimal solutions, we use the problem of finding locally maximal cliques as a challenging experimental benchmark, and we report results on a large dataset of graphs that is designed to test clique finding algorithms. Notably, we show in this benchmark that fixing the distribution of the surrogate is key to consistently recovering locally optimal solutions, and that our surrogate objective leads to an algorithm that outperforms other methods we have tested in a number of measures.",/pdf/b69cf6fa58cb659cdd3766e8943ee2c3b7208965.pdf,ICLR,2019,"A new policy gradient algorithm designed to approach black-box combinatorial optimization problems. The algorithm relies only on function evaluations, and returns locally optimal solutions with high probability." +Hygv3xrtDr,HklyPgbFPr,1569440000000.0,1577170000000.0,2539,Sparse Skill Coding: Learning Behavioral Hierarchies with Sparse Codes,"[""sanborn@berkeley.edu"", ""mbchang@berkeley.edu"", ""svlevine@eecs.berkeley.edu"", ""tomg@princeton.edu""]","[""Sophia Sanborn"", ""Michael Chang"", ""Sergey Levine"", ""Thomas Griffiths""]","[""hierarchical reinforcement learning"", ""unsupervised learning"", ""compression""]","Many approaches to hierarchical reinforcement learning aim to identify sub-goal structure in tasks. We consider an alternative perspective based on identifying behavioral `motifs'---repeated action sequences that can be compressed to yield a compact code of action trajectories. We present a method for iteratively compressing action trajectories to learn nested behavioral hierarchies of arbitrary depth, with actions of arbitrary length. The learned temporally extended actions provide new action primitives that can participate in deeper hierarchies as the agent learns. We demonstrate the relevance of this approach for tasks with non-trivial hierarchical structure and show that the approach can be used to accelerate learning in recursively more complex tasks through transfer.",/pdf/a9ec663ac61dfbf133bccbf5dc1d6be70e5404b4.pdf,ICLR,2020, +Syg6jTNtDH,S1gpcYkOPr,1569440000000.0,1577170000000.0,762,Learning Numeral Embedding,"[""jiangchy@shanghaitech.edu.cn"", ""nianzhl@shanghaitech.edu.cn"", ""guokh@shanghaitech.edu.cn"", ""chushb@leyantech.com"", ""ygzhao@leyantech.com"", ""libin@leyantech.com"", ""tukw@shanghaitech.edu.cn""]","[""Chengyue Jiang"", ""Zhonglin Nian"", ""Kaihao Guo"", ""Shanbo Chu"", ""Yinggong Zhao"", ""Libin Shen"", ""Kewei Tu""]","[""Natural Language Processing"", ""Numeral Embedding"", ""Word Embedding"", ""Out-of-vocabulary Problem""]","Word embedding is an essential building block for deep learning methods for natural language processing. Although word embedding has been extensively studied over the years, the problem of how to effectively embed numerals, a special subset of words, is still underexplored. Existing word embedding methods do not learn numeral embeddings well because there are an infinite number of numerals and their individual appearances in training corpora are highly scarce. +In this paper, we propose two novel numeral embedding methods that can handle the out-of-vocabulary (OOV) problem for numerals. We first induce a finite set of prototype numerals using either a self-organizing map or a Gaussian mixture model. We then represent the embedding of a numeral as a weighted average of the prototype number embeddings. Numeral embeddings represented in this manner can be plugged into existing word embedding learning approaches such as skip-gram for training. +We evaluated our methods and showed its effectiveness on four intrinsic and extrinsic tasks: word similarity, embedding numeracy, numeral prediction, and sequence labeling. ",/pdf/4a076dd3baa83d984e8174e2e86ba83231e6705e.pdf,ICLR,2020,We propose two methods for learning better numeral embeddings that solve the numeral out-of-vocabulary (OOV) problem and can be integrated into traditional word embedding training methods. +HklE01BYDB,rklE-vJFwr,1569440000000.0,1577170000000.0,2020,Improving Sample Efficiency in Model-Free Reinforcement Learning from Images,"[""denisyarats@cs.nyu.edu"", ""amyzhang@fb.com"", ""ik1078@nyu.edu"", ""brandon.amos.cs@gmail.com"", ""jpineau@fb.com"", ""robfergus@fb.com""]","[""Denis Yarats"", ""Amy Zhang"", ""Ilya Kostrikov"", ""Brandon Amos"", ""Joelle Pineau"", ""Rob Fergus""]","[""reinforcement learning"", ""model-free"", ""off-policy"", ""image-based reinforcement learning"", ""continuous control""]","Training an agent to solve control tasks directly from high-dimensional images with model-free reinforcement learning (RL) has proven difficult. The agent needs to learn a latent representation together with a control policy to perform the task. Fitting a high-capacity encoder using a scarce reward signal is not only extremely sample inefficient, but also prone to suboptimal convergence. Two ways to improve sample efficiency are to learn a good feature representation and use off-policy algorithms. We dissect various approaches of learning good latent features, and conclude that the image reconstruction loss is the essential ingredient that enables efficient and stable representation learning in image-based RL. Following these findings, we devise an off-policy actor-critic algorithm with an auxiliary decoder that trains end-to-end and matches state-of-the-art performance across both model-free and model-based algorithms on many challenging control tasks. We release our code to encourage future research on image-based RL.",/pdf/26bfa35b5ec3aeacf128b36cd8d5347f4cabcd26.pdf,ICLR,2020,We design a simple and efficient model-free off-policy method for image-based reinforcement learning that matches the state-of-the-art model-based methods in sample efficiency +Hkexw1BtDr,S1xIIqadDB,1569440000000.0,1577170000000.0,1754,Deep Auto-Deferring Policy for Combinatorial Optimization,"[""sungsoo.ahn@kaist.ac.kr"", ""younggyo.seo@kaist.ac.kr"", ""jinwoos@kaist.ac.kr""]","[""Sungsoo Ahn"", ""Younggyo Seo"", ""Jinwoo Shin""]","[""deep reinforcement learning"", ""combinatorial optimization""]","Designing efficient algorithms for combinatorial optimization appears ubiquitously in various scientific fields. Recently, deep reinforcement learning (DRL) frameworks have gained considerable attention as a new approach: they can automatically learn the design of a good solver without using any sophisticated knowledge or hand-crafted heuristic specialized for the target problem. However, the number of stages (until reaching the final solution) required by existing DRL solvers is proportional to the size of the input graph, which hurts their scalability to large-scale instances. In this paper, we seek to resolve this issue by proposing a novel design of DRL's policy, coined auto-deferring policy (ADP), automatically stretching or shrinking its decision process. Specifically, it decides whether to finalize the value of each vertex at the current stage or defer to determine it at later stages. We apply the proposed ADP framework to the maximum independent set (MIS) problem, a prototype of NP-complete problems, under various scenarios. Our experimental results demonstrate significant improvement of ADP over the current state-of-the-art DRL scheme in terms of computational efficiency and approximation quality. The reported performance of our generic DRL scheme is also comparable with that of the state-of-the-art solvers specialized for MIS, e.g., ADP outperforms them for some graphs with millions of vertices. ",/pdf/c26f6c159fb6afb21162504b46cbdb449543e4fb.pdf,ICLR,2020,We propose a new scalable framework based on deep reinforcement learning for solving combinatorial optimization on large graphs. +HkXWCMbRW,BJucTf-0b,1509140000000.0,1519410000000.0,1036,Towards Image Understanding from Deep Compression Without Decoding,"[""robertto@student.ethz.ch"", ""mentzerf@vision.ee.ethz.ch"", ""aeirikur@vision.ee.ethz.ch"", ""michaelt@nari.ee.ethz.ch"", ""radu.timofte@vision.ee.ethz.ch"", ""vangool@vision.ee.ethz.ch""]","[""Robert Torfason"", ""Fabian Mentzer"", ""Eirikur Agustsson"", ""Michael Tschannen"", ""Radu Timofte"", ""Luc Van Gool""]",[],"Motivated by recent work on deep neural network (DNN)-based image compression methods showing potential improvements in image quality, savings in storage, and bandwidth reduction, we propose to perform image understanding tasks such as classification and segmentation directly on the compressed representations produced by these compression methods. Since the encoders and decoders in DNN-based compression methods are neural networks with feature-maps as internal representations of the images, we directly integrate these with architectures for image understanding. This bypasses decoding of the compressed representation into RGB space and reduces computational cost. Our study shows that accuracies comparable to networks that operate on compressed RGB images can be achieved while reducing the computational complexity up to $2\times$. Furthermore, we show that synergies are obtained by jointly training compression networks with classification networks on the compressed representations, improving image quality, classification accuracy, and segmentation performance. We find that inference from compressed representations is particularly advantageous compared to inference from compressed RGB images for aggressive compression rates.",/pdf/ff5ed2e1f146461e5146d0a370700ddbfc4c4483.pdf,ICLR,2018, +YPm0fzy_z6R,Oaheo_Rtbly,1601310000000.0,1614990000000.0,1131,Signed Graph Diffusion Network,"[""~Jinhong_Jung1"", ""~Jaemin_Yoo1"", ""~U_Kang1""]","[""Jinhong Jung"", ""Jaemin Yoo"", ""U Kang""]","[""graph neural network"", ""signed graph analysis"", ""representation learning"", ""graph diffusion"", ""random walk"", ""link sign prediction""]","Given a signed social graph, how can we learn appropriate node representations to infer the signs of missing edges? +Signed social graphs have received considerable attention to model trust relationships. +Learning node representations is crucial to effectively analyze graph data, and various techniques such as network embedding and graph convolutional network (GCN) have been proposed for learning signed graphs. +However, traditional network embedding methods are not end-to-end for a specific task such as link sign prediction, and GCN-based methods suffer from a performance degradation problem when their depth increases. +In this paper, we propose Signed Graph Diffusion Network (SGDNet), a novel graph neural network that achieves end-to-end node representation learning for link sign prediction in signed social graphs. +We propose a random walk technique specially designed for signed graphs so that SGDNet effectively diffuses hidden node features. +Through extensive experiments, we demonstrate that SGDNet outperforms state-of-the-art models in terms of link sign prediction accuracy. ",/pdf/baad2e3d020391c95b9becb6b56365f79a83826d.pdf,ICLR,2021,End-to-end graph neural network model for node representation learning using a novel random walk based feature diffusion in signed graphs. +SkkTMpjex,,1478380000000.0,1487860000000.0,593,Distributed Second-Order Optimization using Kronecker-Factored Approximations,"[""jimmy@psi.toronto.edu"", ""rgrosse@cs.toronto.edu"", ""jmartens@cs.toronto.edu""]","[""Jimmy Ba"", ""Roger Grosse"", ""James Martens""]","[""Deep learning"", ""Optimization""]","As more computational resources become available, machine learning researchers train ever larger neural networks on millions of data points using stochastic gradient descent (SGD). Although SGD scales well in terms of both the size of dataset and the number of parameters of the model, it has rapidly diminishing returns as parallel computing resources increase. Second-order optimization methods have an affinity for well-estimated gradients and large mini-batches, and can therefore benefit much more from parallel computation in principle. Unfortunately, they often employ severe approximations to the curvature matrix in order to scale to large models with millions of parameters, limiting their effectiveness in practice versus well-tuned SGD with momentum. The recently proposed K-FAC method(Martens and Grosse, 2015) uses a stronger and more sophisticated curvature approximation, and has been shown to make much more per-iteration progress than SGD, while only introducing a modest overhead. In this paper, we develop a version of K-FAC that distributes the computation of gradients and additional quantities required by K-FAC across multiple machines, thereby taking advantage of method’s superior scaling to large mini-batches and mitigating its additional overheads. We provide a Tensorflow implementation of our approach which is easy to use and can be applied to many existing codebases without modification. Additionally, we develop several algorithmic enhancements to K-FAC which can improve its computational performance for very large models. Finally, we show that our distributed K-FAC method speeds up training of various state-of-the-art ImageNet classification models by a factor of two compared to Batch Normalization(Ioffe and Szegedy, 2015).",https://jimmylba.github.io/papers/nsync.pdf,ICLR,2017,Fixed typos pointed out by AnonReviewer1 and AnonReviewer4 and added the experiments in Fig. 6 showing the poor scaling of batch normalized SGD using a batch size of 2048 on googlenet. +r1NYjfbR-,ryDLof-0-,1509140000000.0,1519410000000.0,938,Generative networks as inverse problems with Scattering transforms,"[""tomas.angles@ens.fr"", ""stephane.mallat@ens.fr""]","[""Tom\u00e1s Angles"", ""St\u00e9phane Mallat""]","[""Unsupervised Learning"", ""Inverse Problems"", ""Convolutional Networks"", ""Generative Models"", ""Scattering Transform""]","Generative Adversarial Nets (GANs) and Variational Auto-Encoders (VAEs) provide impressive image generations from Gaussian white noise, but the underlying mathematics are not well understood. We compute deep convolutional network generators by inverting a fixed embedding operator. Therefore, they do not require to be optimized with a discriminator or an encoder. The embedding is Lipschitz continuous to deformations so that generators transform linear interpolations between input white noise vectors into deformations between output images. This embedding is computed with a wavelet Scattering transform. Numerical experiments demonstrate that the resulting Scattering generators have similar properties as GANs or VAEs, without learning a discriminative network or an encoder.",/pdf/f9fdb919d861deeded4f7d4ca03fd16e37064d1c.pdf,ICLR,2018,We introduce generative networks that do not require to be learned with a discriminator or an encoder; they are obtained by inverting a special embedding operator defined by a wavelet Scattering transform. +HyxwZRNtDr,H1lY9H4_vr,1569440000000.0,1577170000000.0,968,Wasserstein Robust Reinforcement Learning,"[""mohammed.abdullah@huawei.com"", ""hang.ren1@huawei.com"", ""haitham.ammar@huawei.com"", ""vladimir.milenkovic@huawei.com"", ""ruiluo@huawei.com"", ""w.j@huawei.com""]","[""Mohammed Amin Abdullah"", ""Hang Ren"", ""Haitham Bou-Ammar"", ""Vladimir Milenkovic"", ""Rui Luo"", ""Mingtian Zhang"", ""Jun Wang""]","[""Reinforcement Learning"", ""Robustness"", ""Wasserstein distance""]","Reinforcement learning algorithms, though successful, tend to over-fit to training environments, thereby hampering their application to the real-world. This paper proposes $\text{W}\text{R}^{2}\text{L}$ -- a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. +We empirically demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments",/pdf/9b3e5a53349af570215c693c1aab05a553cd71a2.pdf,ICLR,2020,An RL algorithm that learns to be robust to changes in dynamics +6FqKiVAdI3Y,8AjGUdgzfQuy,1601310000000.0,1616060000000.0,1158,DOP: Off-Policy Multi-Agent Decomposed Policy Gradients,"[""~Yihan_Wang1"", ""~Beining_Han1"", ""~Tonghan_Wang1"", ""~Heng_Dong1"", ""~Chongjie_Zhang1""]","[""Yihan Wang"", ""Beining Han"", ""Tonghan Wang"", ""Heng Dong"", ""Chongjie Zhang""]","[""Multi-Agent Reinforcement Learning"", ""Multi-Agent Policy Gradients""]","Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.",/pdf/3f81a57de106341a8eae39f9a550980090df7241.pdf,ICLR,2021,"We propose an off-policy multi-agent decomposed policy gradient method, addressing the drawbacks that prevent existing multi-agent policy gradient methods from achieving state-of-the-art performance." +SJlbvp4YvS,Syectw9wwS,1569440000000.0,1577170000000.0,586,Risk Averse Value Expansion for Sample Efficient and Robust Policy Learning,"[""zhoubo01@baidu.com"", ""wangfan04@baidu.com"", ""zenghongsheng@baidu.com"", ""tianhao@baidu.com""]","[""Bo Zhou"", ""Fan Wang"", ""Hongsheng Zeng"", ""Hao Tian""]","[""reinforcement learning"", ""model-based RL"", ""risk-sensitive"", ""sample efficiency""]","Model-based Reinforcement Learning(RL) has shown great advantage in sample-efficiency, but suffers from poor asymptotic performance and high inference cost. A promising direction is to combine model-based reinforcement learning with model-free reinforcement learning, such as model-based value expansion(MVE). However, the previous methods do not take into account the stochastic character of the environment, thus still suffers from higher function approximation errors. As a result, they tend to fall behind the best model-free algorithms in some challenging scenarios. We propose a novel Hybrid-RL method, which is developed from MVE, namely the Risk Averse Value Expansion(RAVE). In the proposed method, we use an ensemble of probabilistic models for environment modeling to generate imaginative rollouts, based on which we further introduce the aversion of risks by seeking the lower confidence bound of the estimation. Experiments on different environments including MuJoCo and robo-school show that RAVE yields state-of-the-art performance. Also we found that it greatly prevented some catastrophic consequences such as falling down and thus reduced the variance of the rewards.",/pdf/14b062bf6d157f7f3358ed599c51ec26d9dcaba3.pdf,ICLR,2020,We extend the model-based value expansion methods with risk-averse learning and achieve state-of-the-art results on challenging continuous control benchmarks. +uJSBC7QCfrX,rI0KBE2IynM,1601310000000.0,1614990000000.0,1700,Differential-Critic GAN: Generating What You Want by a Cue of Preferences,"[""~Yinghua_Yao1"", ""~Yuangang_Pan2"", ""~Ivor_Tsang1"", ""~Xin_Yao1""]","[""Yinghua Yao"", ""Yuangang Pan"", ""Ivor Tsang"", ""Xin Yao""]","[""GAN"", ""user-desired data distribution"", ""user preference"", ""critic""]","This paper proposes Differential-Critic Generative Adversarial Network (DiCGAN) to learn the distribution of user-desired data when only partial instead of the entire dataset possesses the desired properties. Existing approaches select the desired samples first and train regular GANs on the selected samples to derive the user-desired data distribution. DiCGAN introduces a differential critic that can learn the preference direction from the pairwise preferences over the entire dataset. The resultant critic would guide the generation of the desired data instead of the whole data. Specifically, apart from the Wasserstein GAN loss, a ranking loss of the pairwise preferences is defined over the critic. It endows the difference of critic values between each pair of samples with the pairwise preference relation. The higher critic value indicates that the sample is preferred by the user. Thus training the generative model for higher critic values would encourage generating the user-preferred samples. Extensive experiments show that our DiCGAN can learn the user-desired data distributions.",/pdf/c6c32128e4cf22e03d8d30d9a093a93c0cfbd235.pdf,ICLR,2021,"This paper proposes DiCGAN to learn the distribution of user-desired data from the entire dataset using pairwise preferences, where a differential critic is introduced to learn the preference direction from the pairwise preferences." +Hkesr205t7,H1l2DJR5tQ,1538090000000.0,1545360000000.0,1576,Learning shared manifold representation of images and attributes for generalized zero-shot learning,"[""masa@weblab.t.u-tokyo.ac.jp"", ""iwasawa@weblab.t.u-tokyo.ac.jp"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Masahiro Suzuki"", ""Yusuke Iwasawa"", ""Yutaka Matsuo""]","[""zero-shot learning"", ""variational autoencoders""]","Many of the zero-shot learning methods have realized predicting labels of unseen images by learning the relations between images and pre-defined class-attributes. However, recent studies show that, under the more realistic generalized zero-shot learning (GZSL) scenarios, these approaches severely suffer from the issue of biased prediction, i.e., their classifier tends to predict all the examples from both seen and unseen classes as one of the seen classes. The cause of this problem is that they cannot properly learn a mapping to the representation space generalized to the unseen classes since the training set does not include any unseen class information. To solve this, we propose a concept to learn a mapping that embeds both images and attributes to the shared representation space that can be generalized even for unseen classes by interpolating from the information of seen classes, which we refer to shared manifold learning. Furthermore, we propose modality invariant variational autoencoders, which can perform shared manifold learning by training variational autoencoders with both images and attributes as inputs. The empirical validation of well-known datasets in GZSL shows that our method achieves the significantly superior performances to the existing relation-based studies.",/pdf/9effaf2d57abf00263ec39ad05363cbd278f4600.pdf,ICLR,2019, +ryenvpEKDr,H1eTTUsPPB,1569440000000.0,1583910000000.0,611,A Constructive Prediction of the Generalization Error Across Scales,"[""jonsr@mit.edu"", ""amir@eecs.yorku.ca"", ""belinkov@mit.edu"", ""shanir@csail.mit.edu""]","[""Jonathan S. Rosenfeld"", ""Amir Rosenfeld"", ""Yonatan Belinkov"", ""Nir Shavit""]","[""neural networks"", ""deep learning"", ""generalization error"", ""scaling"", ""scalability"", ""vision"", ""language""]","The dependency of the generalization error of neural networks on model and dataset size is of critical importance both in practice and for understanding the theory of neural networks. Nevertheless, the functional form of this dependency remains elusive. In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (e.g., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/data scales. Our construction follows insights obtained from observations conducted over a range of model/data scales, in various model types and datasets, in vision and language tasks. We show that the form both fits the observations well across scales, and provides accurate predictions from small- to large-scale models and data.",/pdf/cdf74221b374eccd5c5cfc3ac4c82bb6ce27eb46.pdf,ICLR,2020,We predict the generalization error and specify the model which attains it across model/data scales. +qiAxL3Xqx1o,jaSMD4PGHEO,1601310000000.0,1614990000000.0,3656,GG-GAN: A Geometric Graph Generative Adversarial Network,"[""~Igor_Krawczuk1"", ""pedro.abranches@epfl.ch"", ""~Andreas_Loukas1"", ""~Volkan_Cevher1""]","[""Igor Krawczuk"", ""Pedro Abranches"", ""Andreas Loukas"", ""Volkan Cevher""]","[""GAN"", ""generative adversarial network"", ""WGAN"", ""GNN"", ""graph neural network"", ""generative model"", ""graph""]","We study the fundamental problem of graph generation. Specifically, we treat graph generation from a geometric perspective by associating each node with a position in space and then connecting the edges based on a similarity function. We then provide new solutions to the key challenges that prevent the widespread application of this classical geometric interpretation: (1) modeling complex relations, (2) modeling isomorphic graphs consistently, and (3) fully exploiting the latent distribution. +Our main contribution is dubbed as the geometric graph (GG) generative adversarial network (GAN), which is a Wasserstein GAN that addresses the above challenges. GG-GAN is permutation equivariant and easily scales to generate graphs of tens of thousands of nodes. GG-GAN also strikes a good trade-off between novelty and modeling the distribution statistics, being competitive or surpassing the state-of-the-art methods that are either slower or that are non-equivariant, or that exploit problem-specific knowledge.",/pdf/6f37a93ce42a5827de81dd51ecbcea3f3e49ec41.pdf,ICLR,2021," We present the first fully equivariant pure GAN framework for graph generation, based on a geometric perspective on graphs and scalable for inference of thousands of nodes on commonly available hardware." +S1grRoR9tQ,Bkgzvl_tKX,1538090000000.0,1545360000000.0,894,Bayesian Deep Learning via Stochastic Gradient MCMC with a Stochastic Approximation Adaptation,"[""deng106@purdue.edu"", ""zhang923@purdue.edu"", ""fmliang@purdue.edu"", ""guanglin@purdue.edu""]","[""Wei Deng"", ""Xiao Zhang"", ""Faming Liang"", ""Guang Lin""]","[""generalized stochastic approximation"", ""stochastic gradient Markov chain Monte Carlo"", ""adaptive algorithm"", ""EM algorithm"", ""convolutional neural networks"", ""Bayesian inference"", ""sparse prior"", ""spike and slab prior"", ""local trap""]","We propose a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables. Inspired by dropout, a popular tool for regularization and model ensemble, we assign sparse priors to the weights in deep neural networks (DNN) in order to achieve automatic “dropout” and avoid over-fitting. By alternatively sampling from posterior distribution through stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and optimizing latent variables via stochastic approximation (SA), the trajectory of the target weights is proved to converge to the true posterior distribution conditioned on optimal latent variables. This ensures a stronger regularization on the over-fitted parameter space and more accurate uncertainty quantification on the decisive variables. Simulations from large-p-small-n regressions showcase the robustness of this method when applied to models with latent variables. Additionally, its application on the convolutional neural networks (CNN) leads to state-of-the-art performance on MNIST and Fashion MNIST datasets and improved resistance to adversarial attacks. ",/pdf/a833d91ac20ca1e9f766344271cd57a51fd1d187.pdf,ICLR,2019,a robust Bayesian deep learning algorithm to infer complex posteriors with latent variables +B1KBHtcel,,1478300000000.0,1481770000000.0,484,Here's My Point: Argumentation Mining with Pointer Networks,"[""ppotash@cs.uml.edu"", ""aromanov@cs.uml.edu"", ""arum@cs.uml.edu""]","[""Peter Potash"", ""Alexey Romanov"", ""Anna Rumshisky""]","[""Natural language processing""]","One of the major goals in automated argumentation mining is to uncover the argument structure present in argumentative text. In order to determine this structure, one must understand how different individual components of the overall argument are linked. General consensus in this field dictates that the argument components form a hierarchy of persuasion, which manifests itself in a tree structure. This work provides the first neural network-based approach to argumentation mining, focusing on extracting links between argument components, with a secondary focus on classifying types of argument components. In order to solve this problem, we propose to use a modification of a Pointer Network architecture. A Pointer Network is appealing for this task for the following reasons: 1) It takes into account the sequential nature of argument components; 2) By construction, it enforces certain properties of the tree structure present in argument relations; 3) The hidden representations can be applied to auxiliary tasks. In order to extend the contribution of the original Pointer Network model, we construct a joint model that simultaneously attempts to learn the type of argument component, as well as continuing to predict links between argument components. The proposed model achieves state-of-the-art results on two separate evaluation corpora. Furthermore, our results show that optimizing for both tasks, as well as adding a fully-connected layer prior to recurrent neural network input, is crucial for high performance.",/pdf/1546be2f1c01e692d4ff69f5e67b692d280f9270.pdf,ICLR,2017,We use a modified Pointer Network to predict 1) types of argument components; 2) links between argument components. +ryeAy3AqYm,BJxO8gCctQ,1538090000000.0,1545360000000.0,1034,Distilled Agent DQN for Provable Adversarial Robustness,"[""matthew.mirman@inf.ethz.ch"", ""marcfisc@student.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Matthew Mirman"", ""Marc Fischer"", ""Martin Vechev""]","[""reinforcement learning"", ""dqn"", ""adversarial examples"", ""robustness analysis"", ""adversarial defense"", ""robust learning"", ""robust rl""]","As deep neural networks have become the state of the art for solving complex reinforcement learning tasks, susceptibility to perceptual adversarial examples have become a concern. The transferability of adversarial examples is known to enable attacks capable of tricking the agent into bad states. In this work we demonstrate a simple poisoning attack able to keep deep RL from learning, and into fooling it when trained with defense methods commonly used for classification tasks. We then propose an algorithm called DadQN, based on deep Q-networks, which enables the use of stronger defenses, including defenses enabling the first ever on-line robustness certification of a deep RL agent.",/pdf/af4963ee0a35ee7693fc8eed6b383e95a857ccea.pdf,ICLR,2019,"We introduce a way of (provably) defending Deep-RL against adversarial perturbations, including a new poisoning attack." +wqRvVvMbJAT,uuoV6njx89J,1601310000000.0,1614990000000.0,2060,One Size Doesn't Fit All: Adaptive Label Smoothing,"[""~Ujwal_Krothapalli1"", ""~Lynn_Abbott1""]","[""Ujwal Krothapalli"", ""Lynn Abbott""]","[""Uncertainty estimation"", ""Calibration"", ""Label smoothing""]","This paper concerns the use of objectness measures to improve the calibration performance of Convolutional Neural Networks (CNNs). CNNs have proven to be very good classifiers and generally localize objects well; however, the loss functions typically used to train classification CNNs do not penalize inability to localize an object, nor do they take into account an object's relative size in the given image. During training on ImageNet-1K almost all approaches use random crops on the images and this transformation sometimes provides the CNN with background only samples. This causes the classifiers to depend on context. Context dependence is harmful for safety-critical applications. We present a novel approach to classification that combines the ideas of objectness and label smoothing during training. Unlike previous methods, we compute a smoothing factor that is \emph{adaptive} based on relative object size within an image. This causes our approach to produce confidences that are grounded in the size of the object being classified instead of relying on context to make the correct predictions. We present extensive results using ImageNet to demonstrate that CNNs trained using adaptive label smoothing are much less likely to be overconfident in their predictions. We show qualitative results using class activation maps and quantitative results using classification and transfer learning tasks. Our approach is able to produce an order of magnitude reduction in confidence when predicting on context only images when compared to baselines. Using transfer learning, we gain $0.021$AP on MS COCO compared to the hard label approach.",/pdf/5dc8173ec04652575d0166c2fd7581ecfa59e5c1.pdf,ICLR,2021,The main contribution of this work is that we have developed a novel way to train classification CNNs using objectness measure and adaptive label smoothing. +Byx7LjRcYm,SygH1LW9YX,1538090000000.0,1545360000000.0,160,Human Action Recognition Based on Spatial-Temporal Attention,"[""2489925838@qq.com"", ""zhiqiangtian@xjtu.edu.cn"", ""xglan@xjtu.edu.cn""]","[""Wensong Chan"", ""Zhiqiang Tian"", ""Xuguang Lan""]",[],"Many state-of-the-art methods of recognizing human action are based on attention mechanism, which shows the importance of attention mechanism in action recognition. With the rapid development of neural networks, human action recognition has been achieved great improvement by using convolutional neural networks (CNN) or recurrent neural networks (RNN). In this paper, we propose a model based on spatial-temporal attention weighted LSTM. This model pays attention to the key part in each video frame, and also focuses on the important frames in each video sequence, thus the most important theme for our model is how to find out the key point spatially and the key frames temporally. We show a feasible architecture which can solve those two problems effectively and achieve a satisfactory result. Our model is trained and tested on three datasets including UCF-11, UCF-101, and HMDB51. Those results demonstrate a high performance of our model in human action recognition.",/pdf/a24113535c32a25578b1449c4a53a9404dbc2978.pdf,ICLR,2019, +rJlUhhVYvS,H1xCsXNzPr,1569440000000.0,1577170000000.0,190,Understanding Isomorphism Bias in Graph Data Sets ,"[""ivanovserg990@gmail.com"", ""sergei.sviridov@gmail.com"", ""e.burnaev@skoltech.ru""]","[""Ivanov Sergey"", ""Sviridov Sergey"", ""Evgeny Burnaev""]","[""graph classification"", ""data sets"", ""graph representation learning""]","In recent years there has been a rapid increase in classification methods on graph structured data. Both in graph kernels and graph neural networks, one of the implicit assumptions of successful state-of-the-art models was that incorporating graph isomorphism features into the architecture leads to better empirical performance. However, as we discover in this work, commonly used data sets for graph classification have repeating instances which cause the problem of isomorphism bias, i.e. artificially increasing the accuracy of the models by memorizing target information from the training set. This prevents fair competition of the algorithms and raises a question of the validity of the obtained results. We analyze 54 data sets, previously extensively used for graph-related tasks, on the existence of isomorphism bias, give a set of recommendations to machine learning practitioners to properly set up their models, and open source new data sets for the future experiments. ",/pdf/58bf77f3d919145379ab8c1e203eaf5385e93d0a.pdf,ICLR,2020,"Many graph classification data sets have duplicates, thus raising questions about generalization abilities and fair comparison of the models. " +PS3IMnScugk,b1MJiNg94_7,1601310000000.0,1623110000000.0,1966,Learning to Recombine and Resample Data For Compositional Generalization,"[""~Ekin_Aky\u00fcrek1"", ""~Afra_Feyza_Aky\u00fcrek1"", ""~Jacob_Andreas1""]","[""Ekin Aky\u00fcrek"", ""Afra Feyza Aky\u00fcrek"", ""Jacob Andreas""]","[""compositional generalization"", ""data augmentation"", ""language processing"", ""sequence models"", ""generative modeling""]","Flexible neural sequence models outperform grammar- and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data—particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems—instruction following (SCAN) and morphological analysis (SIGMORPHON 2018)—where R&R enables learning of new constructions and tenses from as few as eight initial examples.",/pdf/4496adf916371224de150134eb36aafb36758fc7.pdf,ICLR,2021,"This paper investigates a data augmentation procedure based on two weaker principles: recombination and resampling, and finds that it is sufficient to induce many of the compositional generalizations studied in previous work. " +HyljzgHtwS,SJgjUXlFDS,1569440000000.0,1577170000000.0,2186,Regularly varying representation for sentence embedding,"[""hamid.jalalzai@telecom-paris.fr"", ""pierre.colombo@telecom-paris.fr"", ""chloe.clavel@telecom-paris.fr"", ""giovanna.varni@telecom-paris.fr"", ""emmanuel.vignon@fr.ibm.com"", ""anne.sabourin@telecom-paris.fr""]","[""Hamid Jalalzai"", ""Pierre Colombo"", ""Chlo\u00e9 Clavel"", ""Eric Gaussier"", ""Giovanna Varni"", ""Emmanuel Vignon"", ""Anne Sabourin""]","[""extreme value theory"", ""classification"", ""supvervised learning"", ""data augmentation"", ""representation learning""]","The dominant approaches to sentence representation in natural language rely on learning embeddings on massive corpuses. The obtained embeddings have desirable properties such as compositionality and distance preservation (sentences with similar meanings have similar representations). In this paper, we develop a novel method for learning an embedding enjoying a dilation invariance property. We propose two algorithms: Orthrus, a classification algorithm, constrains the distribution of the embedded variable to be regularly varying, i.e. multivariate heavy-tail. and uses Extreme Value Theory (EVT) to tackle the classification task on two separate regions: the tail and the bulk. Hydra, a text generation algorithm for dataset augmentation, leverages the invariance property of the embedding learnt by Orthrus to generate coherent sentences with controllable attribute, e.g. positive or negative sentiment. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework. +",/pdf/9413c0fc775b82b57796ebd63820d3ed14109eda.pdf,ICLR,2020, +HJe_Z04Yvr,ByxSeIEdwr,1569440000000.0,1583910000000.0,970,Adjustable Real-time Style Transfer,"[""mb2@uiuc.edu"", ""golnazg@google.com""]","[""Mohammad Babaeizadeh"", ""Golnaz Ghiasi""]","[""Image Style Transfer"", ""Deep Learning""]","Artistic style transfer is the problem of synthesizing an image with content similar to a given image and style similar to another. Although recent feed-forward neural networks can generate stylized images in real-time, these models produce a single stylization given a pair of style/content images, and the user doesn't have control over the synthesized output. Moreover, the style transfer depends on the hyper-parameters of the model with varying ``optimum"" for different input images. Therefore, if the stylized output is not appealing to the user, she/he has to try multiple models or retrain one with different hyper-parameters to get a favorite stylization. In this paper, we address these issues by proposing a novel method which allows adjustment of crucial hyper-parameters, after the training and in real-time, through a set of manually adjustable parameters. These parameters enable the user to modify the synthesized outputs from the same pair of style/content images, in search of a favorite stylized image. Our quantitative and qualitative experiments indicate how adjusting these parameters is comparable to retraining the model with different hyper-parameters. We also demonstrate how these parameters can be randomized to generate results which are diverse but still very similar in style and content.",/pdf/a6a7059cf21d7328b44c35109b916904bc2828b5.pdf,ICLR,2020,Stochastic style transfer with adjustable features. +8nl0k08uMi,I0amhSySq0K,1601310000000.0,1614890000000.0,954,Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs,"[""~Matthew_L_Leavitt1"", ""~Ari_S._Morcos1""]","[""Matthew L Leavitt"", ""Ari S. Morcos""]","[""interpretability"", ""explainability"", ""empirical analysis"", ""deep learning"", ""selectivity""]","The properties of individual neurons are often analyzed in order to understand the biological and artificial neural networks in which they're embedded. Class selectivity—typically defined as how different a neuron's responses are across different classes of stimuli or data samples—is commonly used for this purpose. However, it remains an open question whether it is necessary and/or sufficient for deep neural networks (DNNs) to learn class selectivity in individual units. We investigated the causal impact of class selectivity on network function by directly regularizing for or against class selectivity. Using this regularizer to reduce class selectivity across units in convolutional neural networks increased test accuracy by over 2% in ResNet18 and 1% in ResNet50 trained on Tiny ImageNet. For ResNet20 trained on CIFAR10 we could reduce class selectivity by a factor of 2.5 with no impact on test accuracy, and reduce it nearly to zero with only a small (~2%) drop in test accuracy. In contrast, regularizing to increase class selectivity significantly decreased test accuracy across all models and datasets. These results indicate that class selectivity in individual units is neither sufficient nor strictly necessary, and can even impair DNN performance. They also encourage caution when focusing on the properties of single units as representative of the mechanisms by which DNNs function.",/pdf/17afa91e20d55de3a216e07f59e5ab74d27027a6.pdf,ICLR,2021,Class selectivity in CNNs is neither sufficient nor strictly necessary for optimal test accuracy +Skg3104FDS,SJlzeaMdPS,1569440000000.0,1577170000000.0,906,First-Order Preconditioning via Hypergradient Descent,"[""thmoskovitz@gmail.com"", ""ruiwang@uber.com"", ""janlan@uber.com"", ""sanyam@uber.com"", ""tmiconi@uber.com"", ""yosinski@uber.com"", ""aditya.rawal@uber.com""]","[""Ted Moskovitz"", ""Rui Wang"", ""Janice Lan"", ""Sanyam Kapoor"", ""Thomas Miconi"", ""Jason Yosinski"", ""Aditya Rawal""]","[""optimization"", ""deep learning"", ""hypgergradient""]","Standard gradient-descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space. These difficulties can be addressed by second-order approaches that apply a preconditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher information matrix is highly expensive. We introduce first-order preconditioning (FOP), a fast, scalable approach that generalizes previous work on hypergradient descent (Almeida et al., 1998; Maclaurin et al., 2015; Baydin et al., 2017) to learn a preconditioning matrix that only makes use of first-order information. Experiments show that FOP is able to improve the performance of standard deep learning optimizers on several visual classification tasks with minimal computational overhead. We also investigate the properties of the learned preconditioning matrices and perform a preliminary theoretical analysis of the algorithm.",/pdf/696ce02ead07c34c2a1c7824201a8c814bfe58fe.pdf,ICLR,2020,We introduce a computationally-efficient method for learning a preconditioning matrix for optimization via hypergradient descent. +wpSWuz_hyqA,#NAME?,1601310000000.0,1615830000000.0,643,Grounded Language Learning Fast and Slow,"[""~Felix_Hill1"", ""~Olivier_Tieleman1"", ""tamaravg@google.com"", ""~Nathaniel_Wong1"", ""~Hamza_Merzic1"", ""~Stephen_Clark1""]","[""Felix Hill"", ""Olivier Tieleman"", ""Tamara von Glehn"", ""Nathaniel Wong"", ""Hamza Merzic"", ""Stephen Clark""]","[""language"", ""cognition"", ""fast-mapping"", ""grounding"", ""word-learning"", ""memory"", ""meta-learning""]","Recent work has shown that large text-based neural language models acquire a surprising propensity for one-shot learning. Here, we show that an agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional RL algorithms. After a single introduction to a novel object via visual perception and language (""This is a dax""), the agent can manipulate the object as instructed (""Put the dax on the bed""), combining short-term, within-episode knowledge of the nonsense word with long-term lexical and motor knowledge. We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful later. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for artificial agents. ",/pdf/e357c41d68e8a24bfdaba368a3b2baa867fa25e2.pdf,ICLR,2021,A language-learning agent with dual-coding external memory meta-learns to combine fast-mapped and semantic lexical knowledge to execute instructions in one-shot.. +zYmnBGOZtH,RItqpkaBUaRY,1601310000000.0,1614990000000.0,740,An information-theoretic framework for learning models of instance-independent label noise,"[""~Xia_Huang1"", ""~Kai_Fong_Ernest_Chong1""]","[""Xia Huang"", ""Kai Fong Ernest Chong""]","[""label noise"", ""noise transition matrix"", ""entropy"", ""information theory"", ""local intrinsic dimensionality""]","Given a dataset $\mathcal{D}$ with label noise, how do we learn its underlying noise model? If we assume that the label noise is instance-independent, then the noise model can be represented by a noise transition matrix $Q_{\mathcal{D}}$. Recent work has shown that even without further information about any instances with correct labels, or further assumptions on the distribution of the label noise, it is still possible to estimate $Q_{\mathcal{D}}$ while simultaneously learning a classifier from $\mathcal{D}$. However, this presupposes that a good estimate of $Q_{\mathcal{D}}$ requires an accurate classifier. In this paper, we show that high classification accuracy is actually not required for estimating $Q_{\mathcal{D}}$ well. We shall introduce an information-theoretic-based framework for estimating $Q_{\mathcal{D}}$ solely from $\mathcal{D}$ (without additional information or assumptions). At the heart of our framework is a discriminator that predicts whether an input dataset has maximum Shannon entropy, which shall be used on multiple new datasets $\hat{\mathcal{D}}$ synthesized from $\mathcal{D}$ via the insertion of additional label noise. We prove that our estimator for $Q_{\mathcal{D}}$ is statistically consistent, in terms of dataset size, and the number of intermediate datasets $\hat{\mathcal{D}}$ synthesized from $\mathcal{D}$. As a concrete realization of our framework, we shall incorporate local intrinsic dimensionality (LID) into the discriminator, and we show experimentally that with our LID-based discriminator, the estimation error for $Q_{\mathcal{D}}$ can be significantly reduced. We achieved average Kullback--Leibler loss reduction from $0.27$ to $0.17$ for $40\%$ anchor-like samples removal when evaluated on the CIFAR10 with symmetric noise. Although no clean subset of $\mathcal{D}$ is required for our framework to work, we show that our framework can also take advantage of clean data to improve upon existing estimation methods.",/pdf/8fdd635a0ef096d6235dd32122cf7cd82dcb4915.pdf,ICLR,2021,"We introduce a consistent information-theoretic-based estimator for the noise transition matrix of any dataset with instance-independent label noise, without assuming any matrix structure, and without requiring anchor points or clean data." +S14g5s09tm,rJx21z4PYm,1538090000000.0,1545360000000.0,506,Unseen Action Recognition with Unpaired Adversarial Multimodal Learning,"[""ajpiergi@indiana.edu"", ""mryoo@indiana.edu""]","[""AJ Piergiovanni"", ""Michael S. Ryoo""]",[],"In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos. We compare the effect of placing various constraints on the embedding space using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning. +",/pdf/1bc4cbb67391c9fd5d48b1bab629035b341885d3.pdf,ICLR,2019, +rklJ2CEYPH,ryg1NEF_Pr,1569440000000.0,1577170000000.0,1340,Point Process Flows,"[""nmehrasa@sfu.ca"", ""ruizhid@sfu.ca"", ""mohamed.o.ahmed@borealisai.com"", ""bchang@stat.ubc.ca"", ""jha203@sfu.ca"", ""thibaut.p.durand@borealisai.com"", ""marcus.brubaker@borealisai.com"", ""mori@cs.sfu.ca""]","[""Nazanin Mehrasa"", ""Ruizhi Deng"", ""Mohamed Osama Ahmed"", ""Bo Chang"", ""Jiawei He"", ""Thibaut Durand"", ""Marcus Brubaker"", ""Greg Mori""]","[""Temporal Point Process"", ""Intensity-free Point Process""]",Event sequences can be modeled by temporal point processes (TPPs) to capture their asynchronous and probabilistic nature. We propose an intensity-free framework that directly models the point process as a non-parametric distribution by utilizing normalizing flows. This approach is capable of capturing highly complex temporal distributions and does not rely on restrictive parametric forms. Comparisons with state-of-the-art baseline models on both synthetic and challenging real-life datasets show that the proposed framework is effective at modeling the stochasticity of discrete event sequences. ,/pdf/fd3a52d3031ac0734727c6d57d71e7c87c773f2d.pdf,ICLR,2020,A non-parametric point process model via Normalizing Flow +BkIkkseAZ,r1NyyilRb,1509110000000.0,1518730000000.0,363,Theoretical properties of the global optimizer of two-layer Neural Network,"[""digvijaybb40@gatech.edu"", ""george.lan@isye.gatech.edu""]","[""Digvijay Boob"", ""Guanghui Lan""]","[""Non-convex optimization"", ""Two-layer Neural Network"", ""global optimality"", ""first-order optimality""]","In this paper, we study the problem of optimizing a two-layer artificial neural network that best fits a training dataset. We look at this problem in the setting where the number of parameters is greater than the number of sampled points. We show that for a wide class of differentiable activation functions (this class involves most nonlinear functions and excludes piecewise linear functions), we have that arbitrary first-order optimal solutions satisfy global optimality provided the hidden layer is non-singular. We essentially show that these non-singular hidden layer matrix satisfy a ``""good"" property for these big class of activation functions. Techniques involved in proving this result inspire us to look at a new algorithmic, where in between two gradient step of hidden layer, we add a stochastic gradient descent (SGD) step of the output layer. In this new algorithmic framework, we extend our earlier result and show that for all finite iterations the hidden layer satisfies the``good"" property mentioned earlier therefore partially explaining success of noisy gradient methods and addressing the issue of data independency of our earlier result. Both of these results are easily extended to hidden layers given by a flat matrix from that of a square matrix. Results are applicable even if network has more than one hidden layer provided all inner hidden layers are arbitrary, satisfy non-singularity, all activations are from the given class of differentiable functions and optimization is only with respect to the outermost hidden layer. Separately, we also study the smoothness properties of the objective function and show that it is actually Lipschitz smooth, i.e., its gradients do not change sharply. We use smoothness properties to guarantee asymptotic convergence of $O(1/\text{number of iterations})$ to a first-order optimal solution.",/pdf/10dcbb896bd547ba23360fc33575173cfe48b35f.pdf,ICLR,2018,This paper talks about theoretical properties of first-order optimal point of two layer neural network in over-parametrized case +HJWGdbbCW,ryZMdbbRZ,1509130000000.0,1518730000000.0,693,Reinforcement and Imitation Learning for Diverse Visuomotor Skills,"[""yukez@cs.stanford.edu"", ""ziyu@google.com"", ""jsmerel@google.com"", ""andreirusu@google.com"", ""etom@google.com"", ""cabi@google.com"", ""stunya@google.com"", ""janosk@google.com"", ""raia@google.com"", ""nandodefreitas@google.com"", ""heess@google.com""]","[""Yuke Zhu"", ""Ziyu Wang"", ""Josh Merel"", ""Andrei Rusu"", ""Tom Erez"", ""Serkan Cabi"", ""Saran Tunyasuvunakool"", ""J\u00e1nos Kram\u00e1r"", ""Raia Hadsell"", ""Nando de Freitas"", ""Nicolas Heess""]","[""reinforcement learning"", ""imitation learning"", ""robotics"", ""visuomotor skills""]","We propose a general deep reinforcement learning method and apply it to robot manipulation tasks. Our approach leverages demonstration data to assist a reinforcement learning agent in learning to solve a wide range of tasks, mainly previously unsolved. We train visuomotor policies end-to-end to learn a direct mapping from RGB camera inputs to joint velocities. Our experiments indicate that our reinforcement and imitation approach can solve contact-rich robot manipulation tasks that neither the state-of-the-art reinforcement nor imitation learning method can solve alone. We also illustrate that these policies achieved zero-shot sim2real transfer by training with large visual and dynamics variations.",/pdf/c39982c4ecfdad60fa6d9794ddff7574bb1cb21d.pdf,ICLR,2018,combine reinforcement learning and imitation learning to solve complex robot manipulation tasks from pixels +SyfXKoRqFQ,r1eVUcDcFm,1538090000000.0,1545360000000.0,430,Ada-Boundary: Accelerating the DNN Training via Adaptive Boundary Batch Selection,"[""songhwanjun@kaist.ac.kr"", ""sundong.kim@kaist.ac.kr"", ""minseokkim@kaist.ac.kr"", ""jaegil@kaist.ac.kr""]","[""Hwanjun Song"", ""Sundong Kim"", ""Minseok Kim"", ""Jae-Gil Lee""]","[""acceleration"", ""batch selection"", ""convergence"", ""decision boundary""]","Neural networks can converge faster with help from a smarter batch selection strategy. In this regard, we propose Ada-Boundary, a novel adaptive-batch selection algorithm that constructs an effective mini-batch according to the learning progress of the model.Our key idea is to present confusing samples what the true label is. Thus, the samples near the current decision boundary are considered as the most effective to expedite convergence. Taking advantage of our design, Ada-Boundary maintains its dominance in various degrees of training difficulty. We demonstrate the advantage of Ada-Boundary by extensive experiments using two convolutional neural networks for three benchmark data sets. The experiment results show that Ada-Boundary improves the training time by up to 31.7% compared with the state-of-the-art strategy and by up to 33.5% compared with the baseline strategy.",/pdf/6794259f0359a746ac73f4e424a9116acde6b905.pdf,ICLR,2019,We suggest a smart batch selection technique called Ada-Boundary. +ryxdEkHtPS,BygoPa3dvH,1569440000000.0,1585850000000.0,1660,A Closer Look at Deep Policy Gradients,"[""ailyas@mit.edu"", ""engstrom@mit.edu"", ""shibani@mit.edu"", ""tsipras@mit.edu"", ""firdaus.janoos@twosigma.com"", ""rudolph@csail.mit.edu"", ""madry@mit.edu""]","[""Andrew Ilyas"", ""Logan Engstrom"", ""Shibani Santurkar"", ""Dimitris Tsipras"", ""Firdaus Janoos"", ""Larry Rudolph"", ""Aleksander Madry""]","[""deep policy gradient methods"", ""deep reinforcement learning"", ""trpo"", ""ppo""]"," We study how the behavior of deep policy gradient algorithms reflects the conceptual framework motivating their development. To this end, we propose a fine-grained analysis of state-of-the-art methods based on key elements of this framework: gradient estimation, value prediction, and optimization landscapes. Our results show that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict: surrogate rewards do not match the true reward landscape, learned value estimators fail to fit the true value function, and gradient estimates poorly correlate with the ""true"" gradient. The mismatch between predicted and empirical behavior we uncover highlights our poor understanding of current methods, and indicates the need to move beyond current benchmark-centric evaluation methods.",/pdf/ff9e9a4c93a033da68f30df6d8e1bf9e0611b130.pdf,ICLR,2020, +HklRKpEKDr,BJx4Q8pDDH,1569440000000.0,1577170000000.0,689,Deep Coordination Graphs,"[""wendelin.boehmer@cs.ox.ac.uk"", ""vitaly.kurin@cs.ox.ac.uk"", ""shimon.whiteson@cs.ox.ac.uk""]","[""Wendelin Boehmer"", ""Vitaly Kurin"", ""Shimon Whiteson""]","[""multi-agent reinforcement learning"", ""coordination graph"", ""deep Q-learning"", ""value factorization"", ""relative overgeneralization""]","This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factorizing the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks and parameter sharing improves generalization over the state-action space. We show that DCG can solve challenging predator-prey tasks that are vulnerable to the relative overgeneralization pathology and in which all other known value factorization approaches fail.",/pdf/90fc6c82d7c1ff65de1eede66ba89cfac0950e38.pdf,ICLR,2020,We introduce an efficient value factorization architecture for MARL that is defined by a coordination graph. +4T489T4yav,WQCAqNB0a6n,1601310000000.0,1611610000000.0,3185,Differentiable Segmentation of Sequences,"[""~Erik_Scharw\u00e4chter1"", ""jlen@uni-bonn.de"", ""emmanuel.mueller@cs.tu-dortmund.de""]","[""Erik Scharw\u00e4chter"", ""Jonathan Lennartz"", ""Emmanuel M\u00fcller""]","[""segmented models"", ""segmentation"", ""change point detection"", ""concept drift"", ""warping functions"", ""gradient descent""]","Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discrete-continuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures that critically depend on gradient-based optimization techniques. In this work, we formulate a relaxed variant of segmented models that enables joint estimation of all model parameters, including the segmentation, with gradient descent. We build on recent advances in learning continuous warping functions and propose a novel family of warping functions based on the two-sided power (TSP) distribution. TSP-based warping functions are differentiable, have simple closed-form expressions, and can represent segmentation functions exactly. Our formulation includes the important class of segmented generalized linear models as a special case, which makes it highly versatile. We use our approach to model the spread of COVID-19 with Poisson regression, apply it on a change point detection task, and learn classification models with concept drift. The experiments show that our approach effectively learns all these tasks with standard algorithms for gradient descent.",/pdf/211648c2242f789fd76f662801f326094db7433d.pdf,ICLR,2021,We propose an architecture for effective gradient-based learning of segmented models for sequential data. +rJlpUiAcYX,HJl66zIqtm,1538090000000.0,1545360000000.0,214,Holographic and other Point Set Distances for Machine Learning,"[""lukas.balles@tuebingen.mpg.de"", ""tfish@google.com""]","[""Lukas Balles"", ""Thomas Fischbacher""]","[""point set"", ""set"", ""permutation-invariant"", ""loss function""]","We introduce an analytic distance function for moderately sized point sets of known cardinality that is shown to have very desirable properties, both as a loss function as well as a regularizer for machine learning applications. We compare our novel construction to other point set distance functions and show proof of concept experiments for training neural networks end-to-end on point set prediction tasks such as object detection.",/pdf/760c6a4f1a4af8529053ce3a0a20b0d2dd1ff191.pdf,ICLR,2019,Permutation-invariant loss function for point set prediction. +HJx38iC5KX,BJlSE3MFK7,1538090000000.0,1545360000000.0,209,Domain Generalization via Invariant Representation under Domain-Class Dependency,"[""akuzawa-kei@weblab.t.u-tokyo.ac.jp"", ""iwasawa@weblab.t.u-tokyo.ac.jp"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Kei Akuzawa"", ""Yusuke Iwasawa"", ""Yutaka Matsuo""]","[""domain generalization"", ""adversarial learning"", ""invariant feature learning""]","Learning domain-invariant representation is a dominant approach for domain generalization, where we need to build a classifier that is robust toward domain shifts induced by change of users, acoustic or lighting conditions, etc. However, prior domain-invariance-based methods overlooked the underlying dependency of classes (target variable) on source domains during optimization, which causes the trade-off between classification accuracy and domain-invariance, and often interferes with the domain generalization performance. This study first provides the notion of domain generalization under domain-class dependency and elaborates on the importance of considering the dependency by expanding the analysis of Xie et al. (2017). We then propose a method, invariant feature learning under optimal classifier constrains (IFLOC), which explicitly considers the dependency and maintains accuracy while improving domain-invariance. Specifically, the proposed method regularizes the representation so that it has as much domain information as the class labels, unlike prior methods that remove all domain information. Empirical validations show the superior performance of IFLOC to baseline methods, supporting the importance of the domain-class dependency in domain generalization and the efficacy of the proposed method for overcoming the issue.",/pdf/aa5133e6112e107e01246a64a0d87a19876b7b7f.pdf,ICLR,2019,Address the trade-off caused by the dependency of classes on domains in domain generalization +60j5LygnmD,qW0BfGLi396,1601310000000.0,1616710000000.0,1967,Meta-learning with negative learning rates,"[""~Alberto_Bernacchia1""]","[""Alberto Bernacchia""]","[""Meta-learning""]","Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or ""learning to learn"" a distribution of tasks, where ""learning"" is represented by an outer loop, and ""to learn"" by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative +values. These results help clarify under what circumstances meta-learning performs best.",/pdf/af54b731e47f23c1fd6d58dd050db1ab79d41b33.pdf,ICLR,2021,We show theoretically that the optimal inner learning rate of MAML during training is always negative in a family of models +r1ayG7WRZ,SJQ2bmZCb,1509140000000.0,1518730000000.0,1133,Don't encrypt the data; just approximate the model \ Towards Secure Transaction and Fair Pricing of Training Data,"[""xxu@hmc.edu""]","[""Xinlei Xu""]","[""Applications"", ""Security in Machine Learning"", ""Fairness and Security"", ""Model Compression""]","As machine learning becomes ubiquitous, deployed systems need to be as accu- rate as they can. As a result, machine learning service providers have a surging need for useful, additional training data that benefits training, without giving up all the details about the trained program. At the same time, data owners would like to trade their data for its value, without having to first give away the data itself be- fore receiving compensation. It is difficult for data providers and model providers to agree on a fair price without first revealing the data or the trained model to the other side. Escrow systems only complicate this further, adding an additional layer of trust required of both parties. Currently, data owners and model owners don’t have a fair pricing system that eliminates the need to trust a third party and training the model on the data, which 1) takes a long time to complete, 2) does not guarantee that useful data is paid valuably and that useless data isn’t, without trusting in the third party with both the model and the data. Existing improve- ments to secure the transaction focus heavily on encrypting or approximating the data, such as training on encrypted data, and variants of federated learning. As powerful as the methods appear to be, we show them to be impractical in our use case with real world assumptions for preserving privacy for the data owners when facing black-box models. Thus, a fair pricing scheme that does not rely on secure data encryption and obfuscation is needed before the exchange of data. This pa- per proposes a novel method for fair pricing using data-model efficacy techniques such as influence functions, model extraction, and model compression methods, thus enabling secure data transactions. We successfully show that without running the data through the model, one can approximate the value of the data; that is, if the data turns out redundant, the pricing is minimal, and if the data leads to proper improvement, its value is properly assessed, without placing strong assumptions on the nature of the model. Future work will be focused on establishing a system with stronger transactional security against adversarial attacks that will reveal details about the model or the data to the other party.",/pdf/69170f53ffe9f431f2c54cd1a453add292d356cb.pdf,ICLR,2018,"Facing complex, black-box models, encrypting the data is not as usable as approximating the model and using it to price a potential transaction." +SJlpYJBKvH,HJlfjDRuDB,1569440000000.0,1583910000000.0,1857,Measuring the Reliability of Reinforcement Learning Algorithms,"[""scychan@google.com"", ""sfishman@google.com"", ""kbanoop@google.com"", ""canny@google.com"", ""sguada@google.com""]","[""Stephanie C.Y. Chan"", ""Samuel Fishman"", ""Anoop Korattikara"", ""John Canny"", ""Sergio Guadarrama""]","[""reinforcement learning"", ""metrics"", ""statistics"", ""reliability""]","Lack of reliability is a well-known issue for reinforcement learning (RL) algorithms. This problem has gained increasing attention in recent years, and efforts to improve it have grown substantially. To aid RL researchers and production users with the evaluation and improvement of reliability, we propose a set of metrics that quantitatively measure different aspects of reliability. In this work, we focus on variability and risk, both during training and after learning (on a fixed policy). We designed these metrics to be general-purpose, and we also designed complementary statistical tests to enable rigorous comparisons on these metrics. In this paper, we first describe the desired properties of the metrics and their design, the aspects of reliability that they measure, and their applicability to different scenarios. We then describe the statistical tests and make additional practical recommendations for reporting results. The metrics and accompanying statistical tools have been made available as an open-source library. We apply our metrics to a set of common RL algorithms and environments, compare them, and analyze the results.",/pdf/4b88278b4db6dff57af01390b2d4bc557699fbd7.pdf,ICLR,2020,A novel set of metrics for measuring reliability of reinforcement learning algorithms (+ accompanying statistical tests) +rklfIeSFwS,Bkec5FeYvH,1569440000000.0,1577170000000.0,2314,CNAS: Channel-Level Neural Architecture Search,"[""skyde1021@dgist.ac.kr"", ""mskim@dgist.ac.kr"", ""jinjun@us.ibm.com""]","[""Heechul Lim"", ""Min-Soo Kim"", ""Jinjun Xiong""]","[""Neural architecture search""]","There is growing interest in automating designing good neural network architectures. The NAS methods proposed recently have significantly reduced architecture search cost by sharing parameters, but there is still a challenging problem of designing search space. We consider search space is typically defined with its shape and a set of operations and propose a channel-level architecture search\,(CNAS) method using only a fixed type of operation. The resulting architecture is sparse in terms of channel and has different topology at different cell. The experimental results for CIFAR-10 and ImageNet show that a fine-granular and sparse model searched by CNAS achieves very competitive performance with dense models searched by the existing methods.",/pdf/0d54570760447d6f3b2ed93b2bcdcd2b464cd809.pdf,ICLR,2020, +S1xO4xHFvB,SylURUeYPH,1569440000000.0,1577170000000.0,2253,Atomic Compression Networks,"[""falkner@ismll.uni-hildesheim.de"", ""josif@ismll.uni-hildesheim.de"", ""schmidt-thieme@ismll.uni-hildesheim.de""]","[""Jonas Falkner"", ""Josif Grabocka"", ""Lars Schmidt-Thieme""]","[""Network Compression""]","Compressed forms of deep neural networks are essential in deploying large-scale +computational models on resource-constrained devices. Contrary to analogous +domains where large-scale systems are build as a hierarchical repetition of small- +scale units, the current practice in Machine Learning largely relies on models with +non-repetitive components. In the spirit of molecular composition with repeating +atoms, we advance the state-of-the-art in model compression by proposing Atomic +Compression Networks (ACNs), a novel architecture that is constructed by recursive +repetition of a small set of neurons. In other words, the same neurons with the +same weights are stochastically re-positioned in subsequent layers of the network. +Empirical evidence suggests that ACNs achieve compression rates of up to three +orders of magnitudes compared to fine-tuned fully-connected neural networks (88× +to 1116× reduction) with only a fractional deterioration of classification accuracy +(0.15% to 5.33%). Moreover our method can yield sub-linear model complexities +and permits learning deep ACNs with less parameters than a logistic regression +with no decline in classification accuracy.",/pdf/2796e7b556b2bc5b2b32b87d3dc4532b731988a5.pdf,ICLR,2020,"We advance the state-of-the-art in model compression by proposing Atomic Compression Networks (ACNs), a novel architecture that is constructed by recursive repetition of a small set of neurons." +BJgnXpVYwS,HJlr40xDDH,1569440000000.0,1583910000000.0,464,Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity,"[""jzhzhang@mit.edu"", ""tianxing@mit.edu"", ""suvrit@mit.edu"", ""jadbabai@mit.edu""]","[""Jingzhao Zhang"", ""Tianxing He"", ""Suvrit Sra"", ""Ali Jadbabaie""]","[""Adaptive methods"", ""optimization"", ""deep learning""]","We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, gradient clipping and normalized gradient, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.",/pdf/3b53be5599473dd2d2275e3acaa1e2890cad4e1e.pdf,ICLR,2020,Gradient clipping provably accelerates gradient descent for non-smooth non-convex functions. +SJ71VXZAZ,Hypi77-R-,1509140000000.0,1518730000000.0,1157,Learning To Generate Reviews and Discovering Sentiment,"[""alec@openai.com"", ""rafal@openai.com"", ""ilya@openai.com""]","[""Alec Radford"", ""Rafal Jozefowicz"", ""Ilya Sutskever""]","[""unsupervised learning"", ""representation learning"", ""deep learning""]","We explore the properties of byte-level recurrent language models. When given sufficient amounts of capacity, training data, and compute time, the representations learned by these models include disentangled features corresponding to high-level concepts. Specifically, we find a single unit which performs sentiment analysis. These representations, learned in an unsupervised manner, achieve state of the art on the binary subset of the Stanford Sentiment Treebank. They are also very data efficient. When using only a handful of labeled examples, our approach matches the performance of strong baselines trained on full datasets. We also demonstrate the sentiment unit has a direct influence on the generative process of the model. Simply fixing its value to be positive or negative generates samples with the corresponding positive or negative sentiment.",/pdf/82eaeeca82af695721cc73403066982e93ef60d2.pdf,ICLR,2018,Byte-level recurrent language models learn high-quality domain specific representations of text. +Hkg313AcFX,ByltmgAqFm,1538090000000.0,1545360000000.0,1019,Metropolis-Hastings view on variational inference and adversarial training,"[""k.necludov@gmail.com"", ""vetrodim@gmail.com""]","[""Kirill Neklyudov"", ""Dmitry Vetrov""]","[""MCMC"", ""GANs"", ""Variational Inference""]","In this paper we propose to view the acceptance rate of the Metropolis-Hastings algorithm as a universal objective for learning to sample from target distribution -- given either as a set of samples or in the form of unnormalized density. This point of view unifies the goals of such approaches as Markov Chain Monte Carlo (MCMC), Generative Adversarial Networks (GANs), variational inference. To reveal the connection we derive the lower bound on the acceptance rate and treat it as the objective for learning explicit and implicit samplers. The form of the lower bound allows for doubly stochastic gradient optimization in case the target distribution factorizes (i.e. over data points). We empirically validate our approach on Bayesian inference for neural networks and generative models for images.",/pdf/a8d3b96a9cd3fe5265667d660bdece339323d74d.pdf,ICLR,2019,Learning to sample via lower bounding the acceptance rate of the Metropolis-Hastings algorithm +rylXBkrYDS,H1eN8Wp_PH,1569440000000.0,1584930000000.0,1686,A Baseline for Few-Shot Image Classification,"[""guneetdhillon@utexas.edu"", ""pratikac@seas.upenn.edu"", ""avinash.a.ravichandran@gmail.com"", ""soattos@amazon.com""]","[""Guneet Singh Dhillon"", ""Pratik Chaudhari"", ""Avinash Ravichandran"", ""Stefano Soatto""]","[""few-shot learning"", ""transductive learning"", ""fine-tuning"", ""baseline"", ""meta-learning""]","Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for few-shot learning. When fine-tuned transductively, this outperforms the current state-of-the-art on standard datasets such as Mini-ImageNet, Tiered-ImageNet, CIFAR-FS and FC-100 with the same hyper-parameters. The simplicity of this approach enables us to demonstrate the first few-shot learning results on the ImageNet-21k dataset. We find that using a large number of meta-training classes results in high few-shot accuracies even for a large number of few-shot classes. We do not advocate our approach as the solution for few-shot learning, but simply use the results to highlight limitations of current benchmarks and few-shot protocols. We perform extensive studies on benchmark datasets to propose a metric that quantifies the ""hardness"" of a few-shot episode. This metric can be used to report the performance of few-shot algorithms in a more systematic way.",/pdf/2273b0b94259b117f20a6e058a6a0917d9522896.pdf,ICLR,2020,Transductive fine-tuning of a deep network is a strong baseline for few-shot image classification and outperforms the state-of-the-art on all standard benchmarks. +By-IifZRW,ryN4iMZ0W,1509140000000.0,1518730000000.0,936,Gaussian Process Neurons,"[""surban@tum.de"", ""smagt@brml.org""]","[""Sebastian Urban"", ""Patrick van der Smagt""]","[""gaussian process neuron activation function stochastic transfer function learning variational bayes probabilistic""]","We propose a method to learn stochastic activation functions for use in probabilistic neural networks. +First, we develop a framework to embed stochastic activation functions based on Gaussian processes in probabilistic neural networks. +Second, we analytically derive expressions for the propagation of means and covariances in such a network, thus allowing for an efficient implementation and training without the need for sampling. +Third, we show how to apply variational Bayesian inference to regularize and efficiently train this model. +The resulting model can deal with uncertain inputs and implicitly provides an estimate of the confidence of its predictions. +Like a conventional neural network it can scale to datasets of arbitrary size and be extended with convolutional and recurrent connections, if desired.",/pdf/ade247befee7fc59f5293b8e372be246ef9e3cc3.pdf,ICLR,2018,We model the activation function of each neuron as a Gaussian Process and learn it alongside the weight with Variational Inference. +HJeqWztlg,,1478200000000.0,1484380000000.0,78,Hierarchical compositional feature learning,"[""miguel@vicarious.com"", ""yi@vicarious.com"", ""scott@vicarious.com"", ""dileep@vicarious.com""]","[""Miguel Lazaro-Gredilla"", ""Yi Liu"", ""D. Scott Phoenix"", ""Dileep George""]","[""Unsupervised Learning""]","We introduce the hierarchical compositional network (HCN), a directed generative model able to discover and disentangle, without supervision, the building blocks of a set of binary images. The building blocks are binary features defined hierarchically as a composition of some of the features in the layer immediately below, arranged in a particular manner. At a high level, HCN is similar to a sigmoid belief network with pooling. Inference and learning in HCN are very challenging and existing variational approximations do not work satisfactorily. A main contribution of this work is to show that both can be addressed using max-product message passing (MPMP) with a particular schedule (no EM required). Also, using MPMP as an inference engine for HCN makes new tasks simple: adding supervision information, classifying images, or performing inpainting all correspond to clamping some variables of the model to their known values and running MPMP on the rest. When used for classification, fast inference with HCN has exactly the same functional form as a convolutional neural network (CNN) with linear activations and binary weights. However, HCN’s features are qualitatively very different.",/pdf/3fc9054b95c774ce389c2f546e49f904db28d83f.pdf,ICLR,2017,"We show that max-product message passing with an appropriate schedule can be used to perform inference and learning in a directed multilayer generative model, thus recovering interpretable features from binary images." +X4y_10OX-hX,eU0eGAKWjbX,1601310000000.0,1614720000000.0,868,Large Associative Memory Problem in Neurobiology and Machine Learning,"[""~Dmitry_Krotov2"", ""~John_J._Hopfield1""]","[""Dmitry Krotov"", ""John J. Hopfield""]","[""associative memory"", ""Hopfield networks"", ""modern Hopfield networks"", ""neuroscience""]","Dense Associative Memories or modern Hopfield networks permit storage and reliable retrieval of an exponentially large (in the dimension of feature space) number of memories. At the same time, their naive implementation is non-biological, since it seemingly requires the existence of many-body synaptic junctions between the neurons. We show that these models are effective descriptions of a more microscopic (written in terms of biological degrees of freedom) theory that has additional (hidden) neurons and only requires two-body interactions between them. For this reason our proposed microscopic theory is a valid model of large associative memory with a degree of biological plausibility. The dynamics of our network and its reduced dimensional equivalent both minimize energy (Lyapunov) functions. When certain dynamical variables (hidden neurons) are integrated out from our microscopic theory, one can recover many of the models that were previously discussed in the literature, e.g. the model presented in ""Hopfield Networks is All You Need"" paper. We also provide an alternative derivation of the energy function and the update rule proposed in the aforementioned paper and clarify the relationships between various models of this class.",/pdf/707e3c461fd1c2e94f05cad3278e6aaa3a32f2de.pdf,ICLR,2021,Our paper proposes a microscopic biologically-plausible theory of modern Hopfield networks. +BJgAf6Etwr,HyeyMraUPH,1569440000000.0,1577170000000.0,432,XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering,"[""jasdeep@cs.stanford.edu"", ""bmccann@salesforce.com"", ""nkeskar@salesforce.com"", ""cxiong@salesforce.com"", ""rsocher@salesforce.com""]","[""Jasdeep Singh"", ""Bryan McCann"", ""Nitish Shirish Keskar"", ""Caiming Xiong"", ""Richard Socher""]","[""cross-lingual"", ""transfer learning"", ""BERT""]","While natural language processing systems often focus on a single language, multilingual transfer learning has the potential to improve performance, especially for low-resource languages. +We introduce XLDA, cross-lingual data augmentation, a method that replaces a segment of the input text with its translation in another language. XLDA enhances performance of all 14 tested languages of the cross-lingual natural language inference (XNLI) benchmark. With improvements of up to 4.8, training with XLDA achieves state-of-the-art performance for Greek, Turkish, and Urdu. XLDA is in contrast to, and performs markedly better than, a more naive approach that aggregates examples in various languages in a way that each example is solely in one language. On the SQuAD question answering task, we see that XLDA provides a 1.0 performance increase on the English evaluation set. Comprehensive experiments suggest that most languages are effective as cross-lingual augmentors, that XLDA is robust to a wide range of translation quality, and that XLDA is even more effective for randomly initialized models than for pretrained models.",/pdf/ea36a6d49738d244781fdda6276a9281382ce1e4.pdf,ICLR,2020,Translating portions of the input during training can improve cross-lingual performance. +SJ19eUg0-,SJAFlLeA-,1509090000000.0,1518730000000.0,268,BLOCK-DIAGONAL HESSIAN-FREE OPTIMIZATION FOR TRAINING NEURAL NETWORKS,"[""hzhan23@syr.edu"", ""cxiong@salesforce.com"", ""james.bradbury@salesforce.com"", ""richard@socher.org""]","[""Huishuai Zhang"", ""Caiming Xiong"", ""James Bradbury"", ""Richard Socher""]","[""deep learning"", ""second-order optimization"", ""hessian free""]","Second-order methods for neural network optimization have several advantages over methods based on first-order gradient descent, including better scaling to large mini-batch sizes and fewer updates needed for convergence. But they are rarely applied to deep learning in practice because of high computational cost and the need for model-dependent algorithmic variations. We introduce a vari- ant of the Hessian-free method that leverages a block-diagonal approximation of the generalized Gauss-Newton matrix. Our method computes the curvature approximation matrix only for pairs of parameters from the same layer or block of the neural network and performs conjugate gradient updates independently for each block. Experiments on deep autoencoders, deep convolutional networks, and multilayer LSTMs demonstrate better convergence and generalization compared to the original Hessian-free approach and the Adam method.",/pdf/24d0da99dcaf71bb338d27742f4a21fa1b547b6e.pdf,ICLR,2018, +bhCDO_cEGCz,xVpOfm78OJo,1601310000000.0,1617420000000.0,2607,Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning,"[""~Zhenfang_Chen1"", ""~Jiayuan_Mao1"", ""~Jiajun_Wu1"", ""~Kwan-Yee_Kenneth_Wong2"", ""~Joshua_B._Tenenbaum1"", ""~Chuang_Gan1""]","[""Zhenfang Chen"", ""Jiayuan Mao"", ""Jiajun Wu"", ""Kwan-Yee Kenneth Wong"", ""Joshua B. Tenenbaum"", ""Chuang Gan""]","[""Concept Learning"", ""Neuro-Symbolic Learning"", ""Video Reasoning"", ""Visual Reasoning""]","We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse question into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.",/pdf/b0012d0f037d3416af76be33e23dacc31d14746f.pdf,ICLR,2021,We propose a neural-symbolic framework to learn physical concepts of objects and events via causal reasoning on videos. +rJg3zxBYwH,Skg0uQgtDH,1569440000000.0,1577170000000.0,2189,Learning Likelihoods with Conditional Normalizing Flows ,"[""christina.winkler.94@gmail.com"", ""d.e.worrall@uva.nl"", ""e.hoogeboom@uva.nl"", ""m.welling@uva.nl""]","[""Christina Winkler"", ""Daniel Worrall"", ""Emiel Hoogeboom"", ""Max Welling""]","[""Likelihood learning"", ""conditional normalizing flows"", ""generative modelling"", ""super-resolution"", ""vessel segmentation""]","Normalizing Flows (NFs) are able to model complicated distributions p(y) with strong inter-dimensional correlations and high multimodality by transforming a simple base density p(z) through an invertible neural network under the change of variables formula. Such behavior is desirable in multivariate structured prediction tasks, where handcrafted per-pixel loss-based methods inadequately capture strong correlations between output dimensions. We present a study of conditional normalizing flows (CNFs), a class of NFs where the base density to output space mapping is conditioned on an input x, to model conditional densities p(y|x). CNFs are efficient in sampling and inference, they can be trained with a likelihood-based objective, and CNFs, being generative flows, do not suffer from mode collapse or training instabilities. We provide an effective method to train continuous CNFs for binary problems and in particular, we apply these CNFs to super-resolution and vessel segmentation tasks demonstrating competitive performance on standard benchmark datasets in terms of likelihood and conventional metrics.",/pdf/c5c80e5f52349ab77e4cc93b2f766836d0664bc1.pdf,ICLR,2020, +S1fcY-Z0-,SyWqYZW0W,1509130000000.0,1518730000000.0,702,Bayesian Hypernetworks,"[""david.scott.krueger@gmail.com"", ""chin-wei.huang@umontreal.ca"", ""riashat.islam@mail.mcgill.ca"", ""turnerry@iro.umontreal.ca"", ""allac@elementai.com"", ""aaron.courville@gmail.com""]","[""David Krueger"", ""Chin-Wei Huang"", ""Riashat Islam"", ""Ryan Turner"", ""Alexandre Lacoste"", ""Aaron Courville""]","[""variational inference"", ""bayesian inference"", ""deep networks""]","We propose Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork, h, is a neural network which learns to transform a simple noise distribution, p(e) = N(0,I), to a distribution q(t) := q(h(e)) over the parameters t of another neural network (the ``primary network). We train q with variational inference, using an invertible h to enable efficient estimation of the variational lower bound on the posterior p(t | D) via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap iid sampling of q(t). In practice, Bayesian hypernets provide a better defense against adversarial examples than dropout, and also exhibit competitive performance on a suite of tasks which evaluate model uncertainty, including regularization, active learning, and anomaly detection. +",/pdf/44e489beb75a4b7f95637556cbca03bb4a801b28.pdf,ICLR,2018,We propose Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. +ryeG924twB,H1lmdcwCUr,1569440000000.0,1583910000000.0,108,Learning Expensive Coordination: An Event-Based Deep RL Approach,"[""shizhy6@mail2.sysu.edu.cn"", ""runsheng.yu@ntu.edu.sg"", ""xwang033@e.ntu.edu.sg"", ""rundong001@e.ntu.edu.sg"", ""yzhang137@e.ntu.edu.sg"", ""laihanj3@mail.sysu.edu.cn"", ""boan@ntu.edu.sg""]","[""Zhenyu Shi*"", ""Runsheng Yu*"", ""Xinrun Wang*"", ""Rundong Wang"", ""Youzhi Zhang"", ""Hanjiang Lai"", ""Bo An""]","[""Multi-Agent Deep Reinforcement Learning"", ""Deep Reinforcement Learning"", ""Leader\u2013Follower Markov Game"", ""Expensive Coordination""]","Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers' behaviors when assigning bonuses and ii) the complex interactions between followers make the training process hard to converge, especially when the leader's policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader's decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader's long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers' behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers' decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically.",/pdf/a0af93829f387e544131364f643d911eb03803ae.pdf,ICLR,2020,We propose an event-based policy gradient to train the leader and an action abstraction policy gradient to train the followers in leader-follower Markov game. +S1xCuTNYDr,r1xM0FhPPH,1569440000000.0,1577170000000.0,651,Regularizing Black-box Models for Improved Interpretability,"[""gdplumb@andrew.cmu.edu"", ""alshedivat@cs.cmu.edu"", ""epxing@cs.cmu.edu"", ""talwalkar@cmu.edu""]","[""Gregory Plumb"", ""Maruan Al-Shedivat"", ""Eric Xing"", ""Ameet Talwalkar""]","[""Interpretable Machine Learning"", ""Local Explanations"", ""Regularization""]","Most of the work on interpretable machine learning has focused on designingeither inherently interpretable models, which typically trade-off accuracyfor interpretability, or post-hoc explanation systems, which lack guarantees about their explanation quality. We explore an alternative to theseapproaches by directly regularizing a black-box model for interpretabilityat training time. Our approach explicitly connects three key aspects ofinterpretable machine learning: (i) the model’s internal interpretability, (ii)the explanation system used at test time, and (iii) the metrics that measureexplanation quality. Our regularization results in substantial improvementin terms of the explanation fidelity and stability metrics across a range ofdatasets and black-box explanation systems while slightly improving accuracy. Finally, we justify theoretically that the benefits of our regularizationgeneralize to unseen points.",/pdf/548bba94a10c97d5c137c61261285f78b353a820.pdf,ICLR,2020,"If you train your model with our regularizers, black-box explanations systems will work better on the resulting model. Further, its likely that the resulting model will be more accurate as well. " +L4n9FPoQL1,MuaIqRMishQ,1601310000000.0,1614990000000.0,1396,Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled Learning and Conditional Generation with Extra Data,"[""~Bing_Yu1"", ""~Ke_Sun3"", ""~He_Wang6"", ""~Zhouchen_Lin1"", ""~Zhanxing_Zhu1""]","[""Bing Yu"", ""Ke Sun"", ""He Wang"", ""Zhouchen Lin"", ""Zhanxing Zhu""]",[],"The scarcity of class-labeled data is a ubiquitous bottleneck in a wide range of machine learning problems. While abundant unlabeled data normally exist and provide a potential solution, it is extremely challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled~(PU) classification and conditional generation with extra unlabeled data \emph{simultaneously}, both of which aim to make full use of agnostic unlabeled data to improve classification and generation performances. In particular, we present a novel training framework to jointly target both PU classification and conditional generation when exposing to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Conditional Generative Adversarial Network~(CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Our key contribution is a Classifier-Noise-Invariant Conditional GAN~(CNI-CGAN) that can learn the clean data distribution from noisy labels predicted by a PU classifier. Theoretically, we proved the optimal condition of CNI-CGAN and experimentally, we conducted extensive evaluations on diverse datasets, verifying the simultaneous improvements on both classification and generation.",/pdf/747e9d042146eed0720654b2ce71c79a492c96fb.pdf,ICLR,2021, +r1lIKlSYvH,rkeRkReKDB,1569440000000.0,1577170000000.0,2436,The Usual Suspects? Reassessing Blame for VAE Posterior Collapse,"[""daib13@mails.tsinghua.edu.cn"", ""wzy196@gmail.com"", ""davidwipf@gmail.com""]","[""Bin Dai"", ""Ziyu Wang"", ""David Wipf""]","[""variational autoencoder"", ""posterior collapse""]","In narrow asymptotic settings Gaussian VAE models of continuous data have been shown to possess global optima aligned with ground-truth distributions. Even so, it is well known that poor solutions whereby the latent posterior collapses to an uninformative prior are sometimes obtained in practice. However, contrary to conventional wisdom that largely assigns blame for this phenomena on the undue influence of KL-divergence regularization, we will argue that posterior collapse is, at least in part, a direct consequence of bad local minima inherent to the loss surface of deep autoencoder networks. In particular, we prove that even small nonlinear perturbations of affine VAE decoder models can produce such minima, and in deeper models, analogous minima can force the VAE to behave like an aggressive truncation operator, provably discarding information along all latent dimensions in certain circumstances. Regardless, the underlying message here is not meant to undercut valuable existing explanations of posterior collapse, but rather, to refine the discussion and elucidate alternative risk factors that may have been previously underappreciated.",/pdf/3be35420a75184e056cf185b94aafbd137eff80b.pdf,ICLR,2020, +HJIoJWZCZ,r1rj1WbAZ,1509130000000.0,1519400000000.0,642,Adversarial Dropout Regularization,"[""k-saito@mi.t.u-tokyo.ac.jp"", ""ushiku@mi.t.u-tokyo.ac.jp"", ""harada@mi.t.u-tokyo.ac.jp"", ""saenko@bu.edu""]","[""Kuniaki Saito"", ""Yoshitaka Ushiku"", ""Tatsuya Harada"", ""Kate Saenko""]","[""domain adaptation"", ""computer vision"", ""generative models""]","We present a domain adaptation method for transferring neural representations from label-rich source domains to unlabeled target domains. Recent adversarial methods proposed for this task learn to align features across domains by ``fooling'' a special domain classifier network. However, a drawback of this approach is that the domain classifier simply labels the generated features as in-domain or not, without considering the boundaries between classes. This means that ambiguous target features can be generated near class boundaries, reducing target classification accuracy. We propose a novel approach, Adversarial Dropout Regularization (ADR), which encourages the generator to output more discriminative features for the target domain. Our key idea is to replace the traditional domain critic with a critic that detects non-discriminative features by using dropout on the classifier network. The generator then learns to avoid these areas of the feature space and thus creates better features. We apply our ADR approach to the problem of unsupervised domain adaptation for image classification and semantic segmentation tasks, and demonstrate significant improvements over the state of the art.",/pdf/7ef508b1e2c14f443e49910c820adb60b9e98a3b.pdf,ICLR,2018,We present a new adversarial method for adapting neural representations based on a critic that detects non-discriminative features. +Sklgs0NFvr,BkxN6yFOwH,1569440000000.0,1583910000000.0,1305,Learning The Difference That Makes A Difference With Counterfactually-Augmented Data,"[""dkaushik@cs.cmu.edu"", ""hovy@cmu.edu"", ""zlipton@cmu.edu""]","[""Divyansh Kaushik"", ""Eduard Hovy"", ""Zachary Lipton""]","[""humans in the loop"", ""annotation artifacts"", ""text classification"", ""sentiment analysis"", ""natural language inference""]","Despite alarm over the reliance of machine learning systems on so-called spurious patterns, the term lacks coherent meaning in standard statistical frameworks. However, the language of causality offers clarity: spurious associations are due to confounding (e.g., a common cause), but not direct or indirect causal effects. In this paper, we focus on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns. Given documents and their initial labels, we task humans with revising each document so that it (i) accords with a counterfactual target label; (ii) retains internal coherence; and (iii) avoids unnecessary changes. Interestingly, on sentiment analysis and natural language inference tasks, classifiers trained on original data fail on their counterfactually-revised counterparts and vice versa. Classifiers trained on combined datasets perform remarkably well, just shy of those specialized to either domain. While classifiers trained on either original or manipulated data alone are sensitive to spurious features (e.g., mentions of genre), models trained on the combined data are less sensitive to this signal. Both datasets are publicly available.",/pdf/6267805abf4a2ab7b3f971c759b8c669e6060774.pdf,ICLR,2020,"Humans in the loop revise documents to accord with counterfactual labels, resulting resource helps to reduce reliance on spurious associations." +HJe62s09tX,HyxvNptqKX,1538090000000.0,1545360000000.0,754,Unsupervised Hyper-alignment for Multilingual Word Embeddings,"[""jean.alaux--lorain@ens.fr"", ""egrave@fb.com"", ""marco.cuturi.cameto@gmail.com"", ""ajoulin@fb.com""]","[""Jean Alaux"", ""Edouard Grave"", ""Marco Cuturi"", ""Armand Joulin""]",[],"We consider the problem of aligning continuous word representations, learned in multiple languages, to a common space. It was recently shown that, in the case of two languages, it is possible to learn such a mapping without supervision. This paper extends this line of work to the problem of aligning multiple languages to a common space. A solution is to independently map all languages to a pivot language. Unfortunately, this degrades the quality of indirect word translation. We thus propose a novel formulation that ensures composable mappings, leading to better alignments. We evaluate our method by jointly aligning word vectors in eleven languages, showing consistent improvement with indirect mappings while maintaining competitive performance on direct word translation.",/pdf/1516fc8e288c7003440cd342462dad26c4825c05.pdf,ICLR,2019, +ryykVe-0W,rJT0QeZ0-,1509130000000.0,1518730000000.0,573,Learning Independent Features with Adversarial Nets for Non-linear ICA,"[""pbpop3@gmail.com"", ""yoshua.bengio@umontreal.ca""]","[""Philemon Brakel"", ""Yoshua Bengio""]","[""adversarial networks"", ""ica"", ""unsupervised"", ""independence""]","Reliable measures of statistical dependence could potentially be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA). Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly. We propose to learn independent features with adversarial objectives (Goodfellow et al. 2014, Arjovsky et al. 2017) which optimize such measures implicitly. These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution. Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems. +",/pdf/918f910202a69f08ada9f06eef31e21178166564.pdf,ICLR,2018, +O-XJwyoIF-k,mW1I2VtMdwi,1601310000000.0,1616020000000.0,2176,Minimum Width for Universal Approximation,"[""~Sejun_Park1"", ""~Chulhee_Yun1"", ""~Jaeho_Lee3"", ""~Jinwoo_Shin1""]","[""Sejun Park"", ""Chulhee Yun"", ""Jaeho Lee"", ""Jinwoo Shin""]","[""universal approximation"", ""neural networks""]","The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension $d_x$ and the output dimension $d_y$. In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is exactly $\max\{d_x+1,d_y\}$. We also prove that the same conclusion does not hold for the uniform approximation with ReLU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions.",/pdf/e18b24a6733ce47aa4779ab76eb962fe09a49a45.pdf,ICLR,2021,We establish the tight bound on width for the universal approximability of neural network. +IZQm8mMRVqW,VSLKB62YT0k,1601310000000.0,1614990000000.0,1209,Quickly Finding a Benign Region via Heavy Ball Momentum in Non-Convex Optimization,"[""~Jun-Kun_Wang1"", ""~Jacob_Abernethy1""]","[""Jun-Kun Wang"", ""Jacob Abernethy""]",[],"The Heavy Ball Method, proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of Heavy Ball has thus far eluded theoretical understanding. Our goal is to address this gap, and in the present work we identify two non-convex problems where we provably show that the Heavy Ball momentum helps the iterate to enter a benign region that contains a global optimal point faster. We show that Heavy Ball exhibits simple dynamics that clearly reveal the benefit of using a larger value of momentum parameter for the problems. The first of these optimization problems is the phase retrieval problem, which has useful applications in physical science. The second of these optimization problems is the cubic-regularized minimization, a critical subroutine required by Nesterov-Polyak cubic-regularized method to find second-order stationary points in general smooth non-convex problems.",/pdf/e447bbd3f112b645d274879cc98efbddf64d2c49.pdf,ICLR,2021, +B1g-X3RqKm,ryln1AP5Ym,1538090000000.0,1545360000000.0,1329,A Proposed Hierarchy of Deep Learning Tasks,"[""joel@baidu.com"", ""sharan@baidu.com"", ""ardalaninewsha@baidu.com"", ""junheewoo@baidu.com"", ""hassankianinejad@baidu.com"", ""patwarymostofa@baidu.com"", ""yangyang62@baidu.com"", ""zhouyanqi@baidu.com"", ""gregdiamos@baidu.com"", ""kennethchurch@baidu.com""]","[""Joel Hestness"", ""Sharan Narang"", ""Newsha Ardalani"", ""Heewoo Jun"", ""Hassan Kianinejad"", ""Md. Mostofa Ali Patwary"", ""Yang Yang"", ""Yanqi Zhou"", ""Gregory Diamos"", ""Kenneth Church""]","[""Deep learning"", ""scaling with data"", ""computational complexity"", ""learning curves"", ""speech recognition"", ""image recognition"", ""machine translation"", ""language modeling""]","As the pace of deep learning innovation accelerates, it becomes increasingly important to organize the space of problems by relative difficultly. Looking to other fields for inspiration, we see analogies to the Chomsky Hierarchy in computational linguistics and time and space complexity in theoretical computer science. + +As a complement to prior theoretical work on the data and computational requirements of learning, this paper presents an empirical approach. We introduce a methodology for measuring validation error scaling with data and model size and test tasks in natural language, vision, and speech domains. We find that power-law validation error scaling exists across a breadth of factors and that model size scales sublinearly with data size, suggesting that simple learning theoretic models offer insights into the scaling behavior of realistic deep learning settings, and providing a new perspective on how to organize the space of problems. + +We measure the power-law exponent---the ""steepness"" of the learning curve---and propose using this metric to sort problems by degree of difficulty. There is no data like more data, but some tasks are more effective at taking advantage of more data. Those that are more effective are easier on the proposed scale. + +Using this approach, we can observe that studied tasks in speech and vision domains scale faster than those in the natural language domain, offering insight into the observation that progress in these areas has proceeded more rapidly than in natural language.",/pdf/cf2d05da49f06b82d0f3d7def5639e560da0e8ab.pdf,ICLR,2019,"We use 50 GPU years of compute time to study how deep learning scales with more data, and propose a new way to organize the space of problems by difficulty." +5K8ZG9twKY,TqDpA91wDdW,1601310000000.0,1614990000000.0,2481,Efficient Estimators for Heavy-Tailed Machine Learning,"[""~Vishwak_Srinivasan1"", ""~Adarsh_Prasad1"", ""~Sivaraman_Balakrishnan1"", ""~Pradeep_Kumar_Ravikumar1""]","[""Vishwak Srinivasan"", ""Adarsh Prasad"", ""Sivaraman Balakrishnan"", ""Pradeep Kumar Ravikumar""]",[],"A dramatic improvement in data collection technologies has aided in procuring massive amounts of unstructured and heterogeneous datasets. This has consequently led to a prevalence of heavy-tailed distributions across a broad range of tasks in machine learning. In this work, we perform thorough empirical studies to show that modern machine learning models such as generative adversarial networks and invertible flow models are plagued with such ill-behaved distributions during the phase of training them. To alleviate this problem, we develop a computationally-efficient estimator for mean estimation with provable guarantees which can handle such ill-behaved distributions. We provide specific consequences of our theory for supervised learning tasks such as linear regression and generalized linear models. Furthermore, we study the performance of our algorithm on synthetic tasks and real-world experiments and show that our methods convincingly outperform a variety of practical baselines.",/pdf/b8752226e0bec662a6f92371fbb78db045056a1b.pdf,ICLR,2021, +ryhqQFKgl,,1478230000000.0,1486530000000.0,109,Towards Deep Interpretability (MUS-ROVER II): Learning Hierarchical Representations of Tonal Music,"[""haiziyu7@illinois.edu"", ""varshney@illinois.edu""]","[""Haizi Yu"", ""Lav R. Varshney""]",[],"Music theory studies the regularity of patterns in music to capture concepts underlying music styles and composers' decisions. This paper continues the study of building \emph{automatic theorists} (rovers) to learn and represent music concepts that lead to human interpretable knowledge and further lead to materials for educating people. Our previous work took a first step in algorithmic concept learning of tonal music, studying high-level representations (concepts) of symbolic music (scores) and extracting interpretable rules for composition. This paper further studies the representation \emph{hierarchy} through the learning process, and supports \emph{adaptive} 2D memory selection in the resulting language model. This leads to a deeper-level interpretability that expands from individual rules to a dynamic system of rules, making the entire rule learning process more cognitive. The outcome is a new rover, MUS-ROVER \RN{2}, trained on Bach's chorales, which outputs customizable syllabi for learning compositional rules. We demonstrate comparable results to our music pedagogy, while also presenting the differences and variations. In addition, we point out the rover's potential usages in style recognition and synthesis, as well as applications beyond music.",/pdf/1aefa0439737c9dde10a30bef4a202918311633f.pdf,ICLR,2017, +LMslR3CTzE_,Df0UrXyP78,1601310000000.0,1614990000000.0,687,Neural Subgraph Matching,"[""~Zhitao_Ying1"", ""anwang@cs.stanford.edu"", ""~Jiaxuan_You2"", ""chengtao.wen@siemens.com"", ""arquimedes.canedo@siemens.com"", ""~Jure_Leskovec1""]","[""Zhitao Ying"", ""Andrew Wang"", ""Jiaxuan You"", ""Chengtao Wen"", ""Arquimedes Canedo"", ""Jure Leskovec""]","[""Graph neural networks"", ""Subgraph matching"", ""Order Embedding""]","Subgraph matching is the problem of determining the presence and location(s) of a given query graph in a large target graph. +Despite being an NP-complete problem, the subgraph matching problem is crucial in domains ranging from network science and database systems to biochemistry and cognitive science. +However, existing techniques based on combinatorial matching and integer programming cannot handle matching problems with both large target and query graphs. +Here we propose NeuroMatch, an accurate, efficient, and robust neural approach to subgraph matching. NeuroMatch decomposes query and target graphs into small subgraphs and embeds them using graph neural networks. Trained to capture geometric constraints corresponding to subgraph relations, NeuroMatch then efficiently performs subgraph matching directly in the embedding space. Experiments demonstrate NeuroMatch is 100x faster than existing combinatorial approaches and 18% more accurate than existing approximate subgraph matching methods.",/pdf/c0b28df14d411fd02ad9559f01d978ccb0340d4d.pdf,ICLR,2021,Neural approach to learning the problem of subgraph isomorphism +4kWGWoFGA_H,WTGtG-dElRJ,1601310000000.0,1614990000000.0,1070,Beyond the Pixels: Exploring the Effects of Bit-Level Network and File Corruptions on Video Model Robustness,"[""~Trenton_Chang1"", ""~Daniel_Yang_Fu1"", ""~Yixuan_Li1""]","[""Trenton Chang"", ""Daniel Yang Fu"", ""Yixuan Li""]","[""robustness"", ""machine learning"", ""file corruption"", ""network corruption"", ""video""]","We investigate the robustness of video machine learning models to bit-level network and file corruptions, which can arise from network transmission failures or hardware errors, and explore defenses against such corruptions. We simulate network and file corruptions at multiple corruption levels, and find that bit-level corruptions can cause substantial performance drops on common action recognition and multi-object tracking tasks. We explore two types of defenses against bit-level corruptions: corruption-agnostic and corruption-aware defenses. We find that corruption-agnostic defenses such as adversarial training have limited effectiveness, performing up to 11.3 accuracy points worse than a no-defense baseline. In response, we propose Bit-corruption Augmented Training (BAT), a corruption-aware baseline that exploits knowledge of bit-level corruptions to enforce model invariance to such corruptions. BAT outperforms corruption-agnostic defenses, recovering up to 7.1 accuracy points over a no-defense baseline on highly-corrupted videos while maintaining competitive performance on clean/near-clean data.",/pdf/223a5edfd979193eee1766237957f1df157a1ca5.pdf,ICLR,2021,We investigate the problem of video model robustness to bit-level network and file corruptions. +B1fbosCcYm,rJlRapo9tQ,1538090000000.0,1545360000000.0,598,A Biologically Inspired Visual Working Memory for Deep Networks,"[""ewah1g13@ecs.soton.ac.uk"", ""mn@ecs.soton.ac.uk"", ""jsh2@ecs.soton.ac.uk""]","[""Ethan Harris"", ""Mahesan Niranjan"", ""Jonathon Hare""]","[""memory"", ""visual attention"", ""image classification"", ""image reconstruction"", ""latent representations""]","The ability to look multiple times through a series of pose-adjusted glimpses is fundamental to human vision. This critical faculty allows us to understand highly complex visual scenes. Short term memory plays an integral role in aggregating the information obtained from these glimpses and informing our interpretation of the scene. Computational models have attempted to address glimpsing and visual attention but have failed to incorporate the notion of memory. We introduce a novel, biologically inspired visual working memory architecture that we term the Hebb-Rosenblatt memory. We subsequently introduce a fully differentiable Short Term Attentive Working Memory model (STAWM) which uses transformational attention to learn a memory over each image it sees. The state of our Hebb-Rosenblatt memory is embedded in STAWM as the weights space of a layer. By projecting different queries through this layer we can obtain goal-oriented latent representations for tasks including classification and visual reconstruction. Our model obtains highly competitive classification performance on MNIST and CIFAR-10. As demonstrated through the CelebA dataset, to perform reconstruction the model learns to make a sequence of updates to a canvas which constitute a parts-based representation. Classification with the self supervised representation obtained from MNIST is shown to be in line with the state of the art models (none of which use a visual attention mechanism). Finally, we show that STAWM can be trained under the dual constraints of classification and reconstruction to provide an interpretable visual sketchpad which helps open the `black-box' of deep learning.",/pdf/b043e3d92c00db005f5e142fafc208f575bea82f.pdf,ICLR,2019,A biologically inspired working memory that can be integrated in recurrent visual attention models for state of the art performance +4JLiaohIk9,w09L02hIm-m,1601310000000.0,1614990000000.0,2198,Motion Forecasting with Unlikelihood Training,"[""~Deyao_Zhu1"", ""~Mohamed_Zahran1"", ""~Li_Erran_Li1"", ""~Mohamed_Elhoseiny1""]","[""Deyao Zhu"", ""Mohamed Zahran"", ""Li Erran Li"", ""Mohamed Elhoseiny""]",[],"Motion forecasting is essential for making safe and intelligent decisions in robotic applications such as autonomous driving. State-of-the-art methods formulate it as a sequence-to-sequence prediction problem, which is solved in an encoder-decoder framework with a maximum likelihood estimation objective. In this paper, we show that the likelihood objective itself results in a model assigning too much probability to trajectories that are unlikely given the contextual information such as maps and states of surrounding agents. This is despite the fact that many state-of-the-art models do take contextual information as part of their input. We propose a new objective, unlikelihood training, which forces generated trajectories that conflicts with contextual information to be assigned a lower probability by our model. We demonstrate that our method can significantly improve state-of-art models’ performance on challenging real-world trajectory forecasting datasets (nuScenes and Argoverse) by 8% and reduce the standard deviation by up to 50%. The code will be made available.",/pdf/c8ea1531c15d85ac01f4afe4480b8c354e5a12d5.pdf,ICLR,2021, +r1HNP0eCW,Byf4vAeC-,1509120000000.0,1518730000000.0,451,Estimation of cross-lingual news similarities using text-mining methods,"[""wangzhouhao94@gmail.com"", ""m2015eliu@socsim.org"", ""sakaji@sys.t.u-tokyo.ac.jp"", ""m2015titoh@socsim.org"", ""izumi@sys.t.u-tokyo.ac.jp"", ""ktsubouc@yahoo-corp.jp"", ""tayamash@yahoo-corp.jp""]","[""Zhouhao Wang"", ""Enda Liu"", ""Hiroki Sakaji"", ""Tomoki Ito"", ""Kiyoshi Izumi"", ""Kota Tsubouchi"", ""Tatsuo Yamashita""]",[],"Every second, innumerable text data, including all kinds news, reports, messages, reviews, comments, and twits have been generated on the Internet, which is written not only in English but also in other languages such as Chinese, Japanese, French and so on. Not only SNS sites but also worldwide news agency such as Thomson Reuters News provide news reported in more than 20 languages, reflecting the significance of the multilingual information. +In this research, by taking advantage of multi-lingual text resources provided by the Thomson Reuters News, we developed a bidirectional LSTM based method to calculate cross-lingual semantic text similarity for long text and short text respectively. Thus, users could understand the situation comprehensively, by investigating similar and related cross-lingual articles, when there an important news comes in.",/pdf/264283bc493fe735c93cd671a0ae68a6b4ccbec9.pdf,ICLR,2018, +SkxHRySFvr,BygPGv1tDH,1569440000000.0,1577170000000.0,2024,LEARNING TO IMPUTE: A GENERAL FRAMEWORK FOR SEMI-SUPERVISED LEARNING,"[""w.h.li@ed.ac.uk"", ""foo_chuan_sheng@i2r.a-star.edu.sg"", ""hbilen@ed.ac.uk""]","[""Wei-Hong Li"", ""Chuan-Sheng Foo"", ""Hakan Bilen""]","[""Semi-supervised Learning"", ""Meta-Learning"", ""Learning to label""]","Recent semi-supervised learning methods have shown to achieve comparable results to their supervised counterparts while using only a small portion of labels in image classification tasks thanks to their regularization strategies. In this paper, we take a more direct approach for semi-supervised learning and propose learning to impute the labels of unlabeled samples such that a network achieves better generalization when it is trained on these labels. We pose the problem in a learning-to-learn formulation which can easily be incorporated to the state-of-the-art semi-supervised techniques and boost their performance especially when the labels are limited. We demonstrate that our method is applicable to both classification and regression problems including image classification and facial landmark detection tasks.",/pdf/5fef022228a1e79b26c026e8c5a473249e245aea.pdf,ICLR,2020,"We proposed a general learning-to-learn framework for semi-supervised learning, which can be used for both classification and regression tasks." +Hygy01StvH,HyxU_8JYPS,1569440000000.0,1577170000000.0,2010,Impact of the latent space on the ability of GANs to fit the distribution,"[""thomas.pinetz@ait.ac.at"", ""daniel.soukup@ait.ac.at"", ""pock@icg.tugraz.at""]","[""Thomas Pinetz"", ""Daniel Soukup"", ""Thomas Pock""]","[""Deep Learning"", ""Generative Adversarial Networks"", ""Compression"", ""Perceptual Quality""]","The goal of generative models is to model the underlying data distribution of a +sample based dataset. Our intuition is that an accurate model should in principle +also include the sample based dataset as part of its induced probability distribution. +To investigate this, we look at fully trained generative models using the Generative +Adversarial Networks (GAN) framework and analyze the resulting generator +on its ability to memorize the dataset. Further, we show that the size of the initial +latent space is paramount to allow for an accurate reconstruction of the training +data. This gives us a link to compression theory, where Autoencoders (AE) are +used to lower bound the reconstruction capabilities of our generative model. Here, +we observe similar results to the perception-distortion tradeoff (Blau & Michaeli +(2018)). Given a small latent space, the AE produces low quality and the GAN +produces high quality outputs from a perceptual viewpoint. In contrast, the distortion +error is smaller for the AE. By increasing the dimensionality of the latent +space the distortion decreases for both models, but the perceptual quality only +increases for the AE.",/pdf/3c7a6b6587f4dd59b0df9598aef36c2d0f80ab97.pdf,ICLR,2020,We analyze the impact of the latent space of fully trained generators by pseudo inverting them. +rk07ZXZRb,BJjz-XbCZ,1509140000000.0,1519430000000.0,1123,Learning an Embedding Space for Transferable Robot Skills,"[""hausmankarol@gmail.com"", ""springenberg@google.com"", ""ziyu@google.com"", ""heess@google.com"", ""riedmiller@google.com""]","[""Karol Hausman"", ""Jost Tobias Springenberg"", ""Ziyu Wang"", ""Nicolas Heess"", ""Martin Riedmiller""]","[""Deep Reinforcement Learning"", ""Variational Inference"", ""Control"", ""Robotics""]","We present a method for reinforcement learning of closely related skills that are parameterized via a skill embedding space. We learn such skills by taking advantage of latent variables and exploiting a connection between reinforcement learning and variational inference. The main contribution of our work is an entropy-regularized policy gradient formulation for hierarchical policies, and an associated, data-efficient and robust off-policy gradient algorithm based on stochastic value gradients. We demonstrate the effectiveness of our method on several simulated robotic manipulation tasks. We find that our method allows for discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. In addition, our results indicate that the hereby proposed technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards. +",/pdf/29c35690ff52463c84d9456ab511e4c944ddbea4.pdf,ICLR,2018, +HkxZVlHYvH,S1eAWLxKDH,1569440000000.0,1577170000000.0,2238,Ergodic Inference: Accelerate Convergence by Optimisation,"[""yichuan.zhang@eng.cam.ac.uk"", ""jmh233@cam.ac.uk""]","[""Yichuan Zhang"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""MCMC"", ""variational inference"", ""statistical inference""]","Statistical inference methods are fundamentally important in machine learning. Most state-of-the-art inference algorithms are +variants of Markov chain Monte Carlo (MCMC) or variational inference (VI). However, both methods struggle with limitations in practice: MCMC methods can be computationally demanding; VI methods may have large bias. +In this work, we aim to improve upon MCMC and VI by a novel hybrid method based on the idea of reducing simulation bias of finite-length MCMC chains using gradient-based optimisation. The proposed method can generate low-biased samples by increasing the length of MCMC simulation and optimising the MCMC hyper-parameters, which offers attractive balance between approximation bias and computational efficiency. We show that our method produces promising results on popular benchmarks when compared to recent hybrid methods of MCMC and VI.",/pdf/07f4cf199dc7b56ec074e6497b02d445e080cb91.pdf,ICLR,2020,"In this work, we aim to improve upon MCMC and VI by a novel hybrid method based on the idea of reducing simulation bias of finite-length MCMC chains using gradient-based optimisation." +UFGEelJkLu5,TPdxn14n2G,1601310000000.0,1616360000000.0,2674,MixKD: Towards Efficient Distillation of Large-scale Language Models,"[""~Kevin_J_Liang1"", ""~Weituo_Hao1"", ""~Dinghan_Shen1"", ""~Yufan_Zhou1"", ""~Weizhu_Chen1"", ""~Changyou_Chen1"", ""~Lawrence_Carin2""]","[""Kevin J Liang"", ""Weituo Hao"", ""Dinghan Shen"", ""Yufan Zhou"", ""Weizhu Chen"", ""Changyou Chen"", ""Lawrence Carin""]","[""Natural Language Processing"", ""Representation Learning""]","Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.",/pdf/1973dfb092fcfb9ef04acaf338a759f67dcc68b8.pdf,ICLR,2021,"We propose MixKD, a distillation framework leveraging mixup for large-scale language models." +r1e8qpVKPS,Bkx89TTDwH,1569440000000.0,1577170000000.0,708,Role of two learning rates in convergence of model-agnostic meta-learning,"[""takagi@mns.k.u-tokyo.ac.jp"", ""nagano@mns.k.u-tokyo.ac.jp"", ""yoshida@mns.k.u-tokyo.ac.jp"", ""okada@edu.k.u-tokyo.ac.jp""]","[""Shiro Takagi"", ""Yoshihiro Nagano"", ""Yuki Yoshida"", ""Masato Okada""]","[""meta-learning"", ""convergence""]","Model-agnostic meta-learning (MAML) is known as a powerful meta-learning method. However, MAML is notorious for being hard to train because of the existence of two learning rates. Therefore, in this paper, we derive the conditions that inner learning rate $\alpha$ and meta-learning rate $\beta$ must satisfy for MAML to converge to minima with some simplifications. We find that the upper bound of $\beta$ depends on $ \alpha$, in contrast to the case of using the normal gradient descent method. Moreover, we show that the threshold of $\beta$ increases as $\alpha$ approaches its own upper bound. This result is verified by experiments on various few-shot tasks and architectures; specifically, we perform sinusoid regression and classification of Omniglot and MiniImagenet datasets with a multilayer perceptron and a convolutional neural network. Based on this outcome, we present a guideline for determining the learning rates: first, search for the largest possible $\alpha$; next, tune $\beta$ based on the chosen value of $\alpha$.",/pdf/cdbfdee484753ccb694c6522f860f50c7c0b5838.pdf,ICLR,2020,We analyzed the role of two learning rates in model-agnostic meta-learning in convergence. +ByxkCj09Fm,rJeepspct7,1538090000000.0,1545360000000.0,860,DEEP HIERARCHICAL MODEL FOR HIERARCHICAL SELECTIVE CLASSIFICATION AND ZERO SHOT LEARNING,"[""sasonil@gmail.com"", ""koby@ee.technion.ac.il""]","[""Eliyahu Sason"", ""Koby Crammer""]","[""deep learning"", ""large-scale classificaion"", ""heirarchical classification"", ""zero-shot learning""]","Object recognition in real-world image scenes is still an open problem. With the growing number of classes, the similarity structures between them become complex and the distinction between classes blurs, which makes the classification problem particularly challenging. Standard N-way discrete classifiers treat all classes as disconnected and unrelated, and therefore unable to learn from their semantic relationships. In this work, we present a hierarchical inter-class relationship model and train it using a newly proposed probability-based loss function. Our hierarchical model provides significantly better semantic generalization ability compared to a regular N-way classifier. We further proposed an algorithm where given a probabilistic classification model it can return the input corresponding super-group based on classes hierarchy without any further learning. We deploy it in two scenarios in which super-group retrieval can be useful. The first one, selective classification, deals with the problem of low-confidence classification, wherein a model is unable to make a successful exact classification. +The second, zero-shot learning problem deals with making reasonable inferences on novel classes. Extensive experiments with the two scenarios show that our proposed hierarchical model yields more accurate and meaningful super-class predictions compared to a regular N-way classifier because of its significantly better semantic generalization ability.",/pdf/3e4b8bf893402744ec96f7aea7d7980934f05ed4.pdf,ICLR,2019,"We propose a new hierarchical probability based loss function which yields a significantly better semantic classifier for large scale classification scenario. Moreover, we show the importance of such a model in two applications." +rJ4vlh0qtm,S1g_FofFOX,1538090000000.0,1545360000000.0,1087,SSoC: Learning Spontaneous and Self-Organizing Communication for Multi-Agent Collaboration,"[""kong@pku.edu.cn"", ""lijingg@pku.edu.cn"", ""jimxinbo@gmail.com"", ""yizhou.wang@pku.edu.cn""]","[""Xiangyu Kong"", ""Jing Li"", ""Bo Xin"", ""Yizhou Wang""]","[""reinforcement learning"", ""multi-agent learning"", ""multi-agent communication"", ""deep learning""]","Multi-agent collaboration is required by numerous real-world problems. Although distributed setting is usually adopted by practical systems, local range communication and information aggregation still matter in fulfilling complex tasks. For multi-agent reinforcement learning, many previous studies have been dedicated to design an effective communication architecture. However, existing models usually suffer from an ossified communication structure, e.g., most of them predefine a particular communication mode by specifying a fixed time frequency and spatial scope for agents to communicate regardless of necessity. Such design is incapable of dealing with multi-agent scenarios that are capricious and complicated, especially when only partial information is available. Motivated by this, we argue that the solution is to build a spontaneous and self-organizing communication (SSoC) learning scheme. By treating the communication behaviour as an explicit action, SSoC learns to organize communication in an effective and efficient way. Particularly, it enables each agent to spontaneously decide when and who to send messages based on its observed states. In this way, a dynamic inter-agent communication channel is established in an online and self-organizing manner. The agents also learn how to adaptively aggregate the received messages and its own hidden states to execute actions. Various experiments have been conducted to demonstrate that SSoC really learns intelligent message passing among agents located far apart. With such agile communications, we observe that effective collaboration tactics emerge which have not been mastered by the compared baselines.",/pdf/1d30cd0a472052ed717904f5d4e01c108e5520b2.pdf,ICLR,2019,This paper proposes a spontaneous and self-organizing communication (SSoC) learning scheme for multi-agent RL tasks. +r1xCMyBtPS,HJxVbE3uPB,1569440000000.0,1583910000000.0,1600,Multilingual Alignment of Contextual Word Representations,"[""stevencao@berkeley.edu"", ""kitaev@berkeley.edu"", ""klein@berkeley.edu""]","[""Steven Cao"", ""Nikita Kitaev"", ""Dan Klein""]","[""multilingual"", ""natural language processing"", ""embedding alignment"", ""BERT"", ""word embeddings"", ""transfer""]","We propose procedures for evaluating and strengthening contextual embedding alignment and show that they are useful in analyzing and improving multilingual BERT. In particular, after our proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek. Further, to measure the degree of alignment, we introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer. Using this word retrieval task, we also analyze BERT and find that it exhibits systematic deficiencies, e.g. worse alignment for open-class parts-of-speech and word pairs written in different scripts, that are corrected by the alignment procedure. These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.",/pdf/db0d538420ea1471c16cc4c3e0251ab3e34bfb22.pdf,ICLR,2020,We propose procedures for evaluating and strengthening contextual embedding alignment and show that they both improve multilingual BERT's zero-shot XNLI transfer and provide useful insights into the model. +H1lXCaVKvS,Byg7iPWOPH,1569440000000.0,1577170000000.0,849,Frustratingly easy quasi-multitask learning,"[""berendg@inf.u-szeged.hu"", ""kis-szabo.norbert@stud.u-szeged.hu""]","[""G\u00e1bor Berend"", ""Norbert Kis-Szab\u00f3""]","[""multitask learning"", ""ensembling""]","We propose the technique of quasi-multitask learning (Q-MTL), a simple and easy to implement modification of standard multitask learning, in which the tasks to be modeled are identical. We illustrate it through a series of sequence labeling experiments over a diverse set of languages, that applying Q-MTL consistently increases the generalization ability of the applied models. The proposed architecture can be regarded as a new regularization technique encouraging the model to develop an internal representation of the problem at hand that is beneficial to multiple output units of the classifier at the same time. This property hampers the convergence to such internal representations which are highly specific and tailored for a classifier with a particular set of parameters. Our experiments corroborate that by relying on the proposed algorithm, we can approximate the quality of an ensemble of classifiers at a fraction of computational resources required. Additionally, our results suggest that Q-MTL handles the presence of noisy training labels better than ensembles. +",/pdf/12ef3a499c17788366fc2cf0a85a97ac75d2ea6d.pdf,ICLR,2020,We propose a computationally efficient alternative for traditional ensemble learning for neural nets. +BkxkH30cFm,Hkl8Ey0qFX,1538090000000.0,1545360000000.0,1505,Object-Oriented Model Learning through Multi-Level Abstraction,"[""guangxiangzhu@outlook.com"", ""jh-wang15@mails.tsinghua.edu.cn"", ""rzz16@mails.tsinghua.edu.cn"", ""chongjie@tsinghua.edu.cn""]","[""Guangxiang Zhu"", ""Jianhao Wang"", ""ZhiZhou Ren"", ""Chongjie Zhang""]","[""action-conditioned dynamics learning"", ""deep learning"", ""generalization"", ""interpretability"", ""sample efficiency""]","Object-based approaches for learning action-conditioned dynamics has demonstrated promise for generalization and interpretability. However, existing approaches suffer from structural limitations and optimization difficulties for common environments with multiple dynamic objects. In this paper, we present a novel self-supervised learning framework, called Multi-level Abstraction Object-oriented Predictor (MAOP), for learning object-based dynamics models from raw visual observations. MAOP employs a three-level learning architecture that enables efficient dynamics learning for complex environments with a dynamic background. We also design a spatial-temporal relational reasoning mechanism to support instance-level dynamics learning and handle partial observability. Empirical results show that MAOP significantly outperforms previous methods in terms of sample efficiency and generalization over novel environments that have multiple controllable and uncontrollable dynamic objects and different static object layouts. In addition, MAOP learns semantically and visually interpretable disentangled representations.",/pdf/3b0682659071c30267191b69378cc33276fdd252.pdf,ICLR,2019, +r1genAVKPB,BJlMzUtODS,1569440000000.0,1583910000000.0,1344,Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?,"[""ssdu@ias.edu"", ""sham@cs.washington.edu"", ""ruosongw@andrew.cmu.edu"", ""linyang@ee.ucla.edu""]","[""Simon S. Du"", ""Sham M. Kakade"", ""Ruosong Wang"", ""Lin F. Yang""]","[""reinforcement learning"", ""function approximation"", ""lower bound"", ""representation""]","Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question has largely been studied only with respect to (worst-case) approximation error, in the more classical approximate dynamic programming literature. With regards to the statistical viewpoint, this question is largely unexplored, and the extant body of literature mainly focuses on conditions which \emph{permit} sample efficient reinforcement learning with little understanding of what are \emph{necessary} conditions for efficient reinforcement learning. +This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning. ",/pdf/158a8c9155a85383b88809776f37454a8282edc2.pdf,ICLR,2020,Exponential lower bounds for value-based and policy-based reinforcement learning with function approximation. +HJxNAnVtDS,r1x_sfcNPB,1569440000000.0,1583910000000.0,261,On the Convergence of FedAvg on Non-IID Data,"[""smslixiang@pku.edu.cn"", ""hackyhuang@pku.edu.cn"", ""yangwhsms@gmail.com"", ""shusen.wang@stevens.edu"", ""zhzhang@math.pku.edu.cn""]","[""Xiang Li"", ""Kaixuan Huang"", ""Wenhao Yang"", ""Shusen Wang"", ""Zhihua Zhang""]","[""Federated Learning"", ""stochastic optimization"", ""Federated Averaging""]","Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate $\eta$ must decay, even if full-gradient is used; otherwise, the solution will be $\Omega (\eta)$ away from the optimal.",/pdf/dcd68da80bc99678b1254ec0b4d49dee872c7898.pdf,ICLR,2020, +HyWG0H5ge,,1478280000000.0,1482230000000.0,275,Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks,"[""david.balduzzi@vuw.ac.nz"", ""brian@disneyresearch.com"", ""butlertony@ecs.vuw.ac.nz""]","[""David Balduzzi"", ""Brian McWilliams"", ""Tony Butler-Yeoman""]","[""Deep learning"", ""Optimization"", ""Theory"", ""Supervised Learning""]","Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets. The guarantee matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation -- a straightforward application of Taylor expansions to neural networks -- and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. + +The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets: that gradients are shattered. We investigate the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.",/pdf/b27fb3966106d9294b0c28e06eea848b5d0fcb92.pdf,ICLR,2017,We provide the first convergence result for rectifier neural networks and investigate implications for exploration in shattered landscapes. +BJRZzFlRb,HJ3WMFgRZ,1509100000000.0,1519340000000.0,331,Compressing Word Embeddings via Deep Compositional Code Learning,"[""shu@nlab.ci.i.u-tokyo.ac.jp"", ""nakayama@ci.i.u-tokyo.ac.jp""]","[""Raphael Shu"", ""Hideki Nakayama""]","[""natural language processing"", ""word embedding"", ""compression"", ""deep learning""]","Natural language processing (NLP) models often require a massive number of parameters for word embeddings, resulting in a large storage or memory footprint. Deploying neural NLP models to mobile devices requires compressing the word embeddings without any significant sacrifices in performance. For this purpose, we propose to construct the embeddings with few basis vectors. For each word, the composition of basis vectors is determined by a hash code. To maximize the compression rate, we adopt the multi-codebook quantization approach instead of binary coding scheme. Each code is composed of multiple discrete numbers, such as (3, 2, 1, 8), where the value of each component is limited to a fixed range. We propose to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick. Experiments show the compression rate achieves 98% in a sentiment analysis task and 94% ~ 99% in machine translation tasks without performance loss. In both tasks, the proposed method can improve the model performance by slightly lowering the compression rate. Compared to other approaches such as character-level segmentation, the proposed method is language-independent and does not require modifications to the network architecture.",/pdf/909d85aa5f6037c0dcd6751ac7d7d064c52fd180.pdf,ICLR,2018,Compressing the word embeddings over 94% without hurting the performance. +7t1FcJUWhi3,pgiYNYvugq5,1601310000000.0,1616050000000.0,2315,Neural Networks for Learning Counterfactual G-Invariances from Single Environments,"[""~S_Chandra_Mouli1"", ""~Bruno_Ribeiro1""]","[""S Chandra Mouli"", ""Bruno Ribeiro""]","[""Extrapolation"", ""G-invariance regularization"", ""Counterfactual inference"", ""Invariant subspaces""]","Despite —or maybe because of— their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model’s inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner’s model. In order to endow neural networks with the ability to extrapolate over group transformations, we introduce a learning framework counterfactually-guided by the learning hypothesis that any group invariance to (known) transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data. Unlike existing invariance-driven methods for (counterfactual) extrapolations, this framework allows extrapolations from a single environment. Finally, we introduce sequence and image extrapolation tasks that validate our framework and showcase the shortcomings of traditional approaches.",/pdf/f68c8dcf4d107a320df0e519d96021379ed46828.pdf,ICLR,2021,"This work introduces a novel learning framework for single-environment extrapolations, where invariance to transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data." +KCzRX9N8BIH,loCeZuWQAXL,1601310000000.0,1614990000000.0,467,It Is Likely That Your Loss Should be a Likelihood,"[""~Mark_Hamilton1"", ""~Evan_Shelhamer2"", ""~William_T._Freeman1""]","[""Mark Hamilton"", ""Evan Shelhamer"", ""William T. Freeman""]","[""Adaptive Losses"", ""Outlier Detection"", ""Adaptive Regularization"", ""Recalibration"", ""Robust Modelling""]","Many common loss functions such as mean-squared-error, cross-entropy, and reconstruction loss are unnecessarily rigid. Under a probabilistic interpretation, these common losses correspond to distributions with fixed shapes and scales. We instead argue for optimizing full likelihoods that include parameters like the normal variance and softmax temperature. Joint optimization of these ``likelihood parameters'' with model parameters can adaptively tune the scales and shapes of losses in addition to the strength of regularization. We explore and systematically evaluate how to parameterize and apply likelihood parameters for robust modeling, outlier-detection, and re-calibration. Additionally, we propose adaptively tuning $L_2$ and $L_1$ weights by fitting the scale parameters of normal and Laplace priors and introduce more flexible element-wise regularizers.",/pdf/c9537210f4dcf4367c8a6a41d85d12b5ba7bc331.pdf,ICLR,2021,"Learning additional likelihood distribution parameters yields new approaches for robust modelling, outlier-detection, calibration, and adaptive regularization." +BylkG20qYm,SkgWzyAcYQ,1538090000000.0,1545360000000.0,1229,On Meaning-Preserving Adversarial Perturbations for Sequence-to-Sequence Models,"[""pmichel1@cs.cmu.edu"", ""gneubig@cs.cmu.edu"", ""xianl@fb.com"", ""juancarabina@fb.com""]","[""Paul Michel"", ""Graham Neubig"", ""Xian Li"", ""Juan Miguel Pino""]","[""Sequence-to-sequence"", ""adversarial attacks"", ""evaluation"", ""meaning preservation"", ""machine translation""]","Adversarial examples have been shown to be an effective way of assessing the robustness of neural sequence-to-sequence (seq2seq) models, by applying perturbations to the input of a model leading to large degradation in performance. However, these perturbations are only indicative of a weakness in the model if they do not change the semantics of the input in a way that would change the expected output. Using the example of machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models taking meaning preservation into account and demonstrate that existing methods may not preserve meaning in general. Based on these findings, we propose new constraints for attacks on word-based MT systems and show, via human and automatic evaluation, that they produce more semantically similar adversarial inputs. Furthermore, we show that performing adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness without hurting test performance.",/pdf/16396c476c971c95c871925a5f3131693be2fcc4.pdf,ICLR,2019,How you should evaluate adversarial attacks on seq2seq +BysZhEqee,,1478280000000.0,1478280000000.0,219,Marginal Deep Architectures: Deep learning for Small and Middle Scale Applications,"[""ouczyc@outlook.com"", ""gqzhong@ouc.edu.cn"", ""dongjunyu@ouc.edu.cn""]","[""Yuchen Zheng"", ""Guoqiang Zhong"", ""Junyu Dong""]",[],"In recent years, many deep architectures have been proposed in different fields. However, to obtain good results, most of the previous deep models need a large number of training data. In this paper, for small and middle scale applications, we +propose a novel deep learning framework based on stacked feature learning models. Particularly, we stack marginal Fisher analysis (MFA) layer by layer for the initialization of the deep architecture and call it “Marginal Deep Architectures” (MDA). In the implementation of MDA, the weight matrices of MFA are first learned layer by layer, and then we exploit some deep learning techniques, such as back propagation, dropout and denoising to fine tune the network. To evaluate the effectiveness of MDA, we have compared it with some feature learning methods and deep learning models on 7 small and middle scale real-world applications, including handwritten digits recognition, speech recognition, historical document understanding, image classification, action recognition and so on. Extensive experiments demonstrate that MDA performs not only better than shallow feature learning models, but also state-of-the-art deep learning models in these applications.",/pdf/e505cfc12d175bb6f600b7348110b72431813073.pdf,ICLR,2017, +GbCkSfstOIA,rjzts7c0WIJ,1601310000000.0,1614990000000.0,735,Semi-Supervised Learning via Clustering Representation Space,"[""jeffpapapa@gmail.com"", ""~Yuh-Jye_Lee1"", ""~Chih-Chi_Wu1"", ""~Yi-Wei_Chiu1"", ""george851101@gmail.com"", ""chuck30621@gmail.com"", ""kphong19.iie08g@nctu.edu.tw""]","[""Yen-Chieh Huang"", ""Yuh-Jye Lee"", ""Chih-Chi Wu"", ""Yi-Wei Chiu"", ""Yong-Xiang Lin"", ""\bCHENG-YING LI"", ""Po-Hung Ko""]","[""semi-supervised learning"", ""deep learning"", ""clustering"", ""embedding latent space""]","We proposed a novel loss function that combines supervised learning with clustering in deep neural networks. Taking advantage of the data distribution and the existence of some labeled data, we construct a meaningful latent space. Our loss function consists of three parts, the quality of the clustering result, the margin between clusters, and the classification error of labeled instances. Our proposed model is trained to minimize our loss function by backpropagation, avoiding the need for pre-training or additional networks. This guides our network to classify labeled samples correctly while able to find good clusters simultaneously. We applied our proposed method on MNIST, USPS, ETH-80, and COIL-100; the comparison results confirm our model's outstanding performance over semi-supervised learning.",/pdf/af3fd190f9ce4ef5ac1ddc184f5062187c1e829b.pdf,ICLR,2021,We proposed a novel loss function that combines supervised learning with clustering in deep neural networks. +H1gB4RVKvB,HJeY078dDH,1569440000000.0,1583910000000.0,1074,Recurrent neural circuits for contour detection,"[""drew_linsley@brown.edu"", ""junkyung_kim@brown.edu"", ""alekh_karkada_ashok@brown.edu"", ""thomas_serre@brown.edu""]","[""Drew Linsley*"", ""Junkyung Kim*"", ""Alekh Ashok"", ""Thomas Serre""]","[""Contextual illusions"", ""visual cortex"", ""recurrent feedback"", ""neural circuits""]","We introduce a deep recurrent neural network architecture that approximates visual cortical circuits (Mély et al., 2018). We show that this architecture, which we refer to as the 𝜸-net, learns to solve contour detection tasks with better sample efficiency than state-of-the-art feedforward networks, while also exhibiting a classic perceptual illusion, known as the orientation-tilt illusion. Correcting this illusion significantly reduces \gnetw contour detection accuracy by driving it to prefer low-level edges over high-level object boundary contours. Overall, our study suggests that the orientation-tilt illusion is a byproduct of neural circuits that help biological visual systems achieve robust and efficient contour detection, and that incorporating these circuits in artificial neural networks can improve computer vision.",/pdf/f382f5ab930c34fa026b1ba9687c63947d26b02f.pdf,ICLR,2020,"Contextual illusions are a feature, not a bug, of neural routines optimized for contour detection." +_L6b4Qzn5bp,NtpmrD1JjE,1601310000000.0,1614990000000.0,2921,Beyond COVID-19 Diagnosis: Prognosis with Hierarchical Graph Representation Learning,"[""~CHEN_LIU7"", ""~Jinze_Cui2"", ""~Dailin_Gan1"", ""~Guosheng_Yin1""]","[""CHEN LIU"", ""Jinze Cui"", ""Dailin Gan"", ""Guosheng Yin""]","[""COVID-19 Diagnosis"", ""COVID-19 Prognosis"", ""GCN""]","Coronavirus disease 2019 (COVID-19), the pandemic that is spreading fast globally, has caused over 34 million confirmed cases. Apart from the reverse transcription polymerase chain reaction (RT-PCR), the chest computed tomography (CT) is viewed as a standard and effective tool for disease diagnosis and progression monitoring. We propose a diagnosis and prognosis model based on graph convolutional networks (GCNs). The chest CT scan of a patient, typically involving hundreds of sectional images in sequential order, is formulated as a densely connected weighted graph. A novel distance aware pooling is proposed to abstract the node information hierarchically, which is robust and efficient for such densely connected graphs. Our method, combining GCNs and distance aware pooling, can integrate the information from all slices in the chest CT scans for optimal decision making, which leads to the state-of-the-art accuracy in the COVID-19 diagnosis and prognosis. With less than 1% number of total parameters in the baseline 3D ResNet model, our method achieves 94.8% accuracy for diagnosis. It has a 2.4% improvement compared with the baseline model on the same dataset. In addition, we can localize the most informative slices with disease lesions for COVID-19 within a large sequence of chest CT images. The proposed model can produce visual explanations for the diagnosis and prognosis, making the decision more transparent and explainable, while RT-PCR only leads to the test result with no prognosis information. The prognosis analysis can help hospitals or clinical centers designate medical resources more efficiently and better support clinicians to determine the proper clinical treatment. ",/pdf/60a52f13d80c4838f136b4cda0e94e4b808d1666.pdf,ICLR,2021, +ByeNra4FDB,BJesueLPvS,1569440000000.0,1583910000000.0,520,Novelty Detection Via Blurring,"[""si_choi@kaist.ac.kr"", ""schung@kaist.ac.kr""]","[""Sungik Choi"", ""Sae-Young Chung""]","[""novelty"", ""anomaly"", ""uncertainty""]"," Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain.",/pdf/411756962bea4fb9e7cca16cc98f513ef8069cec.pdf,ICLR,2020,We propose a novel OOD detector that employ blurred images as adversarial examples . Our model achieve significant OOD detection performance in various domains. +H1eRYxHYPB,Hyx-D0lFPr,1569440000000.0,1577170000000.0,2453,Optimal Unsupervised Domain Translation,"[""emmanuel.de-bezenac@lip6.fr"", ""ayedibrahim@gmail.com"", ""patrick.gallinari@lip6.fr""]","[""Emmanuel de B\u00e9zenac"", ""Ibrahim Ayed"", ""Patrick Gallinari""]","[""Unsupervised Domain Translation"", ""CycleGAN"", ""Optimal Transport""]","Unsupervised Domain Translation~(UDT) consists in finding meaningful correspondences between two domains, without access to explicit pairings between them. Following the seminal work of \textit{CycleGAN}, many variants and extensions of this model have been applied successfully to a wide range of applications. However, these methods remain poorly understood, and lack convincing theoretical guarantees. In this work, we define UDT in a rigorous, non-ambiguous manner, explore the implicit biases present in the approach and demonstrate the limits of theses approaches. Specifically, we show that mappings produced by these methods are biased towards \textit{low energy} transformations, leading us to cast UDT into an Optimal Transport~(OT) framework by making this implicit bias explicit. This not only allows us to provide theoretical guarantees for existing methods, but also to solve UDT problems where previous methods fail. Finally, making the link between the dynamic formulation of OT and CycleGAN, we propose a simple approach to solve UDT, and illustrate its properties in two distinct settings.",/pdf/c6ebef8a42571bbd6deb8ec2d4a72539e4da3709.pdf,ICLR,2020,"We propose a novel, more rigorous framework for Unsupervised Domain Translation based on Optimal Transport." +HkeryxBtPB,SklXbckYDH,1569440000000.0,1583910000000.0,2061,MMA Training: Direct Input Space Margin Maximization through Adversarial Training,"[""gavin.w.ding@gmail.com"", ""yash.sharma@bethgelab.org"", ""yikchau.y.lui@borealisai.com"", ""ruitong.huang@borealisai.com""]","[""Gavin Weiguang Ding"", ""Yash Sharma"", ""Kry Yik Chau Lui"", ""Ruitong Huang""]","[""adversarial robustness"", ""perturbation"", ""margin maximization"", ""deep learning""]","We study adversarial robustness of neural networks from a margin maximization perspective, where margins are defined as the distances from inputs to a classifier's decision boundary. +Our study shows that maximizing margins can be achieved by minimizing the adversarial loss on the decision boundary at the ""shortest successful perturbation"", demonstrating a close connection between adversarial losses and the margins. We propose Max-Margin Adversarial (MMA) training to directly maximize the margins to achieve adversarial robustness. +Instead of adversarial training with a fixed $\epsilon$, MMA offers an improvement by enabling adaptive selection of the ""correct"" $\epsilon$ as the margin individually for each datapoint. In addition, we rigorously analyze adversarial training with the perspective of margin maximization, and provide an alternative interpretation for adversarial training, maximizing either a lower or an upper bound of the margins. Our experiments empirically confirm our theory and demonstrate MMA training's efficacy on the MNIST and CIFAR10 datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness.",/pdf/9f14831bcf0b20a9bf5d64248db43a0697453114.pdf,ICLR,2020,We propose MMA training to directly maximize input space margin in order to improve adversarial robustness primarily by removing the requirement of specifying a fixed distortion bound. +SJGCiw5gl,,1478290000000.0,1488750000000.0,427,Pruning Convolutional Neural Networks for Resource Efficient Inference,"[""pmolchanov@nvidia.com"", ""styree@nvidia.com"", ""tkarras@nvidia.com"", ""taila@nvidia.com"", ""jkautz@nvidia.com""]","[""Pavlo Molchanov"", ""Stephen Tyree"", ""Tero Karras"", ""Timo Aila"", ""Jan Kautz""]","[""Deep learning"", ""Transfer Learning""]","We propose a new formulation for pruning convolutional kernels in neural networks to enable efficient inference. We interleave greedy criteria-based pruning with fine-tuning by backpropagation-a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters. We focus on transfer learning, where large pretrained networks are adapted to specialized tasks. The proposed criterion demonstrates superior performance compared to other criteria, e.g. the norm of kernel weights or feature map activation, for pruning large CNNs after adaptation to fine-grained classification tasks (Birds-200 and Flowers-102) relaying only on the first order gradient information. We also show that pruning can lead to more than 10x theoretical reduction in adapted 3D-convolutional filters with a small drop in accuracy in a recurrent gesture classifier. Finally, we show results for the large-scale ImageNet dataset to emphasize the flexibility of our approach.",/pdf/419058b627b99e2b53ec85b8a43876b1a9307254.pdf,ICLR,2017,New approach for removing unnecessary conv neurons from network. Work is focused on how to estimate importance fast and efficiently by Taylor expantion. +ryxtWgSKPB,H1x36elKDr,1569440000000.0,1577170000000.0,2143,Quantum Optical Experiments Modeled by Long Short-Term Memory,"[""adler@ml.jku.at"", ""manuel.erhard@univie.ac.at"", ""mario.krenn@univie.ac.at"", ""brandstetter@ml.jku.at"", ""kofler@ml.jku.at"", ""hochreit@ml.jku.at""]","[""Thomas Adler"", ""Manuel Erhard"", ""Mario Krenn"", ""Johannes Brandstetter"", ""Johannes Kofler"", ""Sepp Hochreiter""]","[""Recurrent Networks"", ""LSTM"", ""Sequence Analysis"", ""Binary Classification""]","We demonstrate how machine learning is able to model experiments in quantum physics. Quantum entanglement is a cornerstone for upcoming quantum technologies such as quantum computation and quantum cryptography. Of particular interest are complex quantum states with more than two particles and a large number of entangled quantum levels. Given such a multiparticle high-dimensional quantum state, it is usually impossible to reconstruct an experimental setup that produces it. To search for interesting experiments, one thus has to randomly create millions of setups on a computer and calculate the respective output states. In this work, we show that machine learning models can provide significant improvement over random search. We demonstrate that a long short-term memory (LSTM) neural network can successfully learn to model quantum experiments by correctly predicting output state characteristics for given setups without the necessity of computing the states themselves. This approach not only allows for faster search but is also an essential step towards automated design of multiparticle high-dimensional quantum experiments using generative machine learning models.",/pdf/e90a912a0f6b4596f6bd5b15f847f78fea98a105.pdf,ICLR,2020,We demonstrate how machine learning is able to model experiments in quantum physics. +BJNRFNlRW,SJmAKVxAW,1509080000000.0,1518730000000.0,242,TRAINING GENERATIVE ADVERSARIAL NETWORKS VIA PRIMAL-DUAL SUBGRADIENT METHODS: A LAGRANGIAN PERSPECTIVE ON GAN,"[""chenxugz@gmail.com"", ""wangjiangb@gmail.com"", ""haoge2013@u.northwestern.edu""]","[""Xu Chen"", ""Jiang Wang"", ""Hao Ge""]","[""GAN"", ""Primal-Dual Subgradient"", ""Mode Collapse"", ""Saddle Point""]","We relate the minimax game of generative adversarial networks (GANs) to finding the saddle points of the Lagrangian function for a convex optimization problem, where the discriminator outputs and the distribution of generator outputs play the roles of primal variables and dual variables, respectively. This formulation shows the connection between the standard GAN training process and the primal-dual subgradient methods for convex optimization. The inherent connection does not only provide a theoretical convergence proof for training GANs in the function space, but also inspires a novel objective function for training. The modified objective function forces the distribution of generator outputs to be updated along the direction according to the primal-dual subgradient methods. A toy example shows that the proposed method is able to resolve mode collapse, which in this case cannot be avoided by the standard GAN or Wasserstein GAN. Experiments on both Gaussian mixture synthetic data and real-world image datasets demonstrate the performance of the proposed method on generating diverse samples.",/pdf/aa86ff8da92c0a3b98de217198eaf5c444780120.pdf,ICLR,2018,We propose a primal-dual subgradient method for training GANs and this method effectively alleviates mode collapse. +HJgcw0Etwr,BygpTuw_vH,1569440000000.0,1577170000000.0,1184,Toward Understanding Generalization of Over-parameterized Deep ReLU network trained with SGD in Student-teacher Setting,"[""yuandong.tian@gmail.com""]","[""Yuandong Tian""]","[""deep ReLU network"", ""theoretical analysis"", ""generalization"", ""training dynamics"", ""student teacher setting"", ""interpolation region"", ""over-parameterization""]","To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called \emph{interpolation setting}, there exists many-to-one \emph{alignment} between student and teacher nodes in the lowest layer under mild conditions. This suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization plays two roles: (1) it is a necessary condition for alignment to happen at the critical points, and (2) in training dynamics, it helps student nodes cover more teacher nodes with fewer iterations. Both improve generalization. Experiments justify our finding.",/pdf/47c8a01f61aca9bacd032a057aba0de0ac97f717.pdf,ICLR,2020,This paper analyzes training dynamics and critical points of training deep ReLU network via SGD in the teacher-student setting. +SkeWc2EKPH,S1xdet7A8B,1569440000000.0,1577170000000.0,105,Model-free Learning Control of Nonlinear Stochastic Systems with Stability Guarantee,"[""mhhan@hit.edu.cn"", ""yuantian013@163.com"", ""lixianzhang@hit.edu.cn"", ""jun.wang@cs.ucl.ac.uk"", ""wei.pan@tudelft.nl""]","[""Minghao Han"", ""Yuan Tian"", ""Lixian Zhang"", ""Jun Wang"", ""Wei Pan""]","[""Reinforcement learning"", ""nonlinear stochastic system"", ""Lyapunov""]","Reinforcement learning (RL) offers a principled way to achieve the optimal cumulative performance index in discrete-time nonlinear stochastic systems, which are modeled as Markov decision processes. Its integration with deep learning techniques has promoted the field of deep RL with an impressive performance in complicated continuous control tasks. However, from a control-theoretic perspective, the first and most important property of a system to be guaranteed is stability. Unfortunately, stability is rarely assured in RL and remains an open question. In this paper, we propose a stability guaranteed RL framework which simultaneously learns a Lyapunov function along with the controller or policy, both of which are parameterized by deep neural networks, by borrowing the concept of Lyapunov function from control theory. Our framework can not only offer comparable or superior control performance over state-of-the-art RL algorithms, but also construct a Lyapunov function to validate the closed-loop stability. In the simulated experiments, our approach is evaluated on several well-known examples including classic CartPole balancing, 3-dimensional robot control and control of synthetic biology gene regulatory networks. Compared with RL algorithms without stability guarantee, our approach can enable the system to recover to the operating point when interfered by uncertainties such as unseen disturbances and system parametric variations to a certain extent. ",/pdf/38c6a17e84be962a2994c51873793ee04ebe4499.pdf,ICLR,2020,A stability guaranteed reinforcement learning framework for the stabilization and tracking problems in discrete-time nonlinear stochastic systems +r1Zi2Mb0-,rJ_OnGbRZ,1509140000000.0,1518730000000.0,971,EXPLORING NEURAL ARCHITECTURE SEARCH FOR LANGUAGE TASKS,"[""luong.m.thang@gmail.com"", ""ddohan@google.com"", ""adamsyuwei@gmail.com"", ""qvl@google.com"", ""barretzoph@google.com"", ""vrv@google.com""]","[""Minh-Thang Luong"", ""David Dohan"", ""Adams Wei Yu"", ""Quoc V. Le"", ""Barret Zoph"", ""Vijay Vasudevan""]","[""Neural architecture search"", ""language tasks"", ""neural machine translation"", ""reading comprehension"", ""SQuAD""]","Neural architecture search (NAS), the task of finding neural architectures automatically, has recently emerged as a promising approach for unveiling better models over human-designed ones. However, most success stories are for vision tasks and have been quite limited for text, except for a small language modeling setup. In this paper, we explore NAS for text sequences at scale, by first focusing on the task of language translation and later extending to reading comprehension. From a standard sequence-to-sequence models for translation, we conduct extensive searches over the recurrent cells and attention similarity functions across two translation tasks, IWSLT English-Vietnamese and WMT German-English. We report challenges in performing cell searches as well as demonstrate initial success on attention searches with translation improvements over strong baselines. In addition, we show that results on attention searches are transferable to reading comprehension on the SQuAD dataset.",/pdf/14cabdce2b24dabb021947efc915897b476af354.pdf,ICLR,2018,"We explore neural architecture search for language tasks. Recurrent cell search is challenging for NMT, but attention mechanism search works. The result of attention search on translation is transferable to reading comprehension." +r1xFE3Rqt7,HJx8A_h5FX,1538090000000.0,1545360000000.0,1472,Adaptive Mixture of Low-Rank Factorizations for Compact Neural Modeling,"[""iamtingchen@gmail.com"", ""lin-j14@mails.tsinghua.edu.cn"", ""tianlin@google.com"", ""chongw@google.com"", ""dennyzhou@google.com"", ""hansong8811@gmail.com""]","[""Ting Chen"", ""Ji Lin"", ""Tian Lin"", ""Song Han"", ""Chong Wang"", ""Denny Zhou""]","[""Low-Rank Factorization"", ""Compact Neural Nets"", ""Efficient Modeling"", ""Mixture models""]","Modern deep neural networks have a large amount of weights, which make them difficult to deploy on computation constrained devices such as mobile phones. One common approach to reduce the model size and computational cost is to use low-rank factorization to approximate a weight matrix. However, performing standard low-rank factorization with a small rank can hurt the model expressiveness and significantly decrease the performance. In this work, we propose to use a mixture of multiple low-rank factorizations to model a large weight matrix, and the mixture coefficients are computed dynamically depending on its input. We demonstrate the effectiveness of the proposed approach on both language modeling and image classification tasks. Experiments show that our method not only improves the computation efficiency but also maintains (sometimes outperforms) its accuracy compared with the full-rank counterparts.",/pdf/7aa18f03b4fb999ddd1f8b87b0252025448fbe2b.pdf,ICLR,2019,We propose a simple modification to low-rank factorization that improves performances (in both image and language tasks) while still being compact. +6UdQLhqJyFD,8Dv9eVOsYCf,1601310000000.0,1615990000000.0,1253,Parameter Efficient Multimodal Transformers for Video Representation Learning,"[""~Sangho_Lee1"", ""~Youngjae_Yu1"", ""~Gunhee_Kim1"", ""~Thomas_Breuel1"", ""~Jan_Kautz1"", ""~Yale_Song1""]","[""Sangho Lee"", ""Youngjae Yu"", ""Gunhee Kim"", ""Thomas Breuel"", ""Jan Kautz"", ""Yale Song""]","[""Self-supervised learning"", ""audio-visual representation learning"", ""video representation learning""]","The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.",/pdf/2cccd8332d73cd6ac7e33027594d30f19af464d5.pdf,ICLR,2021,We propose a technique to reduce the number of parameters in multimodal BERT models up to 97% (from 128 million to 4 million parameters). +Qpik5XBv_1-,xxwiGC_sxAL,1601310000000.0,1614990000000.0,3429,Language Controls More Than Top-Down Attention: Modulating Bottom-Up Visual Processing with Referring Expressions,"[""~Ozan_Arkan_Can1"", ""~Ilker_Kesen1"", ""~Deniz_Yuret1""]","[""Ozan Arkan Can"", ""Ilker Kesen"", ""Deniz Yuret""]","[""Referring Expression Understanding"", ""Language-Vision Problems"", ""Grounded Language Understanding""]","How to best integrate linguistic and perceptual processing in multimodal tasks is an important open problem. In this work we argue that the common technique of using language to direct visual attention over high-level visual features may not be optimal. Using language throughout the bottom-up visual pathway, going from pixels to high-level features, may be necessary. Our experiments on several English referring expression datasets show significant improvements when language is used to control the filters for bottom-up visual processing in addition to top-down attention.",/pdf/4ab86a642f612cffb87fd273aa47be6991f83cc0.pdf,ICLR,2021,We modulate both top-down and bottom-up visual processing with referring expressions. +ByG_3s09KX,Byx_mn8cKQ,1538090000000.0,1545360000000.0,725,Dopamine: A Research Framework for Deep Reinforcement Learning,"[""psc@google.com"", ""smoitra@google.com"", ""cgel@google.com"", ""kumasaurabh@google.com"", ""bellemare@google.com""]","[""Pablo Samuel Castro"", ""Subhodeep Moitra"", ""Carles Gelada"", ""Saurabh Kumar"", ""Marc G. Bellemare""]","[""reinforcement learning"", ""software"", ""framework"", ""reproducibility""]","Deep reinforcement learning (deep RL) research has grown significantly in recent years. A number of software offerings now exist that provide stable, comprehensive implementations for benchmarking. At the same time, recent deep RL research +has become more diverse in its goals. In this paper we introduce Dopamine, a new research framework for deep RL that aims to support some of that diversity. Dopamine is open-source, TensorFlow-based, and provides compact yet reliable +implementations of some state-of-the-art deep RL agents. We complement this offering with a taxonomy of the different research objectives in deep RL research. While by no means exhaustive, our analysis highlights the heterogeneity of research +in the field, and the value of frameworks such as ours.",/pdf/2350e7f4367c141be57d64c6b006719ae619d06b.pdf,ICLR,2019,"In this paper we introduce Dopamine, a new research framework for deep RL that is open-source, TensorFlow-based, and provides compact yet reliable implementations of some state-of-the-art deep RL agents." +SJzYdsAqY7,S1gUENCYYX,1538090000000.0,1545360000000.0,373,Spatial-Winograd Pruning Enabling Sparse Winograd Convolution,"[""jiecaoyu@umich.edu"", ""jongsoo@fb.com"", ""mnaumov@fb.com""]","[""Jiecao Yu"", ""Jongsoo Park"", ""Maxim Naumov""]","[""deep learning"", ""convolutional neural network"", ""pruning"", ""Winograd convolution""]","Deep convolutional neural networks (CNNs) are deployed in various applications but demand immense computational requirements. Pruning techniques and Winograd convolution are two typical methods to reduce the CNN computation. However, they cannot be directly combined because Winograd transformation fills in the sparsity resulting from pruning. Li et al. (2017) propose sparse Winograd convolution in which weights are directly pruned in the Winograd domain, but this technique is not very practical because Winograd-domain retraining requires low learning rates and hence significantly longer training time. Besides, Liu et al. (2018) move the ReLU function into the Winograd domain, which can help increase the weight sparsity but requires changes in the network structure. To achieve a high Winograd-domain weight sparsity without changing network structures, we propose a new pruning method, spatial-Winograd pruning. As the first step, spatial-domain weights are pruned in a structured way, which efficiently transfers the spatial-domain sparsity into the Winograd domain and avoids Winograd-domain retraining. For the next step, we also perform pruning and retraining directly in the Winograd domain but propose to use an importance factor matrix to adjust weight importance and weight gradients. This adjustment makes it possible to effectively retrain the pruned Winograd-domain network without changing the network structure. For the three models on the datasets of CIFAR-10, CIFAR-100, and ImageNet, our proposed method can achieve the Winograd-domain sparsities of 63%, 50%, and 74%, respectively.",/pdf/4fc9ea87f70428355bb01bcf4ebc408de6e19c73.pdf,ICLR,2019,"To accelerate the computation of convolutional neural networks, we propose a new two-step pruning technique which achieves a higher Winograd-domain weight sparsity without changing the network structure." +HJPSN3gRW,BJ8SV3g0b,1509110000000.0,1518730000000.0,388,Learning to navigate by distilling visual information and natural language instructions,"[""abhsinha@adobe.com"", ""akb@adobe.com"", ""msarkar@adobe.com"", ""kbalaji@adobe.com""]","[""Abhishek Sinha"", ""Akilesh B"", ""Mausoom Sarkar"", ""Balaji Krishnamurthy""]","[""Deep reinforcement learning"", ""Computer Vision"", ""Multi-modal fusion"", ""Language Grounding""]","In this work, we focus on the problem of grounding language by training an agent +to follow a set of natural language instructions and navigate to a target object +in a 2D grid environment. The agent receives visual information through raw +pixels and a natural language instruction telling what task needs to be achieved. +Other than these two sources of information, our model does not have any prior +information of both the visual and textual modalities and is end-to-end trainable. +We develop an attention mechanism for multi-modal fusion of visual and textual +modalities that allows the agent to learn to complete the navigation tasks and also +achieve language grounding. Our experimental results show that our attention +mechanism outperforms the existing multi-modal fusion mechanisms proposed in +order to solve the above mentioned navigation task. We demonstrate through the +visualization of attention weights that our model learns to correlate attributes of +the object referred in the instruction with visual representations and also show +that the learnt textual representations are semantically meaningful as they follow +vector arithmetic and are also consistent enough to induce translation between instructions +in different natural languages. We also show that our model generalizes +effectively to unseen scenarios and exhibit zero-shot generalization capabilities. +In order to simulate the above described challenges, we introduce a new 2D environment +for an agent to jointly learn visual and textual modalities",/pdf/da88511a0bbaf4aaccfed40961e448fb2308ffff.pdf,ICLR,2018,Attention based architecture for language grounding via reinforcement learning in a new customizable 2D grid environment +tq5JAGsedIP,aqFTIW9U46t,1601310000000.0,1614990000000.0,1157,Time-varying Graph Representation Learning via Higher-Order Skip-Gram with Negative Sampling,"[""~Simone_Piaggesi1"", ""panisson@gmail.com""]","[""Simone Piaggesi"", ""Andr\u00e9 Panisson""]","[""representation learning"", ""node embeddings"", ""temporal graphs"", ""tensor factorization"", ""disease spreading""]","Representation learning models for graphs are a successful family of techniques that project nodes into feature spaces that can be exploited by other machine learning algorithms. Since many real-world networks are inherently dynamic, with interactions among nodes changing over time, these techniques can be defined both for static and for time-varying graphs. Here, we show how the skip-gram embedding approach can be used to perform implicit tensor factorization on different tensor representations of time-varying graphs. We show that higher-order skip-gram with negative sampling (HOSGNS) is able to disentangle the role of nodes and time, with a small fraction of the number of parameters needed by other approaches. We empirically evaluate our approach using time-resolved face-to-face proximity data, showing that the learned representations outperform state-of-the-art methods when used to solve downstream tasks such as network reconstruction. Good performance on predicting the outcome of dynamical processes such as disease spreading shows the potential of this new method to estimate contagion risk, providing early risk awareness based on contact tracing data.",/pdf/068ec8e7d268336972115c6be5fe8b07fe833d6c.pdf,ICLR,2021,Unsupervised representation learning algorithm for temporal graphs which disentagles structural and temporal features into different embedding matrices. +H18WqugAb,ryNb9dlRb,1509100000000.0,1518730000000.0,314,Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks,"[""brenden@nyu.edu"", ""marco.baroni@unitn.it""]","[""Brenden Lake"", ""Marco Baroni""]","[""sequence-to-sequence recurrent networks"", ""compositionality"", ""systematicity"", ""generalization"", ""language-driven navigation""]","Humans can understand and produce new utterances effortlessly, thanks to their systematic compositional skills. Once a person learns the meaning of a new verb ""dax,"" he or she can immediately understand the meaning of ""dax twice"" or ""sing and dax."" In this paper, we introduce the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences. We then test the zero-shot generalization capabilities of a variety of recurrent neural networks (RNNs) trained on SCAN with sequence-to-sequence methods. We find that RNNs can generalize well when the differences between training and test commands are small, so that they can apply ""mix-and-match"" strategies to solve the task. However, when generalization requires systematic compositional skills (as in the ""dax"" example above), RNNs fail spectacularly. We conclude with a proof-of-concept experiment in neural machine translation, supporting the conjecture that lack of systematicity is an important factor explaining why neural networks need very large training sets.",/pdf/a49276b162d7a7d610f7d74be1cee0c4fddb203f.pdf,ICLR,2018,"Using a simple language-driven navigation task, we study the compositional capabilities of modern seq2seq recurrent networks." +B1GIQhCcYm,ryerFq35F7,1538090000000.0,1545360000000.0,1363,Unsupervised one-to-many image translation,"[""samuel.lavoie-marchildon@umontreal.ca"", ""sebastien.lachapelle@umontreal.ca"", ""mikbinkowski@gmail.com"", ""aaron.courville@gmail.com"", ""yoshua.umontreal@gmail.com"", ""devon.hjelm@microsoft.com""]","[""Samuel Lavoie-Marchildon"", ""Sebastien Lachapelle"", ""Miko\u0142aj Bi\u0144kowski"", ""Aaron Courville"", ""Yoshua Bengio"", ""R Devon Hjelm""]","[""Image-to-image"", ""Translation"", ""Unsupervised"", ""Generation"", ""Adversarial"", ""Learning""]","We perform completely unsupervised one-sided image to image translation between a source domain $X$ and a target domain $Y$ such that we preserve relevant underlying shared semantics (e.g., class, size, shape, etc). +In particular, we are interested in a more difficult case than those typically addressed in the literature, where the source and target are ``far"" enough that reconstruction-style or pixel-wise approaches fail. +We argue that transferring (i.e., \emph{translating}) said relevant information should involve both discarding source domain-specific information while incorporate target domain-specific information, the latter of which we model with a noisy prior distribution. +In order to avoid the degenerate case where the generated samples are only explained by the prior distribution, we propose to minimize an estimate of the mutual information between the generated sample and the sample from the prior distribution. We discover that the architectural choices are an important factor to consider in order to preserve the shared semantic between $X$ and $Y$. +We show state of the art results on the MNIST to SVHN task for unsupervised image to image translation.",/pdf/91fe2252aae23d28681a05d8039bf213989a7fd1.pdf,ICLR,2019,We train an image to image translation network that take as input the source image and a sample from a prior distribution to generate a sample from the target distribution +Byx1VnR9K7,rJe5tfC5KQ,1538090000000.0,1545360000000.0,1412,Trajectory VAE for multi-modal imitation,"[""xiaoyu.lu@stats.ox.ac.uk"", ""t-jastuh@microsoft.com"", ""katja.hofmann@microsoft.com""]","[""Xiaoyu Lu"", ""Jan Stuehmer"", ""Katja Hofmann""]","[""imitation learning"", ""latent variable model"", ""variational autoencoder"", ""diverse behaviour""]","We address the problem of imitating multi-modal expert demonstrations in sequential decision making problems. In many practical applications, for example video games, behavioural demonstrations are readily available that contain multi-modal structure not captured by typical existing imitation learning approaches. For example, differences in the observed players' behaviours may be representative of different underlying playstyles. + + In this paper, we use a generative model to capture different emergent playstyles in an unsupervised manner, enabling the imitation of a diverse range of distinct behaviours. We utilise a variational autoencoder to learn an embedding of the different types of expert demonstrations on the trajectory level, and jointly learn a latent representation with a policy. In experiments on a range of 2D continuous control problems representative of Minecraft environments, we empirically demonstrate that our model can capture a multi-modal structured latent space from the demonstrated behavioural trajectories. ",/pdf/52c9267181e391cc67d809146013480f00272fa9.pdf,ICLR,2019,A trajectory-VAE method for imitating multi-modal expert demonstrations in sequential decision making problems. +rkl8dlHYvB,S1lj2nxFvH,1569440000000.0,1631940000000.0,2398,Learning to Group: A Bottom-Up Framework for 3D Part Discovery in Unseen Categories,"[""luotg@pku.edu.cn"", ""kaichun@cs.stanford.edu"", ""z2huang@eng.ucsd.edu"", ""jxuat@connect.ust.hk"", ""sy89128@mail.ustc.edu.cn"", ""wanglw@cis.pku.edu.cn"", ""haosu@eng.ucsd.edu""]","[""Tiange Luo"", ""Kaichun Mo"", ""Zhiao Huang"", ""Jiarui Xu"", ""Siyu Hu"", ""Liwei Wang"", ""Hao Su""]","[""Shape Segmentation"", ""Zero-Shot Learning"", ""Learning Representations""]","We address the problem of learning to discover 3D parts for objects in unseen categories. Being able to learn the geometry prior of parts and transfer this prior to unseen categories pose fundamental challenges on data-driven shape segmentation approaches. Formulated as a contextual bandit problem, we propose a learning-based iterative grouping framework which learns a grouping policy to progressively merge small part proposals into bigger ones in a bottom-up fashion. At the core of our approach is to restrict the local context for extracting part-level features, which encourages the generalizability to novel categories. On a recently proposed large-scale fine-grained 3D part dataset, PartNet, we demonstrate that our method can transfer knowledge of parts learned from 3 training categories to 21 unseen testing categories without seeing any annotated samples. Quantitative comparisons against four strong shape segmentation baselines show that we achieve the state-of-the-art performance.",/pdf/36cca59d178d0867d4d25d1c700b6fb44e718d86.pdf,ICLR,2020,"A zero-shot segmentation framework for 3D shapes. Model the segmentation as a decision-making process, we propose an iterative method to dynamically extend the receptive field for achieving universal shape segmentation." +ZS-9XoX20AV,hVg508o0Oan,1601310000000.0,1614990000000.0,179,GraphSAD: Learning Graph Representations with Structure-Attribute Disentanglement,"[""~Minghao_Xu1"", ""~Hang_Wang1"", ""~Bingbing_Ni3"", ""~Wenjun_Zhang3"", ""~Jian_Tang1""]","[""Minghao Xu"", ""Hang Wang"", ""Bingbing Ni"", ""Wenjun Zhang"", ""Jian Tang""]","[""Graph Representation Learning"", ""Disentangled Representation Learning""]","Graph Neural Networks (GNNs) learn effective node/graph representations by aggregating the attributes of neighboring nodes, which commonly derives a single representation mixing the information of graph structure and node attributes. However, these two kinds of information might be semantically inconsistent and could be useful for different tasks. In this paper, we aim at learning node/graph representations with Structure-Attribute Disentanglement (GraphSAD). We propose to disentangle graph structure and node attributes into two distinct sets of representations, and such disentanglement can be done in either the input or the embedding space. We further design a metric to quantify the extent of such a disentanglement. Extensive experiments on multiple datasets show that our approach can indeed disentangle the semantics of graph structure and node attributes, and it achieves superior performance on both node and graph classification tasks.",/pdf/f64b9eac62c200cedf849de32667dcb9f050ff1d.pdf,ICLR,2021,This work seeks to learn structure-attribute disentangled node/graph representations and measure such disentanglement quantitatively. +NSBrFgJAHg,RugqeQwPnk,1601310000000.0,1615820000000.0,2013,Degree-Quant: Quantization-Aware Training for Graph Neural Networks,"[""~Shyam_Anil_Tailor1"", ""~Javier_Fernandez-Marques1"", ""~Nicholas_Donald_Lane1""]","[""Shyam Anil Tailor"", ""Javier Fernandez-Marques"", ""Nicholas Donald Lane""]","[""Graph neural networks"", ""quantization"", ""benchmark""]","Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks due to their ability to model non-uniform structured data. Despite their promise, there exists little research exploring methods to make them more efficient at inference time. In this work, we explore the viability of training quantized GNNs, enabling the usage of low precision integer arithmetic during inference. For GNNs seemingly unimportant choices in quantization implementation cause dramatic changes in performance. We identify the sources of error that uniquely arise when attempting to quantize GNNs, and propose an architecturally-agnostic and stable method, Degree-Quant, to improve performance over existing quantization-aware training baselines commonly used on other architectures, such as CNNs. We validate our method on six datasets and show, unlike previous quantization attempts, that models generalize to unseen graphs. Models trained with Degree-Quant for INT8 quantization perform as well as FP32 models in most cases; for INT4 models, we obtain up to 26% gains over the baselines. Our work enables up to 4.7x speedups on CPU when using INT8 arithmetic.",/pdf/af4deaf6557487a509442a82fcff188259e4183a.pdf,ICLR,2021,"We provide a training technique that enables graph neural networks to use low precision integer arithmetic at inference time, yielding up to 4.7x latency improvements on CPU" +SJeYe0NtvH,BkgGGFXODS,1569440000000.0,1583910000000.0,936,Neural Text Generation With Unlikelihood Training,"[""wellecks@nyu.edu"", ""kulikov@cs.nyu.edu"", ""roller@fb.com"", ""edinan@fb.com"", ""kyunghyun.cho@nyu.edu"", ""jase@fb.com""]","[""Sean Welleck"", ""Ilia Kulikov"", ""Stephen Roller"", ""Emily Dinan"", ""Kyunghyun Cho"", ""Jason Weston""]","[""language modeling"", ""machine learning""]","Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs. While some post-hoc fixes have been proposed, in particular top-k and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques.",/pdf/b1605077dc551ef983c69ef382a249fb5048060d.pdf,ICLR,2020, +Byk4My-RZ,S1JEzyZ0W,1509120000000.0,1518730000000.0,478,Flexible Prior Distributions for Deep Generative Models,"[""yannic.kilcher@inf.ethz.ch"", ""aurelien.lucchi@inf.ethz.ch"", ""thomas.hofmann@inf.ethz.ch""]","[""Yannic Kilcher"", ""Aurelien Lucchi"", ""Thomas Hofmann""]","[""Deep Generative Models"", ""GANs""]","We consider the problem of training generative models with deep neural networks as generators, i.e. to map latent codes to data points. Whereas the dominant paradigm combines simple priors over codes with complex deterministic models, +we argue that it might be advantageous to use more flexible code distributions. We demonstrate how these distributions can be induced directly from the data. The benefits include: more powerful generative models, better modeling of latent +structure and explicit control of the degree of generalization.",/pdf/4863e65cee49f58bba161df4102c34902b115f58.pdf,ICLR,2018, +rkmtTJZCb,ryGt6JZCb,1509130000000.0,1518730000000.0,529,Unsupervised Hierarchical Video Prediction,"[""wichersn@google.com"", ""dumitru@google.com"", ""honglak@google.com""]","[""Nevan Wichers"", ""Dumitru Erhan"", ""Honglak Lee""]","[""video prediction"", ""visual analogy network"", ""unsupervised"", ""hierarchical""]","Much recent research has been devoted to video prediction and generation, but mostly for short-scale time horizons. The hierarchical video prediction method by Villegas et al. (2017) is an example of a state of the art method for long term video prediction. However, their method has limited applicability in practical settings as it requires a ground truth pose (e.g., poses of joints of a human) at training time. This paper presents a long term hierarchical video prediction model that does not have such a restriction. We show that the network learns its own higher-level structure (e.g., pose equivalent hidden variables) that works better in cases where the ground truth pose does not fully capture all of the information needed to predict the next frame. This method gives sharper results than other video prediction methods which do not require a ground truth pose, and its efficiency is shown on the Humans 3.6M and Robot Pushing datasets.",/pdf/bd96a063e018ce7c3e313c464aea295d72490386.pdf,ICLR,2018,We show ways to train a hierarchical video prediction model without needing pose labels. +rkxn7nR5KX,S1liVh3cF7,1538090000000.0,1545360000000.0,1396,Incremental Few-Shot Learning with Attention Attractor Networks,"[""mren@cs.toronto.edu"", ""rjliao@cs.toronto.edu"", ""ethanf@cs.toronto.edu"", ""zemel@cs.toronto.edu""]","[""Mengye Ren"", ""Renjie Liao"", ""Ethan Fetaya"", ""Richard S. Zemel""]","[""meta-learning"", ""few-shot learning"", ""incremental learning""]","Machine learning classifiers are often trained to recognize a set of pre-defined classes. However, +in many real applications, it is often desirable to have the flexibility of learning additional +concepts, without re-training on the full training set. This paper addresses this problem, +incremental few-shot learning, where a regular classification network has already been trained to +recognize a set of base classes; and several extra novel classes are being considered, each with +only a few labeled examples. After learning the novel classes, the model is then evaluated on the +overall performance of both base and novel classes. To this end, we propose a meta-learning model, +the Attention Attractor Network, which regularizes the learning of novel classes. In each episode, +we train a set of new weights to recognize novel classes until they converge, and we show that the +technique of recurrent back-propagation can back-propagate through the optimization process and +facilitate the learning of the attractor network regularizer. We demonstrate that the learned +attractor network can recognize novel classes while remembering old classes without the need to +review the original training set, outperforming baselines that do not rely on an iterative +optimization process.",/pdf/489c9b46fc9d688b01abbded8c238e22f89d2f17.pdf,ICLR,2019, +rye9LT8cee,,1478290000000.0,1484510000000.0,341,Alternating Direction Method of Multipliers for Sparse Convolutional Neural Networks,"[""farkhondeh.kiaee.1@ulaval.ca"", ""christian.gagne@gel.ulaval.ca"", ""mahdieh.abbasi.1@ulaval.ca""]","[""Farkhondeh Kiaee"", ""Christian Gagn\u00e9"", ""and Mahdieh Abbasi""]","[""Deep learning"", ""Computer vision"", ""Optimization""]","The storage and computation requirements of Convolutional Neural Networks (CNNs) can be prohibitive for exploiting these models over low-power or embedded devices. This paper reduces the computational complexity of the CNNs by minimizing an objective function, including the recognition loss that is augmented with a sparsity-promoting penalty term. The sparsity structure of the network is identified using the Alternating Direction Method of Multipliers (ADMM), which is widely used in large optimization problems. This method alternates between promoting the sparsity of the network and optimizing the recognition performance, which allows us to exploit the two-part structure of the corresponding objective functions. In particular, we take advantage of the separability of the sparsity-inducing penalty functions to decompose the minimization problem into sub-problems that can be solved sequentially. Applying our method to a variety of state-of-the-art CNN models, our proposed method is able to simplify the original model, generating models with less computation and fewer parameters, while maintaining and often improving generalization performance. Accomplishments on a variety of models strongly verify that our proposed ADMM-based method can be a very useful tool for simplifying and improving deep CNNs. ",/pdf/2a7a79259293dbf09747e4ce766ae78301a3c2f9.pdf,ICLR,2017,A method to sparsify (prune) pre-trained deep neural networks. +rkxoh24FPH,HklEb39GvH,1569440000000.0,1583910000000.0,202,On Mutual Information Maximization for Representation Learning,"[""mi.tschannen@gmail.com"", ""josip@djolonga.com"", ""paruby@gmail.com"", ""sylvaingelly@google.com"", ""lucic@google.com""]","[""Michael Tschannen"", ""Josip Djolonga"", ""Paul K. Rubenstein"", ""Sylvain Gelly"", ""Mario Lucic""]","[""mutual information"", ""representation learning"", ""unsupervised learning"", ""self-supervised learning""]","Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data. This comes with several immediate problems: For example, MI is notoriously hard to estimate, and using it as an objective for representation learning may lead to highly entangled representations due to its invariance under arbitrary invertible transformations. Nevertheless, these methods have been repeatedly shown to excel in practice. In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators. Finally, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation for the success of the recently introduced methods.",/pdf/5f764e457d8f1897157708a27a04cc50a3ded94c.pdf,ICLR,2020,The success of recent mutual information (MI)-based representation learning approaches strongly depends on the inductive bias in both the choice of network architectures and the parametrization of the employed MI estimators. +VbCVU10R7K,JSWTVFztiXG,1601310000000.0,1614990000000.0,2249,Offline policy selection under Uncertainty,"[""~Mengjiao_Yang1"", ""~Bo_Dai1"", ""~Ofir_Nachum1"", ""~George_Tucker1"", ""~Dale_Schuurmans1""]","[""Mengjiao Yang"", ""Bo Dai"", ""Ofir Nachum"", ""George Tucker"", ""Dale Schuurmans""]","[""Off-policy selection"", ""reinforcement learning"", ""Bayesian inference""]","The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.",/pdf/25ccb7bfb2f1c21b98f77b27918228df09ef9c12.pdf,ICLR,2021,"Formally defines offline policy selection in RL, and proposes Bayesian dual policy value posterior inference based on stochastic constraints, which enables a diverse set of policy selection algorithms under a wide range of evaluation metrics." +xiwHM0l55c3,n41K68Br2_E,1601310000000.0,1614990000000.0,3214,Monotonic neural network: combining deep learning with domain knowledge for chiller plants energy optimization,"[""~Fanhe_Ma1"", ""zhangfaen@ainnovation.com"", ""~Shenglan_Ben2"", ""~Shuxin_Qin1"", ""zhoupengcheng@ainnovation.com"", ""zhouchangsheng@ainnovation.com"", ""xufengyi@ainnovation.com""]","[""Fanhe Ma"", ""Faen Zhang"", ""Shenglan Ben"", ""Shuxin Qin"", ""pengcheng Zhou"", ""Changsheng Zhou"", ""Fengyi Xu""]",[],"In this paper, we are interested in building a domain knowledge based deep learning framework to solve the chiller plants energy optimization problems. Compared to the hotspot applications of deep learning (e.g. image classification and NLP), it is difficult to collect enormous data for deep network training in real-world physical systems. Most existing methods reduce the complex systems into linear model to facilitate the training on small samples. To tackle the small sample size problem, this paper considers domain knowledge in the structure and loss design of deep network to build a nonlinear model with lower redundancy function space. Specifically, the energy consumption estimation of most chillers can be physically viewed as an input-output monotonic problem. Thus, we can design a Neural Network with monotonic constraints to mimic the physical behavior of the system. We verify the proposed method in a cooling system of a data center, experimental results show the superiority of our framework in energy optimization compared to the existing ones.",/pdf/60c30292979fd7bcdd0c1806dab97f6f34839a28.pdf,ICLR,2021, +Hkgnii09Ym,H1lOZVN5KX,1538090000000.0,1545360000000.0,656,Set Transformer,"[""juho.lee@stats.ox.ac.uk"", ""einet89@gmail.com"", ""jtkim@postech.ac.kr"", ""adamk@robots.ox.ac.uk"", ""seungjin@postech.ac.kr"", ""y.w.teh@stats.ox.ac.uk""]","[""Juho Lee"", ""Yoonho Lee"", ""Jungtaek Kim"", ""Adam R. Kosiorek"", ""Seungjin Choi"", ""Yee Whye Teh""]","[""attention"", ""meta-learning"", ""set-input neural networks"", ""permutation invariant modeling""]","Many machine learning tasks such as multiple instance learning, 3D shape recognition and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the permutation of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating increased performance compared to recent methods for set-structured data.",/pdf/87de8c8193bf9a20d1a2183e4d5c46b0d14feb77.pdf,ICLR,2019,Attention-based neural network to process set-structured data +BJgd7m0xRZ,BJ_QXReRW,1509120000000.0,1518730000000.0,449,Unsupervised Adversarial Anomaly Detection using One-Class Support Vector Machines,"[""pweerasinghe@student.unimelb.edu.au"", ""tansu.alpcan@unimelb.edu.au"", ""sarah.erfani@unimelb.edu.au"", ""caleckie@unimelb.edu.au""]","[""Prameesha Sandamal Weerasinghe"", ""Tansu Alpcan"", ""Sarah Monazam Erfani"", ""Christopher Leckie""]","[""anomaly detection"", ""one class support vector machine"", ""adversarial learning""]","Anomaly detection discovers regular patterns in unlabeled data and identifies the non-conforming data points, which in some cases are the result of malicious attacks by adversaries. Learners such as One-Class Support Vector Machines (OCSVMs) have been successfully in anomaly detection, yet their performance may degrade significantly in the presence of sophisticated adversaries, who target the algorithm itself by compromising the integrity of the training data. With the rise in the use of machine learning in mission critical day-to-day activities where errors may have significant consequences, it is imperative that machine learning systems are made secure. To address this, we propose a defense mechanism that is based on a contraction of the data, and we test its effectiveness using OCSVMs. The proposed approach introduces a layer of uncertainty on top of the OCSVM learner, making it infeasible for the adversary to guess the specific configuration of the learner. We theoretically analyze the effects of adversarial perturbations on the separating margin of OCSVMs and provide empirical evidence on several benchmark datasets, which show that by carefully contracting the data in low dimensional spaces, we can successfully identify adversarial samples that would not have been identifiable in the original dimensional space. The numerical results show that the proposed method improves OCSVMs performance significantly (2-7%)",/pdf/e300aaa0ef5246eda4d05d03b9748fe910c3c460.pdf,ICLR,2018,"A novel method to increase the resistance of OCSVMs against targeted, integrity attacks by selective nonlinear transformations of data to lower dimensions." +fw-BHZ1KjxJ,bqe8oEIkgWq,1601310000000.0,1616650000000.0,2128,SOLAR: Sparse Orthogonal Learned and Random Embeddings,"[""~Tharun_Medini1"", ""~Beidi_Chen1"", ""~Anshumali_Shrivastava1""]","[""Tharun Medini"", ""Beidi Chen"", ""Anshumali Shrivastava""]","[""Sparse Embedding"", ""Inverted Index"", ""Learning to Hash"", ""Embedding Models""]","Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this paper, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This is facilitated by our novel asymmetric mixture of Sparse, Orthogonal, Learned and Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. With these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each task with up to 10 times faster speed.",/pdf/b5e7f77ff9194315708e327c8be7e9ff929d902a.pdf,ICLR,2021,We propose a distributed training scheme to learn high dimensional sparse embeddings that are much better than dense embeddings on both precision and speed. +rkeIIkHKvS,SkgU1P6dwH,1569440000000.0,1583910000000.0,1730,Measuring and Improving the Use of Graph Information in Graph Neural Networks,"[""yfhou@cse.cuhk.edu.hk"", ""jzhang@cse.cuhk.edu.hk"", ""jcheng@cse.cuhk.edu.hk"", ""klma@cse.cuhk.edu.hk"", ""tbma@comp.nus.edu.sg"", ""hzchen@cse.cuhk.edu.hk"", ""mcyang@cse.cuhk.edu.hk""]","[""Yifan Hou"", ""Jian Zhang"", ""James Cheng"", ""Kaili Ma"", ""Richard T. B. Ma"", ""Hongzhi Chen"", ""Ming-Chang Yang""]",[],"Graph neural networks (GNNs) have been widely used for representation learning on graph data. However, there is limited understanding on how much performance GNNs actually gain from graph data. This paper introduces a context-surrounding GNN framework and proposes two smoothness metrics to measure the quantity and quality of information obtained from graph data. A new, improved GNN model, called CS-GNN, is then devised to improve the use of graph information based on the smoothness values of a graph. CS-GNN is shown to achieve better performance than existing methods in different types of real graphs. ",/pdf/3ff628aed23920c95386567ad7acc7885d49b122.pdf,ICLR,2020, +rylkma4twr,rJxpaIT8vr,1569440000000.0,1577170000000.0,433,Min-Max Optimization without Gradients: Convergence and Applications to Adversarial ML,"[""sijia.liu@ibm.com"", ""songtao@ibm.com"", ""chen5719@umn.edu"", ""feng-y16@mails.tsinghua.edu.cn"", ""xu.kaid@husky.neu.edu"", ""aldujail@mit.edu"", ""mhong@umn.edu"", ""unamay@csail.mit.edu""]","[""Sijia Liu"", ""Songtao Lu"", ""Xiangyi Chen"", ""Yao Feng"", ""Kaidi Xu"", ""Abdullah Al-Dujaili"", ""Minyi Hong"", ""Una-May Obelilly""]","[""nonconvex optimization"", ""min-max optimization"", ""robust optimization"", ""adversarial attack""]","In this paper, we study the problem of constrained robust (min-max) optimization ina black-box setting, where the desired optimizer cannot access the gradients of the objective function but may query its values. We present a principled optimization framework, integrating a zeroth-order (ZO) gradient estimator with an alternating projected stochastic gradient descent-ascent method, where the former only requires a small number of function queries and the later needs just one-step descent/ascent update. We show that the proposed framework, referred to as ZO-Min-Max, has a sub-linear convergence rate under mild conditions and scales gracefully with problem size. From an application side, we explore a promising connection between black-box min-max optimization and black-box evasion and poisoning attacks in adversarial machine learning (ML). Our empirical evaluations on these use cases demonstrate the effectiveness of our approach and its scalability to dimensions that prohibit using recent black-box solvers.",/pdf/e39bfd34f9b940361ee6d635a6873bdfe01abedc.pdf,ICLR,2020,Towards principled and efficient black-box min-max optimization with applications to design of evasion and poisoning adversarial attacks +49V11oUejQ,Mxc9H_r2fwR,1601310000000.0,1614990000000.0,2457,Efficient Robust Training via Backward Smoothing,"[""~Jinghui_Chen1"", ""~Yu_Cheng1"", ""~Zhe_Gan1"", ""~Quanquan_Gu1"", ""~Jingjing_Liu2""]","[""Jinghui Chen"", ""Yu Cheng"", ""Zhe Gan"", ""Quanquan Gu"", ""Jingjing Liu""]","[""Efficient Robust Training"", ""Backward Smoothing"", ""Robustness""]","Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational cost due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve Fast Adversarial Training by performing a single-step attack with random initialization. Yet, it remains a mystery why random initialization helps. Besides, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. From this perspective, we show that the smoothing effect by random initialization is not sufficient under the adversarial perturbation constraint. A new initialization strategy, \emph{backward smoothing}, is proposed to address this issue and significantly improves both stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method, while using much less training time (~3x improvement with the same training schedule). ",/pdf/ee969cde7ffd2ade652f069571e71705a0b27adb.pdf,ICLR,2021,"We propose a new principle towards understanding Fast Adversarial Training, and a new initialization strategy that significantly improves both stability and model robustness over the single-step robust training methods." +cef_G2hkiGc,96j81aW5rwk,1601310000000.0,1614990000000.0,3387,"More Side Information, Better Pruning: Shared-Label Classification as a Case Study","[""~Omer_Leibovitch1"", ""~Nir_Ailon1""]","[""Omer Leibovitch"", ""Nir Ailon""]","[""Pruning"", ""Compression"", ""CNN"", ""LSTM"", ""Image classification""]","Pruning of neural networks, also known as compression or sparsification, is the task of converting a given network, which may be too expensive to use (in prediction) on low resource platforms, with another 'lean' network which performs almost as well as the original one, while using considerably fewer resources. By turning the compression ratio knob, the practitioner can trade off the information gain versus the necessary computational resources, where information gain is a measure of reduction of uncertainty in the prediction. + +In certain cases, however, the practitioner may readily possess some information on the prediction from other sources. The main question we study here is, whether it is possible to take advantage of the additional side information, in order to further reduce the computational resources, in tandem with the pruning process? + +Motivated by a real-world application, we distill the following elegantly stated problem. We are given a multi-class prediction problem, combined with a (possibly pre-trained) network architecture for solving it on a given instance distribution, and also a method for pruning the network to allow trading off prediction speed with accuracy. We assume the network and the pruning methods are state-of-the-art, and it is not our goal here to improve them. However, instead of being asked to predict a single drawn instance $x$, we are being asked to predict the label of an $n$-tuple of instances $(x_1,\dots x_n)$, with the additional side information of all tuple instances share the same label. The shared label distribution is identical to the distribution on which the network was trained. + +One trivial way to do this is by obtaining individual raw predictions for each of the $n$ instances (separately), using our given network, pruned for a desired accuracy, then taking the average to obtain a single more accurate prediction. This is simple to implement but intuitively sub-optimal, because the $n$ independent instantiations of the network do not share any information, and would probably waste resources on overlapping computation. + +We propose various methods for performing this task, and compare them using extensive experiments on public benchmark data sets for image classification. Our comparison is based on measures of relative information (RI) and $n$-accuracy, which we define. Interestingly, we empirically find that I) sharing information between the $n$ independently computed hidden representations of $x_1,..,x_n$, using an LSTM based gadget, performs best, among all methods we experiment with, ii) for all methods studied, we exhibit a sweet spot phenomenon, which sheds light on the compression-information trade-off and may assist a practitioner to choose the desired compression ratio.",/pdf/a58affe465befe478ca15b157c80e5554a6fcef2.pdf,ICLR,2021,"The paper presents the practical problem of combining deep network pruning algorithms in scenarios involving additional side information on the data, suggests various solutions and reports empirical findings on their merits." +Bpw_O132lWT,_DcrDHj6X9,1601310000000.0,1614990000000.0,1418,Dynamic of Stochastic Gradient Descent with State-dependent Noise,"[""~Qi_Meng1"", ""~Shiqi_Gong1"", ""~Wei_Chen1"", ""~Zhi-Ming_Ma1"", ""~Tie-Yan_Liu1""]","[""Qi Meng"", ""Shiqi Gong"", ""Wei Chen"", ""Zhi-Ming Ma"", ""Tie-Yan Liu""]","[""state-dependent noise"", ""power-law dynamic"", ""stochastic gradient descent"", ""generalization"", ""deep neural network"", ""heavy-tailed"", ""escape time""]","Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and its impact to generalization, especially the escaping efficiency from local minima. However, these works make the over-simplified assumption that the distribution of gradient noise is state-independent, although it is state-dependent. In this work, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. Then, we prove that the stationary distribution of power-law dynamic is heavy-tailed, which matches the existing empirical observations. Next, we study the escaping efficiency from local minimum of power-law dynamic and prove that the mean escaping time is in polynomial order of the barrier height of the basin, much faster than exponential order of previous dynamics. It indicates that SGD can escape deep sharp minima efficiently and tends to stop at flat minima that have lower generalization error. Finally, we conduct experiments to compare SGD and power-law dynamic, and the results verify our theoretical findings.",/pdf/178eaecb04429fe8847d55f837b1b008857148df.pdf,ICLR,2021,"We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD, and analyze escaping efficiency and PAC-Bayesian generalization bound for it." +rydeCEhs-,rJvl0EhoZ,1506720000000.0,1519460000000.0,1,SMASH: One-Shot Model Architecture Search through HyperNetworks,"[""ajb5@hw.ac.uk"", ""t.lim@hw.ac.uk"", ""j.m.ritchie@hw.ac.uk"", ""nick.weston@renishaw.com""]","[""Andrew Brock"", ""Theo Lim"", ""J.M. Ritchie"", ""Nick Weston""]","[""meta-learning"", ""architecture search"", ""deep learning"", ""computer vision""]","Designing architectures for deep neural networks requires expert knowledge and substantial computation time. We propose a technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture. By comparing the relative validation performance of networks with HyperNet-generated weights, we can effectively search over a wide range of architectures at the cost of a single training run. To facilitate this search, we develop a flexible mechanism based on memory read-writes that allows us to define a wide range of network connectivity patterns, with ResNet, DenseNet, and FractalNet blocks as special cases. We validate our method (SMASH) on CIFAR-10 and CIFAR-100, STL-10, ModelNet10, and Imagenet32x32, achieving competitive performance with similarly-sized hand-designed networks.",/pdf/d419771c2f7a5d1440eae8409b507610b0463a9e.pdf,ICLR,2018,A technique for accelerating neural architecture selection by approximating the weights of each candidate architecture instead of training them individually. +BkzeUiRcY7,HJl-LU6FYX,1538090000000.0,1551940000000.0,146,M^3RL: Mind-aware Multi-agent Management Reinforcement Learning,"[""tianmin.shu@ucla.edu"", ""yuandong@fb.com""]","[""Tianmin Shu"", ""Yuandong Tian""]","[""Multi-agent Reinforcement Learning"", ""Deep Reinforcement Learning""]","Most of the prior work on multi-agent reinforcement learning (MARL) achieves optimal collaboration by directly learning a policy for each agent to maximize a common reward. In this paper, we aim to address this from a different angle. In particular, we consider scenarios where there are self-interested agents (i.e., worker agents) which have their own minds (preferences, intentions, skills, etc.) and can not be dictated to perform tasks they do not want to do. For achieving optimal coordination among these agents, we train a super agent (i.e., the manager) to manage them by first inferring their minds based on both current and past observations and then initiating contracts to assign suitable tasks to workers and promise to reward them with corresponding bonuses so that they will agree to work together. The objective of the manager is to maximize the overall productivity as well as minimize payments made to the workers for ad-hoc worker teaming. To train the manager, we propose Mind-aware Multi-agent Management Reinforcement Learning (M^3RL), which consists of agent modeling and policy learning. We have evaluated our approach in two environments, Resource Collection and Crafting, to simulate multi-agent management problems with various task settings and multiple designs for the worker agents. The experimental results have validated the effectiveness of our approach in modeling worker agents' minds online, and in achieving optimal ad-hoc teaming with good generalization and fast adaptation.",/pdf/0798a743917823f4ae3a889a67ee0bd602d973de.pdf,ICLR,2019,We propose Mind-aware Multi-agent Management Reinforcement Learning (M^3RL) for training a manager to motivate self-interested workers to achieve optimal collaboration by assigning suitable contracts to them. +r1lYRjC9F7,r1l2yRT9KX,1538090000000.0,1556650000000.0,917,Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,"[""fjord@google.com"", ""astas@google.com"", ""adarob@google.com"", ""iansimon@google.com"", ""annahuang@google.com"", ""sedielem@google.com"", ""eriche@google.com"", ""jesseengel@google.com"", ""deck@google.com""]","[""Curtis Hawthorne"", ""Andriy Stasyuk"", ""Adam Roberts"", ""Ian Simon"", ""Cheng-Zhi Anna Huang"", ""Sander Dieleman"", ""Erich Elsen"", ""Jesse Engel"", ""Douglas Eck""]","[""music"", ""piano transcription"", ""transformer"", ""wavnet"", ""audio synthesis"", ""dataset"", ""midi""]","Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.",/pdf/ced843930e16b6829293532407cbae254ab1a5af.pdf,ICLR,2019,"We train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure, enabled by the new MAESTRO dataset." +B1lxH20qtX,HJgn9GA9YX,1538090000000.0,1545360000000.0,1509,Learning to control self-assembling morphologies: a study of generalization via modularity,"[""pathak@berkeley.edu"", ""chris.lu@berkeley.edu"", ""trevor@eecs.berkeley.edu"", ""phillip.isola@gmail.com"", ""efros@eecs.berkeley.edu""]","[""Deepak Pathak"", ""Chris Lu"", ""Trevor Darrell"", ""Philip Isola"", ""Alexei A. Efros""]","[""modularity"", ""compostionality"", ""graphs"", ""dynamics"", ""network""]","Much of contemporary sensorimotor learning assumes that one is already given a complex agent (e.g., a robotic arm) and the goal is to learn to control it. In contrast, this paper investigates a modular co-evolution strategy: a collection of primitive agents learns to self-assemble into increasingly complex collectives in order to solve control tasks. Each primitive agent consists of a limb and a neural controller. Limbs may choose to link up to form collectives, with linking being treated as a dynamic action. When two limbs link, a joint is added between them, actuated by the 'parent' limb's controller. This forms a new 'single' agent, which may further link with other agents. In this way, complex morphologies can emerge, controlled by a policy whose architecture is in explicit correspondence with the morphology. In experiments, we demonstrate that agents with these modular and dynamic topologies generalize better to test-time environments compared to static and monolithic baselines. Project videos are available at https://doubleblindICLR19.github.io/self-assembly/",/pdf/83370ed91d0ed645c285290bd8ea05fc219125af.pdf,ICLR,2019,Learning to control self-assembling agents via dynamic graph networks +SygKyeHKDH,Byejo5yYwS,1569440000000.0,1583910000000.0,2069,Making Efficient Use of Demonstrations to Solve Hard Exploration Problems,"[""caglarg@google.com"", ""tpaine@google.com"", ""bshahr@google.com"", ""mdenil@google.com"", ""mwhoffman@google.com"", ""soyer@google.com"", ""tanburn@google.com"", ""skapturowski@google.com"", ""ncr@google.com"", ""duncanwilliams@google.com"", ""gabrielbm@google.com"", ""ziyu@google.com"", ""nandodefreitas@google.com"", ""deepmind-worlds-team@google.com""]","[""Caglar Gulcehre"", ""Tom Le Paine"", ""Bobak Shahriari"", ""Misha Denil"", ""Matt Hoffman"", ""Hubert Soyer"", ""Richard Tanburn"", ""Steven Kapturowski"", ""Neil Rabinowitz"", ""Duncan Williams"", ""Gabriel Barth-Maron"", ""Ziyu Wang"", ""Nando de Freitas"", ""Worlds Team""]","[""imitation learning"", ""deep learning"", ""reinforcement learning""]","This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.",/pdf/b445e01d0997cf69140ca5d5fc5a7990a5837f29.pdf,ICLR,2020,"We introduce R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions." +BkgGmh09FQ,BkgDNZC5YQ,1538090000000.0,1545360000000.0,1338,Understanding Opportunities for Efficiency in Single-image Super Resolution Networks,"[""rs@roysonlee.com"", ""nicholas.d.lane@gmail.com"", ""marko.stankovic996@gmail.com"", ""bsourav@gmail.com""]","[""Royson Lee"", ""Nic Lane"", ""Marko Stankovic"", ""Sourav Bhattacharya""]","[""Super-Resolution"", ""Resource-Efficiency""]","A successful application of convolutional architectures is to increase the resolution of single low-resolution images -- a image restoration task called super-resolution (SR). Naturally, SR is of value to resource constrained devices like mobile phones, electronic photograph frames and televisions to enhance image quality. However, SR demands perhaps the most extreme amounts of memory and compute operations of any mainstream vision task known today, preventing SR from being deployed to devices that require them. In this paper, we perform a early systematic study of system resource efficiency for SR, within the context of a variety of architectural and low-precision approaches originally developed for discriminative neural networks. We present a rich set of insights, representative SR architectures, and efficiency trade-offs; for example, the prioritization of ways to compress models to reach a specific memory and computation target and techniques to compact SR models so that they are suitable for DSPs and FPGAs. As a result of doing so, we manage to achieve better and comparable performance with previous models in the existing literature, highlighting the practicality of using existing efficiency techniques in SR tasks. Collectively, we believe these results provides the foundation for further research into the little explored area of resource efficiency for SR. ",/pdf/21f463083d76b7439df2380734c1262cb75f6947.pdf,ICLR,2019,We build an understanding of resource-efficient techniques on Super-Resolution +Hkgpnn4YvH,HyeP3n3GvH,1569440000000.0,1577170000000.0,208,Graph Neural Networks For Multi-Image Matching,"[""stephi@seas.upenn.edu"", ""kostas@seas.upenn.edu""]","[""Stephen Phillips"", ""Kostas Daniilidis""]","[""Graph Neural Networks"", ""Multi-image Matching""]","In geometric computer vision applications, multi-image feature matching gives more accurate and robust solutions compared to simple two-image matching. In this work, we formulate multi-image matching as a graph embedding problem, then use a Graph Neural Network to learn an appropriate embedding function for aligning image features. We use cycle consistency to train our network in an unsupervised fashion, since ground truth correspondence can be difficult or expensive to acquire. Geometric consistency losses are added to aid training, though unlike optimization based methods no geometric information is necessary at inference time. To the best of our knowledge, no other works have used graph neural networks for multi-image feature matching. Our experiments show that our method is competitive with other optimization based approaches.",/pdf/224abc04952b2a78f54dadbdf88e2096e16ccca2.pdf,ICLR,2020,We use Graph Neural Networks to learning multi-image feature matching with Geometric side losses. +HyebplHYwB,B1gPkZbKPr,1569440000000.0,1583910000000.0,2564,The Shape of Data: Intrinsic Distance for Data Distributions,"[""tsitsulin@bit.uni-bonn.de"", ""marina.munkhoeva@skolkovotech.ru"", ""davide@cs.au.dk"", ""piekarras@gmail.com"", ""bron@cs.technion.ac.il"", ""i.oseledets@skoltech.ru"", ""mueller@bit.uni-bonn.de""]","[""Anton Tsitsulin"", ""Marina Munkhoeva"", ""Davide Mottin"", ""Panagiotis Karras"", ""Alex Bronstein"", ""Ivan Oseledets"", ""Emmanuel Mueller""]","[""Deep Learning"", ""Generative Models"", ""Nonlinear Dimensionality Reduction"", ""Manifold Learning"", ""Similarity and Distance Learning"", ""Spectral Methods""]","The ability to represent and compare machine learning models is crucial in order to quantify subtle model changes, evaluate generative models, and gather insights on neural network architectures. Existing techniques for comparing data distributions focus on global data properties such as mean and covariance; in that sense, they are extrinsic and uni-scale. We develop a first-of-its-kind intrinsic and multi-scale method for characterizing and comparing data manifolds, using a lower-bound of the spectral variant of the Gromov-Wasserstein inter-manifold distance, which compares all data moments. In a thorough experimental study, we demonstrate that our method effectively discerns the structure of data manifolds even on unaligned data of different dimensionalities; moreover, we showcase its efficacy in evaluating the quality of generative models.",/pdf/363c65692ab623e314b23da974417f909c856b06.pdf,ICLR,2020,We propose a metric for comparing data distributions based on their geometry while not relying on any positional information. +01olnfLIbD,Xzvod_oeWVz,1601310000000.0,1614590000000.0,3659,Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching,"[""~Jonas_Geiping1"", ""~Liam_H_Fowl1"", ""~W._Ronny_Huang1"", ""~Wojciech_Czaja1"", ""~Gavin_Taylor1"", ""~Michael_Moeller1"", ""~Tom_Goldstein1""]","[""Jonas Geiping"", ""Liam H Fowl"", ""W. Ronny Huang"", ""Wojciech Czaja"", ""Gavin Taylor"", ""Michael Moeller"", ""Tom Goldstein""]","[""Data Poisoning"", ""ImageNet"", ""Large-scale"", ""Gradient Alignment"", ""Security"", ""Backdoor Attacks"", ""from-scratch"", ""clean-label""]","Data Poisoning attacks modify training data to maliciously control a model trained on such data. +In this work, we focus on targeted poisoning attacks which cause a reclassification of an unmodified test image and as such breach model integrity. We consider a +particularly malicious poisoning attack that is both ``from scratch"" and ``clean label"", meaning we analyze an attack that successfully works against new, randomly initialized models, and is nearly imperceptible to humans, all while perturbing only a small fraction of the training data. +Previous poisoning attacks against deep neural networks in this setting have been limited in scope and success, working only in simplified settings or being prohibitively expensive for large datasets. +The central mechanism of the new attack is matching the gradient direction of malicious examples. We analyze why this works, supplement with practical considerations. and show its threat to real-world practitioners, finding that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset. +Finally we demonstrate the limitations of existing defensive strategies against such an attack, concluding that data poisoning is a credible threat, even for large-scale deep learning systems.",/pdf/3a3c570da85848de52605f6669aae395d063027b.pdf,ICLR,2021,"Data poisoning attacks that successfully poison neural networks trained from scratch, even on large-scale datasets like ImageNet." +rJlk6iRqKX,H1gG02Jctm,1538090000000.0,1553420000000.0,763,Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach,"[""mhcheng@ucla.edu"", ""thmle@ucdavis.edu"", ""pin-yu.chen@ibm.com"", ""huan@huan-zhang.com"", ""yijinfeng@jd.com"", ""chohsieh@cs.ucla.edu""]","[""Minhao Cheng"", ""Thong Le"", ""Pin-Yu Chen"", ""Huan Zhang"", ""JinFeng Yi"", ""Cho-Jui Hsieh""]","[""Adversarial example"", ""Hard-label"", ""Black-box attack"", ""Query-efficient""]","We study the problem of attacking machine learning models in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., C&W or PGD) to the hard-label black-box setting will require minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The only two current approaches are based on random walk on the boundary (Brendel et al., 2017) and random trials to evaluate the loss function (Ilyas et al., 2018), which require lots of queries and lacks convergence guarantees. +We propose a novel way to formulate the hard-label black-box attack as a real-valued optimization problem which is usually continuous and can be solved by any zeroth order optimization algorithm. For example, using the Randomized Gradient-Free method (Nesterov & Spokoiny, 2017), we are able to bound the number of iterations needed for our algorithm to achieve stationary points under mild assumptions. We demonstrate that our proposed method outperforms the previous stochastic approaches to attacking convolutional neural networks on MNIST, CIFAR, and ImageNet datasets. More interestingly, we show that the proposed algorithm can also be used to attack other discrete and non-continuous machine learning models, such as Gradient Boosting Decision Trees (GBDT).",/pdf/182080868c2b5739e95727e69ac2b0c58401dad8.pdf,ICLR,2019, +HklKui0ct7,SklWQCxcKQ,1538090000000.0,1550820000000.0,375,Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy,"[""xieyuan@umail.iu.edu"", ""boyiliu2018@u.northwestern.edu"", ""lqiang@cs.utexas.edu"", ""zhaoranwang@gmail.com"", ""yzhoucs@iu.edu"", ""jianpeng@illinois.edu""]","[""Yuan Xie"", ""Boyi Liu"", ""Qiang Liu"", ""Zhaoran Wang"", ""Yuan Zhou"", ""Jian Peng""]","[""Causal inference"", ""Policy Optimization"", ""Non-asymptotic analysis""]"," When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given historical policy as the proposal in inverse propensity weights, we estimate a maximum likelihood surrogate policy based on the logged action-context pairs, and then use this surrogate policy as the proposal. We prove that MLIPS is asymptotically unbiased, and moreover, has a smaller nonasymptotic mean squared error than IPS. Such an error reduction phenomenon is somewhat surprising as the estimated surrogate policy is less accurate than the given historical policy. Results on multi-label classification problems and a large-scale ad placement dataset demonstrate the empirical effectiveness of MLIPS. Furthermore, the proposed surrogate policy technique is complementary to existing error reduction techniques, and when combined, is able to consistently boost the performance of several widely used approaches.",/pdf/79a3a948cc952ef893d37f8869760f151b7a36e4.pdf,ICLR,2019, +Ew0zR07CYRd,tzzvMgTn2VZ,1601310000000.0,1614990000000.0,3010,Bounded Myopic Adversaries for Deep Reinforcement Learning Agents,"[""~Ezgi_Korkmaz1"", ""hsan@kth.se"", ""gyuri@kth.se""]","[""Ezgi Korkmaz"", ""Henrik Sandberg"", ""Gyorgy Dan""]","[""deep reinforcement learning"", ""adversarial""]","Adversarial attacks against deep neural networks have been widely studied. Adversarial examples for deep reinforcement learning (DeepRL) have significant security implications, due to the deployment of these algorithms in many application domains. In this work we formalize an optimal myopic adversary for deep reinforcement learning agents. Our adversary attempts to find a bounded perturbation of the state which minimizes the value of the action taken by the agent. We show with experiments in various games in the Atari environment that our attack formulation achieves significantly larger impact as compared to the current state-of-the-art. Furthermore, this enables us to lower the bounds by several orders of magnitude on the perturbation needed to efficiently achieve significant impacts on DeepRL agents.",/pdf/c1151275b352de2e8494046c8db5c488a5d8ba8c.pdf,ICLR,2021, +r1eyceSYPr,H1ldPAgKPS,1569440000000.0,1583910000000.0,2455,Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models,"[""yixuanq@andrew.cmu.edu"", ""lingsong@purdue.edu"", ""wangxiao@purdue.edu""]","[""Yixuan Qiu"", ""Lingsong Zhang"", ""Xiao Wang""]","[""energy model"", ""restricted Boltzmann machine"", ""contrastive divergence"", ""unbiased Markov chain Monte Carlo"", ""distribution coupling""]","The contrastive divergence algorithm is a popular approach to training energy-based latent variable models, which has been widely used in many machine learning models such as the restricted Boltzmann machines and deep belief nets. Despite its empirical success, the contrastive divergence algorithm is also known to have biases that severely affect its convergence. In this article we propose an unbiased version of the contrastive divergence algorithm that completely removes its bias in stochastic gradient methods, based on recent advances on unbiased Markov chain Monte Carlo methods. Rigorous theoretical analysis is developed to justify the proposed algorithm, and numerical experiments show that it significantly improves the existing method. Our findings suggest that the unbiased contrastive divergence algorithm is a promising approach to training general energy-based latent variable models.",/pdf/657b4d2249efb2827fbf6ee8082a4a104e9f0c8c.pdf,ICLR,2020,We have developed a new training algorithm for energy-based latent variable models that completely removes the bias of contrastive divergence. +zleOqnAUZzl,DOkZylos8d5,1601310000000.0,1614990000000.0,2358,Are all outliers alike? On Understanding the Diversity of Outliers for Detecting OODs,"[""ramneetk@seas.upenn.edu"", ""~Susmit_Jha1"", ""~Anirban_Roy3""]","[""Ramneet Kaur"", ""Susmit Jha"", ""Anirban Roy""]","[""OOD"", ""out of distribution"", ""trust"", ""model confidence"", ""DNN"", ""deep learning""]","Deep neural networks (DNNs) are known to produce incorrect predictions with very high confidence on out-of-distribution (OOD) inputs. This limitation is one of the key challenges in the adoption of deep learning models in high-assurance systems such as autonomous driving, air traffic management, and medical diagnosis. This challenge has received significant attention recently, and several techniques have been developed to detect inputs where the model's prediction cannot be trusted. These techniques use different statistical, geometric, or topological signatures. This paper presents a taxonomy of OOD outlier inputs based on their source and nature of uncertainty. We demonstrate how different existing detection approaches fail to detect certain types of outliers. We utilize these insights to develop a novel integrated detection approach that uses multiple attributes corresponding to different types of outliers. Our results include experiments on CIFAR10, SVHN and MNIST as in-distribution data and Imagenet, LSUN, SVHN (for CIFAR10), CIFAR10 (for SVHN), KMNIST, and F-MNIST as OOD data across different DNN architectures such as ResNet34, WideResNet, DenseNet, and LeNet5.",/pdf/c8ccdbdc0f5815237772396974912d03924165e9.pdf,ICLR,2021,We study the diversity of out of distribution (OOD) inputs and develop a new OOD detection approach that outperforms existing state-of-the-art methods. +YtMG5ex0ou,oyB2o1hr9_MZ,1601310000000.0,1617890000000.0,871,Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data,"[""~Francesco_Tonolini1"", ""~Pablo_Garcia_Moreno1"", ""~Andreas_Damianou1"", ""~Roderick_Murray-Smith1""]","[""Francesco Tonolini"", ""Pablo Garcia Moreno"", ""Andreas Damianou"", ""Roderick Murray-Smith""]","[""Missing value imputation"", ""variational inference"", ""variational auto-encoders""]","We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible reconstructed data and hence characterising the underlying uncertainty. In this set-ting, direct application of classical variational methods often gives rise to collapsed densities that do not adequately explore the solution space. Instead, we derive our novel reduced entropy condition approximate inference method that results in rich posteriors. We test our model in a data recovery task under the common setting of missing values and noise, demonstrating superior performance to existing variational methods for imputation and de-noising with different real data sets. We further show higher classification accuracy after imputation, proving the advantage of propagating uncertainty to downstream tasks with our model.",/pdf/1a034aa157281f3316e3bb5608ae9c2bb7e13e30.pdf,ICLR,2021,Recovering accurate posterior distributions for unsupervised data recovery +7wCBOfJ8hJM,WoTkSAa39lg,1601310000000.0,1615850000000.0,971,Nearest Neighbor Machine Translation,"[""~Urvashi_Khandelwal1"", ""~Angela_Fan2"", ""~Dan_Jurafsky1"", ""~Luke_Zettlemoyer1"", ""~Mike_Lewis1""]","[""Urvashi Khandelwal"", ""Angela Fan"", ""Dan Jurafsky"", ""Luke Zettlemoyer"", ""Mike Lewis""]","[""nearest neighbors"", ""machine translation""]","We introduce $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest-neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest-neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. $k$NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results---without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, $k$NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples.",/pdf/8c353a40fa6aa514c510ad3cc89f331af070681b.pdf,ICLR,2021,"We augment the decoder of a pre-trained machine translation model with a nearest neighbor classifier, substantially improving performance in the single language-pair, multilingual and domain adaptation settings, without any additional training." +XOuAOv_-5Fx,LmmA9tqsYXj,1601310000000.0,1614990000000.0,3535,Uncertainty Calibration Error: A New Metric for Multi-Class Classification,"[""~Max-Heinrich_Laves1"", ""~Sontje_Ihler1"", ""kortmann@imes.uni-hannover.de"", ""ortmaier@imes.uni-hannover.de""]","[""Max-Heinrich Laves"", ""Sontje Ihler"", ""Karl-Philipp Kortmann"", ""Tobias Ortmaier""]","[""variational inference"", ""uncertainty"", ""calibration"", ""classification""]","Various metrics have recently been proposed to measure uncertainty calibration of deep models for classification. However, these metrics either fail to capture miscalibration correctly or lack interpretability. We propose to use the normalized entropy as a measure of uncertainty and derive the Uncertainty Calibration Error (UCE), a comprehensible calibration metric for multi-class classification. In our experiments, we focus on uncertainty from variational Bayesian inference methods and compare UCE to established calibration errors on the task of multi-class image classification. UCE avoids several pathologies of other metrics, but does not sacrifice interpretability. It can be used for regularization to improve calibration during training without penalizing predictions with justified high confidence.",/pdf/29192b96d2d3f83f14cbb305bf70309a512a4852.pdf,ICLR,2021,We present an uncertainty calibration error metric based on normalized entropy. +E_U8Zvx7zrf,OY3o2K6dPFn,1601310000000.0,1614990000000.0,1285,Delay-Tolerant Local SGD for Efficient Distributed Training,"[""~An_Xu1"", ""~Xiao_Yan2"", ""~Hongchang_Gao1"", ""~Heng_Huang1""]","[""An Xu"", ""Xiao Yan"", ""Hongchang Gao"", ""Heng Huang""]","[""Delay-tolerant"", ""communication-efficient"", ""distributed learning""]","The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both \textit{delay-tolerant} AND \textit{communication-efficient}. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO\textsubscript{3} to achieve delay tolerance with a low communication budget by using stale information. OLCO\textsubscript{3} introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO\textsubscript{3} achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO\textsubscript{3} and its advantages over existing works.",/pdf/669f7d42004ef8a1c1400d67c30c5d7f3cf9efb6.pdf,ICLR,2021,We propose a delay-tolerant AND communication-efficient training method for distributed learning. +rJl31TNYPr,HJlJgAqBvB,1569440000000.0,1583910000000.0,316,Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking,"[""jack0082010@gmail.com"", ""ylu25@syr.edu"", ""junjies1@uci.edu"", ""alfchen@uci.edu"", ""chen@ucdavis.edu"", ""edwardzhong@baidu.com"", ""lenx.wei@gmail.com""]","[""Yunhan Jia"", ""Yantao Lu"", ""Junjie Shen"", ""Qi Alfred Chen"", ""Hao Chen"", ""Zhenyu Zhong"", ""Tao Wei""]","[""Adversarial examples"", ""object detection"", ""object tracking"", ""security"", ""autonomous vehicle"", ""deep learning""]","Recent work in adversarial machine learning started to focus on the visual perception in autonomous driving and studied Adversarial Examples (AEs) for object detection models. However, in such visual perception pipeline the detected objects must also be tracked, in a process called Multiple Object Tracking (MOT), to build the moving trajectories of surrounding obstacles. Since MOT is designed to be robust against errors in object detection, it poses a general challenge to existing attack techniques that blindly target objection detection: we find that a success rate of over 98% is needed for them to actually affect the tracking results, a requirement that no existing attack technique can satisfy. In this paper, we are the first to study adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving, and discover a novel attack technique, tracker hijacking, that can effectively fool MOT using AEs on object detection. Using our technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and find that on average when 3 frames are attacked, our attack can have a nearly 100% success rate while attacks that blindly target object detection only have up to 25%.",/pdf/adba6eb29fc668f2d88d48fdacfdd1ccb918045c.pdf,ICLR,2020,We study the adversarial machine learning attacks against the Multiple Object Tracking mechanisms for the first time. +BbNIbVPJ-42,qv5sa9jVFjE,1601310000000.0,1615870000000.0,1273,The Risks of Invariant Risk Minimization,"[""~Elan_Rosenfeld1"", ""~Pradeep_Kumar_Ravikumar1"", ""~Andrej_Risteski2""]","[""Elan Rosenfeld"", ""Pradeep Kumar Ravikumar"", ""Andrej Risteski""]","[""out-of-distribution generalization"", ""causality"", ""representation learning"", ""deep learning""]","Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of data which are a complex function of latent variables; many alternatives have subsequently been suggested. However, formal guarantees for all of these works are severely lacking. In this paper, we present the first analysis of classification under the IRM objective—as well as these recently proposed alternatives—under a fairly natural and general model. In the linear case, we show simple conditions under which the optimal solution succeeds or, more often, fails to recover the optimal invariant predictor. We furthermore present the very first results in the non-linear regime: we demonstrate that IRM can fail catastrophically unless the test data is sufficiently similar to the training distribution—this is precisely the issue that it was intended to solve. Thus, in this setting we find that IRM and its alternatives fundamentally do not improve over standard Empirical Risk Minimization.",/pdf/28dd76f69ec8c36b1e5610943745438b4d6ca498.pdf,ICLR,2021,We formally demonstrate that Invariant Risk Minimization and related alternative objectives often perform no better than standard ERM. +rJxMM2C5K7,r1guQco9Fm,1538090000000.0,1545360000000.0,1245,Nested Dithered Quantization for Communication Reduction in Distributed Training,"[""abdi@ece.gatech.edu"", ""fekri@ece.gatech.edu""]","[""Afshin Abdi"", ""Faramarz Fekri""]","[""machine learning"", ""distributed training"", ""dithered quantization"", ""nested quantization"", ""distributed compression""]","In distributed training, the communication cost due to the transmission of gradients +or the parameters of the deep model is a major bottleneck in scaling up the number +of processing nodes. To address this issue, we propose dithered quantization for +the transmission of the stochastic gradients and show that training with Dithered +Quantized Stochastic Gradients (DQSG) is similar to the training with unquantized +SGs perturbed by an independent bounded uniform noise, in contrast to the other +quantization methods where the perturbation depends on the gradients and hence, +complicating the convergence analysis. We study the convergence of training +algorithms using DQSG and the trade off between the number of quantization +levels and the training time. Next, we observe that there is a correlation among the +SGs computed by workers that can be utilized to further reduce the communication +overhead without any performance loss. Hence, we develop a simple yet effective +quantization scheme, nested dithered quantized SG (NDQSG), that can reduce the +communication significantly without requiring the workers communicating extra +information to each other. We prove that although NDQSG requires significantly +less bits, it can achieve the same quantization variance bound as DQSG. Our +simulation results confirm the effectiveness of training using DQSG and NDQSG +in reducing the communication bits or the convergence time compared to the +existing methods without sacrificing the accuracy of the trained model.",/pdf/95f2aebc33c07de020be728a725c30834620ef7f.pdf,ICLR,2019,The paper proposes and analyzes two quantization schemes for communicating Stochastic Gradients in distributed learning which would reduce communication costs compare to the state of the art while maintaining the same accuracy. +Bke13pVKPS,SJxCfj1dDr,1569440000000.0,1577170000000.0,766,"Improved Training Speed, Accuracy, and Data Utilization via Loss Function Optimization","[""slgonzalez@utexas.edu"", ""risto@cs.utexas.edu""]","[""Santiago Gonzalez"", ""Risto Miikkulainen""]","[""metalearning"", ""evolutionary computation"", ""loss functions"", ""optimization"", ""genetic programming""]","As the complexity of neural network models has grown, it has become increasingly important to optimize their design automatically through metalearning. Methods for discovering hyperparameters, topologies, and learning rate schedules have lead to significant increases in performance. This paper shows that loss functions can be optimized with metalearning as well, and result in similar improvements. The method, Genetic Loss-function Optimization (GLO), discovers loss functions de novo, and optimizes them for a target task. Leveraging techniques from genetic programming, GLO builds loss functions hierarchically from a set of operators and leaf nodes. These functions are repeatedly recombined and mutated to find an optimal structure, and then a covariance-matrix adaptation evolutionary strategy (CMA-ES) is used to find optimal coefficients. Networks trained with GLO loss functions are found to outperform the standard cross-entropy loss on standard image classification tasks. Training with these new loss functions requires fewer steps, results in lower test error, and allows for smaller datasets to be used. Loss function optimization thus provides a new dimension of metalearning, and constitutes an important step towards AutoML.",/pdf/38d0f6f04d00eea14bc4bacd508182673bc2233d.pdf,ICLR,2020,"Using evolutionary computation, a system for loss function metalearning was built (GLO) that discovered a new loss function for classification that can train more accurate models in less time." +n5yBuzpqqw,ixNo4hhv7MD,1601310000000.0,1614990000000.0,599,Error Controlled Actor-Critic Method to Reinforcement Learning,"[""31520160154529@stu.xmu.edu.cn"", ""fchao@xmu.edu.cn"", ""dozero@xmu.edu.cn"", ""13290073@student.uts.edu.au"", ""cml@saturn.yzu.edu.tw"", ""longzhi.yang@northumbria.ac.uk"", ""xic9@aber.ac.uk"", ""cns@aber.ac.uk""]","[""Xingen Gao"", ""Fei Chao"", ""Changle Zhou"", ""Zhen Ge"", ""Chih-Min Lin"", ""Longzhi Yang"", ""Xiang Chang"", ""Changjing Shang""]","[""reinforcement learning"", ""actor-critic"", ""function approximation"", ""approximation error"", ""KL divergence""]","In the reinforcement learning (RL) algorithms which incorporate function approximation methods, the approximation error of value function inevitably cause overestimation phenomenon and have a negative impact on the convergence of the algorithms. To mitigate the negative effects of approximation error, we propose a new actor-critic algorithm called Error Controlled Actor-critic which ensures confining the approximation error in value function. In this paper, we firstly present an analysis of how the approximation error can hinder the optimization process of actor-critic methods. Then, we *derive an upper boundary of the approximation error of Q function approximator, and found that the error can be lowered by placing restrictions on the KL-divergence between every two consecutive policies during the training phase of the policy.* The results of experiments on a range of continuous control tasks from OpenAI gym suite demonstrate that the proposed actor-critic algorithm apparently reduces the approximation error and significantly outperforms other model-free RL algorithms.",/pdf/c4be5250d095d9adc06e881bd14a009f3964d378.pdf,ICLR,2021,We propose a new actor-critic algorithm called Error Controlled Actor-critic which ensures confining the approximation error in value function. +HkuVu3ige,,1478380000000.0,1484850000000.0,579,On orthogonality and learning recurrent networks with long term dependencies,"[""eugene.vorontsov@gmail.com"", ""chiheb.trabelsi@polymtl.ca"", ""samuel.kadoury@polymtl.ca"", ""christopher.pal@polymtl.ca""]","[""Eugene Vorontsov"", ""Chiheb Trabelsi"", ""Samuel Kadoury"", ""Chris Pal""]","[""Deep learning""]","It is well known that it is challenging to train deep neural networks and recurrent neural networks for tasks that exhibit long term dependencies. The vanishing or exploding gradient problem is a well known issue associated with these challenges. One approach to addressing vanishing and exploding gradients is to use either soft or hard constraints on weight matrices so as to encourage or enforce orthogonality. Orthogonal matrices preserve gradient norm during backpropagation and can therefore be a desirable property; however, we find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance. This paper explores the issues of optimization convergence, speed and gradient stability using a variety of different methods for encouraging or enforcing orthogonality. In particular we propose a weight matrix factorization and parameterization strategy through which we we can bound matrix norms and therein control the degree of expansivity induced during backpropagation.",/pdf/6a2315bc79e028ba384c12b2fc00b3e3c060d768.pdf,ICLR,2017,"While orthogonal matrices improve neural network stability during training, deviating from orthogonality may improve model convergence speed and performance." +S1lPShAqFm,SklnT10ct7,1538090000000.0,1545360000000.0,1550,Empirically Characterizing Overparameterization Impact on Convergence,"[""newsha@baidu.com"", ""joel@baidu.com"", ""gregdiamos@baidu.com""]","[""Newsha Ardalani"", ""Joel Hestness"", ""Gregory Diamos""]","[""gradient descent"", ""optimization"", ""convergence time"", ""halting time"", ""characterization""]","A long-held conventional wisdom states that larger models train more slowly when using gradient descent. This work challenges this widely-held belief, showing that larger models can potentially train faster despite the increasing computational requirements of each training step. In particular, we study the effect of network structure (depth and width) on halting time and show that larger models---wider models in particular---take fewer training steps to converge. + +We design simple experiments to quantitatively characterize the effect of overparametrization on weight space traversal. Results show that halting time improves when growing model's width for three different applications, and the improvement comes from each factor: The distance from initialized weights to converged weights shrinks with a power-law-like relationship, the average step size grows with a power-law-like relationship, and gradient vectors become more aligned with each other during traversal. +",/pdf/e74b6e80a2c948d26e7630664b62caffbec89029.pdf,ICLR,2019,"Empirically shows that larger models train in fewer training steps, because all factors in weight space traversal improve." +b905-XVjbDO,k1rqBPPmJZm,1601310000000.0,1614990000000.0,1686,Globally Injective ReLU networks,"[""map19@rice.edu"", ""~Konik_Kothari1"", ""~Matti_Lassas1"", ""~Ivan_Dokmani\u01071"", ""~Maarten_V._de_Hoop2""]","[""Michael Puthawala"", ""Konik Kothari"", ""Matti Lassas"", ""Ivan Dokmani\u0107"", ""Maarten V. de Hoop""]","[""Generative models"", ""injectivity of neural networks"", ""universal approximation"", ""inference"", ""compressed sensing with generative priors"", ""well-posedness"", ""random projections""]","Injectivity plays an important role in generative models where it enables inference; in inverse problems and compressed sensing with generative priors it is a precursor to well posedness. We establish sharp characterizations of injectivity of fully-connected and convolutional ReLU layers and networks. First, through a layerwise analysis, we show that an expansivity factor of two is necessary and sufficient for injectivity by constructing appropriate weight matrices. We show that global injectivity with iid Gaussian matrices, a commonly used tractable model, requires larger expansivity between 3.4 and 10.5. We also characterize the stability of inverting an injective network via worst-case Lipschitz constants of the inverse. We then use arguments from differential topology to study injectivity of deep networks and prove that any Lipschitz map can be approximated by an injective ReLU network. Finally, using an argument based on random projections, we show that an end-to-end---rather than layerwise---doubling of the dimension suffices for injectivity. Our results establish a theoretical basis for the study of nonlinear inverse and inference problems using neural networks.",/pdf/a6621d2ce49556d9b4cbed5b11293fb7dba85929.pdf,ICLR,2021,"We provide a complete characterization of injective deep ReLU networks with implications for compressed sensing, inverse problems, and inference with generative models." +rke8ZhCcFQ,HkxFt_5qKX,1538090000000.0,1545360000000.0,1176,ATTACK GRAPH CONVOLUTIONAL NETWORKS BY ADDING FAKE NODES,"[""xiywang@ucdavis.edu"", ""featon@nvidia.com"", ""chohsieh@ucdavis.edu"", ""sfwu@ucdavis.edu""]","[""Xiaoyun Wang"", ""Joe Eaton"", ""Cho-Jui Hsieh"", ""Felix Wu""]","[""Graph Convolutional Network"", ""adversarial attack"", ""node classification""]","Graph convolutional networks (GCNs) have been widely used for classifying graph nodes in the semi-supervised setting. +Previous works have shown that GCNs are vulnerable to the perturbation on adjacency and feature matrices of existing nodes. However, it is unrealistic to change the connections of existing nodes in many applications, such as existing users in social networks. In this paper, we investigate methods attacking GCNs by adding fake nodes. A greedy algorithm is proposed to generate adjacency and feature matrices of fake nodes, aiming to minimize the classification accuracy on the existing ones. In additional, we introduce a discriminator to classify fake nodes from real nodes, and propose a Greedy-GAN algorithm to simultaneously update the discriminator and the attacker, to make fake nodes indistinguishable to the real ones. Our non-targeted attack decreases the accuracy of GCN down to 0.10, and our targeted attack reaches a success rate of 0.99 for attacking the whole datasets, and 0.94 on average for attacking a single node.",/pdf/599c2e723f3adbb416da281cc68db4a7b0841082.pdf,ICLR,2019,non-targeted and targeted attack on GCN by adding fake nodes +SJeoE0VKDS,rJx-_ULuDS,1569440000000.0,1577170000000.0,1088,Novelty Search in representational space for sample efficient exploration,"[""ruo.tao@mail.mcgill.ca"", ""vincent.francois-lavet@mail.mcgill.ca"", ""jpineau@cs.mcgill.ca""]","[""Ruo Yu Tao"", ""Vincent Fran\u00e7ois-Lavet"", ""Joelle Pineau""]","[""Reinforcement Learning"", ""Exploration""]","We present a new approach for efficient exploration which leverages a low-dimensional encoding of the environment learned with a combination of model-based and model-free objectives. Our approach uses intrinsic rewards that are based on a weighted distance of nearest neighbors in the low dimensional representational space to gauge novelty. +We then leverage these intrinsic rewards for sample-efficient exploration with planning routines in representational space. +One key element of our approach is that we perform more gradient steps in-between every environment step in order to ensure the model accuracy. We test our approach on a number of maze tasks, as well as a control problem and show that our exploration approach is more sample-efficient compared to strong baselines. ",/pdf/bee6959ed53763f5aa60a7908c8e7a2063ec6db3.pdf,ICLR,2020,We conduct exploration using intrinsic rewards that are based on a weighted distance of nearest neighbors in representational space. +ByloJ20qtm,rJebmyAqt7,1538090000000.0,1550890000000.0,1014,Neural Program Repair by Jointly Learning to Localize and Repair,"[""vasic@utexas.edu"", ""akanade@google.com"", ""maniatis@google.com"", ""dbieber@google.com"", ""rising@google.com""]","[""Marko Vasic"", ""Aditya Kanade"", ""Petros Maniatis"", ""David Bieber"", ""Rishabh Singh""]","[""neural program repair"", ""neural program embeddings"", ""pointer networks""]","Due to its potential to improve programmer productivity and software quality, automated program repair has been an active topic of research. Newer techniques harness neural networks to learn directly from examples of buggy programs and their fixes. In this work, we consider a recently identified class of bugs called variable-misuse bugs. The state-of-the-art solution for variable misuse enumerates potential fixes for all possible bug locations in a program, before selecting the best prediction. We show that it is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs. We present multi-headed pointer networks for this purpose, with one head each for localization and repair. The experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone.",/pdf/4ecedd21b029dd6e88227fcaa87a4338beed3c7a.pdf,ICLR,2019,Multi-headed Pointer Networks for jointly learning to localize and repair Variable Misuse bugs +Hygi7xStvS,BJgDbSxFwS,1569440000000.0,1577170000000.0,2224,Lossless Data Compression with Transformer,"[""gizacard@gmail.com"", ""ajoulin@fb.com"", ""egrave@fb.com""]","[""Gautier Izacard"", ""Armand Joulin"", ""Edouard Grave""]","[""data compression"", ""transformer""]","Transformers have replaced long-short term memory and other recurrent neural networks variants in sequence modeling. It achieves state-of-the-art performance on a wide range of tasks related to natural language processing, including language modeling, machine translation, and sentence representation. Lossless compression is another problem that can benefit from better sequence models. It is closely related to the problem of online learning of language models. But, despite this ressemblance, it is an area where purely neural network based methods have not yet reached the compression ratio of state-of-the-art algorithms. In this paper, we propose a Transformer based lossless compression method that match the best compression ratio for text. Our approach is purely based on neural networks and does not rely on hand-crafted features as other lossless compression algorithms. We also provide a thorough study of the impact of the different components of the Transformer and its training on the compression ratio.",/pdf/83555d5592e7501d04ab23681385b72a81bad943.pdf,ICLR,2020,Application of transformer networks to lossless data compression +rJe4_xSFDB,BklRjhxYDH,1569440000000.0,1583910000000.0,2394,Lipschitz constant estimation of Neural Networks via sparse polynomial optimization,"[""fabian.latorre@epfl.ch"", ""paul.rolland@epfl.ch"", ""volkan.cevher@epfl.ch""]","[""Fabian Latorre"", ""Paul Rolland"", ""Volkan Cevher""]","[""robust networks"", ""Lipschitz constant"", ""polynomial optimization""]","We introduce LiPopt, a polynomial optimization framework for computing increasingly tighter upper bound on the Lipschitz constant of neural networks. The underlying optimization problems boil down to either linear (LP) or semidefinite (SDP) programming. We show how to use the sparse connectivity of a network, to significantly reduce the complexity of computation. This is specially useful for convolutional as well as pruned neural networks. We conduct experiments on networks with random weights as well as networks trained on MNIST, showing that in the particular case of the $\ell_\infty$-Lipschitz constant, our approach yields superior estimates as compared to other baselines available in the literature. +",/pdf/1955c3585c78c18fe3ce133f94802e9b47f80307.pdf,ICLR,2020,LP-based upper bounds on the Lipschitz constant of Neural Networks +BkJ3ibb0-,rJAjibWC-,1509130000000.0,1519370000000.0,714,Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models,"[""pouya@umiacs.umd.edu"", ""mayak@umiacs.umd.edu"", ""rama@umiacs.umd.edu""]","[""Pouya Samangouei"", ""Maya Kabkab"", ""Rama Chellappa""]",[],"In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose Defense-GAN, a new framework leveraging the expressive capability of generative models to defend deep neural networks against such attacks. Defense-GAN is trained to model the distribution of unperturbed images. At inference time, it finds a close output to a given image which does not contain the adversarial changes. This output is then fed to the classifier. Our proposed method can be used with any classification model and does not modify the classifier structure or training procedure. It can also be used as a defense against any attack as it does not assume knowledge of the process for generating the adversarial examples. We empirically show that Defense-GAN is consistently effective against different attack methods and improves on existing defense strategies.",/pdf/c64a9663a82e760275d885b4ed0f955bdf39a399.pdf,ICLR,2018,Defense-GAN uses a Generative Adversarial Network to defend against white-box and black-box attacks in classification models. +H1xmqiAqFm,rklknVc5KX,1538090000000.0,1545360000000.0,523,Investigating CNNs' Learning Representation under label noise,"[""hataya@nlab.ci.i.u-tokyo.ac.jp"", ""nakayama@nlab.ci.i.u-tokyo.aco.jp""]","[""Ryuichiro Hataya"", ""Hideki Nakayama""]","[""learning with noisy labels"", ""deep learning"", ""convolutional neural networks""]","Deep convolutional neural networks (CNNs) are known to be robust against label noise on extensive datasets. However, at the same time, CNNs are capable of memorizing all labels even if they are random, which means they can memorize corrupted labels. Are CNNs robust or fragile to label noise? Much of researches focusing on such memorization uses class-independent label noise to simulate label corruption, but this setting is simple and unrealistic. In this paper, we investigate the behavior of CNNs under class-dependently simulated label noise, which is generated based on the conceptual distance between classes of a large dataset (i.e., ImageNet-1k). Contrary to previous knowledge, we reveal CNNs are more robust to such class-dependent label noise than class-independent label noise. We also demonstrate the networks under class-dependent noise situations learn similar representation to the no noise situation, compared to class-independent noise situations.",/pdf/ce1045faa6656fb4b9622079567a31f459a3b661.pdf,ICLR,2019,"Are CNNs robust or fragile to label noise? Practically, robust." +Hy6GHpkCW,rk2fr6yRW,1509050000000.0,1530860000000.0,178,A Neural Representation of Sketch Drawings,"[""hadavid@google.com"", ""deck@google.com""]","[""David Ha"", ""Douglas Eck""]","[""applications"", ""image modelling"", ""computer-assisted"", ""drawing"", ""art"", ""creativity"", ""dataset""]","We present sketch-rnn, a recurrent neural network able to construct stroke-based drawings of common objects. The model is trained on a dataset of human-drawn images representing many different classes. We outline a framework for conditional and unconditional sketch generation, and describe new robust training methods for generating coherent sketch drawings in a vector format.",/pdf/1e7e33ffa8fe030c42dad328e4f075c614ccce55.pdf,ICLR,2018,"We investigate alternative to traditional pixel image modelling approaches, and propose a generative model for vector images." +ryedjkSFwr,Hkg1pRCOwr,1569440000000.0,1577170000000.0,1921,Global Momentum Compression for Sparse Communication in Distributed SGD,"[""zhaosy@lamda.nju.edu.cn"", ""xieyp@lamda.nju.edu.cn"", ""gaoh@lamda.nju.edu.cn"", ""liwujun@nju.edu.cn""]","[""Shen-Yi Zhao"", ""Yin-Peng Xie"", ""Hao Gao"", ""Wu-Jun Li""]","[""Distributed momentum SGD"", ""Communication compression""]","With the rapid growth of data, distributed stochastic gradient descent~(DSGD) has been widely used for solving large-scale machine learning problems. Due to the latency and limited bandwidth of network, communication has become the bottleneck of DSGD when we need to train large scale models, like deep neural networks. Communication compression with sparsified gradient, abbreviated as \emph{sparse communication}, has been widely used for reducing communication cost in DSGD. Recently, there has appeared one method, called deep gradient compression~(DGC), to combine memory gradient and momentum SGD for sparse communication. DGC has achieved promising performance in practice. However, the theory about the convergence of DGC is lack. In this paper, we propose a novel method, called \emph{\underline{g}}lobal \emph{\underline{m}}omentum \emph{\underline{c}}ompression~(GMC), for sparse communication in DSGD. GMC also combines memory gradient and momentum SGD. But different from DGC which adopts local momentum, GMC adopts global momentum. We theoretically prove the convergence rate of GMC for both convex and non-convex problems. To the best of our knowledge, this is the first work that proves the convergence of distributed momentum SGD~(DMSGD) with sparse communication and memory gradient. Empirical results show that, compared with the DMSGD counterpart without sparse communication, GMC can reduce the communication cost by approximately 100 fold without loss of generalization accuracy. GMC can also achieve comparable~(sometimes better) performance compared with DGC, with an extra theoretical guarantee.",/pdf/c2b4e75d8f0d8541b00d717c1ee6be1e3100c5fe.pdf,ICLR,2020,"We propose a novel method combining global momentum and memory gradient for sparse communication, with an extra convergence guarantee." +ByxaUgrFvH,r1x5hcgFPB,1569440000000.0,1586750000000.0,2341,Mutual Information Gradient Estimation for Representation Learning,"[""wlj6816@gmail.com"", ""zhouyiji@outlook.com"", ""ronghe1217@gmail.com"", ""mingyuan.zhou@mccombs.utexas.edu"", ""zenglin@gmail.com""]","[""Liangjian Wen"", ""Yiji Zhou"", ""Lirong He"", ""Mingyuan Zhou"", ""Zenglin Xu""]","[""Mutual Information"", ""Score Estimation"", ""Representation Learning"", ""Information Bottleneck""]","Mutual Information (MI) plays an important role in representation learning. However, MI is unfortunately intractable in continuous and high-dimensional settings. Recent advances establish tractable and scalable MI estimators to discover useful representation. However, most of the existing methods are not capable of providing an accurate estimation of MI with low-variance when the MI is large. We argue that directly estimating the gradients of MI is more appealing for representation learning than estimating MI in itself. To this end, we propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score estimation of implicit distributions. MIGE exhibits a tight and smooth gradient estimation of MI in the high-dimensional and large-MI settings. We expand the applications of MIGE in both unsupervised learning of deep representations based on InfoMax and the Information Bottleneck method. Experimental results have indicated significant performance improvement in learning useful representation.",/pdf/0b5d08511f43c8eea7a10da7a0ae88603c8ee064.pdf,ICLR,2020, +rJx_b3RqY7,Ske0kMA9YQ,1538090000000.0,1545360000000.0,1184,AIM: Adversarial Inference by Matching Priors and Conditionals,"[""alexanderhanboli@gmail.com"", ""yaqingwa@buffalo.edu"", ""cchangyou@gmail.com"", ""jing@buffalo.edu""]","[""Hanbo Li"", ""Yaqing Wang"", ""Changyou Chen"", ""Jing Gao""]","[""Generative adversarial network"", ""inference"", ""generative model""]","Effective inference for a generative adversarial model remains an important and challenging problem. We propose a novel approach, Adversarial Inference by Matching priors and conditionals (AIM), which explicitly matches prior and conditional distributions in both data and code spaces, and puts a direct constraint on the dependency structure of the generative model. We derive an equivalent form of the prior and conditional matching objective that can be optimized efficiently without any parametric assumption on the data. We validate the effectiveness of AIM on the MNIST, CIFAR-10, and CelebA datasets by conducting quantitative and qualitative evaluations. Results demonstrate that AIM significantly improves both reconstruction and generation as compared to other adversarial inference models.",/pdf/5125b9a3a85a504329d12296b61b7be99ca3188e.pdf,ICLR,2019, +HMEiDPTOTmY,OQ5aBMfvSwm,1601310000000.0,1614990000000.0,2186,Later Span Adaptation for Language Understanding,"[""~Rongzhou_Bao1"", ""~Zhuosheng_Zhang1"", ""~hai_zhao1""]","[""Rongzhou Bao"", ""Zhuosheng Zhang"", ""hai zhao""]",[],"Pre-trained contextualized language models (PrLMs) broadly use fine-grained tokens (words or sub-words) as minimal linguistic unit in pre-training phase. Introducing span-level information in pre-training has shown capable of further enhancing PrLMs. However, such methods require enormous resources and are lack of adaptivity due to huge computational requirement from pre-training. Instead of too early fixing the linguistic unit input as nearly all previous work did, we propose a novel method that combines span-level information into the representations generated by PrLMs during fine-tuning phase for better flexibility. In this way, the modeling procedure of span-level texts can be more adaptive to different downstream tasks. In detail, we divide the sentence into several spans according to the segmentation generated by a pre-sampled dictionary. Based on the sub-token-level representation provided by PrLMs, we enhance the connection between the tokens in each span and gain a representation with enhanced span-level information. Experiments are conducted on GLUE benchmark and prove that our approach could remarkably enhance the performance of PrLMs in various natural language understanding tasks.",/pdf/f89bde3d3801f0f4a50f42589350c1e72c940710.pdf,ICLR,2021, +rkeUrjCcYQ,BkxjWXrKKQ,1538090000000.0,1545360000000.0,90,Monge-Amp\`ere Flow for Generative Modeling,"[""linfengz@princeton.edu"", ""weinan@math.princeton.edu"", ""wanglei@iphy.ac.cn""]","[""Linfeng Zhang"", ""Weinan E"", ""Lei Wang""]","[""generative modeling"", ""Monge-Amp\\`ere equation"", ""dynamical system"", ""optimal transport"", ""density estimation"", ""free energy calculation""]","We present a deep generative model, named Monge-Amp\`ere flow, which builds on continuous-time gradient flow arising from the Monge-Amp\`ere equation in optimal transport theory. The generative map from the latent space to the data space follows a dynamical system, where a learnable potential function guides a compressible fluid to flow towards the target density distribution. Training of the model amounts to solving an optimal control problem. The Monge-Amp\`ere flow has tractable likelihoods and supports efficient sampling and inference. One can easily impose symmetry constraints in the generative model by designing suitable scalar potential functions. We apply the approach to unsupervised density estimation of the MNIST dataset and variational calculation of the two-dimensional Ising model at the critical point. This approach brings insights and techniques from Monge-Amp\`ere equation, optimal transport, and fluid dynamics into reversible flow-based generative models. ",/pdf/ad63645db0c0c05fbf18648ca40e0523384ad4f4.pdf,ICLR,2019,A gradient flow based dynamical system for invertible generative modeling +dYeAHXnpWJ4,I4bsT6RJfJ,1601310000000.0,1614760000000.0,1633,Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability,"[""~Suraj_Srinivas1"", ""francois.fleuret@unige.ch""]","[""Suraj Srinivas"", ""Francois Fleuret""]","[""Interpretability"", ""saliency maps"", ""score-matching""]","Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_{\theta} ( y\mid \mathbf{x} )$, the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work, we show that these input-gradients can be arbitrarily manipulated as a consequence of the shift-invariance of softmax without changing the discriminative function. This leaves an open question: given that input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? + +In this work, we re-interpret the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional generative model $p_{\theta}(\mathbf{x} \mid y)$ implicit in the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model $p_{\theta}(\mathbf{x} \mid y)$ with that of the ground truth data distribution $p_{\text{data}} (\mathbf{x} \mid y)$. We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this density alignment, we use an algorithm called score-matching, and propose novel approximations to this algorithm to enable training large-scale models. + +Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. This also leads us to conjecture that unintended density alignment in standard neural network training may explain the highly structured nature of input-gradients observed in practice. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models.",/pdf/b29e31cf78e011a59f9e49950670211d4c516b00.pdf,ICLR,2021,"Input-gradients in discriminative neural net models capture information regarding an implicit density model, rather than that of the underlying discriminative model which it is intended to explain." +SJi9WOeRb,By5qZ_lCW,1509090000000.0,1519400000000.0,303,Gradient Estimators for Implicit Models,"[""yl494@cam.ac.uk"", ""ret26@cam.ac.uk""]","[""Yingzhen Li"", ""Richard E. Turner""]","[""Implicit Models"", ""Approximate Inference"", ""Deep Learning""]","Implicit models, which allow for the generation of samples but not for point-wise evaluation of probabilities, are omnipresent in real-world problems tackled by machine learning and a hot topic of current research. Some examples include data simulators that are widely used in engineering and scientific research, generative adversarial networks (GANs) for image synthesis, and hot-off-the-press approximate inference techniques relying on implicit distributions. The majority of existing approaches to learning implicit models rely on approximating the intractable distribution or optimisation objective for gradient-based optimisation, which is liable to produce inaccurate updates and thus poor models. This paper alleviates the need for such approximations by proposing the \emph{Stein gradient estimator}, which directly estimates the score function of the implicitly defined distribution. The efficacy of the proposed estimator is empirically demonstrated by examples that include meta-learning for approximate inference and entropy regularised GANs that provide improved sample diversity.",/pdf/4a47cfaa30d979549f6b42cef5a1371babb02e08.pdf,ICLR,2018,"We introduced a novel gradient estimator using Stein's method, and compared with other methods on learning implicit models for approximate inference and image generation." +npOuXc85I5k,FHh0ZV_cG,1601310000000.0,1614990000000.0,1406,Pareto Adversarial Robustness: Balancing Spatial Robustness and Sensitivity-based Robustness,"[""~Ke_Sun3"", ""~Mingjie_Li1"", ""~Zhouchen_Lin1""]","[""Ke Sun"", ""Mingjie Li"", ""Zhouchen Lin""]",[],"Adversarial robustness, mainly including sensitivity-based robustness and spatial robustness, plays an integral part in the robust generalization. In this paper, we endeavor to design strategies to achieve comprehensive adversarial robustness. To hit this target, firstly we investigate the less-studied spatial robustness and then integrate existing spatial robustness methods by incorporating both local and global spatial vulnerability into one spatial attack design. Based on this exploration, we further present a comprehensive relationship between natural accuracy, sensitivity-based and different spatial robustness, supported by the strong evidence from the perspective of representation. More importantly, in order to balance these mutual impact within different robustness into one unified framework, we incorporate the Pareto criterion into the adversarial robustness analysis, yielding a novel strategy towards comprehensive robustness called \textit{Pareto Adversarial Training}. The resulting Pareto front, the set of optimal solutions, provides the set of optimal balance among natural accuracy and different adversarial robustness, shedding light on solutions towards comprehensive robustness in the future. To the best of our knowledge, we are the first to consider comprehensive robustness via the multi-objective optimization.",/pdf/19dad14b16e7d53143022400d01cf129ba4f9c25.pdf,ICLR,2021, +rJlWOj0qF7,H1ldE4tqYm,1538090000000.0,1550580000000.0,331,Imposing Category Trees Onto Word-Embeddings Using A Geometric Construction,"[""tian1shi2@gmail.com"", ""christian.bauckhage@iais.fraunhofer.de"", ""jinhl15@mails.tsinghua.edu.cn"", ""lijuanzi2008@gmail.com"", ""cremerso@iai.uni-bonn.de"", ""dsp@bit.uni-bonn.de"", ""abc@iai.uni-bonn.de"", ""jz@bit.uni-bonn.de""]","[""Tiansi Dong"", ""Chrisitan Bauckhage"", ""Hailong Jin"", ""Juanzi Li"", ""Olaf Cremers"", ""Daniel Speicher"", ""Armin B. Cremers"", ""Joerg Zimmermann""]","[""category tree"", ""word-embeddings"", ""geometry""]","We present a novel method to precisely impose tree-structured category information onto word-embeddings, resulting in ball embeddings in higher dimensional spaces (N-balls for short). Inclusion relations among N-balls implicitly encode subordinate relations among categories. The similarity measurement in terms of the cosine function is enriched by category information. Using a geometric construction method instead of back-propagation, we create large N-ball embeddings that satisfy two conditions: (1) category trees are precisely imposed onto word embeddings at zero energy cost; (2) pre-trained word embeddings are well preserved. A new benchmark data set is created for validating the category of unknown words. Experiments show that N-ball embeddings, carrying category information, significantly outperform word embeddings in the test of nearest neighborhoods, and demonstrate surprisingly good performance in validating categories of unknown words. Source codes and data-sets are free for public access \url{https://github.com/gnodisnait/nball4tree.git} and \url{https://github.com/gnodisnait/bp94nball.git}. ",/pdf/fe8206fc94e754b9fe88b13f93f1e03dedc3eeed.pdf,ICLR,2019,we show a geometric method to perfectly encode categroy tree information into pre-trained word-embeddings. +rkGabzZgl,,1477680000000.0,1487190000000.0,9,Dropout with Expectation-linear Regularization,"[""xuezhem@cs.cmu.edu"", ""yingkaig@cs.cmu.edu"", ""zhitinghu@cs.cmu.edu"", ""yaoliang@cs.cmu.edu"", ""dengyuntian@gmail.com"", ""hovy@cmu.edu""]","[""Xuezhe Ma"", ""Yingkai Gao"", ""Zhiting Hu"", ""Yaoliang Yu"", ""Yuntian Deng"", ""Eduard Hovy""]","[""Theory"", ""Deep learning"", ""Supervised Learning""]","Dropout, a simple and effective way to train deep neural networks, has led to a number of impressive empirical successes and spawned many recent theoretical investigations. However, the gap between dropout’s training and inference phases, introduced due to tractability considerations, has largely remained under-appreciated. In this work, we first formulate dropout as a tractable approximation of some latent variable model, leading to a clean view of parameter sharing and enabling further theoretical analysis. Then, we introduce (approximate) expectation-linear dropout neural networks, whose inference gap we are able to formally characterize. Algorithmically, we show that our proposed measure of the inference gap can be used to regularize the standard dropout training objective, resulting in an explicit control of the gap. Our method is as simple and efficient as standard dropout. We further prove the upper bounds on the loss in accuracy due to expectation-linearization, describe classes of input distributions that expectation-linearize easily. Experiments on three image classification benchmark datasets demonstrate that reducing the inference gap can indeed improve the performance consistently.",/pdf/436cd523473505454012d97b09f3321aa071a2cc.pdf,ICLR,2017, +BywyFQlAW,HkLyKXlCW,1509080000000.0,1528440000000.0,232,Minimax Curriculum Learning: Machine Teaching with Desirable Difficulties and Scheduled Diversity,"[""tianyi.david.zhou@gmail.com"", ""bilmes@uw.edu""]","[""Tianyi Zhou"", ""Jeff Bilmes""]","[""machine teaching"", ""deep learning"", ""minimax"", ""curriculum learning"", ""submodular"", ""diversity""]","We introduce and study minimax curriculum learning (MCL), a new method for adaptively selecting a sequence of training subsets for a succession of stages in machine learning. The subsets are encouraged to be small and diverse early on, and then larger, harder, and allowably more homogeneous in later stages. At each stage, model weights and training sets are chosen by solving a joint continuous-discrete minimax optimization, whose objective is composed of a continuous loss (reflecting training set hardness) and a discrete submodular promoter of diversity for the chosen subset. MCL repeatedly solves a sequence of such optimizations with a schedule of increasing training set size and decreasing pressure on diversity encouragement. We reduce MCL to the minimization of a surrogate function handled by submodular maximization and continuous gradient methods. We show that MCL achieves better performance and, with a clustering trick, uses fewer labeled samples for both shallow and deep models while achieving the same performance. Our method involves repeatedly solving constrained submodular maximization of an only slowly varying function on the same ground set. Therefore, we develop a heuristic method that utilizes the previous submodular maximization solution as a warm start for the current submodular maximization process to reduce computation while still yielding a guarantee.",/pdf/e50100f756a361dfbea712ff382c7b3fc1f2dcea.pdf,ICLR,2018,Minimax Curriculum Learning is a machine teaching method involving increasing desirable hardness and scheduled reducing diversity. +JbAqsfbYsJy,pvpaUIrPs2Zc,1601310000000.0,1614990000000.0,1861,Action and Perception as Divergence Minimization,"[""~Danijar_Hafner1"", ""~Pedro_A_Ortega1"", ""~Jimmy_Ba1"", ""thomas.parr.12@ucl.ac.uk"", ""~Karl_Friston1"", ""~Nicolas_Heess1""]","[""Danijar Hafner"", ""Pedro A Ortega"", ""Jimmy Ba"", ""Thomas Parr"", ""Karl Friston"", ""Nicolas Heess""]","[""objective functions"", ""reinforcement learning"", ""information theory"", ""probabilistic modeling"", ""control as inference"", ""exploration"", ""intrinsic motivation"", ""world models""]","We introduce a unified objective for action and perception of intelligent agents. Extending representation learning and control, we minimize the joint divergence between the combined system of agent and environment and a target distribution. Intuitively, such agents use perception to align their beliefs with the world, and use actions to align the world with their beliefs. Minimizing the joint divergence to an expressive target maximizes the mutual information between the agent's representations and inputs, thus inferring representations that are informative of past inputs and exploring future inputs that are informative of the representations. This lets us explain intrinsic objectives, such as representation learning, information gain, empowerment, and skill discovery from minimal assumptions. Moreover, interpreting the target distribution as a latent variable model suggests powerful world models as a path toward highly adaptive agents that seek large niches in their environments, rendering task rewards optional. The framework provides a common language for comparing a wide range of objectives, advances the understanding of latent variables for decision making, and offers a recipe for designing novel objectives. We recommend deriving future agent objectives the joint divergence to facilitate comparison, to point out the agent's target distribution, and to identify the intrinsic objective terms needed to reach that distribution.",/pdf/93c5d37716cb74040572055765d14d464f9b24b7.pdf,ICLR,2021, +BJxSI1SKDH,HJgITL6_PS,1569440000000.0,1583910000000.0,1727,A Latent Morphology Model for Open-Vocabulary Neural Machine Translation,"[""duyguataman@gmail.com"", ""will.aziz@gmail.com"", ""a.birch@ed.ac.uk""]","[""Duygu Ataman"", ""Wilker Aziz"", ""Alexandra Birch""]","[""neural machine translation"", ""low-resource languages"", ""latent-variable models""]","Translation into morphologically-rich languages challenges neural machine translation (NMT) models with extremely sparse vocabularies where atomic treatment of surface forms is unrealistic. This problem is typically addressed by either pre-processing words into subword units or performing translation directly at the level of characters. The former is based on word segmentation algorithms optimized using corpus-level statistics with no regard to the translation task. The latter learns directly from translation data but requires rather deep architectures. In this paper, we propose to translate words by modeling word formation through a hierarchical latent variable model which mimics the process of morphological inflection. Our model generates words one character at a time by composing two latent representations: a continuous one, aimed at capturing the lexical semantics, and a set of (approximately) discrete features, aimed at capturing the morphosyntactic function, which are shared among different surface forms. Our model achieves better accuracy in translation into three morphologically-rich languages than conventional open-vocabulary NMT methods, while also demonstrating a better generalization capacity under low to mid-resource settings.",/pdf/baa0eb7b9ff7a35095265aaf73866521790ad753.pdf,ICLR,2020, +XjYgR6gbCEc,pqBcUWSDBnX,1601310000000.0,1613390000000.0,1427,MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space,"[""~Tsz-Him_Cheung1"", ""~Dit-Yan_Yeung2""]","[""Tsz-Him Cheung"", ""Dit-Yan Yeung""]","[""deep learning"", ""data augmentation"", ""automated data augmentation"", ""latent space""]","Data augmentation is an efficient way to expand a training dataset by creating additional artificial data. While data augmentation is found to be effective in improving the generalization capabilities of models for various machine learning tasks, the underlying augmentation methods are usually manually designed and carefully evaluated for each data modality separately, like image processing functions for image data and word-replacing rules for text data. In this work, we propose an automated data augmentation approach called MODALS (Modality-agnostic Automated Data Augmentation in the Latent Space) to augment data for any modality in a generic way. MODALS exploits automated data augmentation to fine-tune four universal data transformation operations in the latent space to adapt the transform to data of different modalities. Through comprehensive experiments, we demonstrate the effectiveness of MODALS on multiple datasets for text, tabular, time-series and image modalities.",/pdf/83d8754b7f6d39240200978f99334ce755678fb1.pdf,ICLR,2021,MODALS is an automated data augmentation framework that fine-tunes four universal data transformation operations in the latent space to augment data of different modalities. +HJgC60EtwB,HyxmRNqOwr,1569440000000.0,1583910000000.0,1413,Robust Reinforcement Learning for Continuous Control with Model Misspecification,"[""dmankowitz@google.com"", ""nirlevine@google.com"", ""raejeong@google.com"", ""aabdolmaleki@google.com"", ""springenberg@google.com"", ""yyshi@google.com"", ""kayj@google.com"", ""toddhester@google.com"", ""timothymann@google.com"", ""riedmiller@google.com""]","[""Daniel J. Mankowitz"", ""Nir Levine"", ""Rae Jeong"", ""Abbas Abdolmaleki"", ""Jost Tobias Springenberg"", ""Yuanyuan Shi"", ""Jackie Kay"", ""Todd Hester"", ""Timothy Mann"", ""Martin Riedmiller""]","[""reinforcement learning"", ""robustness""]","We provide a framework for incorporating robustness -- to perturbations in the transition dynamics which we refer to as model misspecification -- into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a policy that optimizes for a worst case, entropy-regularized, expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. In addition, we show improved robust performance on a challenging, simulated, dexterous robotic hand. Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework; including an adaptation to another continuous control RL algorithm. Performance videos can be found online at https://sites.google.com/view/robust-rl.",/pdf/25983a067172bdfe6b7e570573e4101d12a2d1a3.pdf,ICLR,2020,A framework for incorporating robustness to model misspecification into continuous control Reinforcement Learning algorithms. +SJeeL04KvH,H1ew1A8_wr,1569440000000.0,1577170000000.0,1127,Robust Federated Learning Through Representation Matching and Adaptive Hyper-parameters,"[""hesham.mostafa@intel.com""]","[""Hesham Mostafa""]","[""federated learning"", ""hyper-parameter tuning"", ""regularization""]"," Federated learning is a distributed, privacy-aware learning scenario which trains a single model on data belonging to several clients. Each client trains a local model on its data and the local models are then aggregated by a central party. Current federated learning methods struggle in cases with heterogeneous client-side data distributions which can quickly lead to divergent local models and a collapse in performance. Careful hyper-parameter tuning is particularly important in these cases but traditional automated hyper-parameter tuning methods would require several training trials which is often impractical in a federated learning setting. We describe a two-pronged solution to the issues of robustness and hyper-parameter tuning in federated learning settings. We propose a novel representation matching scheme that reduces the divergence of local models by ensuring the feature representations in the global (aggregate) model can be derived from the locally learned representations. We also propose an online hyper-parameter tuning scheme which uses an online version of the REINFORCE algorithm to find a hyper-parameter distribution that maximizes the expected improvements in training loss. We show on several benchmarks that our two-part scheme of local representation matching and global adaptive hyper-parameters significantly improves performance and training robustness.",/pdf/2dde9a2735aeca1beffa99a478f3391a1134e026.pdf,ICLR,2020,"We describe a cheap, online, and automated hyper-parameter tuning scheme for Federated learning settings and a novel mechanism for mitigating model divergence in the presence of non-iid client data." +B1fPYj0qt7,BJlZ5SdcY7,1538090000000.0,1545360000000.0,454,Riemannian Stochastic Gradient Descent for Tensor-Train Recurrent Neural Networks,"[""jqi41@gatech.edu"", ""qij13@uw.edu"", ""javiertejedornoguerales@gmail.com""]","[""Jun Qi"", ""Chin-Hui Lee"", ""Javier Tejedor""]","[""Riemannian Stochastic Gradient Descent"", ""Tensor-Train"", ""Recurrent Neural Networks""]","The Tensor-Train factorization (TTF) is an efficient way to compress large weight matrices of fully-connected layers and recurrent layers in recurrent neural networks (RNNs). However, high Tensor-Train ranks for all the core tensors of parameters need to be element-wise fixed, which results in an unnecessary redundancy of model parameters. This work applies Riemannian stochastic gradient descent (RSGD) to train core tensors of parameters in the Riemannian Manifold before finding vectors of lower Tensor-Train ranks for parameters. The paper first presents the RSGD algorithm with a convergence analysis and then tests it on more advanced Tensor-Train RNNs such as bi-directional GRU/LSTM and Encoder-Decoder RNNs with a Tensor-Train attention model. The experiments on digit recognition and machine translation tasks suggest the effectiveness of the RSGD algorithm for Tensor-Train RNNs. ",/pdf/adcd9143448fd902ce58ddabf1283443b08ff865.pdf,ICLR,2019,Applying the Riemannian SGD (RSGD) algorithm for training Tensor-Train RNNs to further reduce model parameters. +BkUp6GZRW,rysH6Gb0-,1509140000000.0,1519440000000.0,1017,Boosting the Actor with Dual Critic,"[""bohr.dai@gmail.com"", ""ashaw596@gatech.edu"", ""niaohe@illinois.edu"", ""lihongli.cs@gmail.com"", ""lsong@cc.gatech.edu""]","[""Bo Dai"", ""Albert Shaw"", ""Niao He"", ""Lihong Li"", ""Le Song""]","[""reinforcement learning"", ""actor-critic algorithm"", ""Lagrangian duality""]","This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, which can be viewed as a two-player game between the actor and a critic-like function, which is named as dual critic. Compared to its actor-critic relatives, Dual-AC has the desired property that the actor and dual critic are updated cooperatively to optimize the same objective function, providing a more transparent way for learning the critic that is directly related to the objective function of the actor. We then provide a concrete algorithm that can effectively solve the minimax optimization problem, using techniques of multi-step bootstrapping, path regularization, and stochastic dual ascent algorithm. We demonstrate that the proposed algorithm achieves the state-of-the-art performances across several benchmarks.",/pdf/9ba722ec0ff57d5f6ee2469e14b6d3e2f3f99365.pdf,ICLR,2018,"We propose Dual Actor-Critic algorithm, which is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation. The algorithm achieves the state-of-the-art performances across several benchmarks." +BylldnNFwS,Skl3GzjCrH,1569440000000.0,1577170000000.0,28,On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective,"[""motasem.alfarra@kaust.edu.sa"", ""adel.bibi@kaust.edu.sa"", ""hasan.hammoud@kaust.edu.sa"", ""muhamed.gaafar@gmail.com"", ""bernard.ghanem@kaust.edu.sa""]","[""Motasem Alfarra"", ""Adel Bibi"", ""Hasan Hammoud"", ""Mohamed Gaafar"", ""Bernard Ghanem""]","[""Decision boundaries"", ""Neural Network"", ""Tropical Geometry"", ""Network Pruning"", ""Adversarial Attacks"", ""Lottery Ticket Hypothesis""]","This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piece-wise linear non-linearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to provide a characterization of the decision boundaries of a simple neural network of the form (Affine, ReLU, Affine). Specifically, we show that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of the zonotopes are precise functions of the neural network parameters. We utilize this geometric characterization to shed light and new perspective on three tasks. In doing so, we propose a new tropical perspective for the lottery ticket hypothesis, where we see the effect of different initializations on the tropical geometric representation of the decision boundaries. Also, we leverage this characterization as a new set of tropical regularizers, which deal directly with the decision boundaries of a network. We investigate the use of these regularizers in neural network pruning (removing network parameters that do not contribute to the tropical geometric representation of the decision boundaries) and in generating adversarial input attacks (with input perturbations explicitly perturbing the decision boundaries geometry to change the network prediction of the input). ",/pdf/34a5dd2db7c3d59680cac9599213046e68e8f7c4.pdf,ICLR,2020,Tropical geometry can be leveraged to represent the decision boundaries of neural networks and bring to light interesting insights. +r1kQkVFgl,,1478210000000.0,1483880000000.0,89,Learning Python Code Suggestion with a Sparse Pointer Network,"[""avishkar.bhoopchand.15@ucl.ac.uk"", ""t.rocktaschel@cs.ucl.ac.uk"", ""e.barr@cs.ucl.ac.uk"", ""s.riedel@cs.ucl.ac.uk""]","[""Avishkar Bhoopchand"", ""Tim Rockt\u00e4schel"", ""Earl Barr"", ""Sebastian Riedel""]",[],"To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.",/pdf/c87dce17b558f6c58df97ba9f664e375407d4633.pdf,ICLR,2017,We augment a neural language model with a pointer network for code suggestion that is specialized to referring to predefined groups of identifiers +CYO5T-YjWZV,iFcKpd_7Sd,1601310000000.0,1616860000000.0,2311,Simple Spectral Graph Convolution,"[""~Hao_Zhu2"", ""~Piotr_Koniusz1""]","[""Hao Zhu"", ""Piotr Koniusz""]","[""Graph Convolutional Network"", ""Oversmoothing""]","Graph Convolutional Networks (GCNs) are leading methods for learning graph representations. However, without specially designed architectures, the performance of GCNs degrades quickly with increased depth. As the aggregated neighborhood size and neural network depth are two completely orthogonal aspects of graph representation, several methods focus on summarizing the neighborhood by aggregating K-hop neighborhoods of nodes while using shallow neural networks. However, these methods still encounter oversmoothing, and suffer from high computation and storage costs. In this paper, we use a modified Markov Diffusion Kernel to derive a variant of GCN called Simple Spectral Graph Convolution (SSGC). Our spectral analysis shows that our simple spectral graph convolution used in SSGC is a trade-off of low- and high-pass filter bands which capture the global and local contexts of each node. We provide two theoretical claims which demonstrate that we can aggregate over a sequence of increasingly larger neighborhoods compared to competitors while limiting severe oversmoothing. Our experimental evaluations show that SSGC with a linear learner is competitive in text and node classification tasks. Moreover, SSGC is comparable to other state-of-the-art methods for node clustering and community prediction tasks.",/pdf/9015cbfb15f31fdf7835279414de3b27ef3b0c01.pdf,ICLR,2021,"A simple and efficient method for graph convolution based on the Markov Diffusion Kernel, which works well on different tasks under unsupervised, semi-supervised and supervised settings." +SkgbmyHFDS,SyljmH2_vH,1569440000000.0,1577170000000.0,1605,What Can Learned Intrinsic Rewards Capture?,"[""zeyu@umich.edu"", ""junhyuk@google.com"", ""mtthss@google.com"", ""zhongwen@google.com"", ""makro@google.com"", ""hado@google.com"", ""davidsilver@google.com"", ""baveja@google.com""]","[""Zeyu Zheng"", ""Junhyuk Oh"", ""Matteo Hessel"", ""Zhongwen Xu"", ""Manuel Kroiss"", ""Hado van Hasselt"", ""David Silver"", ""Satinder Singh""]","[""reinforcement learning"", ""deep reinforcement learning"", ""intrinsic movitation""]","Reinforcement learning agents can include different components, such as policies, value functions, state representations, and environment models. Any or all of these can be the loci of knowledge, i.e., structures where knowledge, whether given or learned, can be deposited and reused. Regardless of its composition, the objective of an agent is behave so as to maximise the sum of suitable scalar functions of state: the rewards. As far as the learning algorithm is concerned, these rewards are typically given and immutable. In this paper we instead consider the proposition that the reward function itself may be a good locus of knowledge. This is consistent with a common use, in the literature, of hand-designed intrinsic rewards to improve the learning dynamics of an agent. We adopt a multi-lifetime setting of the Optimal Rewards Framework, and investigate how meta-learning can be used to find good reward functions in a data-driven way. To this end, we propose to meta-learn an intrinsic reward function that allows agents to maximise their extrinsic rewards accumulated until the end of their lifetimes. This long-term lifetime objective allows our learned intrinsic reward to generate systematic multi-episode exploratory behaviour. Through proof-of-concept experiments, we elucidate interesting forms of knowledge that may be captured by a suitably trained intrinsic reward such as the usefulness of exploring uncertain states and rewards.",/pdf/f8670795090b0c2a16931ab16fed50e81f5b29e1.pdf,ICLR,2020, +HkxJHlrFvr,Syg9RDeFwr,1569440000000.0,1577170000000.0,2270,Angular Visual Hardness,"[""beidi.chen@rice.edu"", ""wyliu@gatech.edu"", ""garg@cs.stanford.edu"", ""zhidingy@nvidia.com"", ""anshumali@rice.edu"", ""jkautz@nvidia.com"", ""anima@caltech.edu""]","[""Beidi Chen"", ""Weiyang Liu"", ""Animesh Garg"", ""Zhiding Yu"", ""Anshumali Shrivastava"", ""Jan Kautz"", ""Anima Anandkumar""]","[""angular similarity"", ""self-training"", ""hard samples mining""]","The mechanisms behind human visual systems and convolutional neural networks (CNNs) are vastly different. Hence, it is expected that they have different notions of ambiguity or hardness. In this paper, we make a surprising discovery: there exists a (nearly) universal score function for CNNs whose correlation with human visual hardness is statistically significant. We term this function as angular visual hardness (AVH) and in a CNN, it is given by the normalized angular distance between a feature embedding and the classifier weights of the corresponding target category. We conduct an in-depth scientific study. We observe that CNN models with the highest accuracy also have the best AVH scores. This agrees with an earlier finding that state-of-art models tend to improve on classification of harder training examples. We find that AVH displays interesting dynamics during training: it quickly reaches a plateau even though the training loss keeps improving. This suggests the need for designing better loss functions that can target harder examples more effectively. Finally, we empirically show significant improvement in performance by using AVH as a measure of hardness in self-training tasks. + ",/pdf/2c27b165b76cb86aa00accf3d09d2510e6be5a5d.pdf,ICLR,2020,A novel measure in CNN based on angular similarity that is shown to correlate strongly with human visual hardness with gains in applications such as self-training. +x2ywTOFM4xt,pcUIt-_p5Ip,1601310000000.0,1614990000000.0,2632,Variational saliency maps for explaining model's behavior,"[""~Jae_Myung_Kim1"", ""~Eunji_Kim2"", ""~Seokhyeon_Ha1"", ""~Sungroh_Yoon1"", ""~Jungwoo_Lee1""]","[""Jae Myung Kim"", ""Eunji Kim"", ""Seokhyeon Ha"", ""Sungroh Yoon"", ""Jungwoo Lee""]","[""Interpretability"", ""XAI"", ""Variational Inference""]","Saliency maps have been widely used to explain the behavior of an image classifier. We introduce a new interpretability method which considers a saliency map as a random variable and aims to calculate the posterior distribution over the saliency map. The likelihood function is designed to measure the distance between the classifier's predictive probability of an image and that of locally perturbed image. For the prior distribution, we make attributions of adjacent pixels have a positive correlation. We use a variational approximation, and show that the approximate posterior is effective in explaining the classifier's behavior. It also has benefits of providing uncertainty over the explanation, giving auxiliary information to experts on how much the explanation is trustworthy.",/pdf/b7e3002d2e2bb73d25420a5b273dc5d2e766383b.pdf,ICLR,2021,Posterior distribution of a saliency map that explains model's behavior and provides uncertainty over the explanation. +Hyl7ygStwB,BkxntY1tPH,1569440000000.0,1583910000000.0,2055,Incorporating BERT into Neural Machine Translation,"[""teslazhu@mail.ustc.edu.cn"", ""yingce.xia@gmail.com"", ""wulijun3@mail2.sysu.edu.cn"", ""di_he@pku.edu.cn"", ""taoqin@microsoft.com"", ""zhwg@ustc.edu.cn"", ""lihq@ustc.edu.cn"", ""tyliu@microsoft.com""]","[""Jinhua Zhu"", ""Yingce Xia"", ""Lijun Wu"", ""Di He"", ""Tao Qin"", ""Wengang Zhou"", ""Houqiang Li"", ""Tieyan Liu""]","[""BERT"", ""Neural Machine Translation""]","The recently proposed BERT (Devlin et al., 2019) has shown great power on a variety of natural language understanding tasks, such as text classification, reading comprehension, etc. However, how to effectively apply BERT to neural machine translation (NMT) lacks enough exploration. While BERT is more commonly used as fine-tuning instead of contextual embedding for downstream language understanding tasks, in NMT, our preliminary exploration of using BERT as contextual embedding is better than using for fine-tuning. This motivates us to think how to better leverage BERT for NMT along this direction. We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets. Our code is available at https://github.com/bert-nmt/bert-nmt",/pdf/d131542710841297d9e981e433c86120b31486bd.pdf,ICLR,2020, +HkGTwjCctm,rygoMj_5Km,1538090000000.0,1545360000000.0,306,Pyramid Recurrent Neural Networks for Multi-Scale Change-Point Detection,"[""shina.ebiz@gmail.com"", ""mzheng3@stevens.edu"", ""fkarakas@stevens.edu"", ""samantha.kleinberg@stevens.edu""]","[""Zahra Ebrahimzadeh"", ""Min Zheng"", ""Selcuk Karakas"", ""Samantha Kleinberg""]","[""changepoint detection"", ""multivariate time series data"", ""multiscale RNN""]","Many real-world time series, such as in activity recognition, finance, or climate science, have changepoints where the system's structure or parameters change. Detecting changes is important as they may indicate critical events. However, existing methods for changepoint detection face challenges when (1) the patterns of change cannot be modeled using simple and predefined metrics, and (2) changes can occur gradually, at multiple time-scales. To address this, we show how changepoint detection can be treated as a supervised learning problem, and propose a new deep neural network architecture that can efficiently identify both abrupt and gradual changes at multiple scales. Our proposed method, pyramid recurrent neural network (PRNN), is designed to be scale-invariant, by incorporating wavelets and pyramid analysis techniques from multi-scale signal processing. Through experiments on synthetic and real-world datasets, we show that PRNN can detect abrupt and gradual changes with higher accuracy than the state of the art and can extrapolate to detect changepoints at novel timescales that have not been seen in training.",/pdf/a03796b78f6014d788a1b34269bbb7cf6196e243.pdf,ICLR,2019,We introduce a scale-invariant neural network architecture for changepoint detection in multivariate time series. +SyepHTNFDS,BJxEiVPwDH,1569440000000.0,1577170000000.0,541,Graph Residual Flow for Molecular Graph Generation,"[""26x.orc.ed5.1hs@gmail.com"", ""akita714@preferred.jp"", ""k.ishiguro.jp@ieee.org"", ""nakanishi@preferred.jp"", ""oono@preferred.jp""]","[""Shion Honda"", ""Hirotaka Akita"", ""Katsuhiko Ishiguro"", ""Toshiki Nakanishi"", ""Kenta Oono""]","[""deep generative model"", ""normalizing flow"", ""graph generation"", ""cheminformatics""]","Statistical generative models for molecular graphs attract attention from many researchers from the fields of bio- and chemo-informatics. Among these models, invertible flow-based approaches are not fully explored yet. In this paper, we propose a powerful invertible flow for molecular graphs, called Graph Residual Flow (GRF). The GRF is based on residual flows, which are known for more flexible and complex non-linear mappings than traditional coupling flows. We theoretically derive non-trivial conditions such that GRF is invertible, and present a way of keeping the entire flows invertible throughout the training and sampling. Experimental results show that a generative model based on the proposed GRF achieve comparable generation performance, with much smaller number of trainable parameters compared to the existing flow-based model. ",/pdf/2b4c83b9ad032d681365303fa6e8cb8a1ae8c95a.pdf,ICLR,2020,"We propose a residual flow model for molecular graphs, derive the conditions so that the flow is invertible, and show its efficacy in experiments." +ry4Vrt5gl,,1478300000000.0,1488610000000.0,483,Learning to Optimize,"[""ke.li@eecs.berkeley.edu"", ""malik@eecs.berkeley.edu""]","[""Ke Li"", ""Jitendra Malik""]","[""Reinforcement Learning"", ""Optimization""]","Algorithm design is a laborious process and often requires many iterations of ideation and validation. In this paper, we explore automating algorithm design and present a method to learn an optimization algorithm. We approach this problem from a reinforcement learning perspective and represent any particular optimization algorithm as a policy. We learn an optimization algorithm using guided policy search and demonstrate that the resulting algorithm outperforms existing hand-engineered algorithms in terms of convergence speed and/or the final objective value. ",/pdf/90fed8c631fb9b66174870c7af1ccc8a49cf80c9.pdf,ICLR,2017,We explore learning an optimization algorithm automatically. +MtEE0CktZht,MRzuiuLF8e1,1601310000000.0,1613530000000.0,990,Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments,"[""~Daochen_Zha1"", ""mawenye@gmail.com"", ""~Lei_Yuan1"", ""~Xia_Hu4"", ""~Ji_Liu1""]","[""Daochen Zha"", ""Wenye Ma"", ""Lei Yuan"", ""Xia Hu"", ""Ji Liu""]","[""Reinforcement Learning"", ""Exploration"", ""Generalization of Reinforcement Learning"", ""Self-Imitation""]","Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good exploration behaviors by looking into the entire episode, we introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments. RAPID regards each episode as a whole and gives an episodic exploration score from both per-episode and long-term views. Those highly scored episodes are treated as good exploration behaviors and are stored in a small ranking buffer. The agent then imitates the episodes in the buffer to reproduce the past good exploration behaviors. We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks. The results show that RAPID significantly outperforms the state-of-the-art intrinsic reward strategies in terms of sample efficiency and final performance. The code is available at https://github.com/daochenzha/rapid",/pdf/1a3486946250b679d39b93e6b12ff912023a3955.pdf,ICLR,2021,Encouraging exploration via ranking the past episodes and reproducing past good exploration behaviors with imitation learning. +vC8hNRk9dOR,P9apC5r2og4,1601310000000.0,1614990000000.0,856,Evaluating Online Continual Learning with CALM,"[""~Germ\u00e1n_Kruszewski1"", ""~Ionut_Teodor_Sorodoc1"", ""~Tomas_Mikolov1""]","[""Germ\u00e1n Kruszewski"", ""Ionut Teodor Sorodoc"", ""Tomas Mikolov""]","[""online continual learning"", ""catastrophic forgetting"", ""benchmark"", ""language modelling""]","Online Continual Learning (OCL) studies learning over a continuous data stream without observing any single example more than once, a setting that is closer to the experience of humans and systems that must learn “on-the-wild”. Yet, commonly available benchmarks are far from these real world conditions, because they explicitly signal different tasks, lack latent similarity structure or assume temporal independence between different examples. Here, we propose a new benchmark for OCL based on language modelling in which input alternates between different languages and domains without any explicit delimitation. Additionally, we propose new metrics to study catastrophic forgetting in this setting and evaluate multiple baseline models based on compositions of experts. Finally, we introduce a simple gating technique that learns the latent similarities between different inputs, improving the performance of a Products of Experts model.",/pdf/306a24d78b5bb9ff28485312af1080724879c7a5.pdf,ICLR,2021,"We introduce a benchmark for Online Continual Learning based on language modelling, evaluating multiple baselines and improving one of them." +a-_HfiIow3m,IhD_5FjshRQ,1601310000000.0,1614990000000.0,3162,Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts,"[""svegal@biosustain.dtu.dk"", ""~Oswin_Krause1"", ""domccl@biosustain.dtu.dk"", ""~Mads_Nielsen2"", ""~Christian_Igel1""]","[""Svetlana Kutuzova"", ""Oswin Krause"", ""Douglas McCloskey"", ""Mads Nielsen"", ""Christian Igel""]","[""variational autoencoder"", ""multimodal data"", ""product-of-experts"", ""semi-supervised learning""]","Multimodal generative models should be able to learn a meaningful latent representation that enables a coherent joint generation of all modalities (e.g., images and text). Many applications also require the ability to accurately sample modalities conditioned on observations of a subset of the modalities. Often not all modalities may be observed for all training data points, so semi-supervised learning should be possible. +In this study, we evaluate a family of product-of-experts (PoE) based variational autoencoders that have these desired properties. We include a novel PoE based architecture and training procedure. An empirical evaluation shows that the PoE based models can outperform an additive mixture-of-experts (MoE) approach. +Our experiments support the intuition that PoE models are more suited for a conjunctive combination of modalities while MoEs are more suited for a disjunctive fusion. ",/pdf/f5df283b136768f85d39a9bc755da5c9798f05f3.pdf,ICLR,2021,Product-of-experts based variational autoencoders work well for generative modelling of multiple high-dimensional modalities +76M3pxkqRl,RD2GrQ4tukw,1601310000000.0,1614990000000.0,1834,Status-Quo Policy Gradient in Multi-agent Reinforcement Learning,"[""~Pinkesh_Badjatiya1"", ""~Mausoom_Sarkar1"", ""~Abhishek_Sinha1"", ""~Nikaash_Puri1"", ""~Jayakumar_Subramanian1"", ""siddharth9820@gmail.com"", ""~Balaji_Krishnamurthy1""]","[""Pinkesh Badjatiya"", ""Mausoom Sarkar"", ""Abhishek Sinha"", ""Nikaash Puri"", ""Jayakumar Subramanian"", ""Siddharth Singh"", ""Balaji Krishnamurthy""]","[""multi-agent rl"", ""reinforcement learning"", ""social dilemma"", ""policy gradient"", ""game theory""]","Individual rationality, which involves maximizing expected individual return, does not always lead to optimal individual or group outcomes in multi-agent problems. For instance, in social dilemma situations, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to mutual defection that is individually and socially sub-optimal. In contrast, humans evolve individual and socially optimal strategies in such social dilemmas. Inspired by ideas from human psychology that attribute this behavior in humans to the status-quo bias, we present a status-quo loss (SQLoss) and the corresponding policy gradient algorithm that incorporates this bias in an RL agent. We demonstrate that agents trained with SQLoss evolve individually as well as socially optimal behavior in several social dilemma matrix games. To apply SQLoss to games where cooperation and defection are determined by a sequence of non-trivial actions, we present GameDistill, an algorithm that reduces a multi-step game with visual input to a matrix game. We empirically show how agents trained with SQLoss on a GameDistill reduced version of the Coin Game evolve optimal policies. ",/pdf/14674eece944a800c153009debd39db92a1a5c39.pdf,ICLR,2021, +Y87Ri-GNHYu,BHJvI8w9Yv,1601310000000.0,1615060000000.0,1692,Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning,"[""~Valerie_Chen2"", ""~Abhinav_Gupta1"", ""~Kenneth_Marino1""]","[""Valerie Chen"", ""Abhinav Gupta"", ""Kenneth Marino""]",[],"Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level language generator and low-level policy, conditioned on language. We find that human demonstrations help solve the most complex tasks. We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting and to learn quickly from a few demonstrations. Generalization is not only reflected in the actions of the agent, but also in the generated natural language instructions in unseen tasks. Our approach also gives our trained agent interpretable behaviors because it is able to generate a sequence of high-level descriptions of its actions.",/pdf/be28b5c6ceded7efd434a4a312e5019c4cb5480f.pdf,ICLR,2021, +O9bnihsFfXU,2ZbntTyGpWc,1601310000000.0,1616060000000.0,913,Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning,"[""~Aviral_Kumar2"", ""~Rishabh_Agarwal2"", ""~Dibya_Ghosh1"", ""~Sergey_Levine1""]","[""Aviral Kumar"", ""Rishabh Agarwal"", ""Dibya Ghosh"", ""Sergey Levine""]","[""deep Q-learning"", ""data-efficient RL"", ""rank-collapse"", ""offline RL""]","We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We char- acterize this loss of expressivity via a drop in the rank of the learned value net- work features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.",/pdf/766d703650bcb86c35c53681b89ad0a2aba30504.pdf,ICLR,2021,Identifies and studies feature matrix rank collapse (i.e. implicit regularization) in deep Q-learning methods. +BJxvH1BtDS,BklnMfa_Pr,1569440000000.0,1577170000000.0,1694,Three-Head Neural Network Architecture for AlphaZero Learning,"[""cgao3@ualberta.ca"", ""mmueller@ualberta.ca"", ""hayward@ualberta.ca"", ""hengshuai.yao@huawei.com"", ""jui.shangling@huawei.com""]","[""Chao Gao"", ""Martin Mueller"", ""Ryan Hayward"", ""Hengshuai Yao"", ""Shangling Jui""]","[""alphazero"", ""reinforcement learning"", ""two-player games"", ""heuristic search"", ""deep neural networks""]","The search-based reinforcement learning algorithm AlphaZero has been used as a general method for +mastering two-player games Go, chess and Shogi. One crucial ingredient in AlphaZero (and its predecessor AlphaGo Zero) is the two-head network architecture that outputs two estimates --- policy and value --- for one input game state. The merit of such an architecture is that letting policy and value learning share the same representation substantially improved generalization of the neural net. +A three-head network architecture has been recently proposed that can learn a third action-value head on a fixed dataset the same as for two-head net. Also, using the action-value head in Monte Carlo tree search (MCTS) improved the search efficiency. +However, effectiveness of the three-head network has not been investigated in an AlphaZero style learning paradigm. +In this paper, using the game of Hex as a test domain, we conduct an empirical study of the three-head network architecture in AlpahZero learning. We show that the architecture is also advantageous at the zero-style iterative learning. Specifically, we find that three-head network can induce the following benefits: (1) learning can become faster as search takes advantage of the additional action-value head; (2) better prediction results than two-head architecture can be achieved when using additional action-value learning as an auxiliary task.",/pdf/ef877234e6e9c78027a8a30a408d4fa651a31370.pdf,ICLR,2020,An empirical study of three-head architecture for AlphaZero learning +SJqaCVLxx,,1478020000000.0,1486310000000.0,30,New Learning Approach By Genetic Algorithm In A Convolutional Neural Network For Pattern Recognition,"[""Alimehrolhassani@yahoo.com"", ""Mohammadi@uk.ac.ir""]","[""Mohammad Ali Mehrolhassani"", ""Majid Mohammadi""]","[""Deep learning"", ""Supervised Learning"", ""Optimization"", ""Computer vision""]","Almost all of the presented articles in the CNN are based on the error backpropagation algorithm and calculation of derivations of error, our innovative proposal refers to engaging TICA filters and NSGA-II genetic algorithms to train the LeNet-5 CNN network. Consequently, genetic algorithm updates the weights of LeNet-5 CNN network similar to chromosome update. In our approach the weights of LeNet-5 are obtained in two stages. The first is pre-training and the second is fine-tuning. As a result, our approach impacts in learning task.",/pdf/a5e7a46286d01abf776909ee187b11ede1537df4.pdf,ICLR,2017,Implement new approach without exerting backpropagation in learning of CNN is useful for parallel processing Like GPU. +TJzkxFw-mGm,e88g5MxZqJ3,1601310000000.0,1614990000000.0,1971,Near-Optimal Regret Bounds for Model-Free RL in Non-Stationary Episodic MDPs,"[""~Weichao_Mao1"", ""~Kaiqing_Zhang3"", ""~Ruihao_Zhu2"", ""dslevi@mit.edu"", ""~Tamer_Basar1""]","[""Weichao Mao"", ""Kaiqing Zhang"", ""Ruihao Zhu"", ""David Simchi-Levi"", ""Tamer Basar""]","[""reinforcement learning"", ""non-stationary environment"", ""model-free approach"", ""regret analysis""]","We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes (MDPs). Both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. Our algorithm outperforms the state-of-the-art (model-based) solution in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of steps per episode, and $T$ is the total number of steps. We further show that our algorithm is near-optimal by establishing an information-theoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, which to the best of our knowledge is the first impossibility result in non-stationary RL in general.",/pdf/e39dbc7f5c83766085afe90a6149d7f47ed35e78.pdf,ICLR,2021,We present a model-free reinforcement learning algorithm that achieves near-optimal dynamic regret in non-stationary episodic MDPs. +ijJZbomCJIm,2Qa_Dg__3b1,1601310000000.0,1616050000000.0,2366,Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification,"[""~Francisco_Utrera1"", ""kravitz@berkeley.edu"", ""~N._Benjamin_Erichson1"", ""~Rajiv_Khanna1"", ""~Michael_W._Mahoney1""]","[""Francisco Utrera"", ""Evan Kravitz"", ""N. Benjamin Erichson"", ""Rajiv Khanna"", ""Michael W. Mahoney""]","[""transfer learning"", ""adversarial training"", ""influence functions"", ""limited data""]","Transfer learning has emerged as a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains. This process consists of taking a neural network pre-trained on a large feature-rich source dataset, freezing the early layers that encode essential generic image properties, and then fine-tuning the last few layers in order to capture specific information related to the target situation. This approach is particularly useful when only limited or weakly labeled data are available for the new task. In this work, we demonstrate that adversarially-trained models transfer better than non-adversarially-trained models, especially if only limited data are available for the new domain task. Further, we observe that adversarial training biases the learnt representations to retaining shapes, as opposed to textures, which impacts the transferability of the source models. Finally, through the lens of influence functions, we discover that transferred adversarially-trained models contain more human-identifiable semantic information, which explains -- at least partly -- why adversarially-trained models transfer better.",/pdf/566e8902b7a749d4525ff5f0933ffdae3a9bec39.pdf,ICLR,2021,"We demonstrate that adversarially-trained models transfer better to new domains than naturally-trained models, especially when only limited training data is available in the target domain. " +le9LIliDOG,6v5kNgUMunv,1601310000000.0,1614990000000.0,1864,Efficient Long-Range Convolutions for Point Clouds,"[""pyf04142017@sjtu.edu.cn"", ""~Lin_Lin1"", ""~Lexing_Ying1"", ""~Leonardo_Zepeda-Nunez1""]","[""Yifan Peng"", ""Lin Lin"", ""Lexing Ying"", ""Leonardo Zepeda-Nunez""]","[""global convolution"", ""point cloud"", ""graph-cnn"", ""NUFFT""]","The efficient treatment of long-range interactions for point clouds is a challenging problem in many scientific machine learning applications. To extract global information, one usually needs a large window size, a large number of layers, and/or a large number of channels. This can often significantly increase the computational cost. In this work, we present a novel neural network layer that directly incorporates long-range information for a point cloud. This layer, dubbed the long-range convolutional (LRC)-layer, leverages the convolutional theorem coupled with the non-uniform Fourier transform. In a nutshell, the LRC-layer mollifies the point cloud to an adequately sized regular grid, computes its Fourier transform, multiplies the result by a set of trainable Fourier multipliers, computes the inverse Fourier transform, and finally interpolates the result back to the point cloud. The resulting global all-to-all convolution operation can be performed in nearly-linear time asymptotically with respect to the number of input points. The LRC-layer is a particularly powerful tool when combined with local convolution as together they offer efficient and seamless treatment of both short and long range interactions. We showcase this framework by introducing a neural network architecture that combines LRC-layers with short-range convolutional layers to accurately learn the energy and force associated with a $N$-body potential. We also exploit the induced two-level decomposition and propose an efficient strategy to train the combined architecture with a reduced number of samples.",/pdf/1c42bec14a38188596aea6963295adbdf04cf191.pdf,ICLR,2021, +S1Y7OOlRZ,HyYQO_gCZ,1509100000000.0,1518730000000.0,312,Massively Parallel Hyperparameter Tuning,"[""lishal@cs.ucla.edu"", ""jamieson@cs.washington.edu"", ""rostami@google.com"", ""kgonina@google.com"", ""hardt@berkeley.edu"", ""brecht@berkeley.edu"", ""talwalkar@cmu.edu""]","[""Lisha Li"", ""Kevin Jamieson"", ""Afshin Rostamizadeh"", ""Katya Gonina"", ""Moritz Hardt"", ""Benjamin Recht"", ""Ameet Talwalkar""]","[""parallel hyperparameter tuning"", ""deep learning""]","Modern machine learning models are characterized by large hyperparameter search spaces and prohibitively expensive training costs. For such models, we cannot afford to train candidate models sequentially and wait months before finding a suitable hyperparameter configuration. Hence, we introduce the large-scale regime for parallel hyperparameter tuning, where we need to evaluate orders of magnitude more configurations than available parallel workers in a small multiple of the wall-clock time needed to train a single model. We propose a novel hyperparameter tuning algorithm for this setting that exploits both parallelism and aggressive early-stopping techniques, building on the insights of the Hyperband algorithm. Finally, we conduct a thorough empirical study of our algorithm on several benchmarks, including large-scale experiments with up to 500 workers. Our results show that our proposed algorithm finds good hyperparameter settings nearly an order of magnitude faster than random search.",/pdf/ffd02a66c70df617db5d040918f43af54397fa6d.pdf,ICLR,2018, +BJvVbCJCb,HkIEbC1CZ,1509050000000.0,1518730000000.0,186,Neural Clustering By Predicting And Copying Noise,"[""sam@digitalgenius.com"", ""andrej@digitalgenius.com"", ""yoram@digitalgenius.com""]","[""Sam Coope"", ""Andrej Zukov-Gregoric"", ""Yoram Bachrach""]","[""unsupervised learning"", ""clustering"", ""deep learning""]","We propose a neural clustering model that jointly learns both latent features and how they cluster. Unlike similar methods our model does not require a predefined number of clusters. Using a supervised approach, we agglomerate latent features towards randomly sampled targets within the same space whilst progressively removing the targets until we are left with only targets which represent cluster centroids. To show the behavior of our model across different modalities we apply our model on both text and image data and very competitive results on MNIST. Finally, we also provide results against baseline models for fashion-MNIST, the 20 newsgroups dataset, and a Twitter dataset we ourselves create.",/pdf/96bf1d3ae154e36a7d0b7d06aba5ba49093a4459.pdf,ICLR,2018,Neural clustering without needing a number of clusters +Pd_oMxH8IlF,3ukXfSBr8L6,1601310000000.0,1616020000000.0,2705,Iterated learning for emergent systematicity in VQA,"[""~Ankit_Vani1"", ""~Max_Schwarzer1"", ""~Yuchen_Lu1"", ""eeshandhekane@gmail.com"", ""~Aaron_Courville3""]","[""Ankit Vani"", ""Max Schwarzer"", ""Yuchen Lu"", ""Eeshan Dhekane"", ""Aaron Courville""]","[""iterated learning"", ""cultural transmission"", ""neural module network"", ""clevr"", ""shapes"", ""vqa"", ""visual question answering"", ""systematic generalization"", ""compositionality""]","Although neural module networks have an architectural bias towards compositionality, they require gold standard layouts to generalize systematically in practice. When instead learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure. We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature that has primarily been applied to simple referential games in machine learning. Considering the layouts of module networks as samples from an emergent language, we use iterated learning to encourage the development of structure within this language. We show that the resulting layouts support systematic generalization in neural agents solving the more complex task of visual question-answering. Our regularized iterated learning method can outperform baselines without iterated learning on SHAPES-SyGeT (SHAPES Systematic Generalization Test), a new split of the SHAPES dataset we introduce to evaluate systematic generalization, and on CLOSURE, an extension of CLEVR also designed to test systematic generalization. We demonstrate superior performance in recovering ground-truth compositional program structure with limited supervision on both SHAPES-SyGeT and CLEVR.",/pdf/62bee9dfb73bae4271c7f80e9d64eda7effacc43.pdf,ICLR,2021,We use iterated learning to encourage the emergence of structure in the generated programs for neural module networks. +oY7La6DBTLx,nCoDbbEgVwL,1601310000000.0,1614990000000.0,1619,One-class Classification Robust to Geometric Transformation,"[""~Hyunjun_Ju1"", ""~Dongha_Lee1"", ""seongku@postech.ac.kr"", ""~Hwanjo_Yu1""]","[""Hyunjun Ju"", ""Dongha Lee"", ""SeongKu Kang"", ""Hwanjo Yu""]","[""one-class classification"", ""image classification"", ""object classification"", ""self-supervised learning"", ""geometric robustness""]","Recent studies on one-class classification have achieved a remarkable performance, by employing the self-supervised classifier that predicts the geometric transformation applied to in-class images. However, they cannot identify in-class images at all when the input images are geometrically-transformed (e.g., rotated images), because their classification-based in-class scores assume that input images always have a fixed viewpoint, as similar to the images used for training. Pointing out that humans can easily recognize such transformed images as the same class, in this work, we aim to propose a one-class classifier robust to geometrically-transformed inputs, named as GROC. To this end, we introduce a conformity score which indicates how strongly an input image agrees with one of the predefined in-class transformations, then utilize the conformity score with our proposed agreement measures for one-class classification. Our extensive experiments demonstrate that GROC is able to accurately distinguish in-class images from out-of-class images regardless of whether the inputs are geometrically-transformed or not, whereas the existing methods fail.",/pdf/cf6f100decac323d32df3617f67dfc5285c41830.pdf,ICLR,2021,"This paper proposes a one-class classification method robust to geometric transformations, which effectively addresses the challenge that in-class images cannot be correctly distinguished from out-of-class images when they have various viewpoints. " +1yXhko8GZEE,u_ooknnRWC,1601310000000.0,1614990000000.0,2649,Precondition Layer and Its Use for GANs,"[""~Tiantian_Fang1"", ""~Alex_Schwing1"", ""~Ruoyu_Sun1""]","[""Tiantian Fang"", ""Alex Schwing"", ""Ruoyu Sun""]","[""GAN"", ""Preconditioning"", ""Condition Number""]","One of the major challenges when training generative adversarial nets (GANs) is instability. To address this instability spectral normalization (SN) is remarkably successful. However, SN-GAN still suffers from training instabilities, especially when working with higher-dimensional data. We find that those instabilities are accompanied by large condition numbers of the discriminator weight matrices. To improve training stability we study common linear-algebra practice and employ preconditioning. Specifically, we introduce a preconditioning layer (PC-layer)that performs a low-degree polynomial preconditioning. We use this PC-layer in two ways: 1) fixed preconditioning (FPC) adds a fixed PC-layer to all layers, and 2) adaptive preconditioning (APC) adaptively controls the strength of preconditioning. Empirically, we show that FPC and APC stabilize the training of un-conditional GANs using classical architectures. On LSUN256×256 data, APC improves FID scores by around 5 points over baselines.",/pdf/0e9755432ffcad336eaaa8e6f2618c6f952aa5c8.pdf,ICLR,2021,"We introduce a preconditioning layer (PC-layer) that performs a low-degree polynomial preconditioning, and show that it improves the performance of GANs." +Hygq3JrtwS,SJxrSzkKDH,1569440000000.0,1577170000000.0,1961,On the Reflection of Sensitivity in the Generalization Error,"[""mahsa.forouzesh@epfl.ch"", ""farnood.salehi@epfl.ch"", ""patrick.thiran@epfl.ch""]","[""Mahsa Forouzesh"", ""Farnood Salehi"", ""Patrick Thiran""]","[""Generalization Error"", ""Sensitivity Analysis"", ""Deep Neural Networks"", ""Bias-variance Decomposition""]","Even though recent works have brought some insight into the performance improvement of techniques used in state-of-the-art deep-learning models, more work is needed to understand the generalization properties of over-parameterized deep neural networks. We shed light on this matter by linking the loss function to the output’s sensitivity to its input. We find a rather strong empirical relation between the output sensitivity and the variance in the bias-variance decomposition of the loss function, which hints on using sensitivity as a metric for comparing generalization performance of networks, without requiring labeled data. We find that sensitivity is decreased by applying popular methods which improve the generalization performance of the model, such as (1) using a deep network rather than a wide one, (2) adding convolutional layers to baseline classifiers instead of adding fully connected layers, (3) using batch normalization, dropout and max-pooling, and (4) applying parameter initialization techniques.",/pdf/3756a3956de9343f94bc15fbe5467a747f53d38e.pdf,ICLR,2020,We study the relation between the generalization error and the sensitivity of the output to random input perturbations in deep neural networks. +M3NDrHEGyyO,xeDy6ollvtF9,1601310000000.0,1614990000000.0,1381,Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies,"[""~Tsung-Yen_Yang2"", ""justinian.rosca@siemens.com"", ""~Karthik_R_Narasimhan1"", ""~Peter_Ramadge1""]","[""Tsung-Yen Yang"", ""Justinian Rosca"", ""Karthik R Narasimhan"", ""Peter Ramadge""]","[""Reinforcement learning with constraints"", ""Safe reinforcement learning""]","We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the controlled system must satisfy. The baseline policy can arise from a teacher agent, demonstration data or even a heuristic while the constraints might encode safety, fairness or other application-specific requirements. Importantly, the baseline policy may be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints. The key challenge therefore lies in effectively leveraging the baseline policy for faster learning, while still ensuring that the constraints are minimally violated. To reconcile these potentially competing aspects, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze the convergence of our algorithm theoretically and provide a finite-sample guarantee. In our empirical experiments on five different control tasks, our algorithm consistently outperforms several state-of-the-art methods, achieving 10 times fewer constraint violations and 40% higher reward on average.",/pdf/dd1dfedf50b0b8fe4ca55d12f0adaae3775bcd1f.pdf,ICLR,2021,"We propose a new algorithm that learns constraint-satisfying policies with constraint-mismatched baseline policies, and provide theoretical analysis and empirical demonstration in the context of reinforcement learning with constraints." +BklVA2NYvH,B1eJO2KEwH,1569440000000.0,1577170000000.0,260,Adversarially Robust Neural Networks via Optimal Control: Bridging Robustness with Lyapunov Stability,"[""zy-chen17@mails.tsinghua.edu.cn"", ""suhangss@mail.tsinghua.edu.cn""]","[""Zhiyang Chen"", ""Hang Su""]","[""adversarial defense"", ""optimal control"", ""Lyapunov stability""]","Deep neural networks are known to be vulnerable to adversarial perturbations. In this paper, we bridge adversarial robustness of neural nets with Lyapunov stability of dynamical systems. From this viewpoint, training neural nets is equivalent to finding an optimal control of the discrete dynamical system, which allows one to utilize methods of successive approximations, an optimal control algorithm based on Pontryagin's maximum principle, to train neural nets. This decoupled training method allows us to add constraints to the optimization, which makes the deep model more robust. The constrained optimization problem can be formulated as a semi-definite programming problem and hence can be solved efficiently. Experiments show that our method effectively improves deep model's adversarial robustness.",/pdf/9b40938c2b940454dd36242431ecd22baa208076.pdf,ICLR,2020,An adversarial defense method bridging robustness of deep neural nets with Lyapunov stability +SJMZRsC9Y7,r1e2Bha5Km,1538090000000.0,1545360000000.0,869,A NON-LINEAR THEORY FOR SENTENCE EMBEDDING,"[""hichem@imrsv.ai"", ""isar@imrsv.ai""]","[""Hichem Mezaoui"", ""Isar Nejadgholi""]","[""sentence embedding"", ""generative models""]","This paper revisits the Random Walk model for sentence embedding in the context of non-extensive statistics. We propose a non-extensive algebra to compute the discourse vector. We argue that by doing so we are taking into account high non-linearity in the semantic space. Furthermore, we show that by considering a non-extensive algebra, the compounding effect of the vector length is mitigated. Overall, we show that the proposed model leads to good sentence embedding. We evaluate the embedding method on textual similarity tasks.",/pdf/d296c11ba5cab54f2a3104a4dc52d6c2150f96bd.pdf,ICLR,2019, +5Y21V0RDBV,F56leSJDK1e,1601310000000.0,1619190000000.0,3332,Generalized Multimodal ELBO,"[""~Thomas_Marco_Sutter1"", ""~Imant_Daunhawer2"", ""~Julia_E_Vogt1""]","[""Thomas M. Sutter"", ""Imant Daunhawer"", ""Julia E Vogt""]","[""Multimodal"", ""VAE"", ""ELBO"", ""self-supervised"", ""generative learning""]","Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in self-supervised, generative learning tasks.",/pdf/2cfd5fea6a35d4586487da796743d75dacc7118c.pdf,ICLR,2021,We propose a generalized ELBO for modeling multiple data types in a scalable and self-supervised way. +ByOfBggRZ,ryPfrgeRZ,1509060000000.0,1519810000000.0,210,Detecting Statistical Interactions from Neural Network Weights,"[""tsangm@usc.edu"", ""dehuache@usc.edu"", ""yanliu.cs@usc.edu""]","[""Michael Tsang"", ""Dehua Cheng"", ""Yan Liu""]","[""statistical interaction detection"", ""multilayer perceptron"", ""generalized additive model""]","Interpreting neural networks is a crucial and challenging task in machine learning. In this paper, we develop a novel framework for detecting statistical interactions captured by a feedforward multilayer neural network by directly interpreting its learned weights. Depending on the desired interactions, our method can achieve significantly better or similar interaction detection performance compared to the state-of-the-art without searching an exponential solution space of possible interactions. We obtain this accuracy and efficiency by observing that interactions between input features are created by the non-additive effect of nonlinear activation functions, and that interacting paths are encoded in weight matrices. We demonstrate the performance of our method and the importance of discovered interactions via experimental results on both synthetic datasets and real-world application datasets. ",/pdf/41d30ec9e8e57144978fb09d9631a7f736248c42.pdf,ICLR,2018,We detect statistical interactions captured by a feedforward multilayer neural network by directly interpreting its learned weights. +HJ1JBJ5gl,,1478260000000.0,1484700000000.0,161,Representing inferential uncertainty in deep neural networks through sampling,"[""Patrick.McClure@mrc-cbu.cam.ac.uk"", ""Nikolaus.Kriegeskorte@mrc-cbu.cam.ac.uk""]","[""Patrick McClure"", ""Nikolaus Kriegeskorte""]","[""Deep learning"", ""Theory"", ""Applications""]","As deep neural networks (DNNs) are applied to increasingly challenging problems, they will need to be able to represent their own uncertainty. Modelling uncertainty is one of the key features of Bayesian methods. Bayesian DNNs that use dropout-based variational distributions and scale to complex tasks have recently been proposed. We evaluate Bayesian DNNs trained with Bernoulli or Gaussian multiplicative masking of either the units (dropout) or the weights (dropconnect). We compare these Bayesian DNNs ability to represent their uncertainty about their outputs through sampling during inference. We tested the calibration of these Bayesian fully connected and convolutional DNNs on two visual inference tasks (MNIST and CIFAR-10). By adding different levels of Gaussian noise to the test images, we assessed how these DNNs represented their uncertainty about regions of input space not covered by the training set. These Bayesian DNNs represented their own uncertainty more accurately than traditional DNNs with a softmax output. We find that sampling of weights, whether Gaussian or Bernoulli, led to more accurate representation of uncertainty compared to sampling of units. However, sampling units using either Gaussian or Bernoulli dropout led to increased convolutional neural network (CNN) classification accuracy. Based on these findings we use both Bernoulli dropout and Gaussian dropconnect concurrently, which approximates the use of a spike-and-slab variational distribution. We find that networks with spike-and-slab sampling combine the advantages of the other methods: they classify with high accuracy and robustly represent the uncertainty of their classifications for all tested architectures.",/pdf/1cfeed8ed6564fce035669688a6580c520ca4129.pdf,ICLR,2017,Dropout- and dropconnect-based Bayesian deep neural networks with sampling at inference better represent their own inferential uncertainty than traditional deep neural networks. +r1xMH1BtvB,Hkl9ReTdvS,1569440000000.0,1583910000000.0,1682,ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,"[""kevclark@cs.stanford.edu"", ""thangluong@google.com"", ""qvl@google.com"", ""manning@cs.stanford.edu""]","[""Kevin Clark"", ""Minh-Thang Luong"", ""Quoc V. Le"", ""Christopher D. Manning""]","[""Natural Language Processing"", ""Representation Learning""]","Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute. +",/pdf/81eb9548d84e4498b2dae9e4355551a1764e8cfe.pdf,ICLR,2020,A text encoder trained to distinguish real input tokens from plausible fakes efficiently learns effective language representations. +rYt0p0Um9r,j_sWV8B0_NJ,1601310000000.0,1614990000000.0,2109,Do Deeper Convolutional Networks Perform Better?,"[""~Eshaan_Nichani1"", ""~Adityanarayanan_Radhakrishnan1"", ""~Caroline_Uhler1""]","[""Eshaan Nichani"", ""Adityanarayanan Radhakrishnan"", ""Caroline Uhler""]","[""Depth"", ""Over-parameterization"", ""Neural Networks""]"," Over-parameterization is a recent topic of much interest in the machine learning community. While over-parameterized neural networks are capable of perfectly fitting (interpolating) training data, these networks often perform well on test data, thereby contradicting classical learning theory. Recent work provided an explanation for this phenomenon by introducing the double descent curve, showing that increasing model capacity past the interpolation threshold leads to a decrease in test error. In line with this, it was recently shown empirically and theoretically that increasing neural network capacity through width leads to double descent. In this work, we analyze the effect of increasing depth on test performance. In contrast to what is observed for increasing width, we demonstrate through a variety of classification experiments on CIFAR10 and ImageNet-32 using ResNets and fully-convolutional networks that test performance worsens beyond a critical depth. We posit an explanation for this phenomenon by drawing intuition from the principle of minimum norm solutions in linear networks. +",/pdf/b3bf12081fa51c8a2129aa099daa7fc0072797b0.pdf,ICLR,2021,"In contrast to what is observed for increasing width, we demonstrate that test performance of convolutional neural networks worsens beyond a critical depth." +H1lug3R5FX,Byx40kC9tX,1538090000000.0,1545360000000.0,1090,On the Geometry of Adversarial Examples,"[""khoury@eecs.berkeley.edu"", ""dhm@berkeley.edu""]","[""Marc Khoury"", ""Dylan Hadfield-Menell""]","[""adversarial examples"", ""high-dimensional geometry""]","Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. We propose a geometric framework, drawing on tools from the manifold reconstruction literature, to analyze the high-dimensional geometry of adversarial examples. In particular, we highlight the importance of codimension: for low-dimensional data manifolds embedded in high-dimensional space there are many directions off the manifold in which to construct adversarial examples. Adversarial examples are a natural consequence of learning a decision boundary that classifies the low-dimensional data manifold well, but classifies points near the manifold incorrectly. Using our geometric framework we prove (1) a tradeoff between robustness under different norms, (2) that adversarial training in balls around the data is sample inefficient, and (3) sufficient sampling conditions under which nearest neighbor classifiers and ball-based adversarial training are robust.",/pdf/eaaac106d84842721cf3b4a441bd73b84efd4897.pdf,ICLR,2019,We present a geometric framework for proving robustness guarantees and highlight the importance of codimension in adversarial examples. +HklvMJSYPB,rkeZ7Gh_PS,1569440000000.0,1577170000000.0,1582,Adaptive Adversarial Imitation Learning,"[""luyiren92@gmail.com"", ""tompson@google.com"", ""slevine@google.com""]","[""Yiren Lu"", ""Jonathan Tompson"", ""Sergey Levine""]","[""Imitation Learning"", ""Reinforcement Learning""]","We present the ADaptive Adversarial Imitation Learning (ADAIL) algorithm for learning adaptive policies that can be transferred between environments of varying dynamics, by imitating a small number of demonstrations collected from a single source domain. This problem is important in robotic learning because in real world scenarios 1) reward functions are hard to obtain, 2) learned policies from one domain are difficult to deploy in another due to varying source to target domain statistics, 3) collecting expert demonstrations in multiple environments where the dynamics are known and controlled is often infeasible. We address these constraints by building upon recent advances in adversarial imitation learning; we condition our policy on a learned dynamics embedding and we employ a domain-adversarial loss to learn a dynamics-invariant discriminator. The effectiveness of our method is demonstrated on simulated control tasks with varying environment dynamics and the learned adaptive agent outperforms several recent baselines.",/pdf/7a36ed0c2931e2e1aec318aded4384faad0af11d.pdf,ICLR,2020, +Hkxbz1HKvr,SyeUle3uvB,1569440000000.0,1577170000000.0,1568,Learning Key Steps to Attack Deep Reinforcement Learning Agents,"[""r07922080@csie.ntu.edu.tw"", ""htlin@csie.ntu.edu.tw""]","[""Chien-Min Yu"", ""Hsuan-Tien Lin""]","[""deep reinforcement learning"", ""adversarial attacks""]","Deep reinforcement learning agents are known to be vulnerable to adversarial attacks. In particular, recent studies have shown that attacking a few key steps is effective for decreasing the agent's cumulative reward. However, all existing attacking methods find those key steps with human-designed heuristics, and it is not clear how more effective key steps can be identified. This paper introduces a novel reinforcement learning framework that learns more effective key steps through interacting with the agent. The proposed framework does not require any human heuristics nor knowledge, and can be flexibly coupled with any white-box or black-box adversarial attack scenarios. Experiments on benchmark Atari games across different scenarios demonstrate that the proposed framework is superior to existing methods for identifying more effective key steps.",/pdf/1a61553c45bb2e8d51ccdb700d8921e2984ea1a0.pdf,ICLR,2020,We propose a novel reinforcement learning framework where an attacker can learn more effective key steps to attack the reinforcement learning agent. +KLH36ELmwIB,jwKYOqU_jn,1601310000000.0,1615910000000.0,52,DARTS-: Robustly Stepping out of Performance Collapse Without Indicators,"[""~Xiangxiang_Chu1"", ""figure1_wxx@sjtu.edu.cn"", ""~Bo_Zhang7"", ""~Shun_Lu1"", ""weixiaolin02@meituan.com"", ""~Junchi_Yan2""]","[""Xiangxiang Chu"", ""Xiaoxing Wang"", ""Bo Zhang"", ""Shun Lu"", ""Xiaolin Wei"", ""Junchi Yan""]","[""neural architecture search"", ""DARTS stability""]","Despite the fast development of differentiable architecture search (DARTS), it suffers from a standing instability issue regarding searching performance, which extremely limits its application. Existing robustifying methods draw clues from the outcome instead of finding out the causing factor. Various indicators such as Hessian eigenvalues are proposed as a signal of performance collapse, and the searching should be stopped once an indicator reaches a preset threshold. +However, these methods tend to easily reject good architectures if thresholds are inappropriately set, let alone the searching is intrinsically noisy. In this paper, we undertake a more subtle and direct approach to resolve the collapse. +We first demonstrate that skip connections with a learnable architectural coefficient can easily recover from a disadvantageous state and become dominant. We conjecture that skip connections profit too much from this privilege, hence causing the collapse for the derived model. Therefore, we propose to factor out this benefit with an auxiliary skip connection, ensuring a fairer competition for all operations. Extensive experiments on various datasets verify that our approach can substantially improve the robustness of DARTS. Our code is available at https://github.com/Meituan-AutoML/DARTS-",/pdf/7a1997d523c7fda09b7c32a482730abe9736ffa1.pdf,ICLR,2021,Indicator-free approach to stabilize DARTS +SkgE8sRcK7,Bkgut0kct7,1538090000000.0,1545360000000.0,164,Sample Efficient Deep Neuroevolution in Low Dimensional Latent Space,"[""bin.zhou@u.nus.edu"", ""elefjia@u.nus.edu""]","[""Bin Zhou"", ""Jiashi Feng""]","[""Neuroevolution"", ""Reinforcement Learning""]","Current deep neuroevolution models are usually trained in a large parameter search space for complex learning tasks, e.g. playing video games, which needs billions of samples and thousands of search steps to obtain significant performance. This raises a question of whether we can make use of sequential data generated during evolution, encode input samples, and evolve in low dimensional parameter space with latent state input in a fast and efficient manner. Here we give an affirmative answer: we train a VAE to encode input samples, then an RNN to model environment dynamics and handle temporal information, and last evolve our low dimensional policy network in latent space. We demonstrate that this approach is surprisingly efficient: our experiments on Atari games show that within 10M frames and 30 evolution steps of training, our algorithm could achieve competitive result compared with ES, A3C, and DQN which need billions of frames.",/pdf/56396e3ca3b58a8f300427f463aba9f2236e97c7.pdf,ICLR,2019, +SJJinbWRZ,HyRqnbbAW,1509130000000.0,1519800000000.0,719,Model-Ensemble Trust-Region Policy Optimization,"[""thanard.kurutach@berkeley.edu"", ""iclavera@berkeley.edu"", ""rockyduan@eecs.berkeley.edu"", ""avivt@berkeley.edu"", ""pabbeel@cs.berkeley.edu""]","[""Thanard Kurutach"", ""Ignasi Clavera"", ""Yan Duan"", ""Aviv Tamar"", ""Pieter Abbeel""]","[""model-based reinforcement learning"", ""model ensemble"", ""reinforcement learning"", ""model bias""]","Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity, which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time. Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark tasks.",/pdf/1f1ee17793ffb2307b8281400e79a37d3ad08f4e.pdf,ICLR,2018,Deep Model-Based RL that works well. +N07ebsD-lHp,qrMaXji95U,1601310000000.0,1614990000000.0,1250,Defending against black-box adversarial attacks with gradient-free trained sign activation neural networks,"[""yx277@njit.edu"", ""mx42@njit.edu"", ""zy328@njit.edu"", ""~Usman_Roshan1""]","[""Yunzhe Xue"", ""Meiyan Xie"", ""Zhibo Yang"", ""Usman Roshan""]","[""sign activation neural network"", ""gradient-free training"", ""stochastic coordinate descent"", ""black box adversarial attack"", ""hopskipjump"", ""transferability"", ""image distortion""]","While machine learning models today can achieve high accuracies on classification tasks, they can be deceived by minor imperceptible distortions to the data. These are known as adversarial attacks and can be lethal in the black-box setting which does not require knowledge of the target model type or its parameters. Binary neural networks that have sign activation and are trained with gradient descent have been shown to be harder to attack than conventional sigmoid activation networks but their improvements are marginal. We instead train sign activation networks with a novel gradient-free stochastic coordinate descent algorithm and propose an ensemble of such networks as a defense model. We evaluate the robustness of our model (a hard problem in itself) on image, text, and medical ECG data and find it to be more robust than ensembles of binary, full precision, and convolutional neural networks, and than random forests while attaining comparable clean test accuracy. In order to explain our model's robustness we show that an adversary targeting a single network in our ensemble fails to attack (and thus non-transferable to) other networks in the ensemble. Thus a datapoint requires a large distortion to fool the majority of networks in our ensemble and is likely to be detected in advance. This property of non-transferability arises naturally from the non-convexity of sign activation networks and randomization in our gradient-free training algorithm without any adversarial defense effort.",/pdf/8d722bb6e4eff15530a0fc8d413ea198ad088b54.pdf,ICLR,2021,"We show that an ensemble of our gradient free trained sign activation networks is much more adversarially robust than ensembles of binary, full precision, convolutional neural networks, and than random forest on image, text, and medical ECG data." +HklQxnC5tX,BketjhwYFm,1538090000000.0,1545360000000.0,1061,Overlapping Community Detection with Graph Neural Networks,"[""shchur@in.tum.de"", ""guennemann@in.tum.de""]","[""Oleksandr Shchur"", ""Stephan G\u00fcnnemann""]","[""community detection"", ""deep learning for graphs""]","Community detection in graphs is of central importance in graph mining, machine learning and network science. Detecting overlapping communities is especially challenging, and remains an open problem. Motivated by the success of graph-based deep learning in other graph-related tasks, we study the applicability of this framework for overlapping community detection. We propose a probabilistic model for overlapping community detection based on the graph neural network architecture. Despite its simplicity, our model outperforms the existing approaches in the community recovery task by a large margin. Moreover, due to the inductive formulation, the proposed model is able to perform out-of-sample community detection for nodes that were not present at training time",/pdf/f55876160fada78bec5fd7d9aac7846c59415775.pdf,ICLR,2019,Detecting overlapping communities in graphs using graph neural networks +rJgqalBKvH,rJxHKW-tPH,1569440000000.0,1577170000000.0,2584,Deceptive Opponent Modeling with Proactive Network Interdiction for Stochastic Goal Recognition Control,"[""luojunren17@nudt.edu.cn"", ""gaowei14@nudt.edu.cn""]","[""Junren Luo"", ""Wei Gao"", ""Zhiyong Liao"", ""Weilin Yuan"", ""Wanpeng Zhang"", ""Shaofei Chen""]",[],"Goal recognition based on the observations of the behaviors collected online has been used to model some potential applications. Newly formulated problem of goal recognition design aims at facilitating the online goal recognition process by performing offline redesign of the underlying environment with hard action removal. +In this paper, we propose the stochastic goal recognition control (S-GRC) problem with two main stages: (1) deceptive opponent modeling based on maximum entropy regularized Markov decision processes (MDPs) and (2) goal recognition control under proactively static interdiction. +For the purpose of evaluation, we propose to use the worst case distinctiveness (wcd) as a measure of the non-distinctive path without revealing the true goals, the task of S-GRC is to interdict a set of actions that improve or reduce the wcd. +We empirically demonstrate that our proposed approach control the goal recognition process based on opponent's deceptive behavior.",/pdf/6f4a2ad52c033670611d1e0e4b44154c0dd9a5ab.pdf,ICLR,2020, +B1jscMbAW,ryZsqzZ0Z,1509140000000.0,1519250000000.0,907,Divide and Conquer Networks,"[""alexnowakvila@gmail.com"", ""david.folque@gmail.com"", ""bruna@cims.nyu.edu""]","[""Alex Nowak"", ""David Folqu\u00e9"", ""Joan Bruna""]","[""Neural Networks"", ""Combinatorial Optimization"", ""Algorithms""]","We consider the learning of algorithmic tasks by mere observation of input-output +pairs. Rather than studying this as a black-box discrete regression problem with +no assumption whatsoever on the input-output mapping, we concentrate on tasks +that are amenable to the principle of divide and conquer, and study what are its +implications in terms of learning. +This principle creates a powerful inductive bias that we leverage with neural +architectures that are defined recursively and dynamically, by learning two scale- +invariant atomic operations: how to split a given input into smaller sets, and how +to merge two partially solved tasks into a larger partial solution. Our model can be +trained in weakly supervised environments, namely by just observing input-output +pairs, and in even weaker environments, using a non-differentiable reward signal. +Moreover, thanks to the dynamic aspect of our architecture, we can incorporate +the computational complexity as a regularization term that can be optimized by +backpropagation. We demonstrate the flexibility and efficiency of the Divide- +and-Conquer Network on several combinatorial and geometric tasks: convex hull, +clustering, knapsack and euclidean TSP. Thanks to the dynamic programming +nature of our model, we show significant improvements in terms of generalization +error and computational complexity.",/pdf/a6d419e20feea77f7bf73a8d2fd4afc963b2dc65.pdf,ICLR,2018,Dynamic model that learns divide and conquer strategies by weak supervision. +HJxJ2h4tPr,rJg5icHZwH,1569440000000.0,1577170000000.0,174,HighRes-net: Multi-Frame Super-Resolution by Recursive Fusion,"[""michel.deudon@elementai.com"", ""freddie@element.ai"", ""rifat.arefin515@gmail.com"", ""isrugeek@gmail.com"", ""zhichao.lin@elementai.com"", ""sankaran.kris@gmail.com"", ""vincent.michalski@gmx.de"", ""samira.ebrahimi-kahou@polymtl.ca"", ""julien@elementai.com"", ""yoshua.bengio@mila.quebec""]","[""Michel Deudon"", ""Alfredo Kalaitzis"", ""Md Rifat Arefin"", ""Israel Goytom"", ""Zhichao Lin"", ""Kris Sankaran"", ""Vincent Michalski"", ""Samira E Kahou"", ""Julien Cornebise"", ""Yoshua Bengio""]","[""multi-frame super-resolution"", ""super-resolution"", ""remote sensing"", ""fusion"", ""de-aliasing"", ""deep learning"", ""registration""]","Generative deep learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance single images with impressive aesthetic results, albeit with imaginary details. Multi-frame Super-Resolution (MFSR) offers a more grounded approach to the ill-posed problem, by conditioning on multiple low-resolution views. This is important for satellite monitoring of human impact on the planet -- from deforestation, to human rights violations -- that depend on reliable imagery. To this end, we present HighRes-net, the first deep learning approach to MFSR that learns its sub-tasks in an end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss. Co-registration of low-res views is learned implicitly through a reference-frame channel, with no explicit registration mechanism. We learn a global fusion operator that is applied recursively on an arbitrary number of low-res pairs. We introduce a registered loss, by learning to align the SR output to a ground-truth through ShiftNet. We show that by learning deep representations of multiple views, we can super-resolve low-resolution signals and enhance Earth observation data at scale. Our approach recently topped the European Space Agency's MFSR competition on real-world satellite imagery.",/pdf/1d6bd5153f19fca0feeb4290bf323b24c26adbfd.pdf,ICLR,2020,"The first deep learning approach to MFSR to solve registration, fusion, up-sampling in an end-to-end manner." +hb1sDDSLbV,0Sme9p3WPh,1601310000000.0,1616080000000.0,2015,Learning explanations that are hard to vary,"[""~Giambattista_Parascandolo1"", ""~Alexander_Neitz1"", ""~ANTONIO_ORVIETO2"", ""~Luigi_Gresele1"", ""~Bernhard_Sch\u00f6lkopf1""]","[""Giambattista Parascandolo"", ""Alexander Neitz"", ""ANTONIO ORVIETO"", ""Luigi Gresele"", ""Bernhard Sch\u00f6lkopf""]","[""invariances"", ""consistency"", ""gradient alignment""]","In this paper, we investigate the principle that good explanations are hard to vary in the context of deep learning. +We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. +To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. +We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. +Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers.",/pdf/3609f7deb3e0f3fe924d4bcc0ad5015cebf79bad.pdf,ICLR,2021, +SJDaqqveg,,1478110000000.0,1488560000000.0,48,An Actor-Critic Algorithm for Sequence Prediction,"[""dimabgv@gmail.com"", ""pbpop3@gmail.com"", ""iamkelvinxu@gmail.com"", ""anirudhgoyal9119@gmail.com"", ""lowe.ryan.t@gmail.com"", ""jpineau@cs.mcgill.ca"", ""aaron.courville@gmail.com"", ""yoshua.bengio@gmail.com""]","[""Dzmitry Bahdanau"", ""Philemon Brakel"", ""Kelvin Xu"", ""Anirudh Goyal"", ""Ryan Lowe"", ""Joelle Pineau"", ""Aaron Courville"", ""Yoshua Bengio""]","[""Natural language processing"", ""Deep learning"", ""Reinforcement Learning"", ""Structured prediction""]","We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a textit{critic} network that is trained to predict the value of an output token, given the policy of an textit{actor} network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling. ",/pdf/35163ec9e0a4092ca39ad729b83880b89f7fd33f.pdf,ICLR,2017,Adapting Actor-Critic methods from reinforcement learning to structured prediction +ByxPYjC5KQ,rJeJwltqY7,1538090000000.0,1549850000000.0,452,Improving Generalization and Stability of Generative Adversarial Networks,"[""hoangtha@deakin.edu.au"", ""truyen.tran@deakin.edu.au"", ""svetha.venkatesh@deakin.edu.au""]","[""Hoang Thanh-Tung"", ""Truyen Tran"", ""Svetha Venkatesh""]","[""GAN"", ""generalization"", ""gradient penalty"", ""zero centered"", ""convergence""]","Generative Adversarial Networks (GANs) are one of the most popular tools for learning complex high dimensional distributions. However, generalization properties of GANs have not been well understood. In this paper, we analyze the generalization of GANs in practical settings. We show that discriminators trained on discrete datasets with the original GAN loss have poor generalization capability and do not approximate the theoretically optimal discriminator. We propose a zero-centered gradient penalty for improving the generalization of the discriminator by pushing it toward the optimal discriminator. The penalty guarantees the generalization and convergence of GANs. Experiments on synthetic and large scale datasets verify our theoretical analysis. +",/pdf/9a7ecbc7be0084bc3c65d24f4dd29a391d380545.pdf,ICLR,2019,We propose a zero-centered gradient penalty for improving generalization and stability of GANs +BygpQlbA-,B1epQe-R-,1509130000000.0,1518730000000.0,570,Towards Provable Control for Unknown Linear Dynamical Systems,"[""arora@cs.princeton.edu"", ""ehazan@cs.princeton.edu"", ""holdenl@princeton.edu"", ""karans@cs.princeton.edu"", ""cyril.zhang@cs.princeton.edu"", ""y.zhang@cs.princeton.edu""]","[""Sanjeev Arora"", ""Elad Hazan"", ""Holden Lee"", ""Karan Singh"", ""Cyril Zhang"", ""Yi Zhang""]","[""optimal control"", ""reinforcement learning""]","We study the control of symmetric linear dynamical systems with unknown dynamics and a hidden state. Using a recent spectral filtering technique for concisely representing such systems in a linear basis, we formulate optimal control in this setting as a convex program. This approach eliminates the need to solve the non-convex problem of explicit identification of the system and its latent state, and allows for provable optimality guarantees for the control signal. We give the first efficient algorithm for finding the optimal control signal with an arbitrary time horizon T, with sample complexity (number of training rollouts) polynomial only in log(T) and other relevant parameters.",/pdf/0d269f107ed71a3614c073b34abb971895a82218.pdf,ICLR,2018,"Using a novel representation of symmetric linear dynamical systems with a latent state, we formulate optimal control as a convex program, giving the first polynomial-time algorithm that solves optimal control with sample complexity only polylogarithmic in the time horizon." +4VixXVZJkoY,k79PCEn7nEu,1601310000000.0,1614990000000.0,1689,TRIP: Refining Image-to-Image Translation via Rival Preferences,"[""~Yinghua_Yao1"", ""~Yuangang_Pan2"", ""~Ivor_Tsang1"", ""~Xin_Yao1""]","[""Yinghua Yao"", ""Yuangang Pan"", ""Ivor Tsang"", ""Xin Yao""]","[""Fine-grained image-to-image translation"", ""GAN"", ""relative attributes"", ""ranker""]","We propose a new model to refine image-to-image translation via an adversarial ranking process. In particular, we simultaneously train two modules: a generator that translates an input image to the desired image with smooth subtle changes with respect to some specific attributes; and a ranker that ranks rival preferences consisting of the input image and the desired image. Rival preferences refer to the adversarial ranking process: (1) the ranker thinks no difference between the desired image and the input image in terms of the desired attributes; (2) the generator fools the ranker to believe that the desired image changes the attributes over the input image as desired. Real image preferences are introduced to guide the ranker to rank image pairs regarding the interested attributes only. With an effective ranker, the generator would “win” the adversarial game by producing high-quality images that present desired changes over the attributes compared to the input image. The experiments demonstrate that our TRIP can generate high-fidelity images which exhibit smooth changes with the strength of the attributes.",/pdf/fbb22f246e8e13c9b4b73d0073efcdc1d7ae9c8c.pdf,ICLR,2021,"We propose TRIP consisting of a ranker and a generator for a high-quality fine-grained translation, where the rival preference is constructed to evoke the adversarial training between the ranker and the generator." +9r30XCjf5Dt,_NjZNaKJarVD,1601310000000.0,1615950000000.0,2512,Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics,"[""~Yanchao_Sun1"", ""~Da_Huo1"", ""~Furong_Huang1""]","[""Yanchao Sun"", ""Da Huo"", ""Furong Huang""]","[""poisoning attack"", ""policy gradient"", ""vulnerability of RL"", ""deep RL""]","Poisoning attacks on Reinforcement Learning (RL) systems could take advantage of RL algorithm’s vulnerabilities and cause failure of the learning. However, prior works on poisoning RL usually either unrealistically assume the attacker knows the underlying Markov Decision Process (MDP), or directly apply the poisoning methods in supervised learning to RL. In this work, we build a generic poisoning framework for online RL via a comprehensive investigation of heterogeneous poisoning models in RL. Without any prior knowledge of the MDP, we propose a strategic poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P), which works for on-policy deep RL agents, closing the gap that no poisoning method exists for policy-based RL agents. VA2C-P uses a novel metric, stability radius in RL, that measures the vulnerability of RL algorithms. Experiments on multiple deep RL agents and multiple environments show that our poisoning algorithm successfully prevents agents from learning a good policy or teaches the agents to converge to a target policy, with a limited attacking budget.",/pdf/fb9e902c18157059497d56cdc36770d12b05acf4.pdf,ICLR,2021,"We propose the first poisoning algorithm against deep policy-based RL methods, without any prior knowledge of the environment, covering heterogeneous poisoning models." +rJhR_pxCZ,HkoRO6g0W,1509120000000.0,1518730000000.0,421,Interpretable Classification via Supervised Variational Autoencoders and Differentiable Decision Trees,"[""pquint@cse.unl.edu"", ""gwirka@cse.unl.edu"", ""jwilliam@cse.unl.edu"", ""sscott@cse.unl.edu"", ""vinod@cse.unl.edu""]","[""Eleanor Quint"", ""Garrett Wirka"", ""Jacob Williams"", ""Stephen Scott"", ""N.V. Vinodchandran""]","[""interpretable classification"", ""decision trees"", ""deep learning"", ""variational autoencoder""]","As deep learning-based classifiers are increasingly adopted in real-world applications, the importance of understanding how a particular label is chosen grows. Single decision trees are an example of a simple, interpretable classifier, but are unsuitable for use with complex, high-dimensional data. On the other hand, the variational autoencoder (VAE) is designed to learn a factored, low-dimensional representation of data, but typically encodes high-likelihood data in an intrinsically non-separable way. We introduce the differentiable decision tree (DDT) as a modular component of deep networks and a simple, differentiable loss function that allows for end-to-end optimization of a deep network to compress high-dimensional data for classification by a single decision tree. We also explore the power of labeled data in a supervised VAE (SVAE) with a Gaussian mixture prior, which leverages label information to produce a high-quality generative model with improved bounds on log-likelihood. We combine the SVAE with the DDT to get our classifier+VAE (C+VAE), which is competitive in both classification error and log-likelihood, despite optimizing both simultaneously and using a very simple encoder/decoder architecture. ",/pdf/6a030c0381f495193d9f5c56d6f8ddb12b53aed4.pdf,ICLR,2018,We combine differentiable decision trees with supervised variational autoencoders to enhance interpretability of classification. +H1cKvl-Rb,Hk5KwlbCb,1509130000000.0,1518730000000.0,600,UCB EXPLORATION VIA Q-ENSEMBLES,"[""richardchen@openai.com"", ""szymon@openai.com"", ""pabbeel@cs.berkeley.edu"", ""joschu@openai.com""]","[""Richard Y. Chen"", ""Szymon Sidor"", ""Pieter Abbeel"", ""John Schulman""]","[""Reinforcement learning"", ""Q-learning"", ""ensemble method"", ""upper confidence bound""]","We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark. ",/pdf/ef93188320781ea7a18c0eb58633fbc18071d095.pdf,ICLR,2018,"Adapting UCB exploration to ensemble Q-learning improves over prior methods such as Double DQN, A3C+ on Atari benchmark" +rkxNh1Stvr,Skg2P-yKPH,1569440000000.0,1583910000000.0,1948,Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel,"[""qiuxin.nju@gmail.com"", ""elliot.meyerson@cognizant.com"", ""risto@cognizant.com""]","[""Xin Qiu"", ""Elliot Meyerson"", ""Risto Miikkulainen""]","[""Uncertainty Estimation"", ""Neural Networks"", ""Gaussian Process""]","Neural Networks (NNs) have been extensively used for a wide spectrum of real-world regression tasks, where the goal is to predict a numerical outcome such as revenue, effectiveness, or a quantitative result. In many such tasks, the point prediction is not enough: the uncertainty (i.e. risk or confidence) of that prediction must also be estimated. Standard NNs, which are most often used in such tasks, do not provide uncertainty information. Existing approaches address this issue by combining Bayesian models with NNs, but these models are hard to implement, more expensive to train, and usually do not predict as accurately as standard NNs. In this paper, a new framework (RIO) is developed that makes it possible to estimate uncertainty in any pretrained standard NN. The behavior of the NN is captured by modeling its prediction residuals with a Gaussian Process, whose kernel includes both the NN's input and its output. The framework is justified theoretically and evaluated in twelve real-world datasets, where it is found to (1) provide reliable estimates of uncertainty, (2) reduce the error of the point predictions, and (3) scale well to large datasets. Given that RIO can be applied to any standard NN without modifications to model architecture or training pipeline, it provides an important ingredient for building real-world NN applications.",/pdf/e4a730fab3756ad3d31924e8a134591096e665d9.pdf,ICLR,2020,Learning to Estimate Point-Prediction Uncertainty and Correct Output in Neural Networks +SJ-uGHcee,,1478280000000.0,1483450000000.0,243,Efficient iterative policy optimization,"[""nicolas@le-roux.name""]","[""Nicolas Le Roux""]",[],"We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates.",/pdf/b8abc0601f0eda355edbb902f24a7443aed318e9.pdf,ICLR,2017, +S1ekaT4tDB,SyeKhDx_DS,1569440000000.0,1577170000000.0,802,Why Convolutional Networks Learn Oriented Bandpass Filters: A Hypothesis,"[""wildes@cse.yorku.ca""]","[""Richard P. Wildes""]","[""convolutional networks"", ""computer vision"", ""oriented bandpass filters"", ""linear systems theory""]","It has been repeatedly observed that convolutional architectures when applied to +image understanding tasks learn oriented bandpass filters. A standard explanation +of this result is that these filters reflect the structure of the images that they have +been exposed to during training: Natural images typically are locally composed +of oriented contours at various scales and oriented bandpass filters are matched +to such structure. The present paper offers an alternative explanation based not +on the structure of images, but rather on the structure of convolutional architectures. +In particular, complex exponentials are the eigenfunctions of convolution. +These eigenfunctions are defined globally; however, convolutional architectures +operate locally. To enforce locality, one can apply a windowing function to the +eigenfunctions, which leads to oriented bandpass filters as the natural operators +to be learned with convolutional architectures. From a representational point of +view, these filters allow for a local systematic way to characterize and operate on +an image or other signal.",/pdf/0d1f21db1d5af7c02cab85b6289a4e6ec2012b3b.pdf,ICLR,2020,This paper offers a hypothesis for why convolutional networks learn oriented bandpass filters when applied to image understanding. +SJzwvoCqF7,rJg66nb_Fm,1538090000000.0,1545360000000.0,272,"On Tighter Generalization Bounds for Deep Neural Networks: CNNs, ResNets, and Beyond","[""xingguol@princeton.edu"", ""junweilu@hsph.harvard.edu"", ""zhaoranwang@gmail.com"", ""jdhaupt@umn.edu"", ""tourzhao@gatech.edu""]","[""Xingguo Li"", ""Junwei Lu"", ""Zhaoran Wang"", ""Jarvis Haupt"", ""Tuo Zhao""]","[""deep learning"", ""generalization error bound"", ""convolutional neural networks""]","We propose a generalization error bound for a general family of deep neural networks based on the depth and width of the networks, as well as the spectral norm of weight matrices. Through introducing a novel characterization of the Lipschitz properties of neural network family, we achieve a tighter generalization error bound. We further obtain a result that is free of linear dependence on norms for bounded losses. Besides the general deep neural networks, our results can be applied to derive new bounds for several popular architectures, including convolutional neural networks (CNNs), residual networks (ResNets), and hyperspherical networks (SphereNets). When achieving same generalization errors with previous arts, our bounds allow for the choice of much larger parameter spaces of weight matrices, inducing potentially stronger expressive ability for neural networks.",/pdf/4dda985420a734c6573c0c06de8dfbd78cc97d98.pdf,ICLR,2019, +H1BHbmWCZ,r1_z-7-CW,1509140000000.0,1518730000000.0,1126,TOWARDS ROBOT VISION MODULE DEVELOPMENT WITH EXPERIENTIAL ROBOT LEARNING,"[""aaa2cn@virginia.edu"", ""jbd@virginia.edu""]","[""Ahmed A Aly"", ""Joanne Bechta Dugan""]","[""Deep Learning"", ""Robotics"", ""Artificial Intelligence"", ""Computer Vision""]","n this paper we present a thrust in three directions of visual development us- ing supervised and semi-supervised techniques. The first is an implementation of semi-supervised object detection and recognition using the principles of Soft At- tention and Generative Adversarial Networks (GANs). The second and the third are supervised networks that learn basic concepts of spatial locality and quantity respectively using Convolutional Neural Networks (CNNs). The three thrusts to- gether are based on the approach of Experiential Robot Learning, introduced in previous publication. While the results are unripe for implementation, we believe they constitute a stepping stone towards autonomous development of robotic vi- sual modules.",/pdf/00e5c4aefc80d0396ee745c032d27e0bccb43079.pdf,ICLR,2018,3 thrusts serving as stepping stones for robot experiential learning of vision module +SkxbDsR9Ym,S1eEW-w9Km,1538090000000.0,1545360000000.0,236,RelWalk -- A Latent Variable Model Approach to Knowledge Graph Embedding,"[""danushka@liverpool.ac.uk"", ""h.a.hakami@liverpool.ac.uk"", ""yyoshida@nii.ac.jp"", ""k_keniti@nii.ac.jp""]","[""Danushka Bollegala"", ""Huda Hakami"", ""Yuichi Yoshida"", ""Ken-ichi Kawarabayashi""]","[""relation representations"", ""natural language processing"", ""theoretical analysis"", ""knowledge graphs""]","Knowledge Graph Embedding (KGE) is the task of jointly learning entity and relation embeddings for a given knowledge graph. Existing methods for learning KGEs can be seen as a two-stage process where (a) entities and relations in the knowledge graph are represented using some linear algebraic structures (embeddings), and (b) a scoring function is defined that evaluates the strength of a relation that holds between two entities using the corresponding relation and entity embeddings. Unfortunately, prior proposals for the scoring functions in the first step have been heuristically motivated, and it is unclear as to how the scoring functions in KGEs relate to the generation process of the underlying knowledge graph. To address this issue, we propose a generative account of the KGE learning task. Specifically, given a knowledge graph represented by a set of relational triples (h, R, t), where the semantic relation R holds between the two entities h (head) and t (tail), we extend the random walk model (Arora et al., 2016a) of word embeddings to KGE. We derive a theoretical relationship between the joint probability p(h, R, t) and the embeddings of h, R and t. Moreover, we show that marginal loss minimisation, a popular objective used by much prior work in KGE, follows naturally from the log-likelihood ratio maximisation under the probabilities estimated from the KGEs according to our theoretical relationship. We propose a learning objective motivated by the theoretical analysis to learn KGEs from a given knowledge graph. The KGEs learnt by our proposed method obtain state-of-the-art performance on FB15K237 and WN18RR benchmark datasets, providing empirical evidence in support of the theory. +",/pdf/d70821e9867c967f02ed7b2eb093b95ef3fa0e3d.pdf,ICLR,2019,We present a theoretically proven generative model of knowledge graph embedding. +B1mAkPxCZ,HyG0yPxAW,1509090000000.0,1518730000000.0,285,VOCABULARY-INFORMED VISUAL FEATURE AUGMENTATION FOR ONE-SHOT LEARNING,"[""14302010017@fudan.edu.cn"", ""16210240036@fudan.edu.cn"", ""yindaz@cs.princeton.edu"", ""y.fu@qmul.ac.uk""]","[""jianqi ma"", ""hangyu lin"", ""yinda zhang"", ""yanwei fu"", ""xiangyang xue""]","[""vocabulary-informed learning"", ""data augmentation""]","A natural solution for one-shot learning is to augment training data to handle the data deficiency problem. However, directly augmenting in the image domain may not necessarily generate training data that sufficiently explore the intra-class space for one-shot classification. Inspired by the recent vocabulary-informed learning, we propose to generate synthetic training data with the guide of the semantic word space. Essentially, we train an auto-encoder as a bridge to enable the transformation between the image feature space and the semantic space. Besides directly augmenting image features, we transform the image features to semantic space using the encoder and perform the data augmentation. The decoder then synthesizes the image features for the augmented instances from the semantic space. Experiments on three datasets show that our data augmentation method effectively improves the performance of one-shot classification. An extensive study shows that data augmented from semantic space are complementary with those from the image space, and thus boost the classification accuracy dramatically. Source code and dataset will be available. ",/pdf/e2adf5f6451fb60229f6bfd75e678b1a28d6e1ae.pdf,ICLR,2018, +p7OewL0RRIH,2bs8Ng5Q3vk,1601310000000.0,1614990000000.0,1493,Sself: Robust Federated Learning against Stragglers and Adversaries,"[""savertm@kaist.ac.kr"", ""~Dong-Jun_Han1"", ""ejaqmf@jejunu.ac.kr"", ""~Jaekyun_Moon2""]","[""Jungwuk Park"", ""Dong-Jun Han"", ""Minseok Choi"", ""Jaekyun Moon""]","[""Federated Learning""]","While federated learning allows efficient model training with local data at edge devices, two major issues that need to be resolved are: slow devices known as stragglers and malicious attacks launched by adversaries. While the presence of both stragglers and adversaries raises serious concerns for the deployment of practical federated learning systems, no known schemes or known combinations of schemes, to our best knowledge, effectively address these two issues at the same time. In this work, we propose Sself, a semi-synchronous entropy and loss based filtering/averaging, to tackle both stragglers and adversaries simultaneously. The stragglers are handled by exploiting different staleness (arrival delay) information when combining locally updated models during periodic global aggregation. Various adversarial attacks are tackled by utilizing a small amount of public data collected at the server in each aggregation step, to first filter out the model-poisoned devices using computed entropies, and then perform weighted averaging based on the estimated losses to combat data poisoning and backdoor attacks. A theoretical convergence bound is established to provide insights on the convergence of Sself. Extensive experimental results show that Sself outperforms various combinations of existing methods aiming to handle stragglers/adversaries.",/pdf/e201319c3e9b74cfed5468dd8cd7824b9c983cf2.pdf,ICLR,2021,"We propose Sself, a semi-synchronous entropy and loss based filtering for federated learning, to tackle both stragglers and adversaries simultaneously." +clyAUUnldg,aAsGZQxugJL,1601310000000.0,1614990000000.0,1972,AdaDGS: An adaptive black-box optimization method with a nonlocal directional Gaussian smoothing gradient,"[""~Hoang_A_Tran1"", ""~Guannan_Zhang1""]","[""Hoang A Tran"", ""Guannan Zhang""]",[],"The local gradient points to the direction of the steepest slope in an infinitesimal neighborhood. An optimizer guided by the local gradient is often trapped in local optima when the loss landscape is multi-modal. A directional Gaussian smoothing (DGS) approach was recently proposed in (Zhang et al., 2020) and used to define a truly nonlocal gradient, referred to as the DGS gradient, for high-dimensional black-box optimization. Promising results show that replacing the traditional local gradient with the DGS gradient can significantly improve the performance of gradient-based methods in optimizing highly multi-modal loss functions. However, the optimal performance of the DGS gradient may rely on fine tuning of two important hyper-parameters, i.e., the smoothing radius and the learning rate. In this paper, we present a simple, yet ingenious and efficient adaptive approach for optimization with the DGS gradient, which removes the need of hyper-parameter fine tuning. Since the DGS gradient generally points to a good search direction, we perform a line search along the DGS direction to determine the step size at each iteration. The learned step size in turn will inform us of the scale of function landscape in the surrounding area, based on which we adjust the smoothing radius accordingly for the next iteration. We present experimental results on high-dimensional benchmark functions, an airfoil design problem and a game content generation problem. The AdaDGS method has shown superior performance over several the state-of-the-art black-box optimization methods.",/pdf/ee389da57b7b95da9ec8c37bee732aed581144bb.pdf,ICLR,2021,We developed an adaptive algorithm to avoid fine hyper-parameter tuning of a nonlocal black-box optimization method for high-dimensional problems. +rkg3kRNKvH,HyevQaMuDB,1569440000000.0,1577170000000.0,907,Linguistic Embeddings as a Common-Sense Knowledge Repository: Challenges and Opportunities,"[""nfulda@byu.edu""]","[""Nancy Fulda""]","[""knowledge representation"", ""word embeddings"", ""sentence embeddings"", ""common-sense knowledge""]","Many applications of linguistic embedding models rely on their value as pre-trained inputs for end-to-end tasks such as dialog modeling, machine translation, or question answering. This position paper presents an alternate paradigm: Rather than using learned embeddings as input features, we instead treat them as a common-sense knowledge repository that can be queried via simple mathematical operations within the embedding space. We show how linear offsets can be used to (a) identify an object given its description, (b) discover relations of an object given its label, and (c) map free-form text to a set of action primitives. Our experiments provide a valuable proof of concept that language-informed common sense reasoning, or `reasoning in the linguistic domain', lies within the grasp of the research community. In order to attain this goal, however, we must reconsider the way neural embedding models are typically trained an evaluated. To that end, we also identify three empirically-motivated evaluation metrics for use in the training of future embedding models.",/pdf/f6e80525ce7a70e3f0e9e8b2f0a82d5e8bf12936.pdf,ICLR,2020,"This paper presents a paradigm and methodology for using learned sentence representations as emergent, flexible knowledge bases that can be queried using linear algebra." +SJQO7UJCW,SkzO78yAb,1509020000000.0,1518730000000.0,125,Adversarial Learning for Semi-Supervised Semantic Segmentation,"[""whung8@ucmerced.edu"", ""ytsai@nec-labs.com"", ""lyt@csie.ntu.edu.tw"", ""yylin@citi.sinica.edu.tw"", ""mhyang@ucmerced.edu""]","[""Wei-Chih Hung"", ""Yi-Hsuan Tsai"", ""Yan-Ting Liou"", ""Yen-Yu Lin"", ""Ming-Hsuan Yang""]","[""semantic segmentation"", ""adversarial learning"", ""semi-supervised learning"", ""self-taught learning""]","We propose a method for semi-supervised semantic segmentation using the adversarial network. While most existing discriminators are trained to classify input images as real or fake on the image level, we design a discriminator in a fully convolutional manner to differentiate the predicted probability maps from the ground truth segmentation distribution with the consideration of the spatial resolution. We show that the proposed discriminator can be used to improve the performance on semantic segmentation by coupling the adversarial loss with the standard cross entropy loss on the segmentation network. In addition, the fully convolutional discriminator enables the semi-supervised learning through discovering the trustworthy regions in prediction results of unlabeled images, providing additional supervisory signals. In contrast to existing methods that utilize weakly-labeled images, our method leverages unlabeled images without any annotation to enhance the segmentation model. Experimental results on both the PASCAL VOC 2012 dataset and the Cityscapes dataset demonstrate the effectiveness of our algorithm.",/pdf/40e221ee03dfd2943cf2dcca323164a7d44a7afd.pdf,ICLR,2018, +rke3OxSKwr,SJe-ragtvB,1569440000000.0,1577170000000.0,2412,Improved Training Techniques for Online Neural Machine Translation,"[""maha.elbayad@inria.fr"", ""laurent.besacier@univ-grenoble-alpes.fr"", ""jakob.verbeek@inria.fr""]","[""Maha Elbayad"", ""Laurent Besacier"", ""Jakob Verbeek""]","[""Deep learning"", ""natural language processing"", ""Machine translation""]","Neural sequence-to-sequence models are at the basis of state-of-the-art solutions for sequential prediction problems such as machine translation and speech recognition. The models typically assume that the entire input is available when starting target generation. In some applications, however, it is desirable to start the decoding process before the entire input is available, e.g. to reduce the latency in automatic speech recognition. We consider state-of-the-art wait-k decoders, that first read k tokens from the source and then alternate between reading tokens from the input and writing to the output. We investigate the sensitivity of such models to the value of k that is used during training and when deploying the model, and the effect of updating the hidden states in transformer models as new source tokens are read. We experiment with German-English translation on the IWSLT14 dataset and the larger WMT15 dataset. Our results significantly improve over earlier state-of-the-art results for German-English translation on the WMT15 dataset across different latency levels.",/pdf/0e45cb24453e778ba06ed67c1007c33cd8375ba0.pdf,ICLR,2020,Improved training of wait-k decoders for online machine translation +BkjLkSqxg,,1478280000000.0,1481910000000.0,230,LipNet: End-to-End Sentence-level Lipreading,"[""yannis.assael@cs.ox.ac.uk"", ""brendan.shillingford@cs.ox.ac.uk"", ""shimon.whiteson@cs.ox.ac.uk"", ""nando.de.freitas@cs.ox.ac.uk""]","[""Yannis M. Assael"", ""Brendan Shillingford"", ""Shimon Whiteson"", ""Nando de Freitas""]","[""Computer vision"", ""Deep learning""]","Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).",/pdf/34c7079f4e5e130a4293c7e9501cc1e8b04690b1.pdf,ICLR,2017,LipNet is the first end-to-end sentence-level lipreading model to simultaneously learn spatiotemporal visual features and a sequence model. +wUUKCAmBx6q,q28d999DF2R,1601310000000.0,1614990000000.0,2586,Flow Neural Network for Traffic Flow Modelling in IP Networks,"[""~Xiangle_Cheng1"", ""y325he@uwaterloo.ca"", ""longfeifei@huawei.com"", ""xiaoshihan@huawei.com"", ""lifenglin@huawei.com""]","[""Xiangle Cheng"", ""Yuchen He"", ""Feifei Long"", ""Shihan Xiao"", ""Fenglin Li""]","[""Flow neural network"", ""contrastive induction learning"", ""representation learning"", ""spatio-temporal induction""]","This paper presents and investigates a novel and timely application domain for deep learning: sub-second traffic flow modelling in IP networks. Traffic flows are the most fundamental components in an IP based networking system. The accurate modelling of the generative patterns of these flows is crucial for many practical network applications. However, the high nonlinearity and dynamics of both the traffic and network conditions make this task challenging, particularly at the time granularity of sub-second. In this paper, we cast this problem as a representation learning task to model the intricate patterns in data traffic according to the IP network structure and working mechanism. Accordingly, we propose a customized Flow Neural Network, which works in a self-supervised way to extract the domain-specific data correlations. We report the state-of-the-art performances on both synthetic and realistic traffic patterns on multiple practical network applications, which provides a good testament to the strength of our approach.",/pdf/395453ab2fb1a2fde41b7e1a8b5947aaffe4823f.pdf,ICLR,2021,We propose a customised Flow Neural Network for the subsecond traffic flow modelling in IP networks by exploiting the domain-specific data properties according to the IP network structure and working machenism. +B1twdMCab,SJdDuG0TZ,1508940000000.0,1518730000000.0,85,Dynamic Integration of Background Knowledge in Neural NLU Systems,"[""dirk.weissenborn@dfki.de"", ""tkocisky@google.com"", ""cdyer@google.com""]","[""Dirk Weissenborn"", ""Tomas Kocisky"", ""Chris Dyer""]","[""natural language processing"", ""background knowledge"", ""word embeddings"", ""question answering"", ""natural language inference""]","Common-sense or background knowledge is required to understand natural language, but in most neural natural language understanding (NLU) systems, the requisite background knowledge is indirectly acquired from static corpora. We develop a new reading architecture for the dynamic integration of explicit background knowledge in NLU models. A new task-agnostic reading module provides refined word representations to a task-specific NLU architecture by processing background knowledge in the form of free-text statements, together with the task-specific inputs. Strong performance on the tasks of document question answering (DQA) and recognizing textual entailment (RTE) demonstrate the effectiveness and flexibility of our approach. Analysis shows that our models learn to exploit knowledge selectively and in a semantically appropriate way.",/pdf/8646625d4c11be5b5c030b95799e20c35053009a.pdf,ICLR,2018,In this paper we present a task-agnostic reading architecture for the dynamic integration of explicit background knowledge in neural NLU models. +SJgXs1HtwH,S1gApp0uDH,1569440000000.0,1577170000000.0,1909,TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing,"[""vinojjayasundara@gmail.com"", ""dqnbui.2016@phdis.smu.edu.sg"", ""lxjiang@smu.edu.sg"", ""davidlo@smu.edu.sg""]","[""Vinoj Jayasundara"", ""Nghi Duy Quoc Bui"", ""Lingxiao Jiang"", ""David Lo""]","[""Program Classification"", ""Capsule Networks"", ""Deep Learning""]","Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to reduce time spent in code navigation and understanding, and thus increase productivity. Different from natural language articles, source code in programming languages often follows rigid syntactical structures and there can exist dependencies among code elements that are located far away from each other through complex control flows and data flows. Existing studies on tree-based convolutional neural networks (TBCNN) and gated graph neural networks (GGNN) are not able to capture essential semantic dependencies among code elements accurately. In this paper, we propose novel tree-based capsule networks (TreeCaps) and relevant techniques for processing program code in an automated way that encodes code syntactical structures and captures code dependencies more accurately. Based on evaluation on programs written in different programming languages, we show that our TreeCaps-based approach can outperform other approaches in classifying the functionalities of many programs.",/pdf/9c6149696670dc2276800df5b5ed8c3f1812967c.pdf,ICLR,2020, +S1gSj0NKvB,BkeL6WYODS,1569440000000.0,1583910000000.0,1317,Comparing Rewinding and Fine-tuning in Neural Network Pruning,"[""renda@csail.mit.edu"", ""jfrankle@csail.mit.edu"", ""mcarbin@csail.mit.edu""]","[""Alex Renda"", ""Jonathan Frankle"", ""Michael Carbin""]","[""pruning"", ""sparsity"", ""fine-tuning"", ""lottery ticket""]","Many neural network pruning algorithms proceed in three steps: train the network to completion, remove unwanted structure to compress the network, and retrain the remaining structure to recover lost accuracy. The standard retraining technique, fine-tuning, trains the unpruned weights from their final trained values using a small fixed learning rate. In this paper, we compare fine-tuning to alternative retraining techniques. Weight rewinding (as proposed by Frankle et al., (2019)), rewinds unpruned weights to their values from earlier in training and retrains them from there using the original training schedule. Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding. Both rewinding techniques outperform fine-tuning, forming the basis of a network-agnostic pruning algorithm that matches the accuracy and compression ratios of several more network-specific state-of-the-art techniques. +",/pdf/4b5ae42dbdeee4a77dfc03a1b2bde2897565891f.pdf,ICLR,2020,"Instead of fine-tuning after pruning, rewind weights or learning rate schedule to their values earlier in training and retrain from there to achieve higher accuracy when pruning neural networks." +BJ9fZNqle,,1478280000000.0,1481450000000.0,204,Multi-modal Variational Encoder-Decoders,"[""julianserban@gmail.com"", ""ago109@psu.edu"", ""jpineau@cs.mcgill.ca"", ""aaron.courville@umontreal.ca""]","[""Iulian V. Serban"", ""Alexander G. Ororbia II"", ""Joelle Pineau"", ""Aaron Courville""]","[""Deep learning"", ""Structured prediction"", ""Natural language processing""]","Recent advances in neural variational inference have facilitated efficient training of powerful directed graphical models with continuous latent variables, such as variational autoencoders. However, these models usually assume simple, uni-modal priors — such as the multivariate Gaussian distribution — yet many real-world data distributions are highly complex and multi-modal. Examples of complex and multi-modal distributions range from topics in newswire text to conversational dialogue responses. When such latent variable models are applied to these domains, the restriction of the simple, uni-modal prior hinders the overall expressivity of the learned model as it cannot possibly capture more complex aspects of the data distribution. To overcome this critical restriction, we propose a flexible, simple prior distribution which can be learned efficiently and potentially capture an exponential number of modes of a target distribution. We develop the multi-modal variational encoder-decoder framework and investigate the effectiveness of the proposed prior in several natural language processing modeling tasks, including document modeling and dialogue modeling.",/pdf/071718de067f86d033a26cee3d6f7bd1e96215db.pdf,ICLR,2017,Learning continuous multimodal latent variables in the variational auto-encoder framework for text processing applications. +BygPq6VFvS,HylWoyRDPS,1569440000000.0,1577170000000.0,711,Enhancing Attention with Explicit Phrasal Alignments,"[""nxphi47@gmail.com"", ""sjoty@salesforce.com"", ""ng0155ng@e.ntu.edu.sg""]","[""Xuan-Phi Nguyen"", ""Shafiq Joty"", ""Thanh-Tung Nguyen""]","[""NMT"", ""Phrasal Attention"", ""Machine Translation"", ""Language Modeling""]","The attention mechanism is an indispensable component of any state-of-the-art neural machine translation system. However, existing attention methods are often token-based and ignore the importance of phrasal alignments, which are the backbone of phrase-based statistical machine translation. We propose a novel phrase-based attention method to model n-grams of tokens as the basic attention entities, and design multi-headed phrasal attentions within the Transformer architecture to perform token-to-token and token-to-phrase mappings. Our approach yields improvements in English-German, English-Russian and English-French translation tasks on the standard WMT'14 test set. Furthermore, our phrasal attention method shows improvements on the one-billion-word language modeling benchmark. +",/pdf/846366884ccb6624de9f8ff9db0534f56f468548.pdf,ICLR,2020, +Ske6wiAcKQ,rJePDX_qK7,1538090000000.0,1545360000000.0,303,Real-time Neural-based Input Method,"[""jiayao@microsoft.com"", ""shu@nlab.ci.i.u-tokyo.ac.jp"", ""xinjianl@andrew.cmu.edu"", ""katsutoshi.ohtsuki@microsoft.com"", ""nakayama@ci.i.u-tokyo.ac.jp""]","[""Jiali Yao"", ""Raphael Shu"", ""Xinjian Li"", ""Katsutoshi Ohtsuki"", ""Hideki Nakayama""]","[""input method"", ""language model"", ""neural network"", ""softmax""]","The input method is an essential service on every mobile and desktop devices that provides text suggestions. It converts sequential keyboard inputs to the characters in its target language, which is indispensable for Japanese and Chinese users. Due to critical resource constraints and limited network bandwidth of the target devices, applying neural models to input method is not well explored. In this work, we apply a LSTM-based language model to input method and evaluate its performance for both prediction and conversion tasks with Japanese BCCWJ corpus. We articulate the bottleneck to be the slow softmax computation during conversion. To solve the issue, we propose incremental softmax approximation approach, which computes softmax with a selected subset vocabulary and fix the stale probabilities when the vocabulary is updated in future steps. We refer to this method as incremental selective softmax. The results show a two order speedup for the softmax computation when converting Japanese input sequences with a large vocabulary, reaching real-time speed on commodity CPU. We also exploit the model compressing potential to achieve a 92% model size reduction without losing accuracy.",/pdf/c288bc4ba40dd3954c66b19fd8bae80380b3c358.pdf,ICLR,2019, +BkLhzHtlg,,1478210000000.0,1488740000000.0,94,Learning Recurrent Representations for Hierarchical Behavior Modeling,"[""eeyjolfs@caltech.edu"", ""bransonk@janelia.hhmi.org"", ""yyue@caltech.edu"", ""perona@caltech.edu""]","[""Eyrun Eyjolfsdottir"", ""Kristin Branson"", ""Yisong Yue"", ""Pietro Perona""]","[""Unsupervised Learning"", ""Semi-Supervised Learning"", ""Reinforcement Learning"", ""Applications""]","We propose a framework for detecting action patterns from motion sequences and modeling the sensory-motor relationship of animals, using a generative recurrent neural network. The network has a discriminative part (classifying actions) and a generative part (predicting motion), whose recurrent cells are laterally connected, allowing higher levels of the network to represent high level behavioral phenomena. We test our framework on two types of tracking data, fruit fly behavior and online handwriting. Our results show that 1) taking advantage of unlabeled sequences, by predicting future motion, significantly improves action detection performance when training labels are scarce, 2) the network learns to represent high level phenomena such as writer identity and fly gender, without supervision, and 3) simulated motion trajectories, generated by treating motion prediction as input to the network, look realistic and may be used to qualitatively evaluate whether the model has learnt generative control rules. ",/pdf/b827866ca50e1bcc6f59cfd1aa6e628115bf13a3.pdf,ICLR,2017, +2d34y5bRWxB,rYoDBfHsWK,1601310000000.0,1614990000000.0,3218,Regularization Cocktails for Tabular Datasets,"[""~Arlind_Kadra1"", ""~Marius_Lindauer1"", ""~Frank_Hutter1"", ""~Josif_Grabocka1""]","[""Arlind Kadra"", ""Marius Lindauer"", ""Frank Hutter"", ""Josif Grabocka""]","[""deep learning"", ""regularization"", ""hyperparameter optimization"", ""benchmarks.""]","The regularization of prediction models is arguably the most crucial ingredient that allows Machine Learning solutions to generalize well on unseen data. Several types of regularization are popular in the Deep Learning community (e.g., weight decay, drop-out, early stopping, etc.), but so far these are selected on an ad-hoc basis, and there is no systematic study as to how different regularizers should be combined into the best “cocktail”. In this paper, we fill this gap, by considering the cocktails of 13 different regularization methods and framing the question of how to best combine them as a standard hyperparameter optimization problem. We perform a large-scale empirical study on 40 tabular datasets, concluding that, firstly, regularization cocktails substantially outperform individual regularization methods, even if the hyperparameters of the latter are carefully tuned; secondly, the optimal regularization cocktail depends on the dataset; and thirdly, regularization cocktails yield the state-of-the-art in classifying tabular datasets by outperforming Gradient-Boosted Decision Trees.",/pdf/7eebe22c4ab2ef082d3c49fdfdf7fc40e3e67547.pdf,ICLR,2021,An empirical study on the optimal combination of regularization methods. +HyztsoC5Y7,SygH8F55Y7,1538090000000.0,1550860000000.0,642,"Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning","[""nagaban2@berkeley.edu"", ""iclavera@berkeley.edu"", ""simin.liu@berkeley.edu"", ""ronf@berkeley.edu"", ""pabbeel@berkeley.edu"", ""svlevine@eecs.berkeley.edu"", ""cbfinn@eecs.berkeley.edu""]","[""Anusha Nagabandi"", ""Ignasi Clavera"", ""Simin Liu"", ""Ronald S. Fearing"", ""Pieter Abbeel"", ""Sergey Levine"", ""Chelsea Finn""]","[""meta-learning"", ""reinforcement learning"", ""meta reinforcement learning"", ""online adaptation""]","Although reinforcement learning methods can achieve impressive results in simulation, the real world presents two major challenges: generating samples is exceedingly expensive, and unexpected perturbations or unseen situations cause proficient but specialized policies to fail at test time. Given that it is impractical to train separate policies to accommodate all situations the agent may see in the real world, this work proposes to learn how to quickly and effectively adapt online to new tasks. To enable sample-efficient learning, we consider learning online adaptation in the context of model-based reinforcement learning. Our approach uses meta-learning to train a dynamics model prior such that, when combined with recent data, this prior can be rapidly adapted to the local context. Our experiments demonstrate online adaptation for continuous control tasks on both simulated and real-world agents. We first show simulated agents adapting their behavior online to novel terrains, crippled body parts, and highly-dynamic environments. We also illustrate the importance of incorporating online adaptation into autonomous agents that operate in the real world by applying our method to a real dynamic legged millirobot: We demonstrate the agent's learned ability to quickly adapt online to a missing leg, adjust to novel terrains and slopes, account for miscalibration or errors in pose estimation, and compensate for pulling payloads.",/pdf/4ac2c9a8d05224872f7c9b5ff9b81384148492f8.pdf,ICLR,2019,A model-based meta-RL algorithm that enables a real robot to adapt online in dynamic environments +HyxgoyHtDB,HkgWv2C_Dr,1569440000000.0,1577170000000.0,1901,Policy Optimization by Local Improvement through Search,"[""jssong@caltech.edu"", ""wenjiej@google.com"", ""ayazdan@google.com"", ""esonghori@google.com"", ""agoldie@google.com"", ""ndjaitly@google.com"", ""azalia@google.com""]","[""Jialin Song"", ""Joe Wenjie Jiang"", ""Amir Yazdanbakhsh"", ""Ebrahim Songhori"", ""Anna Goldie"", ""Navdeep Jaitly"", ""Azalia Mirhoseini""]","[""policy learning"", ""imitation learning""]","Imitation learning has emerged as a powerful strategy for learning initial policies that can be refined with reinforcement learning techniques. Most strategies in imitation learning, however, rely on per-step supervision either from expert demonstrations, referred to as behavioral cloning or from interactive expert policy queries such as DAgger. These strategies differ on the state distribution at which the expert actions are collected -- the former using the state distribution of the expert, the latter using the state distribution of the policy being trained. However, the learning signal in both cases arises from the expert actions. On the other end of the spectrum, approaches rooted in Policy Iteration, such as Dual Policy Iteration do not choose next step actions based on an expert, but instead use planning or search over the policy to choose an action distribution to train towards. However, this can be computationally expensive, and can also end up training the policy on a state distribution that is far from the current policy's induced distribution. In this paper, we propose an algorithm that finds a middle ground by using Monte Carlo Tree Search (MCTS) to perform local trajectory improvement over rollouts from the policy. We provide theoretical justification for both the proposed local trajectory search algorithm and for our use of MCTS as a local policy improvement operator. We also show empirically that our method (Policy Optimization by Local Improvement through Search or POLISH) is much faster than methods that plan globally, speeding up training by a factor of up to 14 in wall clock time. Furthermore, the resulting policy outperforms strong baselines in both reinforcement learning and imitation learning.",/pdf/206d22f8cf59fc1b85b22418fd2e276024278393.pdf,ICLR,2020,Monte Carlo tree search can generate short time horizon demonstrations for effective imitation learning. +BkabRiQpb,H1n-RiQpW,1508260000000.0,1519670000000.0,10,Consequentialist conditional cooperation in social dilemmas with imperfect information,"[""alexpeys@gmail.com"", ""alerer@fb.com""]","[""Alexander Peysakhovich"", ""Adam Lerer""]","[""deep reinforcement learning"", ""cooperation"", ""social dilemma"", ""multi-agent systems""]","Social dilemmas, where mutual cooperation can lead to high payoffs but participants face incentives to cheat, are ubiquitous in multi-agent interaction. We wish to construct agents that cooperate with pure cooperators, avoid exploitation by pure defectors, and incentivize cooperation from the rest. However, often the actions taken by a partner are (partially) unobserved or the consequences of individual actions are hard to predict. We show that in a large class of games good strategies can be constructed by conditioning one's behavior solely on outcomes (ie. one's past rewards). We call this consequentialist conditional cooperation. We show how to construct such strategies using deep reinforcement learning techniques and demonstrate, both analytically and experimentally, that they are effective in social dilemmas beyond simple matrix games. We also show the limitations of relying purely on consequences and discuss the need for understanding both the consequences of and the intentions behind an action.",/pdf/251416720d46c82d6db75a812110371d037ba986.pdf,ICLR,2018,We show how to use deep RL to construct agents that can solve social dilemmas beyond matrix games. +rJwelMbR-,B1UlxzbCb,1509130000000.0,1519460000000.0,760,Divide-and-Conquer Reinforcement Learning,"[""dibya.ghosh@berkeley.edu"", ""avisingh@cs.berkeley.edu"", ""aravraj@cs.washington.edu"", ""vikash@cs.washington.edu"", ""svlevine@eecs.berkeley.edu""]","[""Dibya Ghosh"", ""Avi Singh"", ""Aravind Rajeswaran"", ""Vikash Kumar"", ""Sergey Levine""]","[""deep reinforcement learning"", ""reinforcement learning"", ""policy gradients"", ""model-free""]","Standard model-free deep reinforcement learning (RL) algorithms sample a new initial state for each trial, allowing them to optimize policies that can perform well even in highly stochastic environments. However, problems that exhibit considerable initial state variation typically produce high-variance gradient estimates for model-free RL, making direct policy or value function optimization challenging. In this paper, we develop a novel algorithm that instead partitions the initial state space into ""slices"", and optimizes an ensemble of policies, each on a different slice. The ensemble is gradually unified into a single policy that can succeed on the whole state space. This approach, which we term divide-and-conquer RL, is able to solve complex tasks where conventional deep RL methods are ineffective. Our results show that divide-and-conquer RL greatly outperforms conventional policy gradient methods on challenging grasping, manipulation, and locomotion tasks, and exceeds the performance of a variety of prior methods. Videos of policies learned by our algorithm can be viewed at https://sites.google.com/view/dnc-rl/ +",/pdf/c4dfab385c417dba0ad15ed33f638135ee81a9d6.pdf,ICLR,2018, +rJgjGxrFPS,ryeHL7eKPH,1569440000000.0,1577170000000.0,2185,A Simple and Scalable Shape Representation for 3D Reconstruction,"[""78lhar@gmail.com"", ""belilovsky.eugene@gmail.com"", ""m.baktashmotlagh@uq.edu.au"", ""a.eriksson@uq.edu.au""]","[""Mateusz Michalkiewicz"", ""Eugene Belilovsky"", ""Mahsa Baktashmotagh"", ""Anders Eriksson""]","[""Computer Vision"", ""3D Reconstruction""]","Deep learning applied to the reconstruction of 3D shapes has seen growing interest. A popular approach to 3D reconstruction and generation in recent years has been the CNN decoder-encoder model often applied in voxel space. However this often scales very poorly with the resolution limiting the effectiveness of these models. Several sophisticated alternatives for decoding to 3D shapes have been proposed typically relying on alternative deep learning architectures. We show however in this work that standard benchmarks in 3D reconstruction can be tackled with a surprisingly simple approach: a linear decoder obtained by principal component analysis on the signed distance transform of the surface. This approach allows easily scaling to larger resolutions. We show in multiple experiments it is competitive with state of the art methods and also allows the decoder to be fine-tuned on the target task using a loss designed for SDF transforms, obtaining further gains. ",/pdf/d52f3c11ffbfb294ac6751c02fd617de4875cad7.pdf,ICLR,2020,We show that a shape representation based on applying PCA to the signed distance transform can be effective for shape inference tasks. +BkSqjHqxg,,1478280000000.0,1484100000000.0,268,Skip-graph: Learning graph embeddings with an encoder-decoder model,"[""jtlee@wpi.edu"", ""xkong@wpi.edu""]","[""John Boaz Lee"", ""Xiangnan Kong""]","[""Unsupervised Learning"", ""Deep learning""]","In this work, we study the problem of feature representation learning for graph-structured data. Many of the existing work in the area are task-specific and based on supervised techniques. We study a method for obtaining a generic feature representation for a graph using an unsupervised approach. The neural encoder-decoder model is a method that has been used in the natural language processing domain to learn feature representations of sentences. In our proposed approach, we train the encoder-decoder model to predict the random walk sequence of neighboring regions in a graph given a random walk along a particular region. The goal is to map subgraphs — as represented by their random walks — that are structurally and functionally similar to nearby locations in feature space. We evaluate the learned graph vectors using several real-world datasets on the graph classification task. The proposed model is able to achieve good results against state-of- the-art techniques.",/pdf/954ddff2afa7d65560f37fe0b0e1b08b0a75a739.pdf,ICLR,2017,An unsupervised method for generating graph feature representations based on the encoder-decoder model. +HyzdRiR9Y7,SyeafOTqFm,1538090000000.0,1551800000000.0,910,Universal Transformers,"[""dehghani@uva.nl"", ""sgouws@google.com"", ""vinyals@google.com"", ""usz@google.com"", ""lukaszkaiser@google.com""]","[""Mostafa Dehghani"", ""Stephan Gouws"", ""Oriol Vinyals"", ""Jakob Uszkoreit"", ""Lukasz Kaiser""]","[""sequence-to-sequence"", ""rnn"", ""transformer"", ""machine translation"", ""language understanding"", ""learning to execute""]","Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.",/pdf/6ee41939003eaa38439a2607d081864b4ba5fea4.pdf,ICLR,2019,"We introduce the Universal Transformer, a self-attentive parallel-in-time recurrent sequence model that outperforms Transformers and LSTMs on a wide range of sequence-to-sequence tasks, including machine translation." +HkgBsaVtDB,S1gHo-1uwH,1569440000000.0,1577170000000.0,742,Unified recurrent network for many feature types,"[""stec@u.northwestern.edu"", ""d-klabjan@northwestern.edu"", ""jutke@allstate.com""]","[""Alexander Stec"", ""Diego Klabjan"", ""Jean Utke""]","[""sparse"", ""recurrent"", ""asynchronous"", ""time"", ""series""]","There are time series that are amenable to recurrent neural network (RNN) solutions when treated as sequences, but some series, e.g. asynchronous time series, provide a richer variation of feature types than current RNN cells take into account. In order to address such situations, we introduce a unified RNN that handles five different feature types, each in a different manner. Our RNN framework separates sequential features into two groups dependent on their frequency, which we call sparse and dense features, and which affect cell updates differently. Further, we also incorporate time features at the sequential level that relate to the time between specified events in the sequence and are used to modify the cell's memory state. We also include two types of static (whole sequence level) features, one related to time and one not, which are combined with the encoder output. The experiments show that the proposed modeling framework does increase performance compared to standard cells.",/pdf/6b39d8d363460b3e76e7127ed3b73d507bbeec74.pdf,ICLR,2020,"We introduce a unified RNN that handles five different feature types, each in a different manner." +dvSExzhjG9D,zH5JUrYs4fD,1601310000000.0,1614990000000.0,188,MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks,"[""~Jun_Shu1"", ""zywwyz@stu.xjtu.edu.cn"", ""~Qian_Zhao1"", ""~Deyu_Meng1"", ""~Zongben_Xu1""]","[""Jun Shu"", ""Yanwen Zhu"", ""Qian Zhao"", ""Deyu Meng"", ""Zongben Xu""]","[""Meta Learning"", ""Hyperparameters Learning"", ""Generalization on Tasks"", ""Optimization"", ""LR Schedules Learning"", ""DNNs Training""]","The learning rate (LR) is one of the most important hyper-parameters in stochastic gradient descent (SGD) for deep neural networks (DNN) training and generalization. However, current hand-designed LR schedules need to manually pre-specify a fixed form, which limits their ability to adapt to non-convex optimization problems due to the significant variation of training dynamics. Meanwhile, it always needs to search a proper LR schedule from scratch for new tasks. To address these issues, we propose to parameterize LR schedules with an explicit mapping formulation, called MLR-SNet. The learnable structure brings more flexibility for MLR-SNet to learn a proper LR schedule to comply with the training dynamics of DNN. Image and text classification benchmark experiments substantiate the capability of our method for achieving proper LR schedules. Moreover, the meta-learned MLR-SNet is tuning-free plug-and-play to generalize to new heterogeneous tasks. We transfer our meta-trained MLR-SNet to tasks like different training epochs, network architectures, datasets, especially large scale ImageNet dataset, and achieve comparable performance with hand-designed LR schedules. Finally, MLR-Net can achieve better robustness when training data is biased with corrupted noise. ",/pdf/65e42b516021a9f16cf1041cb206307aa233838c.pdf,ICLR,2021,"We propose a transferable LR schedules, MLR-SNet, which is plug and play for adapting heterogeneous tasks." +RdhjoXl-SDG,q7r45J2HAq,1601310000000.0,1614990000000.0,2428,Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference,"[""~Shumao_Zhang1"", ""~Thomas_Hou2"", ""~Pengchuan_Zhang1""]","[""Shumao Zhang"", ""Thomas Hou"", ""Pengchuan Zhang""]",[],"High-dimensional Bayesian inference problems cast a long-standing challenge in generating samples, especially when the posterior has multiple modes. For a wide class of Bayesian inference problems equipped with the multiscale structure that low-dimensional (coarse-scale) surrogate can approximate the original high-dimensional (fine-scale) problem well, we propose to train a Multiscale Invertible Generative Network (MsIGN) for sample generation. A novel prior conditioning layer is designed to bridge networks at different resolutions, enabling coarse-to-fine multi-stage training. Jeffreys divergence is adopted as the training objective to avoid mode dropping. On two high-dimensional Bayesian inverse problems, MsIGN approximates the posterior accurately and clearly captures multiple modes, showing superior performance compared with previous deep generative network approaches. On the natural image synthesis task, MsIGN achieves the superior performance in bits-per-dimension compared with our baseline models and yields great interpret-ability of its neurons in intermediate layers.",/pdf/c1d96c7db981128a60eb30ab88e5706ed7d54e6c.pdf,ICLR,2021, +HkUfnZFt1Rw,DSGHiRclbdI,1601310000000.0,1614990000000.0,1180,Dissecting graph measures performance for node clustering in LFR parameter space,"[""~Vladimir_Ivashkin1"", ""pavel4e@gmail.com""]","[""Vladimir Ivashkin"", ""Pavel Chebotarev""]","[""graph theory"", ""graph measures"", ""kernel k-means"", ""clustering""]","Graph measures can be used for graph node clustering using metric clustering algorithms. There are multiple measures applicable to this task, and which one performs better is an open question. We study the performance of 25 graph measures on generated graphs with different parameters. While usually measure comparisons are limited to general measure ranking on a particular dataset, we aim to explore the performance of various measures depending on graph features. Using an LFR generator, we create a dataset of ~7500 graphs covering the whole LFR parameter space. For each graph, we assess the quality of clustering with k-means algorithm for every considered measure. We determine the best measure for every area of the parameter space. We find that the parameter space consists of distinct zones where one particular measure is the best. We analyze the geometry of the resulting zones and describe it with simple criteria. Given particular graph parameters, this allows us to choose the best measure to use for clustering.",/pdf/08307383bc73e00148a470266d94f1be0be4cbdb.pdf,ICLR,2021,We investigated graph features space and found zones of leadership for several graph node measures in node clustering task. +Hk8TGSKlg,,1478210000000.0,1488470000000.0,95,Reasoning with Memory Augmented Neural Networks for Language Comprehension,"[""tsendsuren.munkhdalai@umassmed.edu"", ""hong.yu@umassmed.edu""]","[""Tsendsuren Munkhdalai"", ""Hong Yu""]","[""Natural language processing"", ""Deep learning""]","Hypothesis testing is an important cognitive process that supports human reasoning. In this paper, we introduce a computational hypothesis testing approach based on memory augmented neural networks. Our approach involves a hypothesis testing loop that reconsiders and progressively refines a previously formed hypothesis in order to generate new hypotheses to test. We apply the proposed approach to language comprehension task by using Neural Semantic Encoders (NSE). Our NSE models achieve the state-of-the-art results showing an absolute improvement of 1.2% to 2.6% accuracy over previous results obtained by single and ensemble systems on standard machine comprehension benchmarks such as the Children's Book Test (CBT) and Who-Did-What (WDW) news article datasets.",/pdf/6cf39071d41e605ab0e3b4c6657aa5769393dfa4.pdf,ICLR,2017, +_bF8aOMNIdu,pEE1eTvCRsa,1601310000000.0,1614990000000.0,1207,Robust Temporal Ensembling,"[""~Abel_Brown1"", ""~Benedikt_Schifferer2"", ""~Robert_DiPietro1""]","[""Abel Brown"", ""Benedikt Schifferer"", ""Robert DiPietro""]","[""learning with noise"", ""robust task loss"", ""consistency regularization""]","Successful training of deep neural networks with noisy labels is an essential capability as most real-world datasets contain some amount of mislabeled data. Left unmitigated, label noise can sharply degrade typical supervised learning approaches. In this paper, we present robust temporal ensembling (RTE), a simple supervised learning approach which combines robust task loss, temporal pseudo-labeling, and a new ensemble consistency regularization term to achieve noise-robust learning. We demonstrate that RTE achieves state-of-the-art performance across the CIFAR-10, CIFAR-100, and ImageNet datasets, while forgoing the recent trend of label filtering/fixing. In particular, RTE achieves 93.64% accuracy on CIFAR-10 and 66.43% accuracy on CIFAR-100 under 80% label corruption, and achieves 74.79% accuracy on ImageNet under 40% corruption. These are substantial gains over previous state-of-the-art accuracies of 86.6%, 60.2%, and 71.31%, respectively, achieved using three distinct methods. Finally, we show that RTE retains competitive corruption robustness to unforeseen input noise using CIFAR-10-C, obtaining a mean corruption error (mCE) of 13.50% even in the presence of an 80% noise ratio, versus 26.9% mCE with standard methods on clean data.",/pdf/3e60b4e9e00f5713e56b9fdfa43969e2201fd08a.pdf,ICLR,2021,"We present robust temporal ensembling (RTE), a state-of-the-art method for learning with noisy labels that combines robust task loss, temporal pseudo-labeling, and a new form of consistency regularization." +US-TP-xnXI,gei1B_XwS67,1601310000000.0,1615920000000.0,2229,Structured Prediction as Translation between Augmented Natural Languages,"[""~Giovanni_Paolini1"", ""~Ben_Athiwaratkun1"", ""kronej@amazon.com"", ""~Jie_Ma3"", ""~Alessandro_Achille1"", ""~RISHITA_ANUBHAI2"", ""~Cicero_Nogueira_dos_Santos1"", ""~Bing_Xiang2"", ""~Stefano_Soatto1""]","[""Giovanni Paolini"", ""Ben Athiwaratkun"", ""Jason Krone"", ""Jie Ma"", ""Alessandro Achille"", ""RISHITA ANUBHAI"", ""Cicero Nogueira dos Santos"", ""Bing Xiang"", ""Stefano Soatto""]","[""language models"", ""few-shot learning"", ""transfer learning"", ""structured prediction"", ""generative modeling"", ""sequence to sequence"", ""multi-task learning""]","We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as a translation task between augmented natural languages, from which the task-relevant information can be easily extracted. Our approach can match or outperform task-specific models on all tasks, and in particular achieves new state-of-the-art results on joint entity and relation extraction (CoNLL04, ADE, NYT, and ACE2005 datasets), relation classification (FewRel and TACRED), and semantic role labeling (CoNLL-2005 and CoNLL-2012). We accomplish this while using the same architecture and hyperparameters for all tasks, and even when training a single model to solve all tasks at the same time (multi-task learning). Finally, we show that our framework can also significantly improve the performance in a low-resource regime, thanks to better use of label semantics.",/pdf/6d920934176b3cc6eddc9d925f48b786787c9a92.pdf,ICLR,2021,"We propose a unified text-to-text approach to handle a variety of structured prediction tasks in a single model, allowing seamless multi-task training and providing extra benefits on low-resource scenarios. " +e12NDM7wkEY,BV9L1HRE-5p6,1601310000000.0,1615960000000.0,1734,Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation,"[""~Yaling_Tao1"", ""kentaro1.takagi@toshiba.co.jp"", ""kouta.nakata@toshiba.co.jp""]","[""Yaling Tao"", ""Kentaro Takagi"", ""Kouta Nakata""]","[""clustering"", ""representation learning"", ""deep embedding""]","Clustering is one of the most fundamental tasks in machine learning. Recently, deep clustering has become a major trend in clustering techniques. Representation learning often plays an important role in the effectiveness of deep clustering, and thus can be a principal cause of performance degradation. In this paper, we propose a clustering-friendly representation learning method using instance discrimination and feature decorrelation. Our deep-learning-based representation learning method is motivated by the properties of classical spectral clustering. Instance discrimination learns similarities among data and feature decorrelation removes redundant correlation among features. We utilize an instance discrimination method in which learning individual instance classes leads to learning similarity among instances. Through detailed experiments and examination, we show that the approach can be adapted to learning a latent space for clustering. We design novel softmax-formulated decorrelation constraints for learning. In evaluations of image clustering using CIFAR-10 and ImageNet-10, our method achieves accuracy of 81.5% and 95.4%, respectively. We also show that the softmax-formulated constraints are compatible with various neural networks.",/pdf/a66aec8c00bfa723a6c8e4690bd00cf755e17e8f.pdf,ICLR,2021,"We present a clustering-friendly representation learning method using instance discrimination and feature decorrelation, which achieves accuracy of 81.5% and 95.4% on CIFAR-10 and ImageNet-10, respectively, far above state-of-the-art values." +enhd0P_ERBO,nb7bGlOr9Et,1601310000000.0,1614990000000.0,3202,Learning a Transferable Scheduling Policy for Various Vehicle Routing Problems based on Graph-centric Representation Learning,"[""inwukkim@gmail.com"", ""~Jinkyoo_Park1""]","[""Inwook Kim"", ""Jinkyoo Park""]","[""Vehicle Routing Problem"", ""Multiple Traveling Salesmen Problem"", ""Capacitated Vehicle Routing Problem"", ""Reinforcement Learning"", ""Graph Neural Network""]","Reinforcement learning has been used to learn to solve various routing problems. however, most of the algorithm is restricted to finding an optimal routing strategy for only a single vehicle. In addition, the trained policy under a specific target routing problem is not able to solve different types of routing problems with different objectives and constraints. This paper proposes an reinforcement learning approach to solve the min-max capacitated multi vehicle routing problem (mCVRP), the problem seeks to minimize the total completion time for multiple vehicles whose one-time traveling distance is constrained by their fuel levels to serve the geographically distributed customer nodes. The method represents the relationships among vehicles, customers, and fuel stations using relationship-specific graphs to consider their topological relationships and employ graph neural network (GNN) to extract the graph's embedding to be used to make a routing action. We train the proposed model using the random mCVRP instance with different numbers of vehicles, customers, and refueling stations. We then validate that the trained policy solve not only new mCVRP problems with different complexity (weak transferability but also different routing problems (CVRP, mTSP, TSP) with different objectives and constraints (storing transferability). ",/pdf/fa8f3d617736dbd46fcd9e161503259ea04b518c.pdf,ICLR,2021,"This study proposes the graph-centric, RL-based transferable scheduler for various vehicle routing problems using graph-centric state presentation (GRLTS) that can solve any types of vehicle routing problems such as mCVRP, mTSP, CVRP, and TSP." +BkfPnoActQ,SJeTlEo5KQ,1538090000000.0,1545360000000.0,719,Towards Consistent Performance on Atari using Expert Demonstrations,"[""pohlen@google.com"", ""piot@google.com"", ""toddhester@google.com"", ""mazar@google.com"", ""horgan@google.com"", ""budden@google.com"", ""gabrielbm@google.com"", ""hado@google.com"", ""johnquan@google.com"", ""vec@google.com"", ""mtthss@google.com"", ""munos@google.com"", ""pietquin@google.com""]","[""Tobias Pohlen"", ""Bilal Piot"", ""Todd Hester"", ""Mohammad Gheshlaghi Azar"", ""Dan Horgan"", ""David Budden"", ""Gabriel Barth-Maron"", ""Hado van Hasselt"", ""John Quan"", ""Mel Ve\u010der\u00edk"", ""Matteo Hessel"", ""R\u00e9mi Munos"", ""Olivier Pietquin""]","[""Reinforcement Learning"", ""Atari"", ""RL"", ""Demonstrations""]","Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games. A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of 0.999 (instead of 0.99) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states. When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters.",/pdf/be4b43074e9c1a938b1fe4964d97009503d49418.pdf,ICLR,2019,Ape-X DQfD = Distributed (many actors + one learner + prioritized replay) DQN with demonstrations optimizing the unclipped 0.999-discounted return on Atari. +rk8R_JWRW,B1I0dJZ0-,1509120000000.0,1518730000000.0,495,Gating out sensory noise in a spike-based Long Short-Term Memory network,"[""d.zambrano@cwi.nl"", ""isabella.pozzi@cwi.nl"", ""roeland.nusselder@gmail.com"", ""s.m.bohte@cwi.nl""]","[""Davide Zambrano"", ""Isabella Pozzi"", ""Roeland Nusselder"", ""Sander Bohte""]","[""spiking neural networks"", ""LSTM"", ""recurrent neural networks""]","Spiking neural networks are being investigated both as biologically plausible models of neural computation and also as a potentially more efficient type of neural network. While convolutional spiking neural networks have been demonstrated to achieve near state-of-the-art performance, only one solution has been proposed to convert gated recurrent neural networks, so far. +Recurrent neural networks in the form of networks of gating memory cells have been central in state-of-the-art solutions in problem domains that involve sequence recognition or generation. Here, we design an analog gated LSTM cell where its neurons can be substituted for efficient stochastic spiking neurons. These adaptive spiking neurons implement an adaptive form of sigma-delta coding to convert internally computed analog activation values to spike-trains. For such neurons, we approximate the effective activation function, which resembles a sigmoid. We show how analog neurons with such activation functions can be used to create an analog LSTM cell; networks of these cells can then be trained with standard backpropagation. We train these LSTM networks on a noisy and noiseless version of the original sequence prediction task from Hochreiter & Schmidhuber (1997), and also on a noisy and noiseless version of a classical working memory reinforcement learning task, the T-Maze. Substituting the analog neurons for corresponding adaptive spiking neurons, we then show that almost all resulting spiking neural network equivalents correctly compute the original tasks.",/pdf/00d851a73c1f010821e0893cc3161eaa7c442fc0.pdf,ICLR,2018, We demonstrate a gated recurrent asynchronous spiking neural network that corresponds to an LSTM unit. +rye7knCqK7,SygohGe9KQ,1538090000000.0,1550860000000.0,970,Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks,"[""amanpreet@nyu.edu"", ""tushar@nyu.edu"", ""sainbar@cs.nyu.edu""]","[""Amanpreet Singh"", ""Tushar Jain"", ""Sainbayar Sukhbaatar""]","[""multiagent"", ""communication"", ""competitive"", ""cooperative"", ""continuous"", ""emergent"", ""reinforcement learning""]","Learning when to communicate and doing that effectively is essential in multi-agent tasks. Recent works show that continuous communication allows efficient training with back-propagation in multi-agent scenarios, but have been restricted to fully-cooperative tasks. In this paper, we present Individualized Controlled Continuous Communication Model (IC3Net) which has better training efficiency than simple continuous communication model, and can be applied to semi-cooperative and competitive settings along with the cooperative settings. IC3Net controls continuous communication with a gating mechanism and uses individualized rewards foreach agent to gain better performance and scalability while fixing credit assignment issues. Using variety of tasks including StarCraft BroodWars explore and combat scenarios, we show that our network yields improved performance and convergence rates than the baselines as the scale increases. Our results convey that IC3Net agents learn when to communicate based on the scenario and profitability.",/pdf/8c95a9ab59624b4389b81b980133f85ff0b4fe69.pdf,ICLR,2019,"We introduce IC3Net, a single network which can be used to train agents in cooperative, competitive and mixed scenarios. We also show that agents can learn when to communicate using our model." +ryeRn3NtPH,BylcM6nGwB,1569440000000.0,1577170000000.0,209,Adversarial Inductive Transfer Learning with input and output space adaptation,"[""hsharifi@sfu.ca"", ""shumanp@sfu.ca"", ""ozolotareva@techfak.uni-bielefeld.de"", ""ccollins@prostatecentre.com"", ""ester@sfu.ca""]","[""Hossein Sharifi-Noghabi"", ""Shuman Peng"", ""Olga Zolotareva"", ""Colin C. Collins"", ""Martin Ester""]","[""Inductive transfer learning"", ""adversarial learning"", ""multi-task learning"", ""pharmacogenomics"", ""precision oncology""]","We propose Adversarial Inductive Transfer Learning (AITL), a method for addressing discrepancies in input and output spaces between source and target domains. AITL utilizes adversarial domain adaptation and multi-task learning to address these discrepancies. Our motivating application is pharmacogenomics where the goal is to predict drug response in patients using their genomic information. The challenge is that clinical data (i.e. patients) with drug response outcome is very limited, creating a need for transfer learning to bridge the gap between large pre-clinical pharmacogenomics datasets (e.g. cancer cell lines) and clinical datasets. Discrepancies exist between 1) the genomic data of pre-clinical and clinical datasets (the input space), and 2) the different measures of the drug response (the output space). To the best of our knowledge, AITL is the first adversarial inductive transfer learning method to address both input and output discrepancies. Experimental results indicate that AITL outperforms state-of-the-art pharmacogenomics and transfer learning baselines and may guide precision oncology more accurately.",/pdf/8501a0a10f8badb29128089fe4923bc314417d4a.pdf,ICLR,2020,A novel method of inductive transfer learning that employs adversarial learning and multi-task learning to address the discrepancy in input and output space +BkePHaVKwS,HkeN4vLPPS,1569440000000.0,1577170000000.0,527,Learning Surrogate Losses,"[""josif@ismll.uni-hildesheim.de"", ""rscholz@ismll.uni-hildesheim.de"", ""schmidt-thieme@ismll.uni-hildesheim.de""]","[""Josif Grabocka"", ""Randolf Scholz"", ""Lars Schmidt-Thieme""]","[""Surrogate losses"", ""Non-differentiable losses""]","The minimization of loss functions is the heart and soul of Machine Learning. In this paper, we propose an off-the-shelf optimization approach that can seamlessly minimize virtually any non-differentiable and non-decomposable loss function (e.g. Miss-classification Rate, AUC, F1, Jaccard Index, Mathew Correlation Coefficient, etc.). Our strategy learns smooth relaxation versions of the true losses by approximating them through a surrogate neural network. The proposed loss networks are set-wise models which are invariant to the order of mini-batch instances. Ultimately, the surrogate losses are learned jointly with the prediction model via bilevel optimization. Empirical results on multiple datasets with diverse real-life loss functions compared with state-of-the-art baselines demonstrate the efficiency of learning surrogate losses.",/pdf/3a01fadccde84522b35e9e36372a05ba6ea8b171.pdf,ICLR,2020,Optimizing Surrogate Loss Functions +HkEI22jeg,,1478380000000.0,1488630000000.0,582,Multilayer Recurrent Network Models of Primate Retinal Ganglion Cell Responses,"[""erb2180@columbia.edu"", ""jsmerel@gmail.com"", ""nbrack@stanford.edu"", ""alexkenheitmen@gmail.com"", ""sashake3@uscs.edu"", ""Alan.Litke@cern.ch"", ""ej@stanford.edu"", ""liam@stat.columbia.edu""]","[""Eleanor Batty"", ""Josh Merel"", ""Nora Brackbill"", ""Alexander Heitman"", ""Alexander Sher"", ""Alan Litke"", ""E.J. Chichilnisky"", ""Liam Paninski""]","[""Deep learning"", ""Applications""]","Developing accurate predictive models of sensory neurons is vital to understanding sensory processing and brain computations. The current standard approach to modeling neurons is to start with simple models and to incrementally add interpretable features. An alternative approach is to start with a more complex model that captures responses accurately, and then probe the fitted model structure to understand the neural computations. Here, we show that a multitask recurrent neural network (RNN) framework provides the flexibility necessary to model complex computations of neurons that cannot be captured by previous methods. Specifically, multilayer recurrent neural networks that share features across neurons outperform generalized linear models (GLMs) in predicting the spiking responses of parasol ganglion cells in the primate retina to natural images. The networks achieve good predictive performance given surprisingly small amounts of experimental training data. Additionally, we present a novel GLM-RNN hybrid model with separate spatial and temporal processing components which provides insights into the aspects of retinal processing better captured by the recurrent neural networks.",/pdf/2daca8b06ab9bde6f8f7cc254b614e0a6d93e26e.pdf,ICLR,2017, +HJg3Rp4FwH,S1eh9xG_PH,1569440000000.0,1577170000000.0,871,Policy Optimization In the Face of Uncertainty,"[""longvt94@vnu.edu.vn"", ""hann1@andrew.cmu.edu"", ""htpham@cs.cmu.edu"", ""ktran@microsoft.com""]","[""Tung-Long Vuong"", ""Han Nguyen"", ""Hai Pham"", ""Kenneth Tran""]","[""Reinforcement Learning"", ""Model-based Reinforcement Learning""]","Model-based reinforcement learning has the potential to be more sample efficient than model-free approaches. However, existing model-based methods are vulnerable to model bias, which leads to poor generalization and asymptotic performance compared to model-free counterparts. In this paper, we propose a novel policy optimization framework using an uncertainty-aware objective function to handle those issues. In this framework, the agent simultaneously learns an uncertainty-aware dynamics model and optimizes the policy according to these learned models. Under this framework, the objective function can represented end-to-end as a single computational graph, which allows seamless policy gradient computation via backpropagation through the models. In addition to being theoretically sound, our approach shows promising results on challenging continuous control benchmarks with competitive asymptotic performance and sample complexity compared to state-of-the-art baselines.",/pdf/6ed6f44745b6c760e5c7bb969760fcfd9ed521f1.pdf,ICLR,2020, +S1el9TEKPB,Skx2rDawwB,1569440000000.0,1577170000000.0,694,Sparsity Meets Robustness: Channel Pruning for the Feynman-Kac Formalism Principled Robust Deep Neural Nets,"[""thud2@uci.edu"", ""wangbaonj@gmail.com"", ""bertozzi@math.ucla.edu"", ""sjo@math.ucla.edu"", ""jxin@math.uci.edu""]","[""Thu Dinh*"", ""Bao Wang*"", ""Andrea L. Bertozzi"", ""Stanley J. Osher"", ""Jack Xin""]","[""Sparse Network"", ""Model Compression"", ""Adversarial Training""]","Deep neural nets (DNNs) compression is crucial for adaptation to mobile devices. Though many successful algorithms exist to compress naturally trained DNNs, developing efficient and stable compression algorithms for robustly trained DNNs remains widely open. In this paper, we focus on a co-design of efficient DNN compression algorithms and sparse neural architectures for robust and accurate deep learning. Such a co-design enables us to advance the goal of accommodating both sparsity and robustness. With this objective in mind, we leverage the relaxed augmented Lagrangian based algorithms to prune the weights of adversarially trained DNNs, at both structured and unstructured levels. Using a Feynman-Kac formalism principled robust and sparse DNNs, we can at least double the channel sparsity of the adversarially trained ResNet20 for CIFAR10 classification, meanwhile, improve the natural accuracy by 8.69\% and the robust accuracy under the benchmark 20 iterations of IFGSM attack by 5.42\%.",/pdf/89f2e45791b7387212fdc0017bcece7b43d2870c.pdf,ICLR,2020,We focus on a co-design of efficient DNN compression algorithms and sparse neural architectures for robust and accurate deep learning. Such a co-design enables us to advance the goal of accommodating both sparsity and robustness. +SJOl4DlCZ,Skdg4wg0Z,1509090000000.0,1518730000000.0,290,Classifier-to-Generator Attack: Estimation of Training Data Distribution from Classifier,"[""cocuh@mdl.cs.tsukuba.ac.jp"", ""jun@cs.tsukuba.ac.jp""]","[""Kosuke Kusano"", ""Jun Sakuma""]","[""Security"", ""Privacy"", ""Model Publication"", ""Generative Adversarial Networks""]","Suppose a deep classification model is trained with samples that need to be kept private for privacy or confidentiality reasons. In this setting, can an adversary obtain the private samples if the classification model is given to the adversary? We call this reverse engineering against the classification model the Classifier-to-Generator (C2G) Attack. This situation arises when the classification model is embedded into mobile devices for offline prediction (e.g., object recognition for the automatic driving car and face recognition for mobile phone authentication). +For C2G attack, we introduce a novel GAN, PreImageGAN. In PreImageGAN, the generator is designed to estimate the the sample distribution conditioned by the preimage of classification model $f$, $P(X|f(X)=y)$, where $X$ is the random variable on the sample space and $y$ is the probability vector representing the target label arbitrary specified by the adversary. In experiments, we demonstrate PreImageGAN works successfully with hand-written character recognition and face recognition. In character recognition, we show that, given a recognition model of hand-written digits, PreImageGAN allows the adversary to extract alphabet letter images without knowing that the model is built for alphabet letter images. In face recognition, we show that, when an adversary obtains a face recognition model for a set of individuals, PreImageGAN allows the adversary to extract face images of specific individuals contained in the set, even when the adversary has no knowledge of the face of the individuals.",/pdf/82c4367a58a5c015604624cd792443160f06d0cf.pdf,ICLR,2018,Estimation of training data distribution from trained classifier using GAN. +H1lKd6NYPS,r1gmOMhDvB,1569440000000.0,1577170000000.0,641,Online Meta-Critic Learning for Off-Policy Actor-Critic Methods,"[""zhouwei14@nudt.edu.cn"", ""liyiying10@nudt.edu.cn"", ""yongxin.yang@ed.ac.uk"", ""hmwang@nudt.edu.cn"", ""t.hospedales@ed.ac.uk""]","[""Wei Zhou"", ""Yiying Li"", ""Yongxin Yang"", ""Huaimin Wang"", ""Timothy M. Hospedales""]","[""off-policy actor-critic"", ""reinforcement learning"", ""meta-learning""]","Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic’s action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in a variety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC. ",/pdf/d227d9672f01c43100cdfffd4c51fb20950c5432.pdf,ICLR,2020,"We present Meta-Critic, an auxiliary critic module for off-policy actor-critic methods that can be meta-learned online during single task learning." +ryxPbkrtvr,H1em3jjODS,1569440000000.0,1577170000000.0,1547,BOSH: An Efficient Meta Algorithm for Decision-based Attacks,"[""alanshawzju@gmail.com"", ""pydyang@ucdavis.edu"", ""jyc@zju.edu.cn"", ""kw@kwchang.net"", ""chohsieh@cs.ucla.edu""]","[""Zhenxin Xiao"", ""Puyudi Yang"", ""Yuchen Jiang"", ""Kai-Wei Chang"", ""Cho-Jui Hsieh""]",[],"Adversarial example generation becomes a viable method for evaluating the robustness of a machine learning model. In this paper, we consider hard-label black- box attacks (a.k.a. decision-based attacks), which is a challenging setting that generates adversarial examples based on only a series of black-box hard-label queries. This type of attacks can be used to attack discrete and complex models, such as Gradient Boosting Decision Tree (GBDT) and detection-based defense models. Existing decision-based attacks based on iterative local updates often get stuck in a local minimum and fail to generate the optimal adversarial example with the smallest distortion. To remedy this issue, we propose an efficient meta algorithm called BOSH-attack, which tremendously improves existing algorithms through Bayesian Optimization (BO) and Successive Halving (SH). In particular, instead of traversing a single solution path when searching an adversarial example, we maintain a pool of solution paths to explore important regions. We show empirically that the proposed algorithm converges to a better solution than existing approaches, while the query count is smaller than applying multiple random initializations by a factor of 10.",/pdf/d9963b33e46e848ba315947f37be89600b008e26.pdf,ICLR,2020, +B1ffQnRcKX,rJlr9aj5Y7,1538090000000.0,1557320000000.0,1340,Automatically Composing Representation Transformations as a Means for Generalization,"[""mbchang@berkeley.edu"", ""abhigupta@berkeley.edu"", ""svlevine@eecs.berkeley.edu"", ""tom_griffiths@berkeley.edu""]","[""Michael Chang"", ""Abhishek Gupta"", ""Sergey Levine"", ""Thomas L. Griffiths""]","[""compositionality"", ""deep learning"", ""metareasoning""]","A generally intelligent learner should generalize to more complex tasks than it has previously encountered, but the two common paradigms in machine learning -- either training a separate learner per task or training a single learner for all tasks -- both have difficulty with such generalization because they do not leverage the compositional structure of the task distribution. This paper introduces the compositional problem graph as a broadly applicable formalism to relate tasks of different complexity in terms of problems with shared subproblems. We propose the compositional generalization problem for measuring how readily old knowledge can be reused and hence built upon. As a first step for tackling compositional generalization, we introduce the compositional recursive learner, a domain-general framework for learning algorithmic procedures for composing representation transformations, producing a learner that reasons about what computation to execute by making analogies to previously seen problems. We show on a symbolic and a high-dimensional domain that our compositional approach can generalize to more complex problems than the learner has previously encountered, whereas baselines that are not explicitly compositional do not.",/pdf/086663869ea0366468eec4c42a0c6eec539e154c.pdf,ICLR,2019,We explore the problem of compositional generalization and propose a means for endowing neural network architectures with the ability to compose themselves to solve these problems. +0_ao8yS2eBw,8mf9NzRssZ9,1601310000000.0,1614990000000.0,3211,Solving NP-Hard Problems on Graphs with Extended AlphaGo Zero,"[""~Kenshin_Abe1"", ""~Zijian_Xu1"", ""~Issei_Sato1"", ""~Masashi_Sugiyama1""]","[""Kenshin Abe"", ""Zijian Xu"", ""Issei Sato"", ""Masashi Sugiyama""]","[""Graph neural network"", ""Combinatorial optimization"", ""Reinforcement learning""]","There have been increasing challenges to solve combinatorial optimization problems by machine learning. +Khalil et al. (NeurIPS 2017) proposed an end-to-end reinforcement learning framework, which automatically learns graph embeddings to construct solutions to a wide range of problems. +However, it sometimes performs poorly on graphs having different characteristics than training graphs. +To improve its generalization ability to various graphs, we propose a novel learning strategy based on AlphaGo Zero, a Go engine that achieved a superhuman level without the domain knowledge of the game. +We redesign AlphaGo Zero for combinatorial optimization problems, taking into account several differences from two-player games. +In experiments on five NP-hard problems such as {\sc MinimumVertexCover} and {\sc MaxCut}, our method, with only a policy network, shows better generalization than the previous method to various instances that are not used for training, including random graphs, synthetic graphs, and real-world graphs. +Furthermore, our method is significantly enhanced by a test-time Monte Carlo Tree Search which makes full use of the policy network and value network. +We also compare recently-developed graph neural network (GNN) models, with an interesting insight into a suitable choice of GNN models for each task.",/pdf/b6aca2a5580381428f2757d9ce963d07308bf5cc.pdf,ICLR,2021,We train graph representation for combinatorial optimization problems without domain knowledge. +HklKWhC5F7,r1lOxz0qYQ,1538090000000.0,1545360000000.0,1190,How Training Data Affect the Accuracy and Robustness of Neural Networks for Image Classification,"[""sulei@ucdavis.edu"", ""huan@huan-zhang.com"", ""kewang@visa.com"", ""zhendong.su@inf.ethz.ch""]","[""Suhua Lei"", ""Huan Zhang"", ""Ke Wang"", ""Zhendong Su""]","[""Adversarial attacks"", ""Robustness"", ""CW"", ""I-FGSM""]","Recent work has demonstrated the lack of robustness of well-trained deep neural networks (DNNs) to adversarial examples. For example, visually indistinguishable perturbations, when mixed with an original image, can easily lead deep learning models to misclassifications. In light of a recent study on the mutual influence between robustness and accuracy over 18 different ImageNet models, this paper investigates how training data affect the accuracy and robustness of deep neural +networks. We conduct extensive experiments on four different datasets, including CIFAR-10, MNIST, STL-10, and Tiny ImageNet, with several representative neural networks. Our results reveal previously unknown phenomena that exist between the size of training data and characteristics of the resulting models. In particular, besides confirming that the model accuracy improves as the amount of training data increases, we also observe that the model robustness improves initially, but there exists a turning point after which robustness starts to decrease. How and when such turning points occur vary for different neural networks and different datasets.",/pdf/5b9bd31c5bd6f380776f65d90a087f4095be3c04.pdf,ICLR,2019, +WPO0vDYLXem,Mi9hTSoEHFu,1601310000000.0,1614990000000.0,3570,Hyperparameter Transfer Across Developer Adjustments,"[""~Danny_Stoll1"", ""~J\u00f6rg_K.H._Franke1"", ""wagnerd@cs.uni-freiburg.de"", ""selgs@cs.uni-freiburg.de"", ""~Frank_Hutter1""]","[""Danny Stoll"", ""J\u00f6rg K.H. Franke"", ""Diane Wagner"", ""Simon Selg"", ""Frank Hutter""]","[""Meta Learning"", ""Hyperparameter Optimization"", ""Transfer Learning""]","After developer adjustments to a machine learning (ML) algorithm, how can the results of an old hyperparameter optimization (HPO) automatically be used to speedup a new HPO? This question poses a challenging problem, as developer adjustments can change which hyperparameter settings perform well, or even the hyperparameter search space itself. While many approaches exist that leverage knowledge obtained on previous tasks, so far, knowledge from previous development steps remains entirely untapped. In this work, we remedy this situation and propose a new research framework: hyperparameter transfer across adjustments (HT-AA). To lay a solid foundation for this research framework, we provide four simple HT-AA baseline algorithms and eight benchmarks +changing various aspects of ML algorithms, their hyperparameter search spaces, and the neural architectures used. The best baseline, on average and depending on the budgets for the old and new HPO, reaches a given performance 1.2-3.6x faster than a prominent HPO algorithm without transfer. As HPO is a crucial step in ML development but requires extensive computational resources, this speedup would lead to faster development cycles, lower costs, and reduced environmental impacts. To make these benefits available to ML developers off-the-shelf and to facilitate future research on HT-AA, we provide python packages for our baselines and benchmarks.",/pdf/2bd5b85352488faba6c3363fd5008e185225125a.pdf,ICLR,2021,A new research framework that introduces automated knowledge transfers across algorithm development steps to speedup hyperparameter optimization. +SkqV-XZRZ,By1zbm-RZ,1509140000000.0,1518730000000.0,1125,Variational Bi-LSTMs,"[""s.shabanian@gmail.com"", ""devansharpit@gmail.com"", ""adam.trischler@microsoft.com"", ""yoshua.umontreal@gmail.com""]","[""Samira Shabanian"", ""Devansh Arpit"", ""Adam Trischler"", ""Yoshua Bengio""]",[],"Recurrent neural networks like long short-term memory (LSTM) are important architectures for sequential prediction tasks. LSTMs (and RNNs in general) model sequences along the forward time direction. Bidirectional LSTMs (Bi-LSTMs), which model sequences along both forward and backward directions, generally perform better at such tasks because they capture a richer representation of the data. In the training of Bi-LSTMs, the forward and backward paths are learned independently. We propose a variant of the Bi-LSTM architecture, which we call Variational Bi-LSTM, that creates a dependence between the two paths (during training, but which may be omitted during inference). Our model acts as a regularizer and encourages the two networks to inform each other in making their respective predictions using distinct information. We perform ablation studies to better understand the different components of our model and evaluate the method on various benchmarks, showing state-of-the-art performance.",/pdf/4324fa39868648281fcca9536b21bab92f264995.pdf,ICLR,2018, +HkMlGnC9KQ,r1e1RJncY7,1538090000000.0,1545360000000.0,1233,On Regularization and Robustness of Deep Neural Networks,"[""alberto.bietti@inria.fr"", ""gregoire.mialon@inria.fr"", ""julien.mairal@inria.fr""]","[""Alberto Bietti*"", ""Gr\u00e9goire Mialon*"", ""Julien Mairal""]","[""regularization"", ""robustness"", ""deep learning"", ""convolutional networks"", ""kernel methods""]","In this work, we study the connection between regularization and robustness of deep neural networks by viewing them as elements of a reproducing kernel Hilbert space (RKHS) of functions and by regularizing them using the RKHS norm. Even though this norm cannot be computed, we consider various approximations based on upper and lower bounds. These approximations lead to new strategies for regularization, but also to existing ones such as spectral norm penalties or constraints, gradient penalties, or adversarial training. Besides, the kernel framework allows us to obtain margin-based bounds on adversarial generalization. We show that our new algorithms lead to empirical benefits for learning on small datasets and learning adversarially robust models. We also discuss implications of our regularization framework for learning implicit generative models.",/pdf/8060b3e985017452425db0182db32fdb290882e4.pdf,ICLR,2019, +B1EVwkqTW,By7Nvk5pZ,1508670000000.0,1518730000000.0,40,Make SVM great again with Siamese kernel for few-shot learning,"[""bence.tilk@gmail.com""]","[""Bence Tilk""]","[""SVM"", ""siamese network"", ""one-shot learning"", ""few-shot learning""]","While deep neural networks have shown outstanding results in a wide range of applications, +learning from a very limited number of examples is still a challenging +task. Despite the difficulties of the few-shot learning, metric-learning techniques +showed the potential of the neural networks for this task. While these methods +perform well, they don’t provide satisfactory results. In this work, the idea of +metric-learning is extended with Support Vector Machines (SVM) working mechanism, +which is well known for generalization capabilities on a small dataset. +Furthermore, this paper presents an end-to-end learning framework for training +adaptive kernel SVMs, which eliminates the problem of choosing a correct kernel +and good features for SVMs. Next, the one-shot learning problem is redefined +for audio signals. Then the model was tested on vision task (using Omniglot +dataset) and speech task (using TIMIT dataset) as well. Actually, the algorithm +using Omniglot dataset improved accuracy from 98.1% to 98.5% on the one-shot +classification task and from 98.9% to 99.3% on the few-shot classification task.",/pdf/016c1e08e70f12916905b698483705214552b6bc.pdf,ICLR,2018,"The proposed method is an end-to-end neural SVM, which is optimized for few-shot learning." +2nm0fGwWBMr,cjSmNWGcea,1601310000000.0,1614990000000.0,1190,PanRep: Universal node embeddings for heterogeneous graphs,"[""~Vassilis_N._Ioannidis1"", ""dzzhen@amazon.com"", ""~George_Karypis1""]","[""Vassilis N. Ioannidis"", ""Da Zheng"", ""George Karypis""]","[""Graph neural networks"", ""universal node embeddings"", ""node classification"", ""link prediction"", ""unsupervised learning""]","Learning unsupervised node embeddings facilitates several downstream tasks such as node classification and link prediction. A node embedding is universal if it is designed to be used by and benefit various downstream tasks. This work introduces PanRep, a graph neural network (GNN) model, for unsupervised learning of universal node representations for heterogenous graphs. PanRep consists of a GNN encoder that obtains node embeddings and four decoders, each capturing different topological and node feature properties. Abiding to these properties the novel unsupervised framework learns universal embeddings applicable to different downstream tasks. PanRep can be furthered fine-tuned to account for possible limited labels. In this operational setting PanRep is considered as a pretrained model for extracting node embeddings of heterogenous graph data. PanRep outperforms all unsupervised and certain supervised methods in node classification and link prediction, especially when the labeled data for the supervised methods is small. PanRep-FT (with fine-tuning) outperforms all other supervised approaches, which corroborates the merits of pretraining models. Finally, we apply PanRep-FT for discovering novel drugs for Covid-19. We showcase the advantage of universal embeddings in drug repurposing and identify several drugs used in clinical trials as possible drug candidates.",/pdf/25d721fbacce6943b4b75a6d5621df9eb4dc2f4d.pdf,ICLR,2021, +SyGT_6yCZ,Sy-pd6yCb,1509050000000.0,1518730000000.0,180,Simple Fast Convolutional Feature Learning,"[""dlm@cin.ufpe.br"", ""cz@cin.ufpe.br"", ""tbl@cin.ufpe.br""]","[""David Mac\u00eado"", ""Cleber Zanchettin"", ""Teresa Ludermir""]","[""Feature Learning"", ""Convolutional Neural Networks"", ""Visual Recognition""]","The quality of the features used in visual recognition is of fundamental importance for the overall system. For a long time, low-level hand-designed feature algorithms as SIFT and HOG have obtained the best results on image recognition. Visual features have recently been extracted from trained convolutional neural networks. Despite the high-quality results, one of the main drawbacks of this approach, when compared with hand-designed features, is the training time required during the learning process. In this paper, we propose a simple and fast way to train supervised convolutional models to feature extraction while still maintaining its high-quality. This methodology is evaluated on different datasets and compared with state-of-the-art approaches.",/pdf/9136bbd285842a07b9f55d97b7b035f3e0862c65.pdf,ICLR,2018,A simple fast method for extracting visual features from convolutional neural networks +B1eZYkHYPS,HJxe0QCdwH,1569440000000.0,1577170000000.0,1830,Shifted Randomized Singular Value Decomposition,"[""ali.basirat@lingfil.uu.se""]","[""Ali Basirat""]","[""SVD"", ""PCA"", ""Randomized Algorithms""]","We extend the randomized singular value decomposition (SVD) algorithm (Halko et al., 2011) to estimate the SVD of a shifted data matrix without explicitly constructing the matrix in the memory. With no loss in the accuracy of the original algorithm, the extended algorithm provides for a more efficient way of matrix factorization. The algorithm facilitates the low-rank approximation and principal component analysis (PCA) of off-center data matrices. When applied to different types of data matrices, our experimental results confirm the advantages of the extensions made to the original algorithm.",/pdf/7fcb85e97a97c39a5e374a96d9239dfa4d866d9a.pdf,ICLR,2020,A randomized algorithm to estimate the SVD of a shifted data matrix without explicitly constructing the matrix in the memory. +NECTfffOvn1,wA-smZfz_U5,1601310000000.0,1613380000000.0,1239,Fidelity-based Deep Adiabatic Scheduling,"[""eliovify@gmail.com"", ""~Lior_Wolf1""]","[""Eli Ovits"", ""Lior Wolf""]",[],"Adiabatic quantum computation is a form of computation that acts by slowly interpolating a quantum system between an easy to prepare initial state and a final state that represents a solution to a given computational problem. The choice of the interpolation schedule is critical to the performance: if at a certain time point, the evolution is too rapid, the system has a high probability to transfer to a higher energy state, which does not represent a solution to the problem. On the other hand, an evolution that is too slow leads to a loss of computation time and increases the probability of failure due to decoherence. In this work, we train deep neural models to produce optimal schedules that are conditioned on the problem at hand. We consider two types of problem representation: the Hamiltonian form, and the Quadratic Unconstrained Binary Optimization (QUBO) form. A novel loss function that scores schedules according to their approximated success probability is introduced. We benchmark our approach on random QUBO problems, Grover search, 3-SAT, and MAX-CUT problems and show that our approach outperforms, by a sizable margin, the linear schedules as well as alternative approaches that were very recently proposed.",/pdf/864d26c496d060fce5f6a17f3e6edd74aaead783.pdf,ICLR,2021,A new loss for applying supervised deep learning to the problem of scheduling adiabatic quantum computations +J7bUsLCb0zf,FB6sOOHe5i,1601310000000.0,1614990000000.0,1419,Compute- and Memory-Efficient Reinforcement Learning with Latent Experience Replay,"[""~Lili_Chen1"", ""~Kimin_Lee1"", ""~Aravind_Srinivas1"", ""~Pieter_Abbeel2""]","[""Lili Chen"", ""Kimin Lee"", ""Aravind Srinivas"", ""Pieter Abbeel""]","[""reinforcement learning"", ""deep learning"", ""computational efficiency"", ""memory efficiency""]","Recent advances in off-policy deep reinforcement learning (RL) have led to impressive success in complex tasks from visual observations. Experience replay improves sample-efficiency by reusing experiences from the past, and convolutional neural networks (CNNs) process high-dimensional inputs effectively. However, such techniques demand high memory and computational bandwidth. In this paper, we present Latent Vector Experience Replay (LeVER), a simple modification of existing off-policy RL methods, to address these computational and memory requirements without sacrificing the performance of RL agents. To reduce the computational overhead of gradient updates in CNNs, we freeze the lower layers of CNN encoders early in training due to early convergence of their parameters. Additionally, we reduce memory requirements by storing the low-dimensional latent vectors for experience replay instead of high-dimensional images, enabling an adaptive increase in the replay buffer capacity, a useful technique in constrained-memory settings. In our experiments, we show that LeVER does not degrade the performance of RL agents while significantly saving computation and memory across a diverse set of DeepMind Control environments and Atari games. Finally, we show that LeVER is useful for computation-efficient transfer learning in RL because lower layers of CNNs extract generalizable features, which can be used for different tasks and domains.",/pdf/84859b5a5a9d4733a079a88813eb849e253a5043.pdf,ICLR,2021,We present a compute- and memory-efficient modification of off-policy RL algorithms by freezing lower layers of CNN encoders early in training. +Bk-ofQZRb,H1icfQWCW,1509140000000.0,1518730000000.0,1145,TD Learning with Constrained Gradients,"[""ishand@cs.utexas.edu"", ""pstone@cs.utexas.edu""]","[""Ishan Durugkar"", ""Peter Stone""]","[""Reinforcement Learning"", ""TD Learning"", ""DQN""]","Temporal Difference Learning with function approximation is known to be unstable. Previous work like \citet{sutton2009fast} and \citet{sutton2009convergent} has presented alternative objectives that are stable to minimize. However, in practice, TD-learning with neural networks requires various tricks like using a target network that updates slowly \citep{mnih2015human}. In this work we propose a constraint on the TD update that minimizes change to the target values. This constraint can be applied to the gradients of any TD objective, and can be easily applied to nonlinear function approximation. We validate this update by applying our technique to deep Q-learning, and training without a target network. We also show that adding this constraint on Baird's counterexample keeps Q-learning from diverging.",/pdf/424ef3a312b7502cf11a36f4693095fb81db7ecb.pdf,ICLR,2018,We show that adding a constraint to TD updates stabilizes learning and allows Deep Q-learning without a target network +rJxAo2VYwr,Hkx2D7HZPS,1569440000000.0,1583910000000.0,172,Transferable Perturbations of Deep Feature Distributions,"[""nathan.inkawhich@duke.edu"", ""kevin.liang@duke.edu"", ""lcarin@duke.edu"", ""yiran.chen@duke.edu""]","[""Nathan Inkawhich"", ""Kevin Liang"", ""Lawrence Carin"", ""Yiran Chen""]","[""adversarial attacks"", ""transferability"", ""interpretability""]","Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended ImageNet models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples. ",/pdf/8b2116131bdfe59cc017ed24381d212331f9d0fd.pdf,ICLR,2020,We show that perturbations based-on intermediate feature distributions yield more transferable adversarial examples and allow for analysis of the affects of adversarial perturbations on intermediate representations. +BkgCv1HYvB,S1lSV0auwH,1569440000000.0,1577200000000.0,1786,Generating Multi-Sentence Abstractive Summaries of Interleaved Texts,"[""skarn@cis.lmu.de"", ""chen@fxpal.com"", ""yanying@fxpal.com"", ""ulli.waltinger@siemens.com"", ""hinrich@hotmail.com""]","[""Sanjeev Kumar Karn"", ""Francine Chen"", ""Yan-Ying Chen"", ""Ulli Waltinger"", ""Hinrich Sch\u00fctze""]",[],"In multi-participant postings, as in online chat conversations, several conversations or topic threads may take place concurrently. This leads to difficulties for readers reviewing the postings in not only following discussions but also in quickly identifying their essence. A two-step process, disentanglement of interleaved posts followed by summarization of each thread, addresses the issue, but disentanglement errors are propagated to the summarization step, degrading the overall performance. To address this, we propose an end-to-end trainable encoder-decoder network for summarizing interleaved posts. The interleaved posts are encoded hierarchically, i.e., word-to-word (words in a post) followed by post-to-post (posts in a channel). The decoder also generates summaries hierarchically, thread-to-thread (generate thread representations) followed by word-to-word (i.e., generate summary words). Additionally, we propose a hierarchical attention mechanism for interleaved text. Overall, our end-to-end trainable hierarchical framework enhances performance over a sequence to sequence framework by 8-10% on multiple synthetic interleaved texts datasets.",/pdf/df9803dcdd2ecb8e6fdcfee2a4e4f1b1c56be326.pdf,ICLR,2020, +jGeOQt3oUl1,IfGCxZuA0hR,1601310000000.0,1614990000000.0,2420,Representational aspects of depth and conditioning in normalizing flows,"[""~Frederic_Koehler1"", ""~Viraj_Mehta1"", ""~Andrej_Risteski2""]","[""Frederic Koehler"", ""Viraj Mehta"", ""Andrej Risteski""]","[""normalizing flows"", ""representational power"", ""conditioning"", ""depth"", ""theory""]","Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: models which produce good samples typically need to be extremely deep -- which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly \emph{conditioned}: since they are parametrized as invertible maps from $\mathbb{R}^d \to \mathbb{R}^d$, and typical training data like images intuitively is lower-dimensional, the learned maps often have Jacobians that are close to being singular. + +In our paper, we tackle representational aspects around depth and conditioning of normalizing flows---both for general invertible architectures, and for a particular common architecture---affine couplings. + +For general invertible architectures, we prove that invertibility comes at a cost in terms of depth: we show examples where a much deeper normalizing flow model may need to be used to match the performance of a non-invertible generator. + +For affine couplings, we first show that the choice of partitions isn't a likely bottleneck for depth: we show that any invertible linear map (and hence a permutation) can be simulated by a constant number of affine coupling layers, using a fixed partition. This shows that the extra flexibility conferred by 1x1 convolution layers, as in GLOW, can in principle be simulated by increasing the size by a constant factor. Next, in terms of conditioning, we show that affine couplings are universal approximators -- provided the Jacobian of the model is allowed to be close to singular. We furthermore empirically explore the benefit of different kinds of padding -- a common strategy for improving conditioning.",/pdf/f1789f019e1d6b9e54c3d69f0434757317bc28fe.pdf,ICLR,2021,Provable tradeoffs between conditioning/depth and representational power for normalizing flows. +Hye4WaVYwr,H1livmvUDB,1569440000000.0,1577170000000.0,372,Bootstrapping the Expressivity with Model-based Planning,"[""dkf16@mails.tsinghua.edu.cn"", ""yupingl@cs.princeton.edu"", ""tengyuma@cs.stanford.edu""]","[""Kefan Dong"", ""Yuping Luo"", ""Tengyu Ma""]","[""reinforcement learning theory"", ""model-based reinforcement learning"", ""planning"", ""expressivity"", ""approximation theory"", ""deep reinforcement learning theory""]","We compare the model-free reinforcement learning with the model-based approaches through the lens of the expressive power of neural networks for policies, $Q$-functions, and dynamics. We show, theoretically and empirically, that even for one-dimensional continuous state space, there are many MDPs whose optimal $Q$-functions and policies are much more complex than the dynamics. We hypothesize many real-world MDPs also have a similar property. For these MDPs, model-based planning is a favorable algorithm, because the resulting policies can approximate the optimal policy significantly better than a neural network parameterization can, and model-free or model-based policy optimization rely on policy parameterization. Motivated by the theory, we apply a simple multi-step model-based bootstrapping planner (BOOTS) to bootstrap a weak $Q$-function into a stronger policy. Empirical results show that applying BOOTS on top of model-based or model-free policy optimization algorithms at the test time improves the performance on MuJoCo benchmark tasks. ",/pdf/bb6f0c5be53784489209cbfa6187607cea25c6e2.pdf,ICLR,2020,"We compare deep model-based and model-free RL algorithms by studying the approximability of $Q$-functions, policies, and dynamics by neural networks. " +Hk3mPK5gg,,1478300000000.0,1488580000000.0,499,Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning,"[""ppwwyyxx@gmail.com"", ""yuandong@fb.com""]","[""Yuxin Wu"", ""Yuandong Tian""]","[""Reinforcement Learning"", ""Applications"", ""Games""]","In this paper, we propose a novel framework for training vision-based agent for First-Person Shooter (FPS) Game, in particular Doom. +Our framework combines the state-of-the-art reinforcement learning approach (Asynchronous Advantage Actor-Critic (A3C) model) with curriculum learning. Our model is simple in design and only uses game states from the AI side, rather than using opponents' information. On a known map, our agent won 10 out of the 11 attended games and the champion of Track1 in ViZDoom AI Competition 2016 by a large margin, 35\% higher score than the second place.",/pdf/e3b317861244ee2b16515b728c7294e2e3724b51.pdf,ICLR,2017,"We propose a novel framework for training vision-based agent for First-Person Shooter (FPS) Game, Doom, using actor-critic model and curriculum training. " +nVZtXBI6LNn,g5wPM-jbpK,1601310000000.0,1614300000000.0,3717,Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers,"[""~Kaidi_Xu1"", ""~Huan_Zhang1"", ""~Shiqi_Wang2"", ""~Yihan_Wang2"", ""~Suman_Jana1"", ""~Xue_Lin1"", ""~Cho-Jui_Hsieh1""]","[""Kaidi Xu"", ""Huan Zhang"", ""Shiqi Wang"", ""Yihan Wang"", ""Suman Jana"", ""Xue Lin"", ""Cho-Jui Hsieh""]","[""neural network verification"", ""branch and bound""]","Formal verification of neural networks (NNs) is a challenging and important problem. Existing efficient complete solvers typically require the branch-and-bound (BaB) process, which splits the problem domain into sub-domains and solves each sub-domain using faster but weaker incomplete verifiers, such as Linear Programming (LP) on linearly relaxed sub-domains. In this paper, we propose to use the backward mode linear relaxation based perturbation analysis (LiRPA) to replace LP during the BaB process, which can be efficiently implemented on the typical machine learning accelerators such as GPUs and TPUs. However, unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting, making the entire procedure incomplete after BaB. To address these challenges, we apply a fast gradient based bound tightening procedure combined with batch splits and the design of minimal usage of LP bound procedure, enabling us to effectively use LiRPA on the accelerator hardware for the challenging complete NN verification problem and significantly outperform LP-based approaches. On a single GPU, we demonstrate an order of magnitude speedup compared to existing LP-based approaches.",/pdf/e8ea009d8faf5987887d2dc7ca2b4d680a2f83dc.pdf,ICLR,2021,We use fast bound propagation methods on GPUs for complete neural network verification and achieve large speedup compared to SOTA. +HkMwHsCctm,rJeFi1OKtX,1538090000000.0,1545360000000.0,96,Principled Deep Neural Network Training through Linear Programming,"[""dano@columbia.edu"", ""gonzalo.munoz@polymtl.ca"", ""sebastian.pokutta@isye.gatech.edu""]","[""Daniel Bienstock"", ""Gonzalo Mu\u00f1oz"", ""Sebastian Pokutta""]","[""deep learning theory"", ""neural network training"", ""empirical risk minimization"", ""non-convex optimization"", ""treewidth""]","Deep Learning has received significant attention due to its impressive performance in many state-of-the-art learning tasks. Unfortunately, while very powerful, Deep Learning is not well understood theoretically and in particular only recently results for the complexity of training deep neural networks have been obtained. In this work we show that large classes of deep neural networks with various architectures (e.g., DNNs, CNNs, Binary Neural Networks, and ResNets), activation functions (e.g., ReLUs and leaky ReLUs), and loss functions (e.g., Hinge loss, Euclidean loss, etc) can be trained to near optimality with desired target accuracy using linear programming in time that is exponential in the input data and parameter space dimension and polynomial in the size of the data set; improvements of the dependence in the input dimension are known to be unlikely assuming $P\neq NP$, and improving the dependence on the parameter space dimension remains open. In particular, we obtain polynomial time algorithms for training for a given fixed network architecture. Our work applies more broadly to empirical risk minimization problems which allows us to generalize various previous results and obtain new complexity results for previously unstudied architectures in the proper learning setting.",/pdf/f9adbd336eb0c03112fbd8eb0d2cf20aeccde6df.pdf,ICLR,2019,Using linear programming we show that the computational complexity of approximate Deep Neural Network training depends polynomially on the data size for several architectures +cPZOyoDloxl,YpNTvUnAlPX,1601310000000.0,1615950000000.0,1385,SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments,"[""~Glen_Berseth1"", ""dangengdg@berkeley.edu"", ""~Coline_Manon_Devin1"", ""~Nicholas_Rhinehart1"", ""~Chelsea_Finn1"", ""~Dinesh_Jayaraman2"", ""~Sergey_Levine1""]","[""Glen Berseth"", ""Daniel Geng"", ""Coline Manon Devin"", ""Nicholas Rhinehart"", ""Chelsea Finn"", ""Dinesh Jayaraman"", ""Sergey Levine""]","[""Reinforcement learning""]","Every living organism struggles against disruptive environmental forces to carve out and maintain an orderly niche. We propose that such a struggle to achieve and preserve order might offer a principle for the emergence of useful behaviors in artificial agents. We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing reinforcement learning (SMiRL). SMiRL alternates between learning a density model to evaluate the surprise of a stimulus, and improving the policy to seek more predictable stimuli. The policy seeks out stable and repeatable situations that counteract the environment's prevailing sources of entropy. This might include avoiding other hostile agents, or finding a stable, balanced pose for a bipedal robot in the face of disturbance forces. We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, control a humanoid to avoid falls, and navigate to escape enemies in a maze without any task-specific reward supervision. We further show that SMiRL can be used together with standard task rewards to accelerate reward-driven learning.",/pdf/f9be390d21352320ef34c706df775ded52b79ebd.pdf,ICLR,2021,Using Bayesian surprise as an unsupervised intrinsic reward function to learn complex behaviors in unstable environments. +H1eRI04KPB,ByxC44wuwB,1569440000000.0,1577170000000.0,1157,Likelihood Contribution based Multi-scale Architecture for Generative Flows,"[""hpdas@eecs.berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""spanos@eecs.berkeley.edu""]","[""Hari Prasanna Das"", ""Pieter Abbeel"", ""Costas J. Spanos""]","[""Generative Flow"", ""Normalizing Flow"", ""Multi-scale Architecture"", ""RealNVP"", ""Dimension Factorization""]","Deep generative modeling using flows has gained popularity owing to the tractable exact log-likelihood estimation with efficient training and synthesis process. However, flow models suffer from the challenge of having high dimensional latent space, same in dimension as the input space. An effective solution to the above challenge as proposed by Dinh et al. (2016) is a multi-scale architecture, which is based on iterative early factorization of a part of the total dimensions at regular intervals. Prior works on generative flows involving a multi-scale architecture perform the dimension factorization based on a static masking. We propose a novel multi-scale architecture that performs data dependent factorization to decide which dimensions should pass through more flow layers. To facilitate the same, we introduce a heuristic based on the contribution of each dimension to the total log-likelihood which encodes the importance of the dimensions. Our proposed heuristic is readily obtained as part of the flow training process, enabling versatile implementation of our likelihood contribution based multi-scale architecture for generic flow models. We present such an implementation for the original flow introduced in Dinh et al. (2016), and demonstrate improvements in log-likelihood score and sampling quality on standard image benchmarks. We also conduct ablation studies to compare proposed method with other options for dimension factorization.",/pdf/d54ee9f4c41ae4064e3d071e91836c3e0ea91ae6.pdf,ICLR,2020,Data-dependent factorization of dimensions in a multi-scale architecture based on contribution to the total log-likelihood +qbH974jKUVy,zqGG95QcgD5,1601310000000.0,1616070000000.0,3567,The role of Disentanglement in Generalisation,"[""~Milton_Llera_Montero1"", ""c.ludwig@bristol.ac.uk"", ""~Rui_Ponte_Costa3"", ""~Gaurav_Malhotra1"", ""~Jeffrey_Bowers1""]","[""Milton Llera Montero"", ""Casimir JH Ludwig"", ""Rui Ponte Costa"", ""Gaurav Malhotra"", ""Jeffrey Bowers""]","[""disentanglement"", ""compositionality"", ""compositional generalization"", ""generalisation"", ""generative models"", ""variational autoencoders""]","Combinatorial generalisation — the ability to understand and produce novel combinations of familiar elements — is a core capacity of human intelligence that current AI systems struggle with. Recently, it has been suggested that learning disentangled representations may help address this problem. It is claimed that such representations should be able to capture the compositional structure of the world which can then be combined to support combinatorial generalisation. In this study, we systematically tested how the degree of disentanglement affects various forms of generalisation, including two forms of combinatorial generalisation that varied in difficulty. We trained three classes of variational autoencoders (VAEs) on two datasets on an unsupervised task by excluding combinations of generative factors during training. At test time we ask the models to reconstruct the missing combinations in order to measure generalisation performance. Irrespective of the degree of disentanglement, we found that the models supported only weak combinatorial generalisation. We obtained the same outcome when we directly input perfectly disentangled representations as the latents, and when we tested a model on a more complex task that explicitly required independent generative factors to be controlled. While learning disentangled representations does improve interpretability and sample efficiency in some downstream tasks, our results suggest that they are not sufficient for supporting more difficult forms of generalisation.",/pdf/b2891a422f7bcfc82a95c0587ba5da7e42c473db.pdf,ICLR,2021,Disentangled models do not achieve compositional generalization when tested systematically. +ryeUg0VFwr,SylzBw7_Dr,1569440000000.0,1577170000000.0,930,Striving for Simplicity in Off-Policy Deep Reinforcement Learning,"[""rishabhagarwal@google.com"", ""schuurmans@google.com"", ""mnorouzi@google.com""]","[""Rishabh Agarwal"", ""Dale Schuurmans"", ""Mohammad Norouzi""]","[""reinforcement learning"", ""off-policy"", ""batch RL"", ""offline RL"", ""benchmark""]","This paper advocates the use of offline (batch) reinforcement learning (RL) to help (1) isolate the contributions of exploitation vs. exploration in off-policy deep RL, (2) improve reproducibility of deep RL research, and (3) facilitate the design of simpler deep RL algorithms. We propose an offline RL benchmark on Atari 2600 games comprising all of the replay data of a DQN agent. Using this benchmark, we demonstrate that recent off-policy deep RL algorithms, even when trained solely on logged DQN data, can outperform online DQN. We present Random Ensemble Mixture (REM), a simple Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. The REM algorithm outperforms more complex RL agents such as C51 and QR-DQN on the offline Atari benchmark and performs comparably in the online setting.",/pdf/42b11365fb292dd6063cb1b4ca1452d54bdedf47.pdf,ICLR,2020, +SJxWS64FwH,rJlKKNHvPS,1569440000000.0,1583910000000.0,512,Deep Network Classification by Scattering and Homotopy Dictionary Learning,"[""john.zarka@ens.fr"", ""louis.thiry@ens.fr"", ""tomas.angles@ens.fr"", ""stephane.mallat@ens.fr""]","[""John Zarka"", ""Louis Thiry"", ""Tomas Angles"", ""Stephane Mallat""]","[""dictionary learning"", ""scattering transform"", ""sparse coding"", ""imagenet""]","We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet 2012 dataset. The network first applies a scattering transform that linearizes variabilities due to geometric transformations such as translations and small deformations. +A sparse $\ell^1$ dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework that includes ALISTA. Classification results are analyzed on ImageNet.",/pdf/e6235a059a0929ea35206c487924a1f355bb8c7d.pdf,ICLR,2020,A scattering transform followed by supervised dictionary learning reaches a higher accuracy than AlexNet on ImageNet. +71zCSP_HuBN,F-pr_S_inSt,1601310000000.0,1616020000000.0,2577,Individually Fair Rankings,"[""~Amanda_Bower1"", ""hamidef@umich.edu"", ""~Mikhail_Yurochkin1"", ""~Yuekai_Sun1""]","[""Amanda Bower"", ""Hamid Eftekhari"", ""Mikhail Yurochkin"", ""Yuekai Sun""]","[""algorithmic fairness"", ""learning to rank"", ""optimal transport""]",We develop an algorithm to train individually fair learning-to-rank (LTR) models. The proposed approach ensures items from minority groups appear alongside similar items from majority groups. This notion of fair ranking is based on the definition of individual fairness from supervised learning and is more nuanced than prior fair LTR approaches that simply ensure the ranking model provides underrepresented items with a basic level of exposure. The crux of our method is an optimal transport-based regularizer that enforces individual fairness and an efficient algorithm for optimizing the regularizer. We show that our approach leads to certifiably individually fair LTR models and demonstrate the efficacy of our method on ranking tasks subject to demographic biases.,/pdf/79474c3ea4a5449a9adbae8a72783142b915282b.pdf,ICLR,2021,We present an algorithm for training individually fair learning-to-rank systems using optimal transport tools. +ry8dvM-R-,Skr_PGWCW,1509140000000.0,1518730000000.0,859,Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning,"[""crosenbaum@umass.edu"", ""tklinger@us.ibm.com"", ""mdriemer@us.ibm.com""]","[""Clemens Rosenbaum"", ""Tim Klinger"", ""Matthew Riemer""]","[""multi-task"", ""transfer"", ""routing"", ""marl"", ""multi-agent"", ""reinforcement"", ""self-organizing""]","Multi-task learning (MTL) with neural networks leverages commonalities in tasks to improve performance, but often suffers from task interference which reduces the benefits of transfer. To address this issue we introduce the routing network paradigm, a novel neural network and training algorithm. A routing network is a kind of self-organizing neural network consisting of two components: a router and a set of one or more function blocks. A function block may be any neural network – for example a fully-connected or a convolutional layer. Given an input the router makes a routing decision, choosing a function block to apply and passing the output back to the router recursively, terminating when a fixed recursion depth is reached. In this way the routing network dynamically composes different function blocks for each input. We employ a collaborative multi-agent reinforcement learning (MARL) approach to jointly train the router and function blocks. We evaluate our model against cross-stitch networks and shared-layer baselines on multi-task settings of the MNIST, mini-imagenet, and CIFAR-100 datasets. Our experiments demonstrate a significant improvement in accuracy, with sharper convergence. In addition, routing networks have nearly constant per-task training cost while cross-stitch networks scale linearly with the number of tasks. On CIFAR100 (20 tasks) we obtain cross-stitch performance levels with an 85% average reduction in training time. +",/pdf/324423a36ea9d1aeab3230d85d59906efc89e29b.pdf,ICLR,2018,routing networks: a new kind of neural network which learns to adaptively route its input for multi-task learning +mDAZVlBeXWx,IP3xNjlhoLt,1601310000000.0,1614990000000.0,969,Towards Robust and Efficient Contrastive Textual Representation Learning,"[""~Liqun_Chen2"", ""~Yizhe_Zhang2"", ""~Dianqi_Li1"", ""~Chenyang_Tao1"", ""~Dong_Wang2"", ""~Lawrence_Carin2""]","[""Liqun Chen"", ""Yizhe Zhang"", ""Dianqi Li"", ""Chenyang Tao"", ""Dong Wang"", ""Lawrence Carin""]",[],"There has been growing interest in representation learning for text data, based on theoretical arguments and empirical evidence. One important direction involves leveraging contrastive learning to improve learned representations. We propose an application of contrastive learning for intermediate textual feature pairs, to explicitly encourage the model to learn more distinguishable representations. To overcome the learner's degeneracy due to vanishing contrasting signals, we impose Wasserstein constraints on the critic via spectral regularization. + Finally, to moderate such an objective from overly regularized training and to enhance learning efficiency, with theoretical justification, we further leverage an active negative-sample-selection procedure to only use high-quality contrast examples. We evaluate the proposed method over a wide range of natural language processing applications, from the perspectives of both supervised and unsupervised learning. Empirical results show consistent improvement over baselines. ",/pdf/f64ed910fda29f970ecddeaf84f2773f8f3287cc.pdf,ICLR,2021, +SyoDInJ0-,Hy5PI3yCb,1509050000000.0,1524510000000.0,171,Reinforcement Learning Algorithm Selection,"[""romain.laroche@gmail.com"", ""raphael.feraud@orange.com""]","[""Romain Laroche"", ""Raphael Feraud""]","[""Reinforcement Learning"", ""Multi-Armed Bandit"", ""Algorithm Selection""]","This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning (RL). The setup is as follows: given an episodic task and a finite number of off-policy RL algorithms, a meta-algorithm has to decide which RL algorithm is in control during the next episode so as to maximize the expected return. The article presents a novel meta-algorithm, called Epochal Stochastic Bandit Algorithm Selection (ESBAS). Its principle is to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection. Under some assumptions, a thorough theoretical analysis demonstrates its near-optimality considering the structural sampling budget limitations. ESBAS is first empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most configurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efficiently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin.",/pdf/a9fd9eab0e2a522faf829a0c17fa35b6e877a93f.pdf,ICLR,2018,This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning. +VlRqY4sV9FO,KjrME6RGyeY,1601310000000.0,1614990000000.0,2278,Human-interpretable model explainability on high-dimensional data,"[""damiendemijolla@gmail.com"", ""~Christopher_Frye1"", ""m.kunesch@cantab.net"", ""john.m@faculty.ai"", ""~Ilya_Feige1""]","[""Damien de Mijolla"", ""Christopher Frye"", ""Markus Kunesch"", ""John Mansir"", ""Ilya Feige""]",[],"The importance of explainability in machine learning continues to grow, as both neural-network architectures and the data they model become increasingly complex. Unique challenges arise when a model's input features become high-dimensional: on one hand, principled model-agnostic approaches to explainability become too computationally expensive; on the other, more efficient explainability algorithms lack natural interpretations for general users. In this work, we introduce a framework for human-interpretable explainability on high-dimensional data, consisting of two modules. First, we apply a semantically-meaningful latent representation, both to reduce the raw dimensionality of the data, and to ensure its human interpretability. These latent features can be learnt, e.g. explicitly as disentangled representations or implicitly through image-to-image translation, or they can be based on any computable quantities the user chooses. Second, we adapt the Shapley paradigm for model-agnostic explainability to operate on these latent features. This leads to interpretable model explanations that are both theoretically controlled and computationally tractable. We benchmark our approach on synthetic data and demonstrate its effectiveness on several image-classification tasks.",/pdf/73362c2ac3901aa16206a534aec32ba4a4407a24.pdf,ICLR,2021,We adapt Shapley explainability to operate on semantic latent features in order to produce human-interpretable model explanations. +BylA_C4tPr,BylMwWOOPH,1569440000000.0,1583910000000.0,1230,Composition-based Multi-Relational Graph Convolutional Networks,"[""shikhar@iisc.ac.in"", ""sanyal.soumya8@gmail.com"", ""vikram.nitin@columbia.edu"", ""ppt@iisc.ac.in""]","[""Shikhar Vashishth"", ""Soumya Sanyal"", ""Vikram Nitin"", ""Partha Talukdar""]","[""Graph Convolutional Networks"", ""Multi-relational Graphs"", ""Knowledge Graph Embeddings"", ""Link Prediction""]","Graph Convolutional Networks (GCNs) have recently been shown to be quite successful in modeling graph-structured data. However, the primary focus has been on handling simple undirected graphs. Multi-relational graphs are a more general and prevalent form of graphs where each edge has a label and direction associated with it. Most of the existing approaches to handle such graphs suffer from over-parameterization and are restricted to learning representations of nodes only. In this paper, we propose CompGCN, a novel Graph Convolutional framework which jointly embeds both nodes and relations in a relational graph. CompGCN leverages a variety of entity-relation composition operations from Knowledge Graph Embedding techniques and scales with the number of relations. It also generalizes several of the existing multi-relational GCN methods. We evaluate our proposed method on multiple tasks such as node classification, link prediction, and graph classification, and achieve demonstrably superior results. We make the source code of CompGCN available to foster reproducible research.",/pdf/bf70ad4afed02db2b8a47eaa28685546b45a47c9.pdf,ICLR,2020,A Composition-based Graph Convolutional framework for multi-relational graphs. +H1lFsREYPS,SkxIxmFuPB,1569440000000.0,1577170000000.0,1327,ASGen: Answer-containing Sentence Generation to Pre-Train Question Generator for Scale-up Data in Question Answering,"[""akhil.kedia@samsung.com"", ""sai.chetan@samsung.com"", ""scv.back@samsung.com"", ""haejun82.lee@samsung.com"", ""jchoo@korea.ac.kr""]","[""Akhil Kedia"", ""Sai Chetan Chinthakindi"", ""Seohyun Back"", ""Haejun Lee"", ""Jaegul Choo""]","[""Question Answering"", ""Machine Reading Comprehension"", ""Data Augmentation"", ""Question Generation"", ""Answer Generation""]","Numerous machine reading comprehension (MRC) datasets often involve manual annotation, requiring enormous human effort, and hence the size of the dataset remains significantly smaller than the size of the data available for unsupervised learning. Recently, researchers proposed a model for generating synthetic question-and-answer data from large corpora such as Wikipedia. This model is utilized to generate synthetic data for training an MRC model before fine-tuning it using the original MRC dataset. This technique shows better performance than other general pre-training techniques such as language modeling, because the characteristics of the generated data are similar to those of the downstream MRC data. However, it is difficult to have high-quality synthetic data comparable to human-annotated MRC datasets. To address this issue, we propose Answer-containing Sentence Generation (ASGen), a novel pre-training method for generating synthetic data involving two advanced techniques, (1) dynamically determining K answers and (2) pre-training the question generator on the answer-containing sentence generation task. We evaluate the question generation capability of our method by comparing the BLEU score with existing methods and test our method by fine-tuning the MRC model on the downstream MRC data after training on synthetic data. Experimental results show that our approach outperforms existing generation methods and increases the performance of the state-of-the-art MRC models across a range of MRC datasets such as SQuAD-v1.1, SQuAD-v2.0, KorQuAD and QUASAR-T without any architectural modifications to the original MRC model.",/pdf/9ada18c2e4db7a783a0a5c466950141a88e7f7f8.pdf,ICLR,2020,"We propose Answer-containing Sentence Generation (ASGen), a novel pre-training method for generating synthetic data for machine reading comprehension." +a4E6SL1rG3F,AHZRl026TGl,1601310000000.0,1614990000000.0,1979,Optimal allocation of data across training tasks in meta-learning,"[""g.batz97@gmail.com"", ""~Alberto_Bernacchia1"", ""ds.shiu@mtkresearch.com"", ""michael.bromberg@mtkresearch.com"", ""alexandru.cioba@mtkresearch.com""]","[""Georgios Batzolis"", ""Alberto Bernacchia"", ""Da-shan Shiu"", ""Michael Bromberg"", ""Alexandru Cioba""]",[],"Meta-learning models transfer the knowledge acquired from previous tasks to quickly learn new ones. They are tested on benchmarks with a fixed number of data-points for each training task, and this number is usually arbitrary, for example, 5 instances per class in few-shot classification. It is unknown how the performance of meta-learning is affected by the distribution of data across training tasks. Since labelling of data is expensive, finding the optimal allocation of labels across training tasks may reduce costs. +Given a fixed budget b of labels to distribute across tasks, should we use a small number of highly labelled tasks, or many tasks with few labels each? In MAML applied to mixed linear regression, we prove that the optimal number of tasks follows the scaling law sqrt{b}. We develop an online algorithm for data allocation across tasks, and show that the same scaling law applies to nonlinear regression. We also show preliminary experiments on few-shot image classification. Our work provides a theoretical guide for allocating labels across tasks in meta-learning, which we believe will prove useful in a large number of applications. +",/pdf/f1a8e0b06f9a293444486696e91b218a06f8e066.pdf,ICLR,2021,We study for the first time the problem of optimally allocating labels across tasks during meta-training +c77KhoLYSwF,qL4RIXIT2Y,1601310000000.0,1614990000000.0,835,Just How Toxic is Data Poisoning? A Benchmark for Backdoor and Data Poisoning Attacks,"[""~Avi_Schwarzschild1"", ""~Micah_Goldblum1"", ""~Arjun_Gupta2"", ""~John_P_Dickerson1"", ""~Tom_Goldstein1""]","[""Avi Schwarzschild"", ""Micah Goldblum"", ""Arjun Gupta"", ""John P Dickerson"", ""Tom Goldstein""]","[""Poisoning"", ""backdoor"", ""attack"", ""benchmark""]","Data poisoning and backdoor attacks manipulate training data in order to cause models to fail during inference. A recent survey of industry practitioners found that data poisoning is the number one concern among threats ranging from model stealing to adversarial attacks. However, we find that the impressive performance evaluations from data poisoning attacks are, in large part, artifacts of inconsistent experimental design. Moreover, we find that existing poisoning methods have been tested in contrived scenarios, and many fail in more realistic settings. In order to promote fair comparison in future work, we develop standardized benchmarks for data poisoning and backdoor attacks.",/pdf/51c0801d766028729a95dc23201520e869fdf8ad.pdf,ICLR,2021,"A novel benchmark for data poisoning and backdoor attacks offers fair comparison of attacks, filling a major gap in the literature to date." +Sy8gdB9xx,,1478280000000.0,1487960000000.0,259,Understanding deep learning requires rethinking generalization,"[""chiyuan@mit.edu"", ""bengio@google.com"", ""mrtz@google.com"", ""brecht@berkeley.edu"", ""vinyals@google.com""]","[""Chiyuan Zhang"", ""Samy Bengio"", ""Moritz Hardt"", ""Benjamin Recht"", ""Oriol Vinyals""]","[""Deep learning""]","Despite their massive size, successful deep artificial neural networks can +exhibit a remarkably small difference between training and test performance. +Conventional wisdom attributes small generalization error either to properties +of the model family, or to the regularization techniques used during training. + +Through extensive systematic experiments, we show how these traditional +approaches fail to explain why large neural networks generalize well in +practice. Specifically, our experiments establish that state-of-the-art +convolutional networks for image classification trained with stochastic +gradient methods easily fit a random labeling of the training data. This +phenomenon is qualitatively unaffected by explicit regularization, and occurs +even if we replace the true images by completely unstructured random noise. We +corroborate these experimental findings with a theoretical construction +showing that simple depth two neural networks already have perfect finite +sample expressivity as soon as the number of parameters exceeds the +number of data points as it usually does in practice. + +We interpret our experimental findings by comparison with traditional models.",/pdf/a667dbd533e9f018c023e21d1e3efd86cd61c365.pdf,ICLR,2017,"Through extensive systematic experiments, we show how the traditional approaches fail to explain why large neural networks generalize well in practice, and why understanding deep learning requires rethinking generalization." +HklFUlBKPB,HygQdcetwS,1569440000000.0,1577170000000.0,2331,Identifying Weights and Architectures of Unknown ReLU Networks,"[""drolnick@seas.upenn.edu"", ""koerding@gmail.com""]","[""David Rolnick"", ""Konrad P. Kording""]","[""deep neural network"", ""ReLU"", ""piecewise linear function"", ""linear region"", ""activation region"", ""weights"", ""parameters"", ""architecture""]","The output of a neural network depends on its parameters in a highly nonlinear way, and it is widely assumed that a network's parameters cannot be identified from its outputs. Here, we show that in many cases it is possible to reconstruct the architecture, weights, and biases of a deep ReLU network given the ability to query the network. ReLU networks are piecewise linear and the boundaries between pieces correspond to inputs for which one of the ReLUs switches between inactive and active states. Thus, first-layer ReLUs can be identified (up to sign and scaling) based on the orientation of their associated hyperplanes. Later-layer ReLU boundaries bend when they cross earlier-layer boundaries and the extent of bending reveals the weights between them. Our algorithm uses this to identify the units in the network and weights connecting them (up to isomorphism). The fact that considerable parts of deep networks can be identified from their outputs has implications for security, neuroscience, and our understanding of neural networks.",/pdf/0998379b513ba311f63f9adaab03d545b9eb37a4.pdf,ICLR,2020,"We show that in many cases it is possible to reconstruct the architecture, weights, and biases of a deep ReLU network given the network's output for specified inputs." +S1Bb3D5gg,,1478290000000.0,1490920000000.0,428,Learning End-to-End Goal-Oriented Dialog,"[""abordes@fb.com"", ""ylan@fb.com"", ""jase@fb.com""]","[""Antoine Bordes"", ""Y-Lan Boureau"", ""Jason Weston""]",[],"Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End- to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.",/pdf/75d78e3fc63af8e7aef85fe58f799cb76c70905a.pdf,ICLR,2017,A new open dataset and testbed for training and evaluating end-to-end dialog systems in goal-oriented scenarios. +aKt7FHPQxVV,YK_Z_k2j8d,1601310000000.0,1614990000000.0,2166,Efficient Differentiable Neural Architecture Search with Model Parallelism,"[""~Yi-Wei_Chen1"", ""~Qingquan_Song1"", ""~Xia_Hu4""]","[""Yi-Wei Chen"", ""Qingquan Song"", ""Xia Hu""]","[""Neural Architecture Search"", ""Model Parallel""]","Neural architecture search (NAS) automatically designs effective network architectures. Differentiable NAS with supernets that encompass all potential architectures in a large graph cuts down search overhead to few GPU days or less. However, these algorithms consume massive GPU memory, which will restrain NAS from large batch sizes and large search spaces (e.g., more candidate operations, diverse cell structures, and large depth of supernets). In this paper, we present binary neural architecture search (NASB) with consecutive model parallel (CMP) to tackle the problem of insufficient GPU memory. CMP aggregates memory from multiple GPUs for supernets. It divides forward/backward phases into several sub-tasks and executes the same type of sub-tasks together to reduce waiting cycles. This approach improves the hardware utilization of model parallel, but it utilizes large GPU memory. NASB is proposed to reduce memory footprint, which excludes inactive operations from computation graphs and computes those operations on the fly for inactive architectural gradients in backward phases. Experiments show that NASB-CMP runs 1.2× faster than other model parallel approaches and outperforms state-of-the-art differentiable NAS. NASB can also save twice GPU memory more than PC-DARTS. Finally, we apply NASB-CMP to complicated supernet architectures. Although deep supernets with diverse cell structures do not improve NAS performance, NASB-CMP shows its potential to explore supernet architecture design in large search space.",/pdf/297bdc897383bdfca49a431f39ef0924c66b3c4b.pdf,ICLR,2021,"We scale up neural architecture search with consecutive model parallel, running 1.2x faster than using other model parallelism" +H1gTEj09FX,HkgPILZuK7,1538090000000.0,1550590000000.0,40,RotDCF: Decomposition of Convolutional Filters for Rotation-Equivariant Deep Networks,"[""xiuyuan.cheng@duke.edu"", ""qiang.qiu@duke.edu"", ""robert.calderbank@duke.edu"", ""guillermo.sapiro@duke.edu""]","[""Xiuyuan Cheng"", ""Qiang Qiu"", ""Robert Calderbank"", ""Guillermo Sapiro""]",[],"Explicit encoding of group actions in deep features makes it possible for convolutional neural networks (CNNs) to handle global deformations of images, which is critical to success in many vision tasks. This paper proposes to decompose the convolutional filters over joint steerable bases across the space and the group geometry simultaneously, namely a rotation-equivariant CNN with decomposed convolutional filters (RotDCF). This decomposition facilitates computing the joint convolution, which is proved to be necessary for the group equivariance. It significantly reduces the model size and computational complexity while preserving performance, and truncation of the bases expansion serves implicitly to regularize the filters. On datasets involving in-plane and out-of-plane object rotations, RotDCF deep features demonstrate greater robustness and interpretability than regular CNNs. The stability of the equivariant representation to input variations is also proved theoretically. The RotDCF framework can be extended to groups other than rotations, providing a general approach which achieves both group equivariance and representation stability at a reduced model size.",/pdf/0f82b92976c11c8b2ea93d2b6366907d1412176c.pdf,ICLR,2019, +JbuYF437WB6,WcQkTLtQPh,1601310000000.0,1613410000000.0,1297,Directed Acyclic Graph Neural Networks,"[""~Veronika_Thost1"", ""~Jie_Chen1""]","[""Veronika Thost"", ""Jie Chen""]","[""Graph Neural Networks"", ""Graph Representation Learning"", ""Directed Acyclic Graphs"", ""DAG"", ""Inductive Bias""]","Graph-structured data ubiquitously appears in science and engineering. Graph neural networks (GNNs) are designed to exploit the relational inductive bias exhibited in graphs; they have been shown to outperform other forms of neural networks in scenarios where structure information supplements node features. The most common GNN architecture aggregates information from neighborhoods based on message passing. Its generality has made it broadly applicable. In this paper, we focus on a special, yet widely used, type of graphs---DAGs---and inject a stronger inductive bias---partial ordering---into the neural network design. We propose the directed acyclic graph neural network, DAGNN, an architecture that processes information according to the flow defined by the partial order. DAGNN can be considered a framework that entails earlier works as special cases (e.g., models for trees and models updating node representations recurrently), but we identify several crucial components that prior architectures lack. We perform comprehensive experiments, including ablation studies, on representative DAG datasets (i.e., source code, neural architectures, and probabilistic graphical models) and demonstrate the superiority of DAGNN over simpler DAG architectures as well as general graph architectures.",/pdf/435a89f1a5c579c1c5fe7ba3ebef81c09224e075.pdf,ICLR,2021,"We propose DAGNN, a graph neural network tailored to directed acyclic graphs that outperforms conventional GNNs by leveraging the partial order as strong inductive bias besides other suitable architectural features." +ByeDl1BYvH,SyeMkHsuPS,1569440000000.0,1577170000000.0,1508,Global graph curvature,"[""ostroumova-la@yandex-team.ru"", ""sameg@yandex-team.ru"", ""pimvdhoorn@gmail.com""]","[""Liudmila Prokhorenkova"", ""Egor Samosvat"", ""Pim van der Hoorn""]","[""graph curvature"", ""graph embedding"", ""hyperbolic space"", ""distortion"", ""Ollivier curvature"", ""Forman curvature""]","Recently, non-Euclidean spaces became popular for embedding structured data. However, determining suitable geometry and, in particular, curvature for a given dataset is still an open problem. In this paper, we define a notion of global graph curvature, specifically catered to the problem of embedding graphs, and analyze the problem of estimating this curvature using only graph-based characteristics (without actual graph embedding). We show that optimal curvature essentially depends on dimensionality of the embedding space and loss function one aims to minimize via embedding. We review the existing notions of local curvature (e.g., Ollivier-Ricci curvature) and analyze their properties theoretically and empirically. In particular, we show that such curvatures are often unable to properly estimate the global one. Hence, we propose a new estimator of global graph curvature specifically designed for zero-one loss function.",/pdf/bb90b0a845a6405788f00ab9b24ef15947f86bc1.pdf,ICLR,2020,Introduce a concept of global graph curvature specifically catered to the problem of embedding graphs and find its connection with popular local graph curvatures. +rkvDssyRb,HyLwjskR-,1509040000000.0,1518730000000.0,161,Multi-Advisor Reinforcement Learning,"[""romain.laroche@gmail.com"", ""mehdi.fatemi@microsoft.com"", ""joshua.romoff@mail.mcgill.ca"", ""havansei@microsoft.com""]","[""Romain Laroche"", ""Mehdi Fatemi"", ""Joshua Romoff"", ""Harm van Seijen""]","[""Reinforcement Learning""]","We consider tackling a single-agent RL problem by distributing it to $n$ learners. These learners, called advisors, endeavour to solve the problem from a different focus. Their advice, taking the form of action values, is then communicated to an aggregator, which is in control of the system. We show that the local planning method for the advisors is critical and that none of the ones found in the literature is flawless: the \textit{egocentric} planning overestimates values of states where the other advisors disagree, and the \textit{agnostic} planning is inefficient around danger zones. We introduce a novel approach called \textit{empathic} and discuss its theoretical aspects. We empirically examine and validate our theoretical findings on a fruit collection task.",/pdf/ddca5f7f90a48b75b89585b6a66615715039c5d9.pdf,ICLR,2018,We consider tackling a single-agent RL problem by distributing it to $n$ learners. +D4QFCXGe_z2,dLZu_WaMP-p,1601310000000.0,1614990000000.0,2413,R-LAtte: Attention Module for Visual Control via Reinforcement Learning,"[""~Mandi_Zhao1"", ""~Qiyang_Li1"", ""~Aravind_Srinivas1"", ""~Ignasi_Clavera1"", ""~Kimin_Lee1"", ""~Pieter_Abbeel2""]","[""Mandi Zhao"", ""Qiyang Li"", ""Aravind Srinivas"", ""Ignasi Clavera"", ""Kimin Lee"", ""Pieter Abbeel""]",[],"Attention mechanisms are generic inductive biases that have played a critical role in improving the state-of-the-art in supervised learning, unsupervised pre-training and generative modeling for multiple domains including vision, language and speech. However, they remain relatively under-explored for neural network architectures typically used in reinforcement learning (RL) from high dimensional inputs such as pixels. In this paper, we propose and study the effectiveness of augmenting a simple attention module in the convolutional encoder of an RL agent. Through experiments on the widely benchmarked DeepMind Control Suite environments, we demonstrate that our proposed module can (i) extract interpretable task-relevant information such as agent locations and movements without the need for data augmentations or contrastive losses; (ii) significantly improve the sample-efficiency and final performance of the agents. We hope our simple and effective approach will serve as a strong baseline for future research incorporating attention mechanisms in reinforcement learning and control.",/pdf/59b9c668da89811ed8bafba2befce1d4936e5de1.pdf,ICLR,2021,We propose and study the effectiveness of augmenting a simple attention module in the convolutional encoder of an RL agent +b6BdrqTnFs7,svqhF35iwnF,1601310000000.0,1614990000000.0,120,Grounded Compositional Generalization with Environment Interactions,"[""~Yuanpeng_Li2""]","[""Yuanpeng Li""]","[""compositional generalization"", ""grounding""]","In this paper, we present a compositional generalization approach in grounded agent instruction learning. Compositional generalization is an important part of human intelligence, but current neural network models do not have such ability. This is more complicated in multi-modal problems with grounding. Our proposed approach has two main ideas. First, we use interactions between agent and the environment to find components in the output. Second, we apply entropy regularization to learn corresponding input components for each output component. The results show the proposed approach significantly outperforms baselines in most tasks, with more than 25% absolute average accuracy increase. We also investigate the impact of entropy regularization and other changes with ablation study. We hope this work is the first step to address grounded compositional generalization, and it will be helpful in advancing artificial intelligence research. +",/pdf/669be6b7068636016fdbe5ccbd133407c852d4c1.pdf,ICLR,2021, +4YzI0KpRQtZ,CbQ4zt_SR7l,1601310000000.0,1614990000000.0,1408,Streaming Probabilistic Deep Tensor Factorization,"[""~shikai_fang1"", ""~Zheng_Wang2"", ""z.pan@utah.edu"", ""~Ji_Liu1"", ""~Shandian_Zhe1""]","[""shikai fang"", ""Zheng Wang"", ""Zhimeng pan"", ""Ji Liu"", ""Shandian Zhe""]","[""Probabilistic Methods"", ""online learing"", ""tensor factorization""]","Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world applications. To address these issues, we propose SPIDER, a Streaming ProbabilistIc Deep tEnsoR factorization method. We first use Bayesian neural networks (NNs) to construct a deep tensor factorization model. We assign a spike-and-slab prior over the NN weights to encourage sparsity and prevent overfitting. We then use Taylor expansions and moment matching to approximate the posterior of the NN output and calculate the running model evidence, based on which we develop an efficient streaming posterior inference algorithm in the assumed-density-filtering and expectation propagation framework. Our algorithm provides responsive incremental updates for the posterior of the latent factors and NN weights upon receiving new tensor entries, and meanwhile select and inhibit redundant/useless weights. We show the advantages of our approach in four real-world applications.",/pdf/ea9feb73ad9a46d06e430e8f40b2b1d7f487efdb.pdf,ICLR,2021,A BNN(Bayesian neural networks)-based probabilistic methods for tensor factorization which allows Streaming update and Uncertainty measure +B1GMDsR5tm,rygFOND9tQ,1538090000000.0,1549470000000.0,245,Initialized Equilibrium Propagation for Backprop-Free Training,"[""peter.ed.oconnor@gmail.com"", ""egavves@uva.nl"", ""m.welling@uva.nl""]","[""Peter O'Connor"", ""Efstratios Gavves"", ""Max Welling""]","[""credit assignment"", ""energy-based models"", ""biologically plausible learning""]","Deep neural networks are almost universally trained with reverse-mode automatic differentiation (a.k.a. backpropagation). Biological networks, on the other hand, appear to lack any mechanism for sending gradients back to their input neurons, and thus cannot be learning in this way. In response to this, Scellier & Bengio (2017) proposed Equilibrium Propagation - a method for gradient-based train- ing of neural networks which uses only local learning rules and, crucially, does not rely on neurons having a mechanism for back-propagating an error gradient. Equilibrium propagation, however, has a major practical limitation: inference involves doing an iterative optimization of neural activations to find a fixed-point, and the number of steps required to closely approximate this fixed point scales poorly with the depth of the network. In response to this problem, we propose Initialized Equilibrium Propagation, which trains a feedforward network to initialize the iterative inference procedure for Equilibrium propagation. This feed-forward network learns to approximate the state of the fixed-point using a local learning rule. After training, we can simply use this initializing network for inference, resulting in a learned feedforward network. Our experiments show that this network appears to work as well or better than the original version of Equilibrium propagation. This shows how we might go about training deep networks without using backpropagation.",/pdf/1a1d2a635d6ec647684b75c37abaa815a5131b9a.pdf,ICLR,2019,We train a feedforward network without backprop by using an energy-based model to provide local targets +H1LAqMbRW,r1BR5GbRZ,1509140000000.0,1518730000000.0,912,Latent forward model for Real-time Strategy game planning with incomplete information,"[""yuandong@fb.com"", ""qucheng@fb.com""]","[""Yuandong Tian"", ""Qucheng Gong""]","[""Real time strategy"", ""latent space"", ""forward model"", ""monte carlo tree search"", ""reinforcement learning"", ""planning""]","Model-free deep reinforcement learning approaches have shown superhuman performance in simulated environments (e.g., Atari games, Go, etc). During training, these approaches often implicitly construct a latent space that contains key information for decision making. In this paper, we learn a forward model on this latent space and apply it to model-based planning in miniature Real-time Strategy game with incomplete information (MiniRTS). We first show that the latent space constructed from existing actor-critic models contains relevant information of the game, and design training procedure to learn forward models. We also show that our learned forward model can predict meaningful future state and is usable for latent space Monte-Carlo Tree Search (MCTS), in terms of win rates against rule-based agents.",/pdf/179b5295fbb1c3bf0b81234694eba23e977e6805.pdf,ICLR,2018,"The paper analyzes the latent space learned by model-free approaches in a miniature incomplete information game, trains a forward model in the latent space and apply it to Monte-Carlo Tree Search, yielding positive performance." +SkgzYiRqtX,r1xQtHcqY7,1538090000000.0,1545360000000.0,424,Graph Neural Networks with Generated Parameters for Relation Extraction,"[""prokilchu@gmail.com"", ""linyk14@mails.tsinghua.edu.cn"", ""liuzy@tsinghua.edu.cn"", ""full.jeffrey@gmail.com"", ""chuats@comp.nus.edu.sg"", ""sms@tsinghua.edu.cn""]","[""Hao Zhu"", ""Yankai Lin"", ""Zhiyuan Liu"", ""Jie Fu"", ""Tat-seng Chua and Maosong Sun""]","[""Graph Neural Networks"", ""Relational Reasoning""]","Recently, progress has been made towards improving relational reasoning in machine learning field. Among existing models, graph neural networks (GNNs) is one of the most effective approaches for multi-hop relational reasoning. In fact, multi-hop relational reasoning is indispensable in many natural language processing tasks such as relation extraction. In this paper, we propose to generate the parameters of graph neural networks (GP-GNNs) according to natural language sentences, which enables GNNs to process relational reasoning on unstructured text inputs. We verify GP-GNNs in relation extraction from text. Experimental results on a human-annotated dataset and two distantly supervised datasets show that our model achieves significant improvements compared to the baselines. We also perform a qualitative analysis to demonstrate that our model could discover more accurate relations by multi-hop relational reasoning.",/pdf/b666e5c11e849ceec046b669f30df1b32ef7c3c6.pdf,ICLR,2019,"A graph neural network model with parameters generated from natural languages, which can perform multi-hop reasoning. " +SklgTkBKDr,S1lXrmJYPS,1569440000000.0,1577170000000.0,1976,Neural Non-additive Utility Aggregation,"[""mzopf@ke.tu-darmstadt.de""]","[""Markus Zopf""]",[],"Neural architectures for set regression problems aim at learning representations such that good predictions can be made based on the learned representations. This strategy, however, ignores the fact that meaningful intermediate results might be helpful to perform well. We study two new architectures that explicitly model latent intermediate utilities and use non-additive utility aggregation to estimate the set utility based on the latent utilities. We evaluate the new architectures with visual and textual datasets, which have non-additive set utilities due to redundancy and synergy effects. We find that the new architectures perform substantially better in this setup.",/pdf/04651b1b4f111f9ca9ebed7fce05c021fea048fc.pdf,ICLR,2020,We propose two new neural architectures that explicitly model latent intermediate element utilities for non-additive set utility estimation. +S1xxx64YwH,H1lXkJ0BDr,1569440000000.0,1577170000000.0,326,Ecological Reinforcement Learning,"[""jcoreyes@eecs.berkeley.edu"", ""suvansh@berkeley.edu"", ""gberseth@gmail.com"", ""abhigupta@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""John D. Co-Reyes"", ""Suvansh Sanjeev"", ""Glen Berseth"", ""Abhishek Gupta"", ""Sergey Levine""]","[""non-episodic"", ""environment analysis"", ""reward shaping"", ""curriculum learning""]","Reinforcement learning algorithms have been shown to effectively learn tasks in a variety of static, deterministic, and simplistic environments, but their application to environments which are characteristic of dynamic lifelong settings encountered in the real world has been limited. Understanding the impact of specific environmental properties on the learning dynamics of reinforcement learning algorithms is important as we want to align the environments in which we develop our algorithms with the real world, and this is strongly coupled with the type of intelligence which can be learned. In this work, we study what we refer to as ecological reinforcement learning: the interaction between properties of the environment and the reinforcement learning agent. To this end, we introduce environments with characteristics that we argue better reflect natural environments: non-episodic learning, uninformative ``fundamental drive'' reward signals, and natural dynamics that cause the environment to change even when the agent fails to take intelligent actions. We show these factors can have a profound effect on the learning progress of reinforcement learning algorithms. Surprisingly, we find that these seemingly more challenging learning conditions can often make reinforcement learning agents learn more effectively. Through this study, we hope to shift the focus of the community towards learning in realistic, natural environments with dynamic elements.",/pdf/ab410b0e9755809ecd4e5d471fe710b07f5148b0.pdf,ICLR,2020, +HkgrZ0EYwB,HJloiX4OPH,1569440000000.0,1583910000000.0,963,Unpaired Point Cloud Completion on Real Scans using Adversarial Training,"[""xuelin.chen.sdu@gmail.com"", ""baoquan.chen@gmail.com"", ""n.mitra@cs.ucl.ac.uk""]","[""Xuelin Chen"", ""Baoquan Chen"", ""Niloy J. Mitra""]","[""point cloud completion"", ""generative adversarial network"", ""real scans""]","As 3D scanning solutions become increasingly popular, several deep learning setups have been developed for the task of scan completion, i.e., plausibly filling in regions that were missed in the raw scans. These methods, however, largely rely on supervision in the form of paired training data, i.e., partial scans with corresponding desired completed scans. While these methods have been successfully demonstrated on synthetic data, the approaches cannot be directly used on real scans in absence of suitable paired training data. We develop a first approach that works directly on input point clouds, does not require paired training data, and hence can directly be applied to real scans for scan completion. We evaluate the approach qualitatively on several real-world datasets (ScanNet, Matterport3D, KITTI), quantitatively on 3D-EPN shape completion benchmark dataset, and demonstrate realistic completions under varying levels of incompleteness. +",/pdf/4a712a30b6fa44e882a1eacc340462f23b2d5210.pdf,ICLR,2020, +r1lohoCqY7,ryl98139tm,1538090000000.0,1555380000000.0,743,Learning-Based Frequency Estimation Algorithms,"[""cyhsu@mit.edu"", ""indyk@mit.edu"", ""dina@csail.mit.edu"", ""vakilian@mit.edu""]","[""Chen-Yu Hsu"", ""Piotr Indyk"", ""Dina Katabi"", ""Ali Vakilian""]","[""streaming algorithms"", ""heavy-hitters"", ""Count-Min"", ""Count-Sketch""]","Estimating the frequencies of elements in a data stream is a fundamental task in data analysis and machine learning. The problem is typically addressed using streaming algorithms which can process very large data using limited storage. Today's streaming algorithms, however, cannot exploit patterns in their input to improve performance. We propose a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates. The proposed algorithms combine the benefits of machine learning with the formal guarantees available through algorithm theory. We prove that our learning-based algorithms have lower estimation errors than their non-learning counterparts. We also evaluate our algorithms on two real-world datasets and demonstrate empirically their performance gains.",/pdf/933864fa625dad4f0cacd07d5d9ea5bfad36294e.pdf,ICLR,2019,"Data stream algorithms can be improved using deep learning, while retaining performance guarantees." +rklv-a4tDB,B1gzrPwUDH,1569440000000.0,1577170000000.0,378,Mesh-Free Unsupervised Learning-Based PDE Solver of Forward and Inverse problems,"[""barleah.libra@gmail.com"", ""sochen@tauex.tau.ac.il""]","[""Leah Bar"", ""Nir Sochen""]","[""PDEs"", ""forward problems"", ""inverse problems"", ""unsupervised learning"", ""deep networks"", ""EIT""]","We introduce a novel neural network-based partial differential equations solver for forward and inverse problems. The solver is grid free, mesh free and shape free, and the solution is approximated by a neural network. +We employ an unsupervised approach such that the input to the network is a points set in an arbitrary domain, and the output is the +set of the corresponding function values. The network is trained to minimize deviations of the learned function from the PDE solution and +satisfy the boundary conditions. +The resulting solution in turn is an explicit smooth differentiable function with a known analytical form. + +Unlike other numerical methods such as finite differences and finite elements, the derivatives of the desired function can be analytically calculated to any order. This framework therefore, enables the solution of high order non-linear PDEs. The proposed algorithm is a unified formulation of both forward and inverse problems +where the optimized loss function consists of few elements: fidelity terms of L2 and L infinity norms, boundary conditions constraints and additional regularizers. This setting is flexible in the sense that regularizers can be tailored to specific +problems. We demonstrate our method on a free shape 2D second order elliptical system with application to Electrical Impedance Tomography (EIT). ",/pdf/3a45fee976612793fe46797d1849497a95222c34.pdf,ICLR,2020,Solving PDEs with deep learning techniques in an unsupervised fashion with regularizers for forward and inverse problems. +SJxE8erKDH,S1xlAYgYwH,1569440000000.0,1586260000000.0,2319,Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings,"[""mahajan@aiphes.tu-darmstadt.de"", ""gurevych@ukp.informatik.tu-darmstadt.de"", ""stefan.roth@visinf.tu-darmstadt.de""]","[""Shweta Mahajan"", ""Iryna Gurevych"", ""Stefan Roth""]",[],"Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. +The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis.",/pdf/9bf24cd3e7eed0d497fd4a2c0493680d463c1563.pdf,ICLR,2020, +#NAME?,y44_kmBrKKN,1601310000000.0,1614990000000.0,1612,Translation Memory Guided Neural Machine Translation,"[""~Shaohui_Kuang1"", ""~Heng_Yu1"", ""weihua.luowh@alibaba-inc.com"", ""~Qiang_Wang8""]","[""Shaohui Kuang"", ""Heng Yu"", ""Weihua Luo"", ""Qiang Wang""]","[""neural machine translation"", ""translation memory"", ""pre-train language model""]","Many studies have proven that Translation Memory (TM) can help improve the translation quality of neural machine translation (NMT). Existing ways either employ extra encoder to encode information from TM or concatenate source sentence and TM sentences as encoder's input. These previous methods don't model the semantic relationship between the source sentence and TM sentences. Meanwhile, the training corpus related to TM is limited, and the sentence level retrieval approach further limits its scale. +In this paper, we propose a novel method to combine the strengths of both TM and NMT. We treat the matched sentence pair of TM as the additional signal and apply one encoder enhanced by the pre-trained language model (PLM) to encode the TM information and source sentence together. Additionally, we extend the sentence level retrieval method to the n-gram retrieval method that we don't need to calculate the similarity score. Further, we explore new methods to manipulate the information flow from TM to the NMT decoder. We validate our proposed methods on a mixed test set of multiple domains. Experiment results demonstrate that the proposed methods can significantly improve the translation quality and show strong adaptation for an unknown or new domain.",/pdf/74cf7b8b193950ce79dd7e9a16023427b73e5f3b.pdf,ICLR,2021, +r1gzdhEKvH,HJxzKzAk8B,1569440000000.0,1577170000000.0,33,Neural Linear Bandits: Overcoming Catastrophic Forgetting through Likelihood Matching,"[""tomzahavy@gmail.com"", ""shiemannor@gmail.com""]","[""Tom Zahavy"", ""Shie Mannor""]",[],"We study neural-linear bandits for solving problems where both exploration and representation learning play an important role. Neural-linear bandits leverage the representation power of deep neural networks and combine it with efficient exploration mechanisms, designed for linear contextual bandits, on top of the last hidden layer. Since the representation is being optimized during learning, information regarding exploration with ""old"" features is lost. Here, we propose the first limited memory neural-linear bandit that is resilient to this catastrophic forgetting phenomenon. We perform simulations on a variety of real-world problems, including regression, classification, and sentiment analysis, and observe that our algorithm achieves superior performance and shows resilience to catastrophic forgetting. ",/pdf/f8137b4a6eef0c217ff45f8d5cbff4af2a37997d.pdf,ICLR,2020,Neural-linear bandits combine linear contextual bandits with deep neural networks to solve problems where both exploration and representation learning play an important role. +S1q_Cz-Cb,Syzf0GZR-,1509140000000.0,1518730000000.0,1058,Training Neural Machines with Partial Traces,"[""matthew.mirman@inf.ethz.ch"", ""dpavle@student.ethz.ch"", ""dimitar.dimitrov@inf.ethz.ch"", ""timon.gehr@inf.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Matthew Mirman"", ""Dimitar Dimitrov"", ""Pavle Djordjevich"", ""Timon Gehr"", ""Martin Vechev""]","[""Neural Abstract Machines"", ""Neural Turing Machines"", ""Neural Random Access Machines"", ""Program Synthesis"", ""Program Induction""]","We present a novel approach for training neural abstract architectures which in- corporates (partial) supervision over the machine’s interpretable components. To cleanly capture the set of neural architectures to which our method applies, we introduce the concept of a differential neural computational machine (∂NCM) and show that several existing architectures (e.g., NTMs, NRAMs) can be instantiated as a ∂NCM and can thus benefit from any amount of additional supervision over their interpretable components. Based on our method, we performed a detailed experimental evaluation with both, the NTM and NRAM architectures, and showed that the approach leads to significantly better convergence and generalization capabilities of the learning phase than when training using only input-output examples. +",/pdf/2a42fa8aa2a9f58d0907a11bc1a5378f7b268690.pdf,ICLR,2018,We increase the amount of trace supervision possible to utilize when training fully differentiable neural machine architectures. +Hk4_qw5xe,,1478290000000.0,1484240000000.0,415,Towards Principled Methods for Training Generative Adversarial Networks,"[""martinarjovsky@gmail.com"", ""leonb@fb.com""]","[""Martin Arjovsky"", ""Leon Bottou""]",[],"The goal of this paper is not to introduce a single algorithm or method, but to make theoretical steps towards fully understanding the training dynamics of gen- erative adversarial networks. In order to substantiate our theoretical analysis, we perform targeted experiments to verify our assumptions, illustrate our claims, and quantify the phenomena. This paper is divided into three sections. The first sec- tion introduces the problem at hand. The second section is dedicated to studying and proving rigorously the problems including instability and saturation that arize when training generative adversarial networks. The third section examines a prac- tical and theoretically grounded direction towards solving these problems, while introducing new tools to study them.",/pdf/a39d6b9d0e4f4746a02732368a9fa08d458f0a45.pdf,ICLR,2017,We introduce a theory about generative adversarial networks and their issues. +NzTU59SYbNq,vbNObtxezub,1601310000000.0,1615920000000.0,2114,EigenGame: PCA as a Nash Equilibrium,"[""~Ian_Gemp1"", ""~Brian_McWilliams2"", ""~Claire_Vernade1"", ""~Thore_Graepel1""]","[""Ian Gemp"", ""Brian McWilliams"", ""Claire Vernade"", ""Thore Graepel""]","[""pca"", ""principal components analysis"", ""nash"", ""games"", ""eigendecomposition"", ""svd"", ""singular value decomposition""]",We present a novel view on principal components analysis as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm---which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization---is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights.,/pdf/05bd1b2e95cfd532a3902be824f0945943dc7503.pdf,ICLR,2021,We formulate the solution to PCA as the Nash of a suitable game with accompanying algorithm that we demonstrate on a 200TB dataset. +0Zxk3ynq7jE,DRoBtVF0kP,1601310000000.0,1614990000000.0,695,An Empirical Exploration of Open-Set Recognition via Lightweight Statistical Pipelines,"[""~Shu_Kong1"", ""~Deva_Ramanan1""]","[""Shu Kong"", ""Deva Ramanan""]","[""open-set recognition"", ""anomaly detection"", ""statistical models"", ""Gaussian Mixture Models"", ""open-world image classification"", ""open-world semantic segmentation""]","Machine-learned safety-critical systems need to be self-aware and reliably know their unknowns in the open-world. This is often explored through the lens of anomaly/outlier detection or out-of-distribution modeling. One popular formulation is that of open-set classification, where an image classifier trained for 1-of-$K$ classes should also recognize images belonging to a $(K+1)^{th}$ ""other"" class, not present in the training set. Recent work has shown that, somewhat surprisingly, most if not all existing open-world methods do not work well on high-dimensional open-world images (Shafaei et al. 2019). In this paper, we carry out an empirical exploration of open-set classification, and find that combining classic statistical methods with carefully computed features can dramatically outperform prior work. We extract features from off-the-shelf (OTS) state-of-the-art networks for the underlying $K$-way closed-world task. We leverage insights from the retrieval community for computing feature descriptors that are low-dimensional (via pooling and PCA) and normalized (via L2-normalization), enabling the modeling of training data densities via classic statistical tools such as kmeans and Gaussian Mixture Models (GMMs).",/pdf/237dbcf59478d10a8ff2534706de86346d1f13a2.pdf,ICLR,2021,"The paper proposes an empirical pipeline for open-world recognition based on classic statistical models, which are built on properly processed deep off-the-shelf features and achieve state-of-the-art performance under various setups." +BkfEzz-0-,HkxNzMZCW,1509130000000.0,1518730000000.0,787,Neuron as an Agent,"[""ohsawa@weblab.t.u-tokyo.ac.jp"", ""akuzawa-kei@weblab.t.u-tokyo.ac.jp"", ""matsushima@weblab.t.u-tokyo.ac.jp"", ""gustavo@weblab.t.u-tokyo.ac.jp"", ""iwasawa@weblab.t.u-tokyo.ac.jp"", ""kjn@jp.ibm.com"", ""s.takenaka@aediworks.com"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Shohei Ohsawa"", ""Kei Akuzawa"", ""Tatsuya Matsushima"", ""Gustavo Bezerra"", ""Yusuke Iwasawa"", ""Hiroshi Kajino"", ""Seiya Takenaka"", ""Yutaka Matsuo""]","[""Multi-agent Reinforcement Learning"", ""Communication"", ""Reward Distribution"", ""Trusted Third Party"", ""Auction Theory""]","Existing multi-agent reinforcement learning (MARL) communication methods have relied on a trusted third party (TTP) to distribute reward to agents, leaving them inapplicable in peer-to-peer environments. This paper proposes reward distribution using {\em Neuron as an Agent} (NaaA) in MARL without a TTP with two key ideas: (i) inter-agent reward distribution and (ii) auction theory. Auction theory is introduced because inter-agent reward distribution is insufficient for optimization. Agents in NaaA maximize their profits (the difference between reward and cost) and, as a theoretical result, the auction mechanism is shown to have agents autonomously evaluate counterfactual returns as the values of other agents. NaaA enables representation trades in peer-to-peer environments, ultimately regarding unit in neural networks as agents. Finally, numerical experiments (a single-agent environment from OpenAI Gym and a multi-agent environment from ViZDoom) confirm that NaaA framework optimization leads to better performance in reinforcement learning.",/pdf/888d31698082eb121137da7477bc6a38f567fea0.pdf,ICLR,2018,Neuron as an Agent (NaaA) enable us to train multi-agent communication without a trusted third party. +SJxDDpEKvH,rygA2xjDwH,1569440000000.0,1588230000000.0,600,Counterfactuals uncover the modular structure of deep generative models,"[""michel.besserve@tuebingen.mpg.de"", ""mehrjou.arash@gmail.com"", ""remy.sun@ens-rennes.fr"", ""bs@tuebingen.mpg.de""]","[""Michel Besserve"", ""Arash Mehrjou"", ""R\u00e9my Sun"", ""Bernhard Sch\u00f6lkopf""]","[""generative models"", ""causality"", ""counterfactuals"", ""representation learning"", ""disentanglement"", ""generalization"", ""unsupervised learning""]","Deep generative models can emulate the perceptual properties of complex image datasets, providing a latent representation of the data. However, manipulating such representation to perform meaningful and controllable transformations in the data space remains challenging without some form of supervision. While previous work has focused on exploiting statistical independence to \textit{disentangle} latent factors, we argue that such requirement can be advantageously relaxed and propose instead a non-statistical framework that relies on identifying a modular organization of the network, based on counterfactual manipulations. Our experiments support that modularity between groups of channels is achieved to a certain degree on a variety of generative models. This allowed the design of targeted interventions on complex image datasets, opening the way to applications such as computationally efficient style transfer and the automated assessment of robustness to contextual changes in pattern recognition systems.",/pdf/805e6f0a14245c98c51f22eec52acd36934b359f.pdf,ICLR,2020,We develop a framework to find modular internal representations in generative models and manipulate then to generate counterfactual examples. +BylKwnEYvS,BkeuM7WXHB,1569440000000.0,1577170000000.0,12,Star-Convexity in Non-Negative Matrix Factorization,"[""njb225@cornell.edu"", ""gomes@cs.cornell.edu"", ""kilianweinberger@cornell.edu""]","[""Johan Bjorck"", ""Carla Gomes"", ""Kilian Weinberger""]","[""nmf"", ""convexity"", ""nonconvex optimization"", ""average-case-analysis""]","Non-negative matrix factorization (NMF) is a highly celebrated algorithm for matrix decomposition that guarantees strictly non-negative factors. The underlying optimization problem is computationally intractable, yet in practice gradient descent based solvers often find good solutions. This gap between computational hardness and practical success mirrors recent observations in deep learning, where it has been the focus of extensive discussion and analysis. In this paper we revisit the NMF optimization problem and analyze its loss landscape in non-worst-case settings. It has recently been observed that gradients in deep networks tend to point towards the final minimizer throughout the optimization. We show that a similar property holds (with high probability) for NMF, provably in a non-worst case model with a planted solution, and empirically across an extensive suite of real-world NMF problems. Our analysis predicts that this property becomes more likely with growing number of parameters, and experiments suggest that a similar trend might also hold for deep neural networks --- turning increasing data sets and models into a blessing from an optimization perspective. ",/pdf/fa2634d5411cf9a5c139258768e86e55eda7d3d9.pdf,ICLR,2020, +BkgqL0EtPH,SygJxmvODH,1569440000000.0,1577170000000.0,1148,{COMPANYNAME}11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery,"[""shawn@wtf.sg"", ""guillaume.androz@icentia.com"", ""doctor.ahmad89@gmail.com"", ""pierre.fecteau@icentia.com"", ""aaron.courville@gmail.com"", ""yoshua.bengio@mila.quebec"", ""joseph@josephpcohen.com""]","[""Shawn Tan"", ""Guillaume Androz"", ""Ahmad Chamseddine"", ""Pierre Fecteau"", ""Aaron Courville"", ""Yoshua Bengio"", ""Joseph Paul Cohen""]","[""representation learning"", ""healthcare"", ""medical"", ""clinical"", ""dataset"", ""ecg"", ""cardiology"", ""heart"", ""discovery"", ""anomaly detection"", ""out of distribution""]","We release the largest public ECG dataset of continuous raw signals for representation learning containing over 11k patients and 2 billion labelled beats. Our goal is to enable semi-supervised ECG models to be made as well as to discover unknown subtypes of arrhythmia and anomalous ECG signal events. To this end, we propose an unsupervised representation learning task, evaluated in a semi-supervised fashion. We provide a set of baselines for different feature extractors that can be built upon. Additionally, we perform qualitative evaluations on results from PCA embeddings, where we identify some clustering of known subtypes indicating the potential for representation learning in arrhythmia sub-type discovery.",/pdf/fd13274db8244868c0db85bb2b6235baa5050444.pdf,ICLR,2020,"We release a dataset constructed from single-lead ECG data from 11,000 patients who were prescribed to use the {DEVICENAME}(TM) device." +Twm9LnWK-zt,m2nA_9-EVs,1601310000000.0,1614990000000.0,2548,Searching towards Class-Aware Generators for Conditional Generative Adversarial Networks,"[""~Peng_Zhou2"", ""~Lingxi_Xie1"", ""~XIAOPENG_ZHANG7"", ""~Bingbing_Ni3"", ""~Qi_Tian3""]","[""Peng Zhou"", ""Lingxi Xie"", ""XIAOPENG ZHANG"", ""Bingbing Ni"", ""Qi Tian""]","[""NAS"", ""cGAN""]","Conditional Generative Adversarial Networks (cGAN) were designed to generate images based on the provided conditions, e.g., class-level distributions. However, existing methods have used the same generating architecture for all classes. This paper presents a novel idea that adopts NAS to find a distinct architecture for each class. The search space contains regular and class-modulated convolutions, where the latter is designed to introduce class-specific information while avoiding the reduction of training data for each class generator. The search algorithm follows a weight-sharing pipeline with mixed-architecture optimization so that the search cost does not grow with the number of classes. To learn the sampling policy, a Markov decision process is embedded into the search algorithm and a moving average is applied for better stability. We evaluate our approach on CIFAR10 and CIFAR100. Besides achieving better image generation quality in terms of FID scores, we discover several insights that are helpful in designing cGAN models.",/pdf/2a18bae05464859f556d49bb5dcefe61998a4063.pdf,ICLR,2021,We implement a class-aware generator model through NAS. +ryxjnREFwH,Byej35YOwH,1569440000000.0,1583910000000.0,1369,Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension,"[""xinyun.chen@berkeley.edu"", ""crazydonkey@google.com"", ""adamsyuwei@google.com"", ""dennyzhou@google.com"", ""dawnsong.travel@gmail.com"", ""qvl@google.com""]","[""Xinyun Chen"", ""Chen Liang"", ""Adams Wei Yu"", ""Denny Zhou"", ""Dawn Song"", ""Quoc V. Le""]","[""neural symbolic"", ""reading comprehension"", ""question answering""]","Integrating distributed representations with symbolic operations is essential for reading comprehension requiring complex reasoning, such as counting, sorting and arithmetics, but most existing approaches are hard to scale to more domains or more complex reasoning. In this work, we propose the Neural Symbolic Reader (NeRd), which includes a reader, e.g., BERT, to encode the passage and question, and a programmer, e.g., LSTM, to generate a program that is executed to produce the answer. Compared to previous works, NeRd is more scalable in two aspects: (1) domain-agnostic, i.e., the same neural architecture works for different domains; (2) compositional, i.e., when needed, complex programs can be generated by recursively applying the predefined operators, which become executable and interpretable representations for more complex reasoning. Furthermore, to overcome the challenge of training NeRd with weak supervision, we apply data augmentation techniques and hard Expectation-Maximization (EM) with thresholding. On DROP, a challenging reading comprehension dataset that requires discrete reasoning, NeRd achieves 1.37%/1.18% absolute improvement over the state-of-the-art on EM/F1 metrics. With the same architecture, NeRd significantly outperforms the baselines on MathQA, a math problem benchmark that requires multiple steps of reasoning, by 25.5% absolute increment on accuracy when trained on all the annotated programs. More importantly, NeRd still beats the baselines even when only 20% of the program annotations are given.",/pdf/04395f69956d84261c9511e873b29bc6d012f29c.pdf,ICLR,2020, +BkMXkhA5Fm,r1l0zTTqKX,1538090000000.0,1545360000000.0,974,Learning State Representations in Complex Systems with Multimodal Data,"[""pavel.solovev.ilich@gmail.com"", ""vldr.aliev@gmail.com"", ""pavelosta@gmail.com"", ""sterkin.gleb@gmail.com"", ""elimohl@gmail.com"", ""troeshust96@gmail.com"", ""windj007@gmail.com"", ""antonagoo@gmail.com"", ""olegkhomenkoru@gmail.com"", ""snikolenko@gmail.com""]","[""Pavel Solovev"", ""Vladimir Aliev"", ""Pavel Ostyakov"", ""Gleb Sterkin"", ""Elizaveta Logacheva"", ""Stepan Troeshestov"", ""Roman Suvorov"", ""Anton Mashikhin"", ""Oleg Khomenko"", ""Sergey I. Nikolenko""]","[""deep learning"", ""representation learning"", ""state representation"", ""disentangled representation"", ""dataset"", ""autonomous system"", ""temporal multimodal data""]","Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset and evaluation framework for representation learning for the complex task of landing an airplane. We implement and compare several approaches to representation learning on this dataset in terms of the quality of simple supervised learning tasks and disentanglement scores. The resulting representations can be used for further tasks such as anomaly detection, optimal control, model-based reinforcement learning, and other applications.",/pdf/4616b67729dd3160cade83f9b479b3a660da72d9.pdf,ICLR,2019,"Multimodal synthetic dataset, collected from X-plane flight simulator, used for learning state representation and unified evaluation framework for representation learning" +S1en0sRqKm,SylrH0p5tX,1538090000000.0,1545360000000.0,932,On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent,"[""noah.golmant@berkeley.edu"", ""nikitavemuri@berkeley.edu"", ""zheweiy@berkeley.edu"", ""vladf@berkeley.edu"", ""amirgh@berkeley.edu"", ""kai.rothauge@berkeley.edu"", ""mmahoney@stat.berkeley.edu"", ""jegonzal@cs.berkeley.edu""]","[""Noah Golmant"", ""Nikita Vemuri"", ""Zhewei Yao"", ""Vladimir Feinberg"", ""Amir Gholami"", ""Kai Rothauge"", ""Michael Mahoney"", ""Joseph Gonzalez""]","[""Deep learning"", ""large batch training"", ""scaling rules"", ""stochastic gradient descent""]","Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique (Daset al., 2016; Keskar et al., 2016). We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for either train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.",/pdf/779ab8c3f35f53393b64ff46209b4044b4427d59.pdf,ICLR,2019,Large batch training results in rapidly diminishing returns in wall-clock time to convergence to find a good model. +jPSYH47QSZL,lKKl_1f-3Q1,1601310000000.0,1614990000000.0,992,Pre-Training by Completing Point Clouds,"[""~Hanchen_Wang1"", ""~Qi_Liu5"", ""~Xiangyu_Yue1"", ""~Joan_Lasenby1"", ""~Matt_Kusner1""]","[""Hanchen Wang"", ""Qi Liu"", ""Xiangyu Yue"", ""Joan Lasenby"", ""Matt Kusner""]","[""self-supervised learning"", ""pre-training"", ""point clouds""]","There has recently been a flurry of exciting advances in deep learning models on point clouds. However, these advances have been hampered by the difficulty of creating labelled point cloud datasets: sparse point clouds often have unclear label identities for certain points, while dense point clouds are time-consuming to annotate. Inspired by mask-based pre-training in the natural language processing community, we propose a pre-training mechanism based point clouds completion. It works by masking occluded points that result from observations at different camera views. It then optimizes a completion model that learns how to reconstruct the occluded points, given the partial point cloud. In this way, our method learns a pre-trained representation that can identify the visual constraints inherently embedded in real-world point clouds. We call our method Occlusion Completion (OcCo). We demonstrate that OcCo learns representations that improve the semantic understandings as well as generalization on downstream tasks over prior methods, transfer to different datasets, reduce training time and improve label efficiency.",/pdf/027dfc014acee997a48021c25e892b13ba566d18.pdf,ICLR,2021,We present a self-supervised pre-training technique that learns to complete occluded point clouds. +Byg5ZANtvH,rylIa8NuPB,1569440000000.0,1583910000000.0,976,Short and Sparse Deconvolution --- A Geometric Approach,"[""y.lau@columbia.edu"", ""qq213@nyu.edu"", ""hk2673@columbia.edu"", ""pz2230@columbia.edu"", ""yz2557@cornell.edu"", ""jw2966@columbia.edu""]","[""Yenson Lau"", ""Qing Qu"", ""Han-Wen Kuo"", ""Pengcheng Zhou"", ""Yuqian Zhang"", ""John Wright""]",[],"Short-and-sparse deconvolution (SaSD) is the problem of extracting localized, recurring motifs in signals with spatial or temporal structure. Variants of this problem arise in applications such as image deblurring, microscopy, neural spike sorting, and more. The problem is challenging in both theory and practice, as natural optimization formulations are nonconvex. Moreover, practical deconvolution problems involve smooth motifs (kernels) whose spectra decay rapidly, resulting in poor conditioning and numerical challenges. This paper is motivated by recent theoretical advances \citep{zhang2017global,kuo2019geometry}, which characterize the optimization landscape of a particular nonconvex formulation of SaSD. This is used to derive a provable algorithm that exactly solves certain non-practical instances of the SaSD problem. We leverage the key ideas from this theory (sphere constraints, data-driven initialization) to develop a practical algorithm, which performs well on data arising from a range of application areas. We highlight key additional challenges posed by the ill-conditioning of real SaSD problems and suggest heuristics (acceleration, continuation, reweighting) to mitigate them. Experiments demonstrate the performance and generality of the proposed method.",/pdf/8cb35e242b7d8e9629eb764ef53ab1bd262667ae.pdf,ICLR,2020, +TK_6nNb_C7q,wVNScSKeadq,1601310000000.0,1616010000000.0,2103,Hierarchical Autoregressive Modeling for Neural Video Compression,"[""~Ruihan_Yang1"", ""~Yibo_Yang1"", ""~Joseph_Marino1"", ""~Stephan_Mandt1""]","[""Ruihan Yang"", ""Yibo Yang"", ""Joseph Marino"", ""Stephan Mandt""]","[""Compression"", ""Video Compression"", ""Generative Models"", ""Autoregressive Models""]","Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal autoregressive transform, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.",/pdf/aa19952024b8087f98c378025770cfb48b752442.pdf,ICLR,2021, +BkxDxJHFDr,BklDyriuvB,1569440000000.0,1577170000000.0,1509,Power up! Robust Graph Convolutional Network based on Graph Powering,"[""jinming@berkeley.edu"", ""changh@berkeley.edu"", ""wwzhu@tsinghua.edu.cn"", ""sojoudi@berkeley.edu""]","[""Ming Jin"", ""Heng Chang"", ""Wenwu Zhu"", ""Somayeh Sojoudi""]","[""graph mining"", ""graph neural network"", ""adversarial robustness""]","Graph convolutional networks (GCNs) are powerful tools for graph-structured data. However, they have been recently shown to be vulnerable to topological attacks. To enhance adversarial robustness, we go beyond spectral graph theory to robust graph theory. By challenging the classical graph Laplacian, we propose a new convolution operator that is provably robust in the spectral domain and is incorporated in the GCN architecture to improve expressivity and interpretability. By extending the original graph to a sequence of graphs, we also propose a robust training paradigm that encourages transferability across graphs that span a range of spatial and spectral characteristics. The proposed approaches are demonstrated in extensive experiments to {simultaneously} improve performance in both benign and adversarial situations. ",/pdf/e1fc93c11f21fa399f0fc9184f912e50448516f3.pdf,ICLR,2020,We propose a framework for robust graph neural networks based on graph powering +Hye1kTVFDS,HyeuOKzSwB,1569440000000.0,1583910000000.0,285,The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget,"[""anirudhgoyal9119@gmail.com"", ""yoshua.bengio@mila.quebec"", ""botvinick@google.com"", ""svlevine@eecs.berkeley.edu""]","[""Anirudh Goyal"", ""Yoshua Bengio"", ""Matthew Botvinick"", ""Sergey Levine""]","[""Variational Information Bottleneck"", ""Reinforcement learning""]","In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. +The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged'' input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information.",/pdf/b40be7a245f8e1c10d5e434f960b0a68288a91b5.pdf,ICLR,2020,Training agents with adaptive computation based on information bottleneck can promote generalization. +KtH8W3S_RE,n-a11YSVcJ_,1601310000000.0,1619030000000.0,1324,Multi-resolution modeling of a discrete stochastic process identifies causes of cancer,"[""~Adam_Uri_Yaari1"", ""~Maxwell_Sherman1"", ""~Oliver_Clarke_Priebe1"", ""~Po-Ru_Loh1"", ""~Boris_Katz1"", ""~Andrei_Barbu3"", ""~Bonnie_Berger1""]","[""Adam Uri Yaari"", ""Maxwell Sherman"", ""Oliver Clarke Priebe"", ""Po-Ru Loh"", ""Boris Katz"", ""Andrei Barbu"", ""Bonnie Berger""]","[""Computational Biology"", ""non-stationary stochastic processes"", ""cancer research"", ""deep learning"", ""probabelistic models"", ""graphical models""]","Detection of cancer-causing mutations within the vast and mostly unexplored human genome is a major challenge. Doing so requires modeling the background mutation rate, a highly non-stationary stochastic process, across regions of interest varying in size from one to millions of positions. Here, we present the split-Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes.",/pdf/ef004741f232acac82caf795ea408cbd814a413c.pdf,ICLR,2021,"We integrate a deep learning framework with a probabilistic model to learn a discrete stochastic process at arbitrary length scales, the method accurately and efficiently model mutations load in a tumor and detect cancer driver mutations genome-wide" +rJerHlrYwH,ryx5d_etvr,1569440000000.0,1577170000000.0,2284,Data-Efficient Image Recognition with Contrastive Predictive Coding,"[""henaff@google.com"", ""aravind@cs.berkeley.edu"", ""defauw@google.com"", ""alirazavi@google.com"", ""doersch@google.com"", ""aeslami@google.com"", ""avdnoord@google.com""]","[""Olivier J Henaff"", ""Aravind Srinivas"", ""Jeffrey De Fauw"", ""Ali Razavi"", ""Carl Doersch"", ""S. M. Ali Eslami"", ""Aaron van den Oord""]","[""Deep learning"", ""representation learning"", ""contrastive methods"", ""unsupervised learning"", ""self-supervised learning"", ""vision"", ""data-efficiency""]","Human observers can learn to recognize new categories of objects from a handful of examples, yet doing so with machine perception remains an open challenge. We hypothesize that data-efficient recognition is enabled by representations which make the variability in natural signals more predictable, as suggested by recent perceptual evidence. We therefore revisit and improve Contrastive Predictive Coding, a recently-proposed unsupervised learning framework, and arrive at a representation which enables generalization from small amounts of labeled data. When provided with only 1% of ImageNet labels (i.e. 13 per class), this model retains a strong classification performance, 73% Top-5 accuracy, outperforming supervised networks by 28% (a 65% relative improvement) and state-of-the-art semi-supervised methods by 14%. We also find this representation to serve as a useful substrate for object detection on the PASCAL-VOC 2007 dataset, approaching the performance of representations trained with a fully annotated ImageNet dataset.",/pdf/760ff8205c69d78094ed1c5804a7d20ffe475c69.pdf,ICLR,2020,Unsupervised representations learned with Contrastive Predictive Coding enable data-efficient image classification. +HJTzHtqee,,1478300000000.0,1488520000000.0,481,A Compare-Aggregate Model for Matching Text Sequences,"[""shwang.2014@phdis.smu.edu.sg"", ""jingjiang@smu.edu.sg""]","[""Shuohang Wang"", ""Jing Jiang""]","[""Natural language processing"", ""Deep learning""]","Many NLP tasks including machine comprehension, answer selection and text entailment require the comparison between sequences. Matching the important units between sequences is a key to solve these problems. In this paper, we present a general ""compare-aggregate"" framework that performs word-level matching followed by aggregation using Convolutional Neural Networks. We particularly focus on the different comparison functions we can use to match two vectors. We use four different datasets to evaluate the model. We find that some simple comparison functions based on element-wise operations can work better than standard neural network and neural tensor network. ",/pdf/153a178fd1072b3ce3642cd6c281b60604cf9aed.pdf,ICLR,2017,"A general ""compare-aggregate"" framework that performs word-level matching followed by aggregation using Convolutional Neural Networks" +8W7LTo_zxdE,d2UxPTEDtU,1601310000000.0,1614990000000.0,3351,Variational Deterministic Uncertainty Quantification,"[""~Joost_van_Amersfoort1"", ""~Lewis_Smith1"", ""~Andrew_Jesson1"", ""~Oscar_Key1"", ""~Yarin_Gal1""]","[""Joost van Amersfoort"", ""Lewis Smith"", ""Andrew Jesson"", ""Oscar Key"", ""Yarin Gal""]","[""Uncertainty estimation"", ""gaussian processes"", ""deep learning"", ""variational inference""]","Building on recent advances in uncertainty quantification using a single deep deterministic model (DUQ), we introduce variational Deterministic Uncertainty Quantification (vDUQ). We overcome several shortcomings of DUQ by recasting it as a Gaussian process (GP) approximation. Our principled approximation is based on an inducing point GP in combination with Deep Kernel Learning. This enables vDUQ to use rigorous probabilistic foundations, and work not only on classification but also on regression problems. We avoid uncertainty collapse away from the training data by regularizing the spectral norm of the deep feature extractor. Our method matches SotA accuracy, 96.2\% on CIFAR-10, while maintaining the speed of softmax models, and provides uncertainty estimates competitive with Deep Ensembles. We demonstrate our method in regression problems and by estimating uncertainty in causal inference for personalized medicine",/pdf/2b0ce6cc6e1bb9c3633be645f8dfee1b398e1a6f.pdf,ICLR,2021,Uncertainty estimation using Deep Kernel Learning and a spectral normalized feature extractor. +3zaVN0M0BIb,xlxAXPvorQb,1601310000000.0,1614990000000.0,3249,Learning and Generalization in Univariate Overparameterized Normalizing Flows,"[""~Kulin_Shah1"", ""~Amit_Deshpande1"", ""~Navin_Goyal1""]","[""Kulin Shah"", ""Amit Deshpande"", ""Navin Goyal""]",[],"In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using Stochastic Gradient Descent (SGD). In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) learn to map complex real-world distributions into simple base distributions, and constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is one hidden layer overparametrized network. On the one hand we provide evidence that for a class of NFs, overparametrization hurts training. On the other, we prove that another class of NFs, with similar underlying networks can efficiently learn any reasonable data distribution under minimal assumptions. We extend theoretical ideas on learning and generalization from overparameterized neural networks in supervised learning to overparameterized normalizing flows in unsupervised learning. We also provide experimental validation to support our theoretical analysis in practice.",/pdf/ffcee50f276b53bc7faebe19294a696f331c6616.pdf,ICLR,2021, +r1nSxrKPH,Hkehktltvr,1569440000000.0,1577170000000.0,2299,Learning Functionally Decomposed Hierarchies for Continuous Navigation Tasks,"[""lukas.jendele@gmail.com"", ""sammy.christen@inf.ethz.ch"", ""eaksan@inf.ethz.ch"", ""otmar.hilliges@inf.ethz.ch""]","[""Lukas Jendele"", ""Sammy Christen"", ""Emre Aksan"", ""Otmar Hilliges""]","[""Hierarchical reinforcement learning"", ""planning"", ""navigation""]","Solving long-horizon sequential decision making tasks in environments with sparse rewards is a longstanding problem in reinforcement learning (RL) research. Hierarchical Reinforcement Learning (HRL) has held the promise to enhance the capabilities of RL agents via operation on different levels of temporal abstraction. Despite the success of recent works in dealing with inherent nonstationarity and sample complexity, it remains difficult to generalize to unseen environments and to transfer different layers of the policy to other agents. In this paper, we propose a novel HRL architecture, Hierarchical Decompositional Reinforcement Learning (HiDe), which allows decomposition of the hierarchical layers into independent subtasks, yet allows for joint training of all layers in end-to-end manner. The main insight is to combine a control policy on a lower level with an image-based planning policy on a higher level. We evaluate our method on various complex continuous control tasks for navigation, demonstrating that generalization across environments and transfer of higher level policies can be achieved. See videos https://sites.google.com/view/hide-rl",/pdf/832e50cff0a270de709cfe2c10969b87e899eb41.pdf,ICLR,2020,Learning Functionally Decomposed Hierarchies for Continuous Navigation Tasks +rkgAGAVKPr,rJgY7mruPS,1569440000000.0,1586290000000.0,1019,Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples,"[""eleni@cs.toronto.edu"", ""tylerzhu@google.com"", ""vdumoulin@google.com"", ""lamblinp@google.com"", ""evcu@google.com"", ""kelvinxu@berkeley.edu"", ""goroshin@google.com"", ""cgel@google.com"", ""kswersky@google.com"", ""manzagop@google.com"", ""hugolarochelle@google.com""]","[""Eleni Triantafillou"", ""Tyler Zhu"", ""Vincent Dumoulin"", ""Pascal Lamblin"", ""Utku Evci"", ""Kelvin Xu"", ""Ross Goroshin"", ""Carles Gelada"", ""Kevin Swersky"", ""Pierre-Antoine Manzagol"", ""Hugo Larochelle""]","[""few-shot learning"", ""meta-learning"", ""few-shot classification""]","Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle it, we find the procedure and datasets that are used to assess their progress lacking. To address this limitation, we propose Meta-Dataset: a new benchmark for training and evaluating models that is large-scale, consists of diverse datasets, and presents more realistic tasks. We experiment with popular baselines and meta-learners on Meta-Dataset, along with a competitive method that we propose. We analyze performance as a function of various characteristics of test tasks and examine the models’ ability to leverage diverse training sources for improving their generalization. We also propose a new set of baselines for quantifying the benefit of meta-learning in Meta-Dataset. Our extensive experimentation has uncovered important research challenges and we hope to inspire work in these directions.",/pdf/5bbab78661f015b59ef2f2bad5695aa5b98cd6c8.pdf,ICLR,2020,"We propose a new large-scale diverse environment for few-shot learning, and evaluate popular models' performance on it, revealing important research challenges." +xppLmXCbOw1,hdIf8v7kPj,1601310000000.0,1616060000000.0,3479,Self-supervised Visual Reinforcement Learning with Object-centric Representations,"[""~Andrii_Zadaianchuk1"", ""~Maximilian_Seitzer1"", ""~Georg_Martius1""]","[""Andrii Zadaianchuk"", ""Maximilian Seitzer"", ""Georg Martius""]","[""self-supervision"", ""autonomous learning"", ""object-centric representations"", ""visual reinforcement learning""]","Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky challenge for any autonomous agent. Previous methods have used variational autoencoders to encode a scene into a low-dimensional vector that can be used as a goal for an agent to discover new skills. Nevertheless, in compositional/multi-object environments it is difficult to disentangle all the factors of variation into such a fixed-length representation of the whole scene. We propose to use object-centric representations as a modular and structured observation space, which is learned with a compositional generative world model. +We show that the structure in the representations in combination with goal-conditioned attention policies helps the autonomous agent to discover and learn useful skills. These skills can be further combined to address compositional tasks like the manipulation of several different objects.",/pdf/17a02894cb2794484690465818a08e4b35ea6982.pdf,ICLR,2021,The combination of object-centric representations and goal-conditioned attention policies helps autonomous agents to learn useful multi-task policies in visual multi-object environments +7aL-OtQrBWD,agWkOIdZ7HF,1601310000000.0,1616070000000.0,1915,A Learning Theoretic Perspective on Local Explainability,"[""~Jeffrey_Li1"", ""~Vaishnavh_Nagarajan3"", ""~Gregory_Plumb2"", ""~Ameet_Talwalkar1""]","[""Jeffrey Li"", ""Vaishnavh Nagarajan"", ""Gregory Plumb"", ""Ameet Talwalkar""]","[""Interpretability"", ""Learning Theory"", ""Local Explanations"", ""Generalization""]","In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations. First, we tackle the traditional problem of performance generalization and bound the test-time predictive accuracy of a model using a notion of how locally explainable it is. Second, we explore the novel problem of explanation generalization which is an important concern for a growing class of finite sample-based local approximation explanations. Finally, we validate our theoretical results empirically and show that they reflect what can be seen in practice.",/pdf/7bc360590e883f39ece56cc91233317a9d075582.pdf,ICLR,2021, +wXBt-7VM2JE,fmgqHzH7sS,1601310000000.0,1614990000000.0,2605,On Single-environment Extrapolations in Graph Classification and Regression Tasks,"[""~Beatrice_Bevilacqua1"", ""~Yangze_Zhou1"", ""~Ryan_L_Murphy1"", ""~Bruno_Ribeiro1""]","[""Beatrice Bevilacqua"", ""Yangze Zhou"", ""Ryan L Murphy"", ""Bruno Ribeiro""]","[""Extrapolation"", ""Graphs"", ""GNNs"", ""SCM"", ""Causality"", ""Counterfactual Inference""]","Extrapolation in graph classification/regression remains an underexplored area of an otherwise rapidly developing field. Our work contributes to a growing literature by providing the first systematic counterfactual modeling framework for extrapolations in graph classification/regression tasks. To show that extrapolation from a single training environment is possible, we develop a connection between certain extrapolation tasks on graph sizes and Lovasz's characterization of graph limits. For these extrapolations, standard graph neural networks (GNNs) will fail, while classifiers using induced homomorphism densities succeed, but mostly on unattributed graphs. Generalizing these density features through a GNN subgraph decomposition allows them to also succeed in more complex attributed graph extrapolation tasks. Finally, our experiments validate our theoretical results and showcase some shortcomings of common (interpolation) methods in the literature.",/pdf/2ff7e3b7be9531ec6770a9c5ba4eaac1a14f4f52.pdf,ICLR,2021,"To the best of our knowledge, this is the first work to formalize extrapolation in graph classification tasks using counterfactual inference, showing that extrapolation in graph tasks is possible even if given a single environment in training." +BygJKn4tPr,rkx8li4cUB,1569440000000.0,1577170000000.0,64,Effective Mechanism to Mitigate Injuries During NFL Plays ,"[""anzanfas@gmail.com"", ""arulanantham.arraamuthan@my.sliit.lk"", ""it16113800@my.sliit.lk"", ""krusanth7@gmail.com"", ""prasanna@sliit.lk""]","[""Arraamuthan Arulanantham"", ""Ahamed Arshad Ahamed Anzar"", ""Gowshalini Rajalingam"", ""Krusanth Ingran"", ""Prasanna S. Haddela""]","[""Concussion"", ""American football"", ""Predictive modelling"", ""Injuries"", ""NFL Plays"", ""Optimization""]","NFL(American football),which is regarded as the premier sports icon of America, has been severely accused in the recent years of being exposed to dangerous injuries that prove to be a bigger crisis as the players' lives have been increasingly at risk. Concussions, which refer to the serious brain traumas experienced during the passage of NFL play, have displayed a dramatic rise in the recent seasons concluding in an alarming rate in 2017/18. Acknowledging the potential risk, the NFL has been trying to fight via NeuroIntel AI mechanism as well as modifying existing game rules and risky play practices to reduce the rate of concussions. As a remedy, we are suggesting an effective mechanism to extensively analyse the potential concussion risks by adopting predictive analysis to project injury risk percentage per each play and positional impact analysis to suggest safer team formation pairs to lessen injuries to offer a comprehensive study on NFL injury analysis. The proposed data analytical approach differentiates itself from the other similar approaches that were focused only on the descriptive analysis rather than going for a bigger context with predictive modelling and formation pairs mining that would assist in modifying existing rules to tackle injury concerns. The predictive model that works with Kafka-stream processor real-time inputs and risky formation pairs identification by designing FP-Matrix, makes this far-reaching solution to analyse injury data on various grounds wherever applicable.",/pdf/efe733c5d12d8372e01311ee567c3e3d8058c167.pdf,ICLR,2020,Mitigate concussions in American Football using Machine learning and Optimization techniques +_XYzwxPIQu6,ZMFSKAkCJvr,1601310000000.0,1615490000000.0,2968,Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies,"[""~Dominik_Schmidt1"", ""~Georgia_Koppe1"", ""zahra.monfared@zi-mannheim.de"", ""max.beutelspacher@mailbox.org"", ""~Daniel_Durstewitz1""]","[""Dominik Schmidt"", ""Georgia Koppe"", ""Zahra Monfared"", ""Max Beutelspacher"", ""Daniel Durstewitz""]","[""nonlinear dynamical systems"", ""recurrent neural networks"", ""attractors"", ""computational neuroscience"", ""vanishing gradient problem"", ""LSTM""]","A main theoretical interest in biology and physics is to identify the nonlinear dynamical system (DS) that generated observed time series. Recurrent Neural Networks (RNN) are, in principle, powerful enough to approximate any underlying DS, but in their vanilla form suffer from the exploding vs. vanishing gradients problem. Previous attempts to alleviate this problem resulted either in more complicated, mathematically less tractable RNN architectures, or strongly limited the dynamical expressiveness of the RNN. +Here we address this issue by suggesting a simple regularization scheme for vanilla RNN with ReLU activation which enables them to solve long-range dependency problems and express slow time scales, while retaining a simple mathematical structure which makes their DS properties partly analytically accessible. We prove two theorems that establish a tight connection between the regularized RNN dynamics and their gradients, illustrate on DS benchmarks that our regularization approach strongly eases the reconstruction of DS which harbor widely differing time scales, and show that our method is also en par with other long-range architectures like LSTMs on several tasks.",/pdf/fb39182856a0c466f6521fc91c26027f3907321b.pdf,ICLR,2021,We introduce a novel regularization for ReLU-based vanilla RNN that mitigates the exploding vs. vanishing gradient problem while retaining a simple mathematical structure that makes the RNN's dynamical systems properties partly analytically tractable +B1MRcPclx,,1478290000000.0,1488590000000.0,421,Query-Reduction Networks for Question Answering,"[""minjoon@cs.washington.edu"", ""shmsw25@snu.ac.kr"", ""ali@cs.washington.edu"", ""hannaneh@cs.washington.edu""]","[""Minjoon Seo"", ""Sewon Min"", ""Ali Farhadi"", ""Hannaneh Hajishirzi""]","[""Natural language processing"", ""Deep learning""]","In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference. ",https://arxiv.org/pdf/1606.04582.pdf,ICLR,2017, +YmA86Zo-P_t,upctvZC2Au_,1601310000000.0,1616000000000.0,878,What they do when in doubt: a study of inductive biases in seq2seq learners,"[""~Eugene_Kharitonov1"", ""~Rahma_Chaabouni1""]","[""Eugene Kharitonov"", ""Rahma Chaabouni""]","[""inductive biases"", ""description length"", ""sequence-to-sequence models""]","Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use four new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and compositional reasoning. Further, we connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive measure of inductive biases. In our experimental study, we find that LSTM-based learners can learn to perform counting, addition, and multiplication by a constant from a single training example. Furthermore, Transformer and LSTM-based learners show a bias toward the hierarchical induction over the linear one, while CNN-based learners prefer the opposite. The latter also show a bias toward a compositional generalization over memorization. Finally, across all our experiments, description length proved to be a sensitive measure of inductive biases.",/pdf/585c196f903498736cc63f2856f42b9c3fd4139d.pdf,ICLR,2021,"Standard seq2seq models can infer perfectly different rules when presented with as little as one training example, showing strikingly different inductive biases that can be studied via description length" +VRgITLy0l2,Hb1na6pChJQ,1601310000000.0,1614990000000.0,3441,A priori guarantees of finite-time convergence for Deep Neural Networks,"[""~Anushree_Rankawat1"", ""~Mansi_Rankawat1"", ""harshal.oza@sot.pdpu.ac.in""]","[""Anushree Rankawat"", ""Mansi Rankawat"", ""Harshal B. Oza""]",[],"In this paper, we perform Lyapunov based analysis of the loss function to derive an a priori upper bound on the settling time of deep neural networks. While previous studies have attempted to understand deep learning using control theory framework, there is limited work on a priori finite time convergence analysis. Drawing from the advances in analysis of finite-time control of non-linear systems, we provide a priori guarantees of finite-time convergence in a deterministic control theoretic setting. We formulate the supervised learning framework as a control problem where weights of the network are control inputs and learning translates into a tracking problem. An analytical formula for finite-time upper bound on settling time is provided a priori under the assumptions of boundedness of input. Finally, we prove that our loss function is robust against input perturbations. ",/pdf/4781058fdde8bdc83bf67fb9a420d558f2b8e991.pdf,ICLR,2021, +EBRTjOm_sl1,ZwAhx5HI6Ku,1601310000000.0,1614990000000.0,2006,Learning Active Learning in the Batch-Mode Setup with Ensembles of Active Learning Agents,"[""~Malte_Ebner1"", ""bkratzwald@ethz.ch"", ""sfeuerriegel@ethz.ch""]","[""Malte Ebner"", ""Bernhard Kratzwald"", ""Stefan Feuerriegel""]","[""active learning"", ""ensembles""]","Supervised learning models perform best when trained on a lot of data, but annotating training data is very costly in some domains. Active learning aims to chose only the most informative subset of unlabelled samples for annotation, thus saving annotation cost. Several heuristics for choosing this subset have been developed, which use fix policies for this choice. They are easily understandable and applied. However, there is no heuristic performing optimal in all settings. This lead to the development of agents learning the best selection policy from data. They formulate active learning as a Markov decision process and applying reinforcement learning (RL) methods to it. Their advantage is that they are able to use many features and to adapt to the specific task. + +Our paper proposes a new approach combining these advantages of learning active learning and heuristics: We propose to learn active learning using a parametrised ensemble of agents, where the parameters are learned using Monte Carlo policy search. As this approach can incorporate any active learning agent into its ensemble, it allows to increase the performance of every active learning agent by learning how to combine it with others.",/pdf/3524e90f8500b156626d85f70065c261f5b645d6.pdf,ICLR,2021,This paper proposes to perform active learning with a parametrised ensemble of agents and evaluates the approach in the batch-mode setting. +HkNDsiC9KQ,H1lm8tCKYm,1538090000000.0,1550890000000.0,632,Meta-Learning Update Rules for Unsupervised Representation Learning,"[""lmetz@google.com"", ""nirum@google.com"", ""bcheung@berkeley.edu"", ""jaschasd@google.com""]","[""Luke Metz"", ""Niru Maheswaranathan"", ""Brian Cheung"", ""Jascha Sohl-Dickstein""]","[""Meta-learning"", ""unsupervised learning"", ""representation learning""]","A major goal of unsupervised learning is to discover data representations that are useful for subsequent tasks, without access to supervised labels during training. Typically, this involves minimizing a surrogate objective, such as the negative log likelihood of a generative model, with the hope that representations useful for subsequent tasks will arise as a side effect. In this work, we propose instead to directly target later desired tasks by meta-learning an unsupervised learning rule which leads to representations useful for those tasks. Specifically, we target semi-supervised classification performance, and we meta-learn an algorithm -- an unsupervised weight update rule -- that produces representations useful for this task. Additionally, we constrain our unsupervised update rule to a be a biologically-motivated, neuron-local function, which enables it to generalize to different neural network architectures, datasets, and data modalities. We show that the meta-learned update rule produces useful features and sometimes outperforms existing unsupervised learning techniques. We further show that the meta-learned unsupervised update rule generalizes to train networks with different widths, depths, and nonlinearities. It also generalizes to train on data with randomly permuted input dimensions and even generalizes from image datasets to a text task.",/pdf/e572a334fe17a077ba8892af92200f9e00dd2656.pdf,ICLR,2019,"We learn an unsupervised learning algorithm that produces useful representations from a set of supervised tasks. At test-time, we apply this algorithm to new tasks without any supervision and show performance comparable to a VAE." +y13JLBiNMsf,dLyeBkuXOO,1601310000000.0,1614990000000.0,796,Learning Monotonic Alignments with Source-Aware GMM Attention,"[""~Tae_Gyoon_Kang1"", ""hogyeong.kim@samsung.com"", ""minjoong.lee@samsung.com"", ""jihyun.s.lee@samsung.com"", ""~Seongmin_Ok1"", ""hoshik.lee@samsung.com"", ""macho@samsung.com""]","[""Tae Gyoon Kang"", ""Ho-Gyeong Kim"", ""Min-Joong Lee"", ""Jihyun Lee"", ""Seongmin Ok"", ""Hoshik Lee"", ""Young Sang Choi""]","[""Monotonic alignments"", ""sequence-to-sequence model"", ""aligned attention"", ""streaming speech recognition"", ""long-form speech recognition""]","Transformers with soft attention have been widely adopted in various sequence-to-sequence (Seq2Seq) tasks. Whereas soft attention is effective for learning semantic similarities between queries and keys based on their contents, it does not explicitly model the order of elements in sequences which is crucial for monotonic Seq2Seq tasks. Learning monotonic alignments between input and output sequences may be beneficial for long-form and online inference applications that are still challenging for the conventional soft attention algorithm. Herein, we focus on monotonic Seq2Seq tasks and propose a source-aware Gaussian mixture model attention in which the attention scores are monotonically calculated considering both the content and order of the source sequence. We experimentally demonstrate that the proposed attention mechanism improved the performance on the online and long-form speech recognition problems without performance degradation in offline in-distribution speech recognition.",/pdf/40b2a952edd850d01b072503e4919dee464c895e.pdf,ICLR,2021,We focus on monotonic sequence-to-sequence tasks and propose source-aware GMM attention which enables online inference and improves long-form sequence generation performance in speech recognition. +SJx1URNKwH,ryx5i68dPS,1569440000000.0,1583910000000.0,1124,MetaPix: Few-Shot Video Retargeting,"[""jl5@cs.cmu.edu"", ""deva@cs.cmu.edu"", ""rgirdhar@cs.cmu.edu""]","[""Jessica Lee"", ""Deva Ramanan"", ""Rohit Girdhar""]","[""Meta-learning"", ""Few-shot Learning"", ""Generative Adversarial Networks"", ""Video Retargeting""]","We address the task of unsupervised retargeting of human actions from one video to another. We consider the challenging setting where only a few frames of the target is available. The core of our approach is a conditional generative model that can transcode input skeletal poses (automatically extracted with an off-the-shelf pose estimator) to output target frames. However, it is challenging to build a universal transcoder because humans can appear wildly different due to clothing and background scene geometry. Instead, we learn to adapt – or personalize – a universal generator to the particular human and background in the target. To do so, we make use of meta-learning to discover effective strategies for on-the-fly personalization. One significant benefit of meta-learning is that the personalized transcoder naturally enforces temporal coherence across its generated frames; all frames contain consistent clothing and background geometry of the target. We experiment on in-the-wild internet videos and images and show our approach improves over widely-used baselines for the task. +",/pdf/a63a19c5457caf7a6647d054cd03a139124fb7b4.pdf,ICLR,2020,"Video retargeting typically requires large amount of target data to be effective, which may not always be available; we propose a metalearning approach that improves over popular baselines while producing temporally coherent frames." +3X64RLgzY6O,8zoz0gh3NrT,1601310000000.0,1616800000000.0,1393,Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate,"[""~Jingfeng_Wu1"", ""~Difan_Zou1"", ""~Vladimir_Braverman1"", ""~Quanquan_Gu1""]","[""Jingfeng Wu"", ""Difan Zou"", ""Vladimir Braverman"", ""Quanquan Gu""]","[""SGD"", ""regularization"", ""implicit bias""]","Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different directional bias: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing.",/pdf/48577f49989616efeba863fc8477794a13b5d90e.pdf,ICLR,2021,We show a directional regularization effect for SGD with moderate learning rate +SJc1hL5ee,,1478290000000.0,1478380000000.0,324,FastText.zip: Compressing text classification models,"[""ajoulin@fb.com"", ""egrave@fb.com"", ""bojanowski@fb.com"", ""matthijs@fb.com"", ""rvj@fb.com"", ""tmikolov@fb.com""]","[""Armand Joulin"", ""Edouard Grave"", ""Piotr Bojanowski"", ""Matthijs Douze"", ""Herve Jegou"", ""Tomas Mikolov""]","[""Natural language processing"", ""Supervised Learning"", ""Applications""]","We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store the word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent the quantization artifacts. As a result, our approach produces a text classifier, derived from the fastText approach, which at test time requires only a fraction of the memory compared to the original one, without noticeably sacrificing the quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy.",/pdf/c041ccf288b0ed8d1eccac2f3d20aa9ed97c155f.pdf,ICLR,2017,Compressing text classification models +OAdGsaptOXy,RnDk4l-tAV3,1601310000000.0,1614990000000.0,2337,Pretrain Knowledge-Aware Language Models,"[""~Corbin_L_Rosset1"", ""~Chenyan_Xiong1"", ""phan.minh@microsoft.com"", ""~Xia_Song1"", ""~Paul_N._Bennett1"", ""~saurabh_tiwary1""]","[""Corbin L Rosset"", ""Chenyan Xiong"", ""Minh Phan"", ""Xia Song"", ""Paul N. Bennett"", ""saurabh tiwary""]","[""Pretraining"", ""Natural Language Generation"", ""GPT-2"", ""QA"", ""Knowledge Graph""]","How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how to ensure they do so. In this paper we incorporate knowledge-awareness in language model pretraining without changing the transformer architecture, inserting explicit knowledge layers, or adding external storage of semantic information. Rather, we simply signal the existence of entities to the input of the transformer in pretraining, with an entity-extended tokenizer; and at the output, with an additional entity prediction task. Our experiments show that solely by adding these entity signals in pretraining, significantly more knowledge is packed into the transformer parameters: we observe improved language modeling accuracy, factual correctness in LAMA knowledge probing tasks, and semantics in the hidden representations through edge probing. We also show that our knowledge-aware language model (\kalm{}) can serve as a drop-in replacement for GPT-2 models, significantly improving downstream tasks like zero-shot question-answering with no task-related training. ",/pdf/7ce6a2690102b4f71d551b08f9af37f04b3eb10a.pdf,ICLR,2021,"Without changing the transformer architecture, we can pack more knowledge into GPT-2 parameters by signaling the existence of entities, leading to better QA performance. " +B1MUroRct7,HyxwG7IYF7,1538090000000.0,1545360000000.0,91,Online Learning for Supervised Dimension Reduction,"[""ningzhang0123@gmail.com"", ""qwu@mtsu.edu""]","[""Ning Zhang"", ""Qiang Wu""]","[""Online Learning"", ""Supervised Dimension Reduction"", ""Incremental Sliced Inverse Regression"", ""Effective Dimension Reduction Space""]"," Online learning has attracted great attention due to the increasing demand for systems that have the ability of learning and evolving. When the data to be processed is also high dimensional and dimension reduction is necessary for visualization or prediction enhancement, online dimension reduction will play an essential role. The purpose of this paper is to propose new online learning approaches for supervised dimension reduction. Our first algorithm is motivated by adapting the sliced inverse regression (SIR), a pioneer and effective algorithm for supervised dimension reduction, and making it implementable in an incremental manner. The new algorithm, called incremental sliced inverse regression (ISIR), is able to update the subspace of significant factors with intrinsic lower dimensionality fast and efficiently when new observations come in. We also refine the algorithm by using an overlapping technique and develop an incremental overlapping sliced inverse regression (IOSIR) algorithm. We verify the effectiveness and efficiency of both algorithms by simulations and real data applications.",/pdf/d13a831ec42dbd62b5cf904145014c0bfc59695a.pdf,ICLR,2019,"We proposed two new approaches, the incremental sliced inverse regression and incremental overlapping sliced inverse regression, to implement supervised dimension reduction in an online learning manner." +S1nQvfgA-,HkiQPzeRb,1509070000000.0,1519330000000.0,224,Semantically Decomposing the Latent Spaces of Generative Adversarial Networks,"[""cdonahue@ucsd.edu"", ""zlipton@cmu.edu"", ""abalsubr@stanford.edu"", ""jmcauley@cs.ucsd.edu""]","[""Chris Donahue"", ""Zachary C. Lipton"", ""Akshay Balsubramani"", ""Julian McAuley""]","[""disentangled representations"", ""generative adversarial networks"", ""generative modeling"", ""image synthesis""]","We propose a new algorithm for training generative adversarial networks to jointly learn latent codes for both identities (e.g. individual humans) and observations (e.g. specific photographs). In practice, this means that by fixing the identity portion of latent codes, we can generate diverse images of the same subject, and by fixing the observation portion we can traverse the manifold of subjects while maintaining contingent aspects such as lighting and pose. Our algorithm features a pairwise training scheme in which each sample from the generator consists of two images with a common identity code. Corresponding samples from the real dataset consist of two distinct photographs of the same subject. In order to fool the discriminator, the generator must produce images that are both photorealistic, distinct, and appear to depict the same person. We augment both the DCGAN and BEGAN approaches with Siamese discriminators to accommodate pairwise training. Experiments with human judges and an off-the-shelf face verification system demonstrate our algorithm’s ability to generate convincing, identity-matched photographs.",/pdf/7d489bf176b131aab3fa9e3f5308aad74c7ea8c7.pdf,ICLR,2018,SD-GANs disentangle latent codes according to known commonalities in a dataset (e.g. photographs depicting the same person). +rkg6sJHYDr,BJexKJ1FPH,1569440000000.0,1583910000000.0,1931,Intrinsically Motivated Discovery of Diverse Patterns in Self-Organizing Systems,"[""chris.reinke@inria.fr"", ""mayalen.etcheverry@inria.fr"", ""chris.reinke@inria.fr"", ""pierre-yves.oudeyer@inria.fr""]","[""Chris Reinke"", ""Mayalen Etcheverry"", ""Pierre-Yves Oudeyer""]","[""deep learning"", ""unsupervised Learning"", ""self-organization"", ""game-of-life""]","In many complex dynamical systems, artificial or natural, one can observe self-organization of patterns emerging from local rules. Cellular automata, like the Game of Life (GOL), have been widely used as abstract models enabling the study of various aspects of self-organization and morphogenesis, such as the emergence of spatially localized patterns. However, findings of self-organized patterns in such models have so far relied on manual tuning of parameters and initial states, and on the human eye to identify interesting patterns. In this paper, we formulate the problem of automated discovery of diverse self-organized patterns in such high-dimensional complex dynamical systems, as well as a framework for experimentation and evaluation. Using a continuous GOL as a testbed, we show that recent intrinsically-motivated machine learning algorithms (POP-IMGEPs), initially developed for learning of inverse models in robotics, can be transposed and used in this novel application area. These algorithms combine intrinsically-motivated goal exploration and unsupervised learning of goal space representations. Goal space representations describe the interesting features of patterns for which diverse variations should be discovered. In particular, we compare various approaches to define and learn goal space representations from the perspective of discovering diverse spatially localized patterns. Moreover, we introduce an extension of a state-of-the-art POP-IMGEP algorithm which incrementally learns a goal representation using a deep auto-encoder, and the use of CPPN primitives for generating initialization parameters. We show that it is more efficient than several baselines and equally efficient as a system pre-trained on a hand-made database of patterns identified by human experts.",/pdf/bb2cff5de2f13ce19f25a957f363b530b7a63f59.pdf,ICLR,2020,We study how an unsupervised exploration and feature learning approach addresses efficiently a new problem: automatic discovery of diverse self-organized patterns in high-dim complex systems such as the game of life. +S1fUpoR5FQ,S1etVYa5YQ,1538090000000.0,1547000000000.0,805,Quasi-hyperbolic momentum and Adam for deep learning,"[""maj@fb.com"", ""denisy@fb.com""]","[""Jerry Ma"", ""Denis Yarats""]","[""sgd"", ""momentum"", ""nesterov"", ""adam"", ""qhm"", ""qhadam"", ""optimization""]","Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. Code is immediately available.",/pdf/d2ea4b4ed5c48630b8c757bb107e9d470a5957ff.pdf,ICLR,2019,Mix plain SGD and momentum (or do something similar with Adam) for great profit. +Ef1nNHQHZ20,AOdHyqIZob9,1601310000000.0,1614990000000.0,303,Layer-wise Adversarial Defense: An ODE Perspective,"[""~Zonghan_Yang1"", ""liuyang2011@tsinghua.edu.cn"", ""~Chenglong_Bao3"", ""~Zuoqiang_Shi1""]","[""Zonghan Yang"", ""Yang Liu"", ""Chenglong Bao"", ""Zuoqiang Shi""]","[""adversarial training"", ""robustness"", ""ODE""]","Deep neural networks are observed to be fragile against adversarial attacks, which have dramatically limited their practical applicability. On improving model robustness, the adversarial training techniques have proven effective and gained increasing attention from research communities. Existing adversarial training approaches mainly focus on perturbations to inputs, while the effect of the perturbations in hidden layers remains underexplored. In this work, we propose layer-wise adversarial defense which improves adversarial training by a noticeable margin. The basic idea of our method is to strengthen all of the hidden layers with perturbations that are proportional to the back-propagated gradients. In order to study the layer-wise neural dynamics, we formulate our approach from the perspective of ordinary differential equations (ODEs) and build up its extended relationship with conventional adversarial training methods, which tightens the relationship between neural networks and ODEs. In the implementation, we propose two different training algorithms by discretizing the ODE model with the Lie-Trotter and the Strang-Marchuk splitting schemes from the operator-splitting theory. Experiments on CIFAR-10 and CIFAR-100 benchmarks show that our methods consistently improve adversarial model robustness on top of widely-used strong adversarial training techniques. ",/pdf/44076e76655a69ad1c97814a9ac7188eb6f05383.pdf,ICLR,2021,"We introduce layer-wise adversarial defense to improve adversarial training algorithms, and build up its extended relationship with current approaches from ordinary differential equation perspective." +WUTkGqErZ9,zayILXFH48l,1601310000000.0,1614990000000.0,2034,"Convolutional Neural Networks are not invariant to translation, but they can learn to be","[""~Valerio_Biscione1"", ""~Jeffrey_Bowers1""]","[""Valerio Biscione"", ""Jeffrey Bowers""]","[""Invariance"", ""Convolutional Networks"", ""Translation"", ""Internal Representations""]","When seeing a new object, humans can immediately recognize it across different retinal locations: we say that the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several works have found that these networks systematically fail to recognise new objects on untrained locations. In this work we show how, even though CNNs are not 'architecturally invariant' to translation, they can indeed 'learn' to be invariant to translation. We verified that this can be achieved by pretraining on ImageNet, and we found that it is also possible with much simpler datasets in which the items are fully translated across the input canvas. Significantly, simply training everywhere on the canvas was not enough. We investigated how this pretraining affected the internal network representations, finding that the invariance was almost always acquired, even though it was some times disrupted by further training due to catastrophic forgetting/interference. + These experiments show how pretraining a network on an environment with the right 'latent' characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.",/pdf/c6064e5c0f7f43c271676e5152f170a8e7d0d899.pdf,ICLR,2021,"CNNs are not, as commonly assumed, 'architecturally' invariant to translation, but we investigated the conditions in which they can learn to be invariant to translation." +NLuOUSp9zZd,BjLiigWsT96e,1601310000000.0,1614990000000.0,1602,DO-GAN: A Double Oracle Framework for Generative Adversarial Networks,"[""~Aye_Phyu_Phyu_Aung1"", ""~Xinrun_Wang1"", ""~Runsheng_Yu2"", ""~Bo_An2"", ""j_senthilnath@i2r.a-star.edu.sg"", ""~Xiaoli_Li1""]","[""Aye Phyu Phyu Aung"", ""Xinrun Wang"", ""Runsheng Yu"", ""Bo An"", ""Senthilnath Jayavelu"", ""Xiaoli Li""]","[""GAN"", ""Generative Models"", ""Adversarial Networks"", ""Game Theory""]","In this paper, we propose a new approach to train Generative Adversarial Networks (GAN) where we deploy a double-oracle framework using the generator and discriminator oracles. GAN is essentially a two-player zero-sum game between the generator and the discriminator. Training GANs is challenging as a pure Nash equilibrium may not exist and even finding the mixed Nash equilibrium is difficult as GANs have a large-scale strategy space. In DO-GAN, we extend the double oracle framework to GANs. We first generalize the player strategies as the trained models of generator and discriminator from the best response oracles. We then compute the meta-strategies using a linear program. Next, we prune the weakly-dominated player strategies to keep the oracles from becoming intractable. We apply our framework to established architectures such as vanilla GAN, Deep Convolutional GAN, Spectral Normalization GAN and Stacked GAN. Finally, we conduct evaluations on MNIST, CIFAR-10 and CelebA datasets and show that DO-GAN variants have significant improvements in both subjective qualitative evaluation and quantitative metrics, compared with their respective GAN architectures.",/pdf/532da5dcc7b0060f83b25e263a35d32d63ce91b1.pdf,ICLR,2021,We deploy a double-oracle framework to GAN architectures using the generator and discriminator oracles which results in significant improvements over the adapted models. +H1MjAnqxg,,1478310000000.0,1484530000000.0,543,Intelligible Language Modeling with Input Switched Affine Networks,"[""jakob.foerster@cs.ox.ac.uk"", ""gilmer@google.com"", ""jan.chorowski@cs.uni.wroc.pl"", ""jaschasd@google.com"", ""sussillo@google.com""]","[""Jakob Foerster"", ""Justin Gilmer"", ""Jan Chorowski"", ""Jascha Sohl-dickstein"", ""David Sussillo""]","[""Natural language processing"", ""Deep learning"", ""Supervised Learning""]","The computational mechanisms by which nonlinear recurrent neural networks (RNNs) achieve their goals remains an open question. There exist many problem domains where intelligibility of the network model is crucial for deployment. Here we introduce a recurrent architecture composed of input-switched affine transformations, in other words an RNN without any nonlinearity and with one set of weights per input. +We show that this architecture achieves near identical performance to traditional architectures on language modeling of Wikipedia text, for the same number of model parameters. +It can obtain this performance with the potential for computational speedup compared to existing methods, by precomputing the composed affine transformations corresponding to longer input sequences. +As our architecture is affine, we are able to understand the mechanisms by which it functions using linear methods. For example, we show how the network linearly combines contributions from the past to make predictions at the current time step. We show how representations for words can be combined in order to understand how context is transferred across word boundaries. Finally, we demonstrate how the system can be executed and analyzed in arbitrary bases to aid understanding.",/pdf/50e43b60e63890cb5a202e2055d529c2e9144178.pdf,ICLR,2017,Input Switched Affine Networks combine intelligibility with performance for character level language modeling. +Y0MgRifqikY,V4DLY69vxxE,1601310000000.0,1614990000000.0,1496,Visual Explanation using Attention Mechanism in Actor-Critic-based Deep Reinforcement Learning,"[""~Hidenori_Itaya1"", ""~Tsubasa_Hirakawa1"", ""~Takayoshi_Yamashita1"", ""~Hironobu_Fujiyoshi2"", ""~Komei_Sugiura1""]","[""Hidenori Itaya"", ""Tsubasa Hirakawa"", ""Takayoshi Yamashita"", ""Hironobu Fujiyoshi"", ""Komei Sugiura""]",[],"Deep reinforcement learning (DRL) has great potential for acquiring the optimal action in complex environments such as games and robot control. However, it is difficult to analyze the decision-making of the agent, i.e., the reasons it selects the action acquired by learning. In this work, we propose Mask-Attention A3C (Mask A3C) that introduced an attention mechanism into Asynchronous Advantage Actor-Critic (A3C) which is an actor-critic-based DRL method, and can analyze decision making of agent in DRL. A3C consists of a feature extractor that extracts features from an image, a policy branch that outputs the policy, value branch that outputs the state value. In our method, we focus on the policy branch and value branch and introduce an attention mechanism to each. In the attention mechanism, mask processing is performed on the feature maps of each branch using mask-attention that expresses the judgment reason for the policy and state value with a heat map. We visualized mask-attention maps for games on the Atari 2600 and found we could easily analyze the reasons behind an agent’s decision-making in various game tasks. Furthermore, experimental results showed that higher performance of the agent could be achieved by introducing the attention mechanism. +",/pdf/54ed73a6a2afd5bf5f2375c9a74bd5818b7b82c3.pdf,ICLR,2021, +KJNcAkY8tY4,Wd4Oqv3u-w1,1601310000000.0,1615960000000.0,2417,Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth,"[""~Thao_Nguyen3"", ""~Maithra_Raghu1"", ""~Simon_Kornblith1""]","[""Thao Nguyen"", ""Maithra Raghu"", ""Simon Kornblith""]","[""Representation learning""]","A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.",/pdf/cb12ae8308060f86d8970f514c2a0e8a33d13c22.pdf,ICLR,2021,"We show that depth/width variations result in distinctive characteristics in the model internal representations, with resulting consequences for representations and output predictions across different model initializations and architectures. " +6Tm1mposlrM,R2Ml9ZcxdrS,1601310000000.0,1616050000000.0,1839,Sharpness-aware Minimization for Efficiently Improving Generalization,"[""~Pierre_Foret1"", ""akleiner@google.com"", ""~Hossein_Mobahi2"", ""~Behnam_Neyshabur1""]","[""Pierre Foret"", ""Ariel Kleiner"", ""Hossein Mobahi"", ""Behnam Neyshabur""]","[""Sharpness Minimization"", ""Generalization"", ""Regularization"", ""Training Method"", ""Deep Learning""]","In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization---including a generalization bound that we prove here---we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.",/pdf/6381fb621c36eb2213f5ef29237a9b4bb75eb839.pdf,ICLR,2021,"Motivated by the connection between geometry of the loss landscape and generalization, we introduce a procedure for simultaneously minimizing loss value and loss sharpness." +HJe-blSYvH,HkgcXkgFDH,1569440000000.0,1577170000000.0,2125,Unsupervised Learning of Efficient and Robust Speech Representations,"[""kawakamik@google.com"", ""luyuwang@google.com"", ""cdyer@google.com"", ""pblunsom@google.com"", ""avdnoord@google.com""]","[""Kazuya Kawakami"", ""Luyu Wang"", ""Chris Dyer"", ""Phil Blunsom"", ""Aaron van den Oord""]",[],"We present an unsupervised method for learning speech representations based on a bidirectional contrastive predictive coding that implicitly discovers phonetic structure from large-scale corpora of unlabelled raw audio signals. The representations, which we learn from up to 8000 hours of publicly accessible speech data, are evaluated by looking at their impact on the behaviour of supervised speech recognition systems. First, across a variety of datasets, we find that the features learned from the largest and most diverse pretraining dataset result in significant improvements over standard audio features as well as over features learned from smaller amounts of pretraining data. Second, they significantly improve sample efficiency in low-data scenarios. Finally, the features confer significant robustness advantages to the resulting recognition systems: we see significant improvements in out-of-domain transfer relative to baseline feature sets, and the features likewise provide improvements in four different low-resource African language datasets.",/pdf/ef7d5e36057351c491522b302edba745f51f0b3c.pdf,ICLR,2020, +8mVSD0ETOXl,iVym1P8y8Np,1601310000000.0,1614990000000.0,3662,Prediction of Enzyme Specificity using Protein Graph Convolutional Neural Networks,"[""~Changpeng_Lu1"", ""samuelstentz@gatech.edu"", ""jhl133@scarletmail.rutgers.edu"", ""sijian.wang@stat.rutgers.edu"", ""~Sagar_D_Khare1""]","[""Changpeng Lu"", ""Samuel Z Stentz"", ""Joseph H Lubin"", ""Sijian Wang"", ""Sagar D Khare""]","[""graph convolutional neural networks"", ""protease specificity"", ""proteins"", ""Rosetta energy function""]","Specific molecular recognition by proteins, for example, protease enzymes, is critical for maintaining the robustness of key life processes. The substrate specificity landscape of a protease enzyme comprises the set of all sequence motifs that are recognized/cut, or just as importantly, not recognized/cut by the enzyme. Current methods for predicting protease specificity landscapes rely on learning sequence patterns in experimentally derived data with a single enzyme, but are not robust to even small mutational changes. A comprehensive evaluation of specificity requires consideration of the three-dimensional structure and energetics of molecular interactions. In this work, we present a protein graph convolutional neural network (PGCN), which uses a physically intuitive, structure-based molecular interaction graph generated using the Rosetta energy function that describes the topology and energetic features, to determine substrate specificity. We use the PGCN to recapitulate and predict the specificity of the NS3/4 protease from the Hepatitic C virus. We compare our PGCN with previously used machine learning models and show that its performance in classification tasks is equivalent or better. Because PGCN is based on physical interactions, it is inherently more interpretable; determination of feature importance reveals key sub-graph patterns responsible for molecular recognition that are biochemically reasonable. The PGCN model also readily lends itself to the design of novel enzymes with tailored specificity against disease targets.",/pdf/6ecf14e2ea8e47bca1ea3cefb4e32341be09371d.pdf,ICLR,2021,A graphical convolutional neural network applied to protein structures predicts with high accuracy the specificity of molecular recognition and sets the stage for design of tailored enzymes against disease targets. +rJYFzMZC-,Bk_tGMW0W,1509130000000.0,1524970000000.0,795,Simulating Action Dynamics with Neural Process Networks,"[""antoineb@cs.washington.edu"", ""omerlevy@cs.washington.edu"", ""ahai@cs.washington.edu"", ""corin123@uw.edu"", ""fox@cs.washington.edu"", ""yejin@cs.washington.edu""]","[""Antoine Bosselut"", ""Omer Levy"", ""Ari Holtzman"", ""Corin Ennis"", ""Dieter Fox"", ""Yejin Choi""]","[""representation learning"", ""memory networks"", ""state tracking""]","Understanding procedural language requires anticipating the causal effects of actions, even when they are not explicitly stated. In this work, we introduce Neural Process Networks to understand procedural text through (neural) simulation of action dynamics. Our model complements existing memory architectures with dynamic entity tracking by explicitly modeling actions as state transformers. The model updates the states of the entities by executing learned action operators. Empirical results demonstrate that our proposed model can reason about the unstated causal effects of actions, allowing it to provide more accurate contextual information for understanding and generating procedural text, all while offering more interpretable internal representations than existing alternatives.",/pdf/a29993dd84c74d29ff8981b4a0a6bb6a5c2d5270.pdf,ICLR,2018,We propose a new recurrent memory architecture that can track common sense state changes of entities by simulating the causal effects of actions. +S1e0ZlHYDB,BJlErbeKDr,1569440000000.0,1577170000000.0,2156,Progressive Compressed Records: Taking a Byte Out of Deep Learning Data,"[""mkuchnik@andrew.cmu.edu"", ""gamvrosi@cmu.edu"", ""smithv@cmu.edu""]","[""Michael Kuchnik"", ""George Amvrosiadis"", ""Virginia Smith""]","[""Deep Learning"", ""Storage"", ""Bandwidth"", ""Compression""]","Deep learning training accesses vast amounts of data at high velocity, posing challenges for datasets retrieved over commodity networks and storage devices. We introduce a way to dynamically reduce the overhead of fetching and transporting training data with a method we term Progressive Compressed Records (PCRs). PCRs deviate from previous formats by leveraging progressive compression to split each training example into multiple examples of increasingly higher fidelity, without adding to the total data size. Training examples of similar fidelity are grouped together, which reduces both the system overhead and data bandwidth needed to train a model. We show that models can be trained on aggressively compressed representations of the training data and still retain high accuracy, and that PCRs can enable a 2x speedup on average over baseline formats using JPEG compression. Our results hold across deep learning architectures for a wide range of datasets: ImageNet, HAM10000, Stanford Cars, and CelebA-HQ.",/pdf/e28989ea5e11ee2e03007933b6fc61345e1f3fda.pdf,ICLR,2020,"We propose a simple, general, and space-efficient data format to accelerate deep learning training by allowing sample fidelity to be dynamically selected at training time" +BylJUTEKvB,SklcBCDvDr,1569440000000.0,1577170000000.0,545,Cross-Iteration Batch Normalization,"[""yaozhuliang13@gmail.com"", ""yuecao@microsoft.com"", ""shuxin.zheng@microsoft.com"", ""gaohuang@tsinghua.edu.cn"", ""stevelin@microsoft.com"", ""jifdai@microsoft.com""]","[""Zhuliang Yao"", ""Yue Cao"", ""Shuxin Zheng"", ""Gao Huang"", ""Stephen Lin"", ""Jifeng Dai""]","[""batch normalization"", ""small batch size""]","A well-known issue of Batch Normalization is its significantly reduced effectiveness in the case of small mini-batch sizes. When a mini-batch contains few examples, the statistics upon which the normalization is defined cannot be reliably estimated from it during a training iteration. To address this problem, we present Cross-Iteration Batch Normalization (CBN), in which examples from multiple recent iterations are jointly utilized to enhance estimation quality. A challenge of computing statistics over multiple iterations is that the network activations from different iterations are not comparable to each other due to changes in network weights. We thus compensate for the network weight changes via a proposed technique based on Taylor polynomials, so that the statistics can be accurately estimated and batch normalization can be effectively applied. On object detection and image classification with small mini-batch sizes, CBN is found to outperform the original batch normalization and a direct calculation of statistics over previous iterations without the proposed compensation technique.",/pdf/cb4db66f6ded623b263bd546deea3a492b5b7906.pdf,ICLR,2020,"We propose to borrow and compensate the statistics from previous iterations, to enhance statistics estimation quality in current iteration for batch normalization." +r1x0lxrFPS,B1gVLAJtPr,1569440000000.0,1583910000000.0,2118,BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations,"[""hyungjun.kim@postech.ac.kr"", ""kyungsu.kim@postech.ac.kr"", ""jinseok.kim@postech.ac.kr"", ""jaejoon@postech.ac.kr""]","[""Hyungjun Kim"", ""Kyungsu Kim"", ""Jinseok Kim"", ""Jae-Joon Kim""]",[],"Binary Neural Networks (BNNs) have been garnering interest thanks to their compute cost reduction and memory savings. However, BNNs suffer from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch problem by reducing the discrepancy between activation functions used at forward pass and its differentiable approximation used at backward pass, which is an indirect measure. In this work, we use the gradient of smoothed loss function to better estimate the gradient mismatch in quantized neural network. Analysis using the gradient mismatch estimator indicates that using higher precision for activation is more effective than modifying the differentiable approximation of activation function. Based on the observation, we propose a new training scheme for binary activation networks called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.",/pdf/d2650d092434446c1f01a83f6ccdebefd03bfdd9.pdf,ICLR,2020, +SygD31HFvB,BJgEjbkYvr,1569440000000.0,1577170000000.0,1954,A Novel Analysis Framework of Lower Complexity Bounds for Finite-Sum Optimization,"[""smsxgz@pku.edu.cn"", ""rickyluoluo@gmail.com"", ""zhzhang@math.pku.edu.cn""]","[""Guangzeng Xie"", ""Luo Luo"", ""Zhihua Zhang""]","[""convex optimization"", ""lower bound complexity"", ""proximal incremental first-order oracle""]","This paper studies the lower bound complexity for the optimization problem whose objective function is the average of $n$ individual smooth convex functions. We consider the algorithm which gets access to gradient and proximal oracle for each individual component. +For the strongly-convex case, we prove such an algorithm can not reach an $\eps$-suboptimal point in fewer than $\Omega((n+\sqrt{\kappa n})\log(1/\eps))$ iterations, where $\kappa$ is the condition number of the objective function. This lower bound is tighter than previous results and perfectly matches the upper bound of the existing proximal incremental first-order oracle algorithm Point-SAGA. +We develop a novel construction to show the above result, which partitions the tridiagonal matrix of classical examples into $n$ groups to make the problem difficult enough to stochastic algorithms. +This construction is friendly to the analysis of proximal oracle and also could be used in general convex and average smooth cases naturally.",/pdf/746a8f488b060a784482d5ebf415fc66f692a2d3.pdf,ICLR,2020, +BXewfAYMmJw,iq7myUB0Qop,1601310000000.0,1611610000000.0,233,Counterfactual Generative Networks,"[""~Axel_Sauer1"", ""~Andreas_Geiger3""]","[""Axel Sauer"", ""Andreas Geiger""]","[""Causality"", ""Counterfactuals"", ""Generative Models"", ""Robustness"", ""Image Classification"", ""Data Augmentation""]","Neural networks are prone to learning shortcuts -- they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances in deep generative modeling, we propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background; hence, they allow for generating counterfactual images. We demonstrate the ability of our model to generate such images on MNIST and ImageNet. Further, we show that the counterfactual images can improve out-of-distribution robustness with a marginal drop in performance on the original classification task, despite being synthetic. Lastly, our generative model can be trained efficiently on a single GPU, exploiting common pre-trained models as inductive biases.",/pdf/ef0dd5122ec4977085ea0ac017f21a3cb33377c5.pdf,ICLR,2021,A generative model structured into independent causal mechanisms produces images for training invariant classifiers. +2CjEVW-RGOJ,Q2a2GivO7D7,1601310000000.0,1615900000000.0,2989,SkipW: Resource Adaptable RNN with Strict Upper Computational Limit,"[""mayet.tsiry@gmail.com"", ""anne.lambert@interdigital.com"", ""pascal.leguyadec@interdigital.com"", ""francoise.lebolzer@interdigital.com"", ""~Fran\u00e7ois_Schnitzler1""]","[""Tsiry Mayet"", ""Anne Lambert"", ""Pascal Leguyadec"", ""Francoise Le Bolzer"", ""Fran\u00e7ois Schnitzler""]","[""Recurrent neural networks"", ""Flexibility"", ""Computational resources""]","We introduce Skip-Window, a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. Similarly to existing approaches, Skip-Window extends existing RNN cells by adding a mechanism to encourage the model to process fewer inputs. Unlike existing approaches, Skip-Window is able to respect a strict computational budget, making this model more suitable for limited hardware. We evaluate this approach on two datasets: a human activity recognition task and adding task. Our results show that Skip-Window is able to exceed the accuracy of existing approaches for a lower computational cost while strictly limiting said cost.",/pdf/6c45c14eaa50cfd7a61ea01da21211148f40eccf.pdf,ICLR,2021,Skip-Window is a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence while keeping a strict upper computational limit. +xoPj3G-OKNM,ZxCjmxn3bk,1601310000000.0,1614990000000.0,3093,Stochastic Normalized Gradient Descent with Momentum for Large Batch Training,"[""~Shen-Yi_Zhao2"", ""~Yin-Peng_Xie1"", ""~Wu-Jun_Li1""]","[""Shen-Yi Zhao"", ""Yin-Peng Xie"", ""Wu-Jun Li""]","[""normalized gradient"", ""large batch training size""]","Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy. As a result, large batch training has also become a challenging topic. In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to momentum SGD (MSGD) which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the $\epsilon$-stationary point with the same computation complexity (total number of gradient computation). Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.",/pdf/01ef2b3691bc635c1e9974597b16ff198581589b.pdf,ICLR,2021,A novel method called stochastic normalized gradient descent with momentum (SNGM) for large batch training. +EVV259WQuFG,VgVwKqOaa9I,1601310000000.0,1614990000000.0,3420,Machine Reading Comprehension with Enhanced Linguistic Verifiers,"[""~Xianchao_Wu1""]","[""Xianchao Wu""]","[""machine reading comprehension"", ""BERT"", ""linguistic verifiers"", ""hierarchical attention networks""]","We propose two linguistic verifiers for span-extraction style machine reading comprehension to respectively tackle two challenges: how to evaluate the syntactic completeness of predicted answers and how to utilize the rich context of long documents. Our first verifier rewrites a question through replacing its interrogatives by the predicted answer phrases and then builds a cross-attention scorer between the rewritten question and the segment, so that the answer candidates are scored in a \emph{position-sensitive} context. Our second verifier builds a hierarchical attention network to represent segments in a passage where neighbour segments in long passages are \emph{recurrently connected} and can contribute to current segment-question pair's inference for answerablility classification and boundary determination. We then combine these two verifiers together into a pipeline and apply it to SQuAD2.0, NewsQA and TriviaQA benchmark sets. Our pipeline achieves significantly better improvements of both exact matching and F1 scores than state-of-the-art baselines.",/pdf/22a206fc285819511b311a0bc05890cd940ebafd.pdf,ICLR,2021,"Two novel linguistic verifiers for answerable questions in machine reading comprehension, one to judge the linguistic correctness of answer phrases and the other to enrich long paragraph contexts by hierarchical attentions." +ryXZmzNeg,,1477870000000.0,1484250000000.0,13,Improving Sampling from Generative Autoencoders with Markov Chains,"[""ac2211@imperial.ac.uk"", ""ka709@imperial.ac.uk"", ""aab01@imperial.ac.uk""]","[""Antonia Creswell"", ""Kai Arulkumaran"", ""Anil Anthony Bharath""]","[""Deep learning"", ""Unsupervised Learning"", ""Theory""]","We focus on generative autoencoders, such as variational or adversarial autoencoders, which jointly learn a generative model alongside an inference model. Generative autoencoders are those which are trained to softly enforce a prior on the latent distribution learned by the inference model. We call the distribution to which the inference model maps observed samples, the learned latent distribution, which may not be consistent with the prior. We formulate a Markov chain Monte Carlo (MCMC) sampling process, equivalent to iteratively decoding and encoding, which allows us to sample from the learned latent distribution. Since, the generative model learns to map from the learned latent distribution, rather than the prior, we may use MCMC to improve the quality of samples drawn from the generative model, especially when the learned latent distribution is far from the prior. Using MCMC sampling, we are able to reveal previously unseen differences between generative autoencoders trained either with or without a denoising criterion.",/pdf/cf9d1cd0e67f6e9670e922a7cf867001bb4dabd9.pdf,ICLR,2017,Iteratively encoding and decoding samples from generative autoencoders recovers samples from the true latent distribution learned by the model +Hyg-JC4FDr,BJlAzSGuDB,1569440000000.0,1583910000000.0,882,Imitation Learning via Off-Policy Distribution Matching,"[""kostrikov@cs.nyu.edu"", ""ofirnachum@google.com"", ""tompson@google.com""]","[""Ilya Kostrikov"", ""Ofir Nachum"", ""Jonathan Tompson""]","[""reinforcement learning"", ""deep learning"", ""imitation learning"", ""adversarial learning""]","When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data- inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.",/pdf/e760b8c2e400ca1d33a3bbac52125055587fe455.pdf,ICLR,2020, +BJevihVtwB,SkxagHgWwr,1569440000000.0,1577170000000.0,154,BOOSTING ENCODER-DECODER CNN FOR INVERSE PROBLEMS,"[""eunju.cha@kaist.ac.kr"", ""jduck.jang@samsung.com"", ""jh0325.lee@samsung.com"", ""eunhayo.lee@samsung.com"", ""jong.ye@kaist.ac.kr""]","[""Eunju Cha"", ""Jaeduck Jang"", ""Junho Lee"", ""Eunha Lee"", ""Jong Chul Ye""]","[""Prediction error"", ""Boosting"", ""Encoder-decoder convolutional neural network"", ""Inverse problem""]","Encoder-decoder convolutional neural networks (CNN) have been extensively used for various inverse problems. However, their prediction error for unseen test data is difficult to estimate a priori, since the neural networks are trained using only selected data and their architectures are largely considered blackboxes. This poses a fundamental challenge in improving the performance of neural networks. Recently, it was shown that Stein’s unbiased risk estimator (SURE) can be used as an unbiased estimator of the prediction error for denoising problems. However, the computation of the divergence term in SURE is difficult to implement in a neural network framework, and the condition to avoid trivial identity mapping is not well defined. In this paper, inspired by the finding that an encoder-decoder CNN can be expressed as a piecewise linear representation, we provide a close form expression of the unbiased estimator for the prediction error. The close form representation leads to a novel boosting scheme to prevent a neural network from converging to an identity mapping so that it can enhance the performance. Experimental results show that the proposed algorithm provides consistent improvement in various inverse problems.",/pdf/8bc038924aef9dc0a4d004adf50f08a472fe6d60.pdf,ICLR,2020, +xgGS6PmzNq6,bGNJOQh9tI4,1601310000000.0,1615790000000.0,1434,On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections,"[""~Peizhao_Li1"", ""~Yifei_Wang3"", ""~Han_Zhao1"", ""~Pengyu_Hong1"", ""~Hongfu_Liu2""]","[""Peizhao Li"", ""Yifei Wang"", ""Han Zhao"", ""Pengyu Hong"", ""Hongfu Liu""]","[""algorithmic fairness"", ""graph-structured data""]","Disparate impact has raised serious concerns in machine learning applications and its societal impacts. In response to the need of mitigating discrimination, fairness has been regarded as a crucial property in algorithmic design. In this work, we study the problem of disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which articulates a fairness concept that a predictive relationship between two instances should be independent of the sensitive attributes. Based on this, we theoretically relate the graph connections to dyadic fairness on link predictive scores in learning graph neural networks, and reveal that regulating weights on existing edges in a graph contributes to dyadic fairness conditionally. Subsequently, we propose our algorithm, \textbf{FairAdj}, to empirically learn a fair adjacency matrix with proper graph structural constraints for fair link prediction, and in the meanwhile preserve predictive accuracy as much as possible. Empirical validation demonstrates that our method delivers effective dyadic fairness in terms of various statistics, and at the same time enjoys a favorable fairness-utility tradeoff.",/pdf/752c587f7b01688e1a6372a843b7256cf17bad0a.pdf,ICLR,2021,A new method on the fairness of predictive relationships in graph-structured data +kyaIeYj4zZ,B94otqlAzy,1601310000000.0,1616050000000.0,3175,GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing,"[""~Tao_Yu5"", ""~Chien-Sheng_Wu1"", ""~Xi_Victoria_Lin1"", ""~bailin_wang1"", ""yichern.tan@yale.edu"", ""x.yang@salesforce.com"", ""~Dragomir_Radev2"", ""~richard_socher1"", ""~Caiming_Xiong1""]","[""Tao Yu"", ""Chien-Sheng Wu"", ""Xi Victoria Lin"", ""bailin wang"", ""Yi Chern Tan"", ""Xinyi Yang"", ""Dragomir Radev"", ""richard socher"", ""Caiming Xiong""]","[""text-to-sql"", ""semantic parsing"", ""pre-training"", ""nlp""]","We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG). We pre-train our model on the synthetic data to inject important structural properties commonly found in semantic parsing into the pre-training language model. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) on several existing table-related datasets to regularize our pre-training process. Our proposed pre-training strategy is much data-efficient. When incorporated with strong base semantic parsers, GraPPa achieves new state-of-the-art results on four popular fully supervised and weakly supervised table semantic parsing tasks.",/pdf/41a8d65642880c0853bfa9f37d81b4fc15cba53e.pdf,ICLR,2021,Language model pre-training for table semantic parsing. +Bkul3t9ee,,1478300000000.0,1484370000000.0,521,Unsupervised Perceptual Rewards for Imitation Learning,"[""sermanet@google.com"", ""kelvinxx@google.com"", ""slevine@google.com""]","[""Pierre Sermanet"", ""Kelvin Xu"", ""Sergey Levine""]","[""Computer vision"", ""Deep learning"", ""Unsupervised Learning"", ""Reinforcement Learning"", ""Transfer Learning""]","Reward function design and exploration time are arguably the biggest obstacles to the deployment of reinforcement learning (RL) agents in the real world. In many real-world tasks, designing a suitable reward function takes considerable manual engineering and often requires additional and potentially visible sensors to be installed just to measure whether the task has been executed successfully. Furthermore, many interesting tasks consist of multiple steps that must be executed in sequence. Even when the final outcome can be measured, it does not necessarily provide useful feedback on these implicit intermediate steps or sub-goals. +To address these issues, we propose leveraging the abstraction power of intermediate visual representations learned by deep models to quickly infer perceptual reward functions from small numbers of demonstrations. We present a method that is able to identify the key intermediate steps of a task from only a handful of demonstration sequences, and automatically identify the most discriminative features for identifying these steps. This method makes use of the features in a pre-trained deep model, but does not require any explicit sub-goal supervision. The resulting reward functions, which are dense and smooth, can then be used by an RL agent to learn to perform the task in real-world settings. To evaluate the learned reward functions, we present qualitative results on two real-world tasks and a quantitative evaluation against a human-designed reward function. We also demonstrate that our method can be used to learn a complex real-world door opening skill using a real robot, even when the demonstration used for reward learning is provided by a human using their own hand. To our knowledge, these are the first results showing that complex robotic manipulation skills can be learned directly and without supervised labels from a video of a human performing the task. +",/pdf/939de21e00049bd95fa78820f7e7e99cc37df62b.pdf,ICLR,2017,Real robots learn new tasks from observing a few human demonstrations. +rLj5jTcCUpp,UMZZsHl0lv2,1601310000000.0,1614990000000.0,924,Distribution Embedding Network for Meta-Learning with Variable-Length Input,"[""~Lang_Liu1"", ""~Mahdi_Milani_Fard1"", ""~Sen_Zhao1""]","[""Lang Liu"", ""Mahdi Milani Fard"", ""Sen Zhao""]","[""meta-learning"", ""variable-length input"", ""distribution embedding""]","We propose Distribution Embedding Network (DEN) for meta-learning, which is designed for applications where both the distribution and the number of features could vary across tasks. DEN first transforms features using a learned piecewise linear function, then learns an embedding of the underlying data distribution after the transformation, and finally classifies examples based on the distribution embedding. We show that the parameters of the distribution embedding and the classification modules can be shared across tasks. We propose a novel methodology to mass-simulate binary classification training tasks, and demonstrate that DEN outperforms existing methods in a number of test tasks in numerical studies.",/pdf/910d44e155f9d26eac9ccb0a03a91aeec501a668.pdf,ICLR,2021,"In this paper we propose Distribution Embedding Network for meta-learning, which is designed for applications where both the distribution and the number of features could vary across tasks." +Hk6a8N5xe,,1478280000000.0,1479090000000.0,216,Classify or Select: Neural Architectures for Extractive Document Summarization,"[""nallapati@us.ibm.com"", ""zhou@us.ibm.com"", ""mam@oregonstate.edu""]","[""Ramesh Nallapati"", ""Bowen Zhou and Mingbo Ma""]","[""Natural language processing"", ""Supervised Learning"", ""Applications"", ""Deep learning""]","We present two novel and contrasting Recurrent Neural Network (RNN) based architectures for extractive summarization of documents. The Classifier based architecture sequentially accepts or rejects each sentence in the original document order for its membership in the summary. The Selector architecture, on the other hand, is free to pick one sentence at a time in any arbitrary order to generate the extractive summary. + +Our models under both architectures jointly capture the notions of salience and redundancy of sentences. In addition, these models have the advantage of being very interpretable, since they allow visualization of their predictions broken up by abstract features such as information content, salience and redundancy. + +We show that our models reach or outperform state-of-the-art supervised models on two different corpora. We also recommend the conditions under which one architecture is superior to the other based on experimental evidence.",/pdf/bb98272f8b9475b02be09aa9f64eb18db6e42de1.pdf,ICLR,2017,"This paper presents two different neural architectures for extractive document summarization whose predictions are very interpretable, and show that they reach or outperform state-of-the-art supervised models." +H1acq85gx,,1478290000000.0,1493400000000.0,317,Maximum Entropy Flow Networks,"[""gl2480@columbia.edu"", ""yg2312@columbia.edu"", ""jpc2181@columbia.edu""]","[""Gabriel Loaiza-Ganem *"", ""Yuanjun Gao *"", ""John P. Cunningham""]",[],"Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. By exploiting recent developments in normalizing flow networks, we cast the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method. Simulation results demonstrate the effectiveness of our method, and applications to finance and computer vision show the flexibility and accuracy of using maximum entropy flow networks.",/pdf/e66ee1788fa4282a33b0db6b577e9bc871a82d3b.pdf,ICLR,2017, +w_BtePbtmx4,Bac0lY_Dw0u,1601310000000.0,1614990000000.0,2641,Accelerating DNN Training through Selective Localized Learning ,"[""~Sarada_Krithivasan1"", ""~Sanchari_Sen1"", ""swagath.venkataramani@us.ibm.com"", ""~Anand_Raghunathan1""]","[""Sarada Krithivasan"", ""Sanchari Sen"", ""Swagath Venkataramani"", ""Anand Raghunathan""]","[""Efficient DNN Training""]","Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. We proposeLoCal+SGD, a new algorithmic approach to accelerate DNN train-ing by selectively combining localized or Hebbian learning within a StochasticGradient Descent (SGD) based training framework. Back-propagation is a computationally expensive process that requires 2 Generalized Matrix Multiply (GEMM)operations to compute the error and weight gradients for each layer. We alleviate this by selectively updating some layers’ weights using localized learning rules that require only 1 GEMM operation per layer. Further, since the weight update is performed during the forward pass itself, the layer activations for the mini-batch do not need to be stored until the backward pass, resulting in a reduced memory footprint. Localized updates can substantially boost training speed, but need to be used selectively and judiciously in order to preserve accuracy and convergence. We address this challenge through the design of a Learning Mode Selection Algorithm, where all layers start with SGD, and as epochs progress, layers gradually transition to localized learning. Specifically, for each epoch, the algorithm identifies a Localized→SGDtransition layer, which delineates the network into two regions. Layers before the transition layer use localized updates, while the transition layer and later layers use gradient-based updates. The trend in the weight updates made to the transition layer across epochs is used to determine how the boundary betweenSGD and localized updates is shifted in future epochs. We also propose a low-cost weak supervision mechanism by controlling the learning rate of localized updates based on the overall training loss. We applied LoCal+SGDto 8 image recognition CNNs (including ResNet50 and MobileNetV2) across 3 datasets (Cifar10, Cifar100and ImageNet). Our measurements on a Nvidia GTX 1080Ti GPU demonstrate upto 1.5×improvement in end-to-end training time with∼0.5% loss in Top-1classification accuracy.",/pdf/926351e27cb4089d69321ebca1dd73e07c22d83f.pdf,ICLR,2021, +K3qa-sMHpQX,UECJh8BrH2b,1601310000000.0,1614990000000.0,997,ForceNet: A Graph Neural Network for Large-Scale Quantum Chemistry Simulation,"[""~Weihua_Hu1"", ""~Muhammed_Shuaibi1"", ""~Abhishek_Das1"", ""~Siddharth_Goyal2"", ""~Anuroop_Sriram1"", ""~Jure_Leskovec1"", ""~Devi_Parikh1"", ""~Larry_Zitnick1""]","[""Weihua Hu"", ""Muhammed Shuaibi"", ""Abhishek Das"", ""Siddharth Goyal"", ""Anuroop Sriram"", ""Jure Leskovec"", ""Devi Parikh"", ""Larry Zitnick""]","[""Graph Neural Networks"", ""Physical simulation"", ""Quantum chemistry"", ""Catalysis""]","Machine Learning (ML) has a potential to dramatically accelerate large-scale physics-based simulations. However, practical models for real large-scale and complex problems remain out of reach. Here we present ForceNet, a model for accurate and fast quantum chemistry simulations to accelerate catalyst discovery for renewable energy applications. ForceNet is a graph neural network that uses surrounding 3D molecular structure to estimate per-atom forces---a central capability for performing atomic simulations. The key challenge is to accurately capture highly complex and non-linear quantum interactions of atoms in 3D space, on which forces are dependent. To this end, ForceNet adopts (1) expressive message passing architecture, (2) appropriate choice of basis and non-linear activation functions, and (3) model scaling in terms of network depth and width. We show ForceNet reduces the estimation error of atomic forces by 30% compared to existing ML models, and generalizes well to out-of-distribution structures. Finally, we apply ForceNet to the large-scale catalyst dataset, OC20. We use ForceNet to perform quantum chemistry simulations, where ForceNet is able to achieve 4x higher success rate than existing ML models. Overall, we demonstrate the potential for ML-based simulations to achieve practical usefulness while being orders of magnitude faster than physics-based simulations.",/pdf/7d0c2a4fc39a0ca804bfbba54aa583beac3610b6.pdf,ICLR,2021,Graph neural network designed to model the complex interactions in large systems of atoms for use in simulating atomic relaxations of catalysts. +H1eF3kStPS,SkgA4z1twH,1569440000000.0,1577170000000.0,1960,Redundancy-Free Computation Graphs for Graph Neural Networks,"[""zhihao@cs.stanford.edu"", ""silin@microsoft.com"", ""rexying@stanford.edu"", ""jiaxuan@stanford.edu"", ""jure@cs.stanford.edu"", ""aiken@cs.stanford.edu""]","[""Zhihao Jia"", ""Sina Lin"", ""Rex Ying"", ""Jiaxuan You"", ""Jure Leskovec"", ""Alex Aiken.""]","[""Graph Neural Networks"", ""Runtime Performance""]","Graph Neural Networks (GNNs) are based on repeated aggregations of information across nodes’ neighbors in a graph. However, because common neighbors are shared between different nodes, this leads to repeated and inefficient computations.We propose Hierarchically Aggregated computation Graphs (HAGs), a new GNN graph representation that explicitly avoids redundancy by managing intermediate aggregation results hierarchically, and eliminating repeated computations and unnecessary data transfers in GNN training and inference. We introduce an accurate cost function to quantitatively evaluate the runtime performance of different HAGsand use a novel search algorithm to find optimized HAGs. Experiments show that the HAG representation significantly outperforms the standard GNN graph representation by increasing the end-to-end training throughput by up to 2.8x and reducing the aggregations and data transfers in GNN training by up to 6.3x and 5.6x. Meanwhile, HAGs improve runtime performance by preserving GNNcomputation, and maintain the original model accuracy for arbitrary GNNs.",/pdf/a1ce3d20aa426d8f78214ce4b8e53940f245ff5a.pdf,ICLR,2020,"We present Hierarchically Aggregated computation Graphs (HAGs), a new GNN graph representation that explicitly avoids redundant computations in GNN training and inference." +BkVsEMYel,,1478200000000.0,1488280000000.0,81,Inductive Bias of Deep Convolutional Networks through Pooling Geometry,"[""cohennadav@cs.huji.ac.il"", ""shashua@cs.huji.ac.il""]","[""Nadav Cohen"", ""Amnon Shashua""]","[""Theory"", ""Deep learning""]","Our formal understanding of the inductive bias that drives the success of convolutional networks on computer vision tasks is limited. In particular, it is unclear what makes hypotheses spaces born from convolution and pooling operations so suitable for natural images. In this paper we study the ability of convolutional networks to model correlations among regions of their input. We theoretically analyze convolutional arithmetic circuits, and empirically validate our findings on other types of convolutional networks as well. Correlations are formalized through the notion of separation rank, which for a given partition of the input, measures how far a function is from being separable. We show that a polynomially sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network's pooling geometry effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. Other pooling schemes lead to different preferences, and this allows tailoring the network to data that departs from the usual domain of natural imagery. In addition to analyzing deep networks, we show that shallow ones support only linear separation ranks, and by this gain insight into the benefit of functions brought forth by depth - they are able to efficiently model strong correlation under favored partitions of the input.",/pdf/fa9c2b48cf7eb5adef47a580055f4355cf51996e.pdf,ICLR,2017,"We study the ability of convolutional networks to model correlations among regions of their input, showing that this is controlled by shapes of pooling windows." +H1ldzA4tPr,H1lkq0EOwH,1569440000000.0,1588040000000.0,1005,Learning Compositional Koopman Operators for Model-Based Control,"[""liyunzhu@mit.edu"", ""haohe@mit.edu"", ""jiajunwu.cs@gmail.com"", ""dina@csail.mit.edu"", ""torralba@csail.mit.edu""]","[""Yunzhu Li"", ""Hao He"", ""Jiajun Wu"", ""Dina Katabi"", ""Antonio Torralba""]","[""Koopman operators"", ""graph neural networks"", ""compositionality""]","Finding an embedding space for a linear approximation of a nonlinear dynamical system enables efficient system identification and control synthesis. The Koopman operator theory lays the foundation for identifying the nonlinear-to-linear coordinate transformations with data-driven methods. Recently, researchers have proposed to use deep neural networks as a more expressive class of basis functions for calculating the Koopman operators. These approaches, however, assume a fixed dimensional state space; they are therefore not applicable to scenarios with a variable number of objects. In this paper, we propose to learn compositional Koopman operators, using graph neural networks to encode the state into object-centric embeddings and using a block-wise linear transition matrix to regularize the shared structure across objects. The learned dynamics can quickly adapt to new environments of unknown physical parameters and produce control signals to achieve a specified goal. Our experiments on manipulating ropes and controlling soft robots show that the proposed method has better efficiency and generalization ability than existing baselines.",/pdf/1d02dae1fcf8967c227d804b26a99c0e22e96d83.pdf,ICLR,2020,Learning compositional Koopman operators for efficient system identification and model-based control. +rJxyqkSYDH,rklaZu0ODH,1569440000000.0,1577170000000.0,1863,A Simple Dynamic Learning Rate Tuning Algorithm For Automated Training of DNNs,"[""koyelmjee@gmail.com"", ""kharealind@gmail.com"", ""ysabharwal@in.ibm.com"", ""ashish.verma1@ibm.com""]","[""Koyel Mukherjee"", ""Alind Khare"", ""Yogish Sabharwal"", ""Ashish Verma""]","[""adaptive LR tuning algorithm"", ""generalization""]","Training neural networks on image datasets generally require extensive experimentation to find the optimal learning rate regime. Especially, for the cases of adversarial training or for training a newly synthesized model, one would not know the best learning rate regime beforehand. We propose an automated algorithm for determining the learning rate trajectory, that works across datasets and models for both natural and adversarial training, without requiring any dataset/model specific tuning. It is a stand-alone, parameterless, adaptive approach with no computational overhead. We theoretically discuss the algorithm's convergence behavior. We empirically validate our algorithm extensively. Our results show that our proposed approach \emph{consistently} achieves top-level accuracy compared to SOTA baselines in the literature in natural training, as well as in adversarial training.",/pdf/4d0102d98c8084fc76dce764e0651f74c4781226.pdf,ICLR,2020,"We propose an automated, adaptive LR tuning algorithm for training DNNs that works as well or better than SOTA for different model-dataset combinations tried for natural as well as adversarial training, with theoretical convergence analysis." +Rq31tXaqXq,fDv-r4ol0dKw,1601310000000.0,1614990000000.0,2615,VideoFlow: A Framework for Building Visual Analysis Pipelines,"[""~Yue_Wu18"", ""jianqiang.hjq@alibaba-inc.com"", ""moevis.zjj@alibaba-inc.com"", ""guokun.wgk@alibaba-inc.com"", ""jason.sc@alibaba-inc.com"", ""zhouchang.zc@alibaba-inc.com"", ""xiansheng.hxs@alibaba-inc.com""]","[""Yue Wu"", ""Jianqiang Huang"", ""Jiangjie Zhen"", ""Guokun Wang"", ""Chen Shen"", ""Chang Zhou"", ""Xian-Sheng Hua""]","[""Computation graph"", ""Resource"", ""Computer vision"", ""Deep learning"", ""Framework"", ""Software""]","The past years have witnessed an explosion of deep learning frameworks like PyTorch and TensorFlow since the success of deep neural networks. These frameworks have significantly facilitated algorithm development in multimedia research and production. However, how to easily and efficiently build an end-to-end visual analysis pipeline with these algorithms is still an open issue. In most cases, developers have to spend a huge amount of time tackling data input and output, optimizing computation efficiency, or even debugging exhausting memory leaks together with algorithm development. VideoFlow aims to overcome these challenges by providing a flexible, efficient, extensible, and secure visual analysis framework for both the academia and industry. With VideoFlow, developers can focus on the improvement of algorithms themselves, as well as the construction of a complete visual analysis workflow. VideoFlow has been incubated in the practices of smart city innovation for more than three years. It has been widely used in tens of intelligent visual analysis systems. VideoFlow will be open-sourced at \url{https://github.com/xxx/videoflow}.",/pdf/75583d0c22c9ffb48a073f9a8f25209cbf0a6474.pdf,ICLR,2021,"a flexible, efficient, extensible and secure framework to build visual analysis pipelines for both the academia and industry." +8Xi5MLFE_IW,l8T3jC78gF3,1601310000000.0,1614990000000.0,3752,Episodic Memory for Learning Subjective-Timescale Models,"[""~Alexey_Zakharov1"", ""m.crosby@imperial.ac.uk"", ""~Zafeirios_Fountas1""]","[""Alexey Zakharov"", ""Matthew Crosby"", ""Zafeirios Fountas""]","[""Episodic Memory"", ""Time Perception"", ""Active Inference"", ""Model-based Reinforcement Learning""]","In model-based learning, an agent’s model is commonly defined over transitions between consecutive states of an environment even though planning often requires reasoning over multi-step timescales, with intermediate states either unnecessary, or worse, accumulating prediction error. In contrast, intelligent behaviour in biological organisms is characterised by the ability to plan over varying temporal scales depending on the context. Inspired by the recent works on human time perception, we devise a novel approach to learning a transition dynamics model, based on the sequences of episodic memories that define the agent's subjective timescale – over which it learns world dynamics and over which future planning is performed. We implement this in the framework of active inference and demonstrate that the resulting subjective-timescale model (STM) can systematically vary the temporal extent of its predictions while preserving the same computational efficiency. Additionally, we show that STM predictions are more likely to introduce future salient events (for example new objects coming into view), incentivising exploration of new areas of the environment. As a result, STM produces more informative action-conditioned roll-outs that assist the agent in making better decisions. We validate significant improvement in our STM agent's performance in the Animal-AI environment against a baseline system, trained using the environment's objective-timescale dynamics.",/pdf/bbaba857ce10fe90389750a85ffbb5dab2edb31d.pdf,ICLR,2021,A subjective-timescale transition model of episodic memory for planning over variable timescales in an active inference agent. +defQ1AG6IWn,bqp6R4974KK,1601310000000.0,1614990000000.0,482,Neighbor Class Consistency on Unsupervised Domain Adaptation,"[""liu.chang6@northeastern.edu"", ""~Kai_Li3"", ""~Yun_Fu1""]","[""Chang Liu"", ""Kai Li"", ""Yun Fu""]","[""Unsupervised domain adaptation"", ""Consistency regularization"", ""Neighbor""]","Unsupervised domain adaptation (UDA) is to make predictions for unlabeled data in a target domain with labeled data from source domain available. Recent advances exploit entropy minimization and self-training to align the feature of two domains. However, as decision boundary is largely biased towards source data, class-wise pseudo labels generated by target predictions are usually very noisy, and trusting those noisy supervisions might potentially deteriorate the intrinsic target discriminative feature. Motivated by agglomerative clustering which assumes that features in the near neighborhood should be clustered together, we observe that target features from source pre-trained model are highly intrinsic discriminative and have a high probability of sharing the same label with their neighbors. Based on those observations, we propose a simple but effective method to impose Neighbor Class Consistency on target features to preserve and further strengthen the intrinsic discriminative nature of target data while regularizing the unified classifier less biased towards source data. We also introduce an entropy-based weighting scheme to help our framework more robust to the potential noisy neighbor supervision. We conduct ablation studies and extensive experiments on three UDA image classification benchmarks. Our method outperforms all existing UDA state-of-the-art. ",/pdf/9f7b790dc1004f0c42d1c2507e9f564f7ebaa6f3.pdf,ICLR,2021, +rJg_NjCqtX,BJxfKXkKtX,1538090000000.0,1545360000000.0,15,CHEMICAL NAMES STANDARDIZATION USING NEURAL SEQUENCE TO SEQUENCE MODEL,"[""longmr.zhan@sjtu.edu.cn"", ""zhaohai@cs.sjtu.edu.cn""]","[""Junlang Zhan"", ""Hai Zhao""]","[""Chemical Names Standardization"", ""Byte Pair Encoding"", ""Sequence to Sequence Model""]","Chemical information extraction is to convert chemical knowledge in text into true chemical database, which is a text processing task heavily relying on chemical compound name identification and standardization. Once a systematic name for a chemical compound is given, it will naturally and much simply convert the name into the eventually required molecular formula. However, for many chemical substances, they have been shown in many other names besides their systematic names which poses a great challenge for this task. In this paper, we propose a framework to do the auto standardization from the non-systematic names to the corresponding systematic names by using the spelling error correction, byte pair encoding tokenization and neural sequence to sequence model. Our framework is trained end to end and is fully data-driven. Our standardization accuracy on the test dataset achieves 54.04% which has a great improvement compared to previous state-of-the-art result.",/pdf/b4b93fabd8b3532080ec439a11e3c6777d2fa9cf.pdf,ICLR,2019,We designed an end-to-end framework using sequence to sequence model to do the chemical names standardization. +mj7WsaHYxj,jQKKSh2__gZ,1601310000000.0,1614990000000.0,142,FLAG: Adversarial Data Augmentation for Graph Neural Networks,"[""~Kezhi_Kong1"", ""~Guohao_Li1"", ""~Mucong_Ding1"", ""~Zuxuan_Wu1"", ""~Chen_Zhu2"", ""~Bernard_Ghanem1"", ""~Gavin_Taylor1"", ""~Tom_Goldstein1""]","[""Kezhi Kong"", ""Guohao Li"", ""Mucong Ding"", ""Zuxuan Wu"", ""Chen Zhu"", ""Bernard Ghanem"", ""Gavin Taylor"", ""Tom Goldstein""]","[""Graph Neural Networks"", ""Data Augmentation"", ""Adversarial Training""]","Data augmentation helps neural networks generalize better, but it remains an open question how to effectively augment graph data to enhance the performance of GNNs (Graph Neural Networks). While most existing graph regularizers focus on augmenting graph topological structures by adding/removing edges, we offer a novel direction to augment in the input node feature space for better performance. We propose a simple but effective solution, FLAG (Free Large-scale Adversarial Augmentation on Graphs), which iteratively augments node features with gradient-based adversarial perturbations during training, and boosts performance at test time. Empirically, FLAG can be easily implemented with a dozen lines of code and is flexible enough to function with any GNN backbone, on a wide variety of large-scale datasets, and in both transductive and inductive settings. Without modifying a model's architecture or training setup, FLAG yields a consistent and salient performance boost across both node and graph classification tasks. Using FLAG, we reach state-of-the-art performance on the large-scale ogbg-molpcba, ogbg-ppa, and ogbg-code datasets. ",/pdf/f154c19f743894f4c4fdaf90a2439109a3374ada.pdf,ICLR,2021,We show that adversarial data augmentation generalizes Graph Neural Networks on large-scale datasets. +SJlsFpVtDB,rkgw676DPH,1569440000000.0,1583910000000.0,681,Continual Learning with Bayesian Neural Networks for Non-Stationary Data,"[""richard.kurle@tum.de"", ""botond.cseke@argmax.ai"", ""a.klushyn@tum.de"", ""smagt@argmax.ai"", ""guennemann@in.tum.de""]","[""Richard Kurle"", ""Botond Cseke"", ""Alexej Klushyn"", ""Patrick van der Smagt"", ""Stephan G\u00fcnnemann""]","[""Continual Learning"", ""Online Variational Bayes"", ""Non-Stationary Data"", ""Bayesian Neural Networks"", ""Variational Inference"", ""Lifelong Learning"", ""Concept Drift"", ""Episodic Memory""]","This work addresses continual learning for non-stationary data, using Bayesian neural networks and memory-based online variational Bayes. We represent the posterior approximation of the network weights by a diagonal Gaussian distribution and a complementary memory of raw data. This raw data corresponds to likelihood terms that cannot be well approximated by the Gaussian. We introduce a novel method for sequentially updating both components of the posterior approximation. Furthermore, we propose Bayesian forgetting and a Gaussian diffusion process for adapting to non-stationary data. The experimental results show that our update method improves on existing approaches for streaming data. Additionally, the adaptation methods lead to better predictive performance for non-stationary data. ",/pdf/0fdd74c6a01317f2e8eaece257838f798a2776e0.pdf,ICLR,2020,"This work addresses continual learning for non-stationary data, using Bayesian neural networks and memory-based online variational Bayes." +d8Q1mt2Ghw,Y70VEefGNvg,1601310000000.0,1613060000000.0,2656,Emergent Road Rules In Multi-Agent Driving Environments,"[""~Avik_Pal1"", ""~Jonah_Philion1"", ""~Yuan-Hong_Liao2"", ""~Sanja_Fidler1""]","[""Avik Pal"", ""Jonah Philion"", ""Yuan-Hong Liao"", ""Sanja Fidler""]",[],"For autonomous vehicles to safely share the road with human drivers, autonomous vehicles must abide by specific ""road rules"" that human drivers have agreed to follow. ""Road rules"" include rules that drivers are required to follow by law – such as the requirement that vehicles stop at red lights – as well as more subtle social rules – such as the implicit designation of fast lanes on the highway. In this paper, we provide empirical evidence that suggests that – instead of hard-coding road rules into self-driving algorithms – a scalable alternative may be to design multi-agent environments in which road rules emerge as optimal solutions to the problem of maximizing traffic flow. We analyze what ingredients in driving environments cause the emergence of these road rules and find that two crucial factors are noisy perception and agents’ spatial density. We provide qualitative and quantitative evidence of the emergence of seven social driving behaviors, ranging from obeying traffic signals to following lanes, all of which emerge from training agents to drive quickly to destinations without colliding. Our results add empirical support for the social road rules that countries worldwide have agreed on for safe, efficient driving.",/pdf/858edcf2544391055e14f4c41482bcc25bb9ae3f.pdf,ICLR,2021,"In multi-agent driving environments with noisy perception, driving conventions emerge" +4TSiOTkKe5P,RGG-wT2dMlt,1601310000000.0,1616000000000.0,3016,Latent Convergent Cross Mapping,"[""~Edward_De_Brouwer1"", ""~Adam_Arany1"", ""~Jaak_Simm1"", ""~Yves_Moreau2""]","[""Edward De Brouwer"", ""Adam Arany"", ""Jaak Simm"", ""Yves Moreau""]","[""Causality"", ""Time Series"", ""Chaos"", ""Neural ODE"", ""Missing Values""]","Discovering causal structures of temporal processes is a major tool of scientific inquiry because it helps us better understand and explain the mechanisms driving a phenomenon of interest, thereby facilitating analysis, reasoning, and synthesis for such systems. +However, accurately inferring causal structures within a phenomenon based on observational data only is still an open problem. Indeed, this type of data usually consists in short time series with missing or noisy values for which causal inference is increasingly difficult. In this work, we propose a method to uncover causal relations in chaotic dynamical systems from short, noisy and sporadic time series (that is, incomplete observations at infrequent and irregular intervals) where the classical convergent cross mapping (CCM) fails. Our method works by learning a Neural ODE latent process modeling the state-space dynamics of the time series and by checking the existence of a continuous map between the resulting processes. We provide theoretical analysis and show empirically that Latent-CCM can reliably uncover the true causal pattern, unlike traditional methods.",/pdf/973e4e487f91472cfee202c1353ca7932b83a942.pdf,ICLR,2021,Latent CCM uses reconstruction between latent processes of dynamical systems to infer causality between short and sporadic time series. +EGdFhBzmAwB,J8wb3ie1xcu,1601310000000.0,1618240000000.0,3178,Generalization bounds via distillation,"[""~Daniel_Hsu1"", ""~Ziwei_Ji1"", ""~Matus_Telgarsky1"", ""~Lan_Wang4""]","[""Daniel Hsu"", ""Ziwei Ji"", ""Matus Telgarsky"", ""Lan Wang""]","[""Generalization"", ""statistical learning theory"", ""theory"", ""distillation""]","This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar and mnist demonstrating similar generalization performance between the original network and its distillation. +",/pdf/dd9e76980b62282e3d975f3ba1c97b2e689d7f50.pdf,ICLR,2021,This paper provides a suite of mathematical tools to bound the generalization error of networks which possess low-complexity distillations. +ByxLl309Ym,HkgomVn5tQ,1538090000000.0,1545360000000.0,1077,Conditional Inference in Pre-trained Variational Autoencoders via Cross-coding,"[""wuga@mie.utoronto.ca"", ""domke@cs.umass.edu"", ""ssanner@mie.utoronto.ca""]","[""Ga Wu"", ""Justin Domke"", ""Scott Sanner""]",[],"Variational Autoencoders (VAEs) are a popular generative model, but one in which conditional inference can be challenging. If the decomposition into query and evidence variables is fixed, conditional VAEs provide an attractive solution. To support arbitrary queries, one is generally reduced to Markov Chain Monte Carlo sampling methods that can suffer from long mixing times. In this paper, we propose an idea we term cross-coding to approximate the distribution over the latent variables after conditioning on an evidence assignment to some subset of the variables. This allows generating query samples without retraining the full VAE. We experimentally evaluate three variations of cross-coding showing that (i) can be quickly optimized for different decompositions of evidence and query and (ii) they quantitatively and qualitatively outperform Hamiltonian Monte Carlo.",/pdf/2acbda5f3f5a9ca4de0b727c8e5892acaea113cd.pdf,ICLR,2019, +_X_4Akcd8Re,8qOHByTgaPM,1601310000000.0,1617250000000.0,2388,Learning Long-term Visual Dynamics with Region Proposal Interaction Networks,"[""~Haozhi_Qi1"", ""~Xiaolong_Wang3"", ""~Deepak_Pathak1"", ""~Yi_Ma4"", ""~Jitendra_Malik2""]","[""Haozhi Qi"", ""Xiaolong Wang"", ""Deepak Pathak"", ""Yi Ma"", ""Jitendra Malik""]","[""dynamics prediction"", ""interaction networks"", ""physical reasoning""]","Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at https://haozhi.io/RPIN.",/pdf/5da931176d4a8bcb421d7a6a087fce6475f7c406.pdf,ICLR,2021,"We propose Region Proposal Interaction Networks for physical interaction prediction, which is applied across both simulation and real world environments for long-range prediction and planning." +4_57x7xhymn,MVcv1R38bm0,1601310000000.0,1614990000000.0,1083,Action Concept Grounding Network for Semantically-Consistent Video Generation,"[""~Wei_Yu10"", ""~Wenxin_Chen1"", ""~Animesh_Garg1""]","[""Wei Yu"", ""Wenxin Chen"", ""Animesh Garg""]","[""action-conditional video prediction"", ""self-supervised learning"", ""counterfactual generation""]","Recent works in self-supervised video prediction have mainly focused on passive forecasting and low-level action-conditional prediction, which sidesteps the problem of semantic learning. We introduce the task of semantic action-conditional video prediction, which can be regarded as an inverse problem of action recognition. The challenge of this new task primarily lies in how to effectively inform the model of semantic action information. To bridge vision and language, we utilize the idea of capsule and propose a novel video prediction model Action Concept Grounding Network (AGCN). Our method is evaluated on two newly designed synthetic datasets, CLEVR-Building-Blocks and Sapien-Kitchen, and experiments show that given different action labels, our ACGN can correctly condition on instructions and generate corresponding future frames without need of bounding boxes. We further demonstrate our trained model can make out-of-distribution predictions for concurrent actions, be quickly adapted to new object categories and exploit its learnt features for object detection. Additional visualizations can be found at https://iclr-acgn.github.io/ACGN/.",/pdf/79f2735a5c4383dbbd8aaacc20f4f217bf251966.pdf,ICLR,2021, +B1EjKsRqtQ,SJgmSSocK7,1538090000000.0,1545360000000.0,478,Hierarchical Attention: What Really Counts in Various NLP Tasks,"[""zehaodou@pku.edu.cn"", ""zhzhang@math.pku.edu.cn""]","[""Zehao Dou"", ""Zhihua Zhang""]","[""attention"", ""hierarchical"", ""machine reading comprehension"", ""poem generation""]","Attention mechanisms in sequence to sequence models have shown great ability and wonderful performance in various natural language processing (NLP) tasks, such as sentence embedding, text generation, machine translation, machine reading comprehension, etc. Unfortunately, existing attention mechanisms only learn either high-level or low-level features. In this paper, we think that the lack of hierarchical mechanisms is a bottleneck in improving the performance of the attention mechanisms, and propose a novel Hierarchical Attention Mechanism (Ham) based on the weighted sum of different layers of a multi-level attention. +Ham achieves a state-of-the-art BLEU score of 0.26 on Chinese poem generation task and a nearly 6.5% averaged improvement compared with the existing machine reading comprehension models such as BIDAF and Match-LSTM. Furthermore, our experiments and theorems reveal that Ham has greater generalization and representation ability than existing attention mechanisms. ",/pdf/2de1a2d23536899248aea15fe0c19a5a9ea7cd65.pdf,ICLR,2019,The paper proposed a novel hierarchical model to replace the original attention model in various NLP tasks. +GtCq61UFDId,7Ch3kNlE3bD,1601310000000.0,1614990000000.0,1637,SoCal: Selective Oracle Questioning for Consistency-based Active Learning of Cardiac Signals,"[""~Dani_Kiyasseh1"", ""tingting.zhu@eng.ox.ac.uk"", ""~David_A._Clifton1""]","[""Dani Kiyasseh"", ""Tingting Zhu"", ""David A. Clifton""]","[""Active learning"", ""consistency-training"", ""cardiac signals"", ""healthcare""]","The ubiquity and rate of collection of cardiac signals produce large, unlabelled datasets. Active learning (AL) can exploit such datasets by incorporating human annotators (oracles) to improve generalization performance. However, the over-reliance of existing algorithms on oracles continues to burden physicians. To minimize this burden, we propose SoCal, a consistency-based AL framework that dynamically determines whether to request a label from an oracle or to generate a pseudo-label instead. We show that our framework decreases the labelling burden while maintaining strong performance, even in the presence of a noisy oracle.",/pdf/97103ddbd75c300a165dc2476470266458d929e4.pdf,ICLR,2021, +S1vuO-bCW,HJIudZ-A-,1509130000000.0,1519430000000.0,698,Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning,"[""eysenbach@google.com"", ""sg717@cam.ac.uk"", ""julianibarz@google.com"", ""slevine@google.com""]","[""Benjamin Eysenbach"", ""Shixiang Gu"", ""Julian Ibarz"", ""Sergey Levine""]","[""manual reset"", ""continual learning"", ""reinforcement learning"", ""safety""]","Deep reinforcement learning algorithms can learn complex behavioral skills, but real-world application of these methods requires a considerable amount of experience to be collected by the agent. In practical settings, such as robotics, this involves repeatedly attempting a task, resetting the environment between each attempt. However, not all tasks are easily or automatically reversible. In practice, this learning process requires considerable human intervention. In this work, we propose an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and backward policy, with the backward policy resetting the environment for a subsequent attempt. By learning a value function for the backward policy, we can automatically determine when the forward policy is about to enter a non-reversible state, providing for uncertainty-aware safety aborts. Our experiments illustrate that proper use of the backward policy can greatly reduce the number of manual resets required to learn a task and can reduce the number of unsafe actions that lead to non-reversible states.",/pdf/2cbc181de7f174a5de23a150263e0ea74d256c52.pdf,ICLR,2018,"We propose an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and backward policy, with the backward policy resetting the environment for a subsequent attempt." +jnRqf0CzBK,LC_FZDZ_hL6,1601310000000.0,1614990000000.0,380,Hierarchical Probabilistic Model for Blind Source Separation via Legendre Transformation,"[""~Simon_Luo1"", ""lamiae.azizi@sydney.edu.au"", ""~Mahito_Sugiyama1""]","[""Simon Luo"", ""Lamiae Azizi"", ""Mahito Sugiyama""]","[""blind source separation"", ""log-linear model"", ""energy-based model"", ""information geometry""]","We present a novel blind source separation (BSS) method, called information geometric blind source separation (IGBSS). Our formulation is based on the log-linear model equipped with a hierarchically structured sample space, which has theoretical guarantees to uniquely recover a set of source signals by minimizing the KL divergence from a set of mixed signals. Source signals, received signals, and mixing matrices are realized as different layers in our hierarchical sample space. Our empirical results have demonstrated on images and time series data that our approach is superior to well established techniques and is able to separate signals with complex interactions.",/pdf/e41860fdcd1484415bd2a007f80236b65f829059.pdf,ICLR,2021,A novel formulation of blind source separation using a hierarchical energy-based model. +SJ4Z72Rctm,Hklsyep9t7,1538090000000.0,1545360000000.0,1333,Composing Entropic Policies using Divergence Correction,"[""jjhunt@google.com"", ""andrebarreto@google.com"", ""countzero@google.com"", ""heess@google.com""]","[""Jonathan J Hunt"", ""Andre Barreto"", ""Timothy P Lillicrap"", ""Nicolas Heess""]","[""maximum entropy RL"", ""policy composition"", ""deep rl""]","Deep reinforcement learning (RL) algorithms have made great strides in recent years. An important remaining challenge is the ability to quickly transfer existing skills to novel tasks, and to combine existing skills with newly acquired ones. In domains where tasks are solved by composing skills this capacity holds the promise of dramatically reducing the data requirements of deep RL algorithms, and hence increasing their applicability. Recent work has studied ways of composing behaviors represented in the form of action-value functions. We analyze these methods to highlight their strengths and weaknesses, and point out situations where each of them is susceptible to poor performance. To perform this analysis we extend generalized policy improvement to the max-entropy framework and introduce a method for the practical implementation of successor features in continuous action spaces. Then we propose a novel approach which, in principle, recovers the optimal policy during transfer. This method works by explicitly learning the (discounted, future) divergence between policies. We study this approach in the tabular case and propose a scalable variant that is applicable in multi-dimensional continuous action spaces. +We compare our approach with existing ones on a range of non-trivial continuous control problems with compositional structure, and demonstrate qualitatively better performance despite not requiring simultaneous observation of all task rewards.",/pdf/ff0490b1a60d7f22b4ae76e88f68c465f3ab25c2.pdf,ICLR,2019,"Two new methods for combining entropic policies: maximum entropy generalized policy improvement, and divergence correction." +SkguE30ct7,Hyll2gAcFQ,1538090000000.0,1545360000000.0,1463,Neural Model-Based Reinforcement Learning for Recommendation,"[""xinshi.chen@gatech.edu"", ""sli370@gatech.edu"", ""ken.lh@alibaba-inc.com"", ""shaohua.jsh@alipay.com"", ""lsong@cc.gatech.edu""]","[""Xinshi Chen"", ""Shuang Li"", ""Hui Li"", ""Shaohua Jiang"", ""Le Song""]","[""Generative adversarial user model"", ""Recommendation system"", ""combinatorial recommendation policy"", ""model-based reinforcement learning"", ""deep Q-networks""]","There are great interests as well as many challenges in applying reinforcement learning (RL) to recommendation systems. In this setting, an online user is the environment; neither the reward function nor the environment dynamics are clearly defined, making the application of RL challenging. +In this paper, we propose a novel model-based reinforcement learning framework for recommendation systems, where we develop a generative adversarial network to imitate user behavior dynamics and learn her reward function. Using this user model as the simulation environment, we develop a novel DQN algorithm to obtain a combinatorial recommendation policy which can handle a large number of candidate items efficiently. In our experiments with real data, we show this generative adversarial user model can better explain user behavior than alternatives, and the RL policy based on this model can lead to a better long-term reward for the user and higher click rate for the system.",/pdf/4bfc4f492b5760e3efdc9aa53c4ee7d98b0d6507.pdf,ICLR,2019,A new insight of designing a RL recommendation policy based on a generative adversarial user model. +rJedV3R5tm,B1xaOP8FtQ,1538090000000.0,1558020000000.0,1462,RelGAN: Relational Generative Adversarial Networks for Text Generation,"[""wn8@rice.edu"", ""nnarodytska@vmware.com"", ""abp4@rice.edu""]","[""Weili Nie"", ""Nina Narodytska"", ""Ankit Patel""]","[""RelGAN"", ""text generation"", ""relational memory"", ""Gumbel-Softmax relaxation"", ""multiple embedded representations""]","Generative adversarial networks (GANs) have achieved great success at generating realistic images. However, the text generation still remains a challenging task for modern GAN architectures. In this work, we propose RelGAN, a new GAN architecture for text generation, consisting of three main components: a relational memory based generator for the long-distance dependency modeling, the Gumbel-Softmax relaxation for training GANs on discrete data, and multiple embedded representations in the discriminator to provide a more informative signal for the generator updates. Our experiments show that RelGAN outperforms current state-of-the-art models in terms of sample quality and diversity, and we also reveal via ablation studies that each component of RelGAN contributes critically to its performance improvements. Moreover, a key advantage of our method, that distinguishes it from other GANs, is the ability to control the trade-off between sample quality and diversity via the use of a single adjustable parameter. Finally, RelGAN is the first architecture that makes GANs with Gumbel-Softmax relaxation succeed in generating realistic text.",/pdf/28ff6712d62fef4d4846fca5be5df06a8ffd41d2.pdf,ICLR,2019, +rJl3yM-Ab,BkCoJfZR-,1509130000000.0,1524760000000.0,758,Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering,"[""shwang.2014@phdis.smu.edu.sg"", ""yum@us.ibm.com"", ""jingjiang@smu.edu.sg"", ""zhangwei@us.ibm.com"", ""xiaoxiao.guo@ibm.com"", ""shiyu.chang@ibm.com"", ""zhigwang@us.ibm.com"", ""tklinger@us.ibm.com"", ""gtesauro@us.ibm.com"", ""mcam@us.ibm.com""]","[""Shuohang Wang"", ""Mo Yu"", ""Jing Jiang"", ""Wei Zhang"", ""Xiaoxiao Guo"", ""Shiyu Chang"", ""Zhiguo Wang"", ""Tim Klinger"", ""Gerald Tesauro"", ""Murray Campbell""]","[""Question Answering"", ""Deep Learning""]","Very recently, it comes to be a popular approach for answering open-domain questions by first searching question-related passages, then applying reading comprehension models to extract answers. Existing works usually extract answers from single passages independently, thus not fully make use of the multiple searched passages, especially for the some questions requiring several evidences, which can appear in different passages, to be answered. The above observations raise the problem of evidence aggregation from multiple passages. In this paper, we deal with this problem as answer re-ranking. Specifically, based on the answer candidates generated from the existing state-of-the-art QA model, we propose two different re-ranking methods, strength-based and coverage-based re-rankers, which make use of the aggregated evidences from different passages to help entail the ground-truth answer for the question. Our model achieved state-of-the-arts on three public open-domain QA datasets, Quasar-T, SearchQA and the open-domain version of TriviaQA, with about 8\% improvement on the former two datasets. ",/pdf/b0c1baf9e43fe143f6e401553ce6e1774c0fe58f.pdf,ICLR,2018,We propose a method that can make use of the multiple passages information for open-domain QA. +r1G4z8cge,,1478280000000.0,1490380000000.0,292,Mollifying Networks,"[""gulcehrc@iro.umontreal.ca"", ""marcin-m@post.pl"", ""fvisin@gmail.com"", ""yoshua.umontreal@gmail.com""]","[""Caglar Gulcehre"", ""Marcin Moczulski"", ""Francesco Visin"", ""Yoshua Bengio""]","[""Deep learning"", ""Optimization""]","The optimization of deep neural networks can be more challenging than the traditional convex optimization problems due to highly non-convex nature of the loss function, e.g. it can involve pathological landscapes such as saddle-surfaces that can be difficult to escape from for algorithms based on simple gradient descent. In this paper, we attack the problem of optimization of highly non-convex neural networks objectives by starting with a smoothed -- or mollified -- objective function which becomes more complex as the training proceeds. Our proposition is inspired by the recent studies in continuation methods: similarly to curriculum methods, we begin by learning an easier (possibly convex) objective function and let it evolve during training until it eventually becomes the original, difficult to optimize objective function. The complexity of the mollified networks is controlled by a single hyperparameter that is annealed during training. We show improvements on various difficult optimization tasks and establish a relationship between recent works on continuation methods for neural networks and mollifiers. +",/pdf/dd73f7709a01b83346fed4edee7b39b192b53406.pdf,ICLR,2017,"We are proposing a new continuation method for neural networks, that starts from optimizing a convex objective function and gradually during the training the function evolves into more non-convex function." +rkx1m2C5YQ,rklhjGAqKm,1538090000000.0,1545360000000.0,1319,Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces,"[""philippbecker93@googlemail.com"", ""hpandya@lincoln.ac.uk"", ""gebhardt@ias.tu-darmstadt.de"", ""irobotcheng@gmail.com"", ""gneumann@lincoln.ac.uk""]","[""Philipp Becker"", ""Harit Pandya"", ""Gregor H.W. Gebhardt"", ""Cheng Zhao"", ""Gerhard Neumann""]","[""state estimation"", ""recurrent neural networks"", ""Kalman Filter"", ""deep learning""]","In order to integrate uncertainty estimates into deep time-series modelling, Kalman Filters (KFs) (Kalman et al., 1960) have been integrated with deep learning models. Yet, such approaches typically rely on approximate inference techniques such as variational inference which makes learning more complex and often less scalable due to approximation errors. We propose a new deep approach to Kalman filtering which can be learned directly in an end-to-end manner using backpropagation without additional approximations. Our approach uses a high-dimensional factorized latent state representation for which the Kalman updates simplify to scalar operations and thus avoids hard to backpropagate, computationally heavy and potentially unstable matrix inversions. Moreover, we use locally linear dynamic models to efficiently propagate the latent state to the next time +step. While our locally linear modelling and factorization assumptions are in general not true for the original low-dimensional state space of the system, the network finds a high-dimensional latent space where these assumptions hold to perform efficient inference. This state representation is learned jointly with the transition and noise models. The resulting network architecture, which we call Recurrent Kalman Network (RKN), can be used for any time-series data, similar to a LSTM (Hochreiter and Schmidhuber, 1997) but uses an explicit representation of uncertainty. As shown by our experiments, the RKN obtains much more accurate uncertainty estimates than an LSTM or Gated Recurrent Units (GRUs) (Cho et al., 2014) while also showing a slightly improved prediction performance and outperforms various recent generative models on an image imputation task.",/pdf/a4e11950337d63d757521b324d8c10575c7d19dd.pdf,ICLR,2019,"Kalman Filter based recurrent model for efficient state estimation, principled uncertainty handling and end to end learning of dynamic models in high dimensional spaces." +Bkxonh4Ywr,HkxEnl5zDr,1569440000000.0,1577170000000.0,201,Localizing and Amortizing: Efficient Inference for Gaussian Processes,"[""linfeng.liu@tufts.edu"", ""liping.liu@tufts.edu""]","[""Linfeng Liu"", ""Liping Liu""]","[""Gaussian Processes"", ""Variational Inference"", ""Amortized Inference"", ""Nearest Neighbors""]","The inference of Gaussian Processes concerns the distribution of the underlying function given observed data points. GP inference based on local ranges of data points is able to capture fine-scale correlations and allow fine-grained decomposition of the computation. Following this direction, we propose a new inference model that considers the correlations and observations of the K nearest neighbors for the inference at a data point. Compared with previous works, we also eliminate the data ordering prerequisite to simplify the inference process. Additionally, the inference task is decomposed to small subtasks with several technique innovations, making our model well suits the stochastic optimization. Since the decomposed small subtasks have the same structure, we further speed up the inference procedure with amortized inference. Our model runs efficiently and achieves good performances on several benchmark tasks.",/pdf/18e4c9fb7f7d12e5f27156372db25a01a171cb92.pdf,ICLR,2020,A scalable variational inference for GP leveraging nearest neighbors and amortization. +HJMCdsC5tX,SkxupM_9FX,1538090000000.0,1545360000000.0,401,A fully automated periodicity detection in time series,"[""tom.puech@craft.ai"", ""matthieu.boussard@craft.ai""]","[""Tom Puech"", ""Matthieu Boussard""]","[""Time series"", ""feature engineering"", ""period detection"", ""machine learning""]","This paper presents a method to autonomously find periodicities in a signal. It is based on the same idea of using Fourier Transform and autocorrelation function presented in Vlachos et al. 2005. While showing interesting results this method does not perform well on noisy signals or signals with multiple periodicities. Thus, our method adds several new extra steps (hints clustering, filtering and detrending) to fix these issues. Experimental results show that the proposed method outperforms the state of the art algorithms. ",/pdf/002cedfe0eb81c22661f1b9535da822b7ab19e0a.pdf,ICLR,2019,"This paper presents a method to autonomously find multiple periodicities in a signal, using FFT and ACF and add three news steps (clustering/filtering/detrending)" +IMPnRXEWpvr,tLOgJ44z1lR,1601310000000.0,1615970000000.0,331,Towards Impartial Multi-task Learning,"[""~Liyang_Liu1"", ""~Yi_Li15"", ""~Zhanghui_Kuang4"", ""~Jing-Hao_Xue1"", ""~Yimin_Chen1"", ""~Wenming_Yang1"", ""~Qingmin_Liao1"", ""~Wayne_Zhang2""]","[""Liyang Liu"", ""Yi Li"", ""Zhanghui Kuang"", ""Jing-Hao Xue"", ""Yimin Chen"", ""Wenming Yang"", ""Qingmin Liao"", ""Wayne Zhang""]","[""Multi-task Learning"", ""Impartial Learning"", ""Scene Understanding""]","Multi-task learning (MTL) has been widely used in representation learning. However, naively training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the task-specific parameters, we dynamically weigh the task losses so that all of them are kept at a comparable scale. Further, we find the above gradient balance and loss balance are complementary and thus propose a hybrid balance method to further improve the performance. Our impartial multi-task learning (IMTL) can be end-to-end trained without any heuristic hyper-parameter tuning, and is general to be applied on all kinds of losses without any distribution assumption. Moreover, our IMTL can converge to similar results even when the task losses are designed to have different scales, and thus it is scale-invariant. We extensively evaluate our IMTL on the standard MTL benchmarks including Cityscapes, NYUv2 and CelebA. It outperforms existing loss weighting methods under the same experimental settings.",/pdf/1641df474b8e0e2f7dd6c0dda99081b06fed400c.pdf,ICLR,2021,We propose an impartial multi-task learning method that treats all tasks equally without bias towards any task. +hu2aMLzOxC,xPtQOphoaYL,1601310000000.0,1614990000000.0,905,Asymmetric self-play for automatic goal discovery in robotic manipulation,"[""robotics@openai.com"", ""~Matthias_Plappert1"", ""raul@openai.com"", ""tao@openai.com"", ""ilge@openai.com"", ""~Vineet_Kosaraju1"", ""pw@openai.com"", ""ruben@openai.com"", ""arthur@openai.com"", ""hponde@openai.com"", ""atpaino@openai.com"", ""~Hyeonwoo_Noh1"", ""~Lilian_Weng1"", ""qiming@openai.com"", ""~Casey_Chu1"", ""~Wojciech_Zaremba1""]","[""OpenAI OpenAI"", ""Matthias Plappert"", ""Raul Sampedro"", ""Tao Xu"", ""Ilge Akkaya"", ""Vineet Kosaraju"", ""Peter Welinder"", ""Ruben D'Sa"", ""Arthur Petron"", ""Henrique Ponde de Oliveira Pinto"", ""Alex Paino"", ""Hyeonwoo Noh"", ""Lilian Weng"", ""Qiming Yuan"", ""Casey Chu"", ""Wojciech Zaremba""]","[""self-play"", ""asymmetric self-play"", ""automatic curriculum"", ""automatic goal generation"", ""robotic learning"", ""robotic manipulation"", ""reinforcement learning""]","We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. To do so, we rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method is able to discover highly diverse and complex goals without any human priors. We further show that Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice's trajectory when relabeled as a goal-conditioned demonstration. Finally, we show that our method scales, resulting in a single policy that can transfer to many unseen hold-out tasks such as setting a table, stacking blocks, and solving simple puzzles. Videos of a learned policy is available at https://robotics-self-play.github.io.",/pdf/ecffb81d4771f43bd0747615bc0b5cd2738f7440.pdf,ICLR,2021,"We use asymmetric self-play to train a goal-conditioned policy for complex object manipulation tasks, and the learned policy can zero-shot generalize to many manually designed holdout tasks." +B1evfa4tPB,Hkeu3go8wr,1569440000000.0,1583910000000.0,415,Neural Network Branching for Neural Network Verification ,"[""jingyue.lu@spc.ox.ac.uk"", ""pawan@robots.ox.ac.uk""]","[""Jingyue Lu"", ""M. Pawan Kumar""]","[""Neural Network Verification"", ""Branch and Bound"", ""Graph Neural Network"", ""Learning to branch""]","Formal verification of neural networks is essential for their deployment in safety-critical areas. Many available formal verification methods have been shown to be instances of a unified Branch and Bound (BaB) formulation. We propose a novel framework for designing an effective branching strategy for BaB. Specifically, we learn a graph neural network (GNN) to imitate the strong branching heuristic behaviour. Our framework differs from previous methods for learning to branch in two main aspects. Firstly, our framework directly treats the neural network we want to verify as a graph input for the GNN. Secondly, we develop an intuitive forward and backward embedding update schedule. Empirically, our framework achieves roughly $50\%$ reduction in both the number of branches and the time required for verification on various convolutional networks when compared to the best available hand-designed branching strategy. In addition, we show that our GNN model enjoys both horizontal and vertical transferability. Horizontally, the model trained on easy properties performs well on properties of increased difficulty levels. Vertically, the model trained on small neural networks achieves similar performance on large neural networks.",/pdf/96a157501dba2f792a2f89e40bb400c11511a741.pdf,ICLR,2020,We propose a novel learning to branch framework using graph neural networks to improve branch and bound based neural network verification methods. +BJxiqxSYPB,rJfzMkWYPS,1569440000000.0,1577170000000.0,2483,Learning to Prove Theorems by Learning to Generate Theorems,"[""mingzhew@cs.princeton.edu"", ""jiadeng@princeton.edu""]","[""Mingzhe Wang"", ""Jia Deng""]",[],"We consider the task of automated theorem proving, a key AI task. Deep learning has shown promise for training theorem provers, but there are limited human-written theorems and proofs available for supervised learning. To address this limitation, we propose to learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks demonstrate that synthetic data from our approach significantly improves the theorem prover and advances the state of the art of automated theorem proving in Metamath.",/pdf/d8653a382c07e83deb00a30960950a5e35c667a5.pdf,ICLR,2020, +HJluEeHKwH,HJxZywgFvH,1569440000000.0,1577170000000.0,2254,The Differentiable Cross-Entropy Method,"[""brandon.amos.cs@gmail.com"", ""denisyarats@cs.nyu.edu""]","[""Brandon Amos"", ""Denis Yarats""]","[""machine learning"", ""differentiable optimization"", ""control"", ""reinforcement learning""]",We study the Cross-Entropy Method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable variant (DCEM) that enables us to differentiate the output of CEM with respect to the objective function's parameters. In the machine learning setting this brings CEM inside of the end-to-end learning pipeline in cases this has otherwise been impossible. We show applications in a synthetic energy-based structured prediction task and in non-convex continuous control. In the control setting we show on the simulated cheetah and walker tasks that we can embed their optimal action sequences with DCEM and then use policy optimization to fine-tune components of the controller as a step towards combining model-based and model-free RL.,/pdf/797146043f1ef167cdf8a892ccd77bed492208d2.pdf,ICLR,2020,DCEM learns latent domains for optimization problems and helps bridge the gap between model-based and model-free RL --- we create a differentiable controller and fine-tune parts of it with PPO +xCm8kiWRiBT,wPxr8KYUs-e,1601310000000.0,1614990000000.0,1181,Adversarial Attacks on Binary Image Recognition Systems,"[""~Eric_Balkanski2"", ""harrison@robustintelligence.com"", ""kojin@robustintelligence.com"", ""rilee@robustintelligence.com"", ""yaron@robustintelligence.com"", ""richard@robustintelligence.com""]","[""Eric Balkanski"", ""Harrison Chase"", ""Kojin Oshiba"", ""Alexander Rilee"", ""Yaron Singer"", ""Richard Wang""]","[""Adversarial attacks"", ""Binary images"", ""Image Recognition"", ""Check processing systems""]","We initiate the study of adversarial attacks on models for binary (i.e. black and white) image classification. Although there has been a great deal of work on attacking models for colored and grayscale images, little is known about attacks on models for binary images. Models trained to classify binary images are used in text recognition applications such as check processing, license plate recognition, invoice processing, and many others. In contrast to colored and grayscale images, the search space of attacks on binary images is extremely restricted and noise cannot be hidden with minor perturbations in each pixel. Thus, the optimization landscape of attacks on binary images introduces new fundamental challenges. + +In this paper we introduce a new attack algorithm called Scar, designed to fool classifiers of binary images. We show that Scar significantly outperforms existing L0 attacks applied to the binary setting and use it to demonstrate the vulnerability of real-world text recognition systems. Scar’s strong performance in practice contrasts with hardness results that show the existence of worst-case classifiers for binary images that are robust to large perturbations. In many cases, altering a single pixel is sufficient to trick Tesseract, a popular open-source text recognition system, to misclassify a word as a different word in the English dictionary. We also demonstrate the vulnerability of check recognition by fooling commercial check processing systems used by major US banks for mobile deposits. These systems are substantially harder to fool since they classify both the handwritten amounts in digits and letters, independently. Nevertheless, we generalize Scar to design attacks that fool state-of-the-art check processing systems using unnoticeable perturbations that lead to misclassification of deposit amounts. Consequently, this is a powerful method to perform financial fraud.",/pdf/3f21ccc27b4d08f6b5124ce802699df8807aa788.pdf,ICLR,2021,We study adversarial attacks on models designed to classify binary (i.e. black and white) images. +1Q-CqRjUzf,sqUSa8QxOw,1601310000000.0,1614990000000.0,677,On the Reproducibility of Neural Network Predictions,"[""~Srinadh_Bhojanapalli1"", ""~Kimberly_Jenney_Wilber1"", ""~Andreas_Veit1"", ""~Ankit_Singh_Rawat1"", ""~Seungyeon_Kim1"", ""~Aditya_Krishna_Menon1"", ""~Sanjiv_Kumar1""]","[""Srinadh Bhojanapalli"", ""Kimberly Jenney Wilber"", ""Andreas Veit"", ""Ankit Singh Rawat"", ""Seungyeon Kim"", ""Aditya Krishna Menon"", ""Sanjiv Kumar""]","[""reproducibility"", ""churn"", ""confidence""]","Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause {\em churn} -- disagreements between predictions of the two models independently trained by the same algorithm, contributing to the `reproducibility challenges' in modern machine learning. In this paper, we study this problem of churn, identify factors that cause it, and propose two simple means of mitigating it. We first demonstrate that churn is indeed an issue, even for standard image classification tasks (CIFAR and ImageNet), and study the role of the different sources of training randomness that cause churn. By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction. First, we propose using \emph{minimum entropy regularizers} to increase prediction confidences. Second, we present a novel variant of co-distillation approach~\citep{anil2018large} to increase model agreement and reduce churn. We present empirical results showing the effectiveness of both techniques in reducing churn while improving the accuracy of the underlying model.",/pdf/91ee8d486b3470f0fcf401b87a5d3af312a137db.pdf,ICLR,2021,We propose new methods to reduce model churn and improve reproducibility of predictions for classification +VyENEGiEYAQ,KNp5B6lmjN,1601310000000.0,1614990000000.0,2754,Cluster-Former: Clustering-based Sparse Transformer for Question Answering,"[""~Shuohang_Wang1"", ""~Luowei_Zhou1"", ""~Zhe_Gan1"", ""~Yen-Chun_Chen1"", ""yuwfan@microsoft.com"", ""~Siqi_Sun2"", ""~Yu_Cheng1"", ""~Jingjing_Liu2""]","[""Shuohang Wang"", ""Luowei Zhou"", ""Zhe Gan"", ""Yen-Chun Chen"", ""Yuwei Fang"", ""Siqi Sun"", ""Yu Cheng"", ""Jingjing Liu""]","[""Transformer"", ""Question Answering""]","Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. +However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically with respect to the sequence length. +Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. +In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer, which encode local sequence information and global context jointly and iteratively. +This new design allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.",/pdf/52aab0027c3af6178c79fa053d1eea8a8e6f3931.pdf,ICLR,2021,We propose Cluster-Former to encode long context in question answering tasks and achieve SOTA on several QA datasets. +S1vyujVye,,1476860000000.0,1484070000000.0,5,Deep unsupervised learning through spatial contrasting,"[""ehoffer@tx.technion.ac.il"", ""itayh@tx.technion.ac.il"", ""nailon@cs.technion.ac.il""]","[""Elad Hoffer"", ""Itay Hubara"", ""Nir Ailon""]","[""Unsupervised Learning"", ""Deep learning"", ""Computer vision""]","Convolutional networks have marked their place over the last few years as the +best performing model for various visual tasks. They are, however, most suited +for supervised learning from large amounts of labeled data. Previous attempts +have been made to use unlabeled data to improve model performance by applying +unsupervised techniques. These attempts require different architectures and training methods. +In this work we present a novel approach for unsupervised training +of Convolutional networks that is based on contrasting between spatial regions +within images. This criterion can be employed within conventional neural net- +works and trained using standard techniques such as SGD and back-propagation, +thus complementing supervised methods.",/pdf/e23ea1d97fdcd1c975c97ee27d237ba6424569d5.pdf,ICLR,2017, +pAJ3svHLDV,rELYkFjzkI,1601310000000.0,1614990000000.0,2404,R-MONet: Region-Based Unsupervised Scene Decomposition and Representation via Consistency of Object Representations,"[""~Shengxin_Qian1""]","[""Shengxin Qian""]","[""unsupervised representation learning"", ""unsupervised scene representation"", ""unsupervised scene decomposition"", ""generative models""]","Decomposing a complex scene into multiple objects is a natural instinct of an intelligent vision system. Recently, the interest in unsupervised scene representation learning emerged and many previous works tackle this by decomposing scenes into object representations either in the form of segmentation masks or position and scale latent variables (i.e. bounding boxes). We observe that these two types of representation both contain object geometric information and should be consistent with each other. Inspired by this observation, we provide an unsupervised generative framework called R-MONet that can generate objects geometric representation in the form of bounding boxes and segmentation masks simultaneously. While bounding boxes can represent the region of interest (ROI) for generating foreground segmentation masks, the foreground segmentation masks can also be used to supervise bounding boxes learning with the Multi-Otsu Thresholding method. Through the experiments on CLEVR and Multi-dSprites datasets, we show that ensuring the consistency of two types of representation can help the model to decompose the scene and learn better object geometric representations.",/pdf/590d7e573a3fec726508cffac4afe0e57912345b.pdf,ICLR,2021,We propose an unsupervised generative framework which can have better scene decomposition capability by ensuring the consistency between two kinds of object geometric representations (i.e. bounding boxes and foreground segmentation masks) +r1eCy0NtDH,Bkef8k7_vS,1569440000000.0,1577170000000.0,912,Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks,"[""soufiane.hayou@stats.ox.ac.uk"", ""doucet@stats.ox.ac.uk"", ""judith.rousseau@stats.ox.ac.uk""]","[""Soufiane Hayou"", ""Arnaud Doucet"", ""Judith Rousseau""]",[],"Recent work by Jacot et al. (2018) has showed that training a neural network of any kind with gradient descent in parameter space is equivalent to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result to show that the output of a neural network trained using full batch gradient descent can be approximated by a linear model for wide networks. In parallel, a recent line of studies ( Schoenhols et al. (2017), Hayou et al. (2019)) suggested that a special initialization known as the Edge of Chaos leads to good performance. In this paper, we bridge the gap between this two concepts and show the impact of the initialization and the activation function on the NTK as the network depth becomes large. We provide experiments illustrating our theoretical results.",/pdf/cf528fa872ea7b8faa5ccd655e6b2ee9966d087c.pdf,ICLR,2020,Impact of the Initialization and the Activation function on the Neural Tangent Kernel +jN5y-zb5Q7m,qOFcVMYTqNm,1601310000000.0,1613060000000.0,2052,Uncertainty Estimation in Autoregressive Structured Prediction,"[""~Andrey_Malinin1"", ""~Mark_Gales1""]","[""Andrey Malinin"", ""Mark Gales""]","[""ensembles"", ""structures prediction"", ""uncertainty estimation"", ""knowledge uncertainty"", ""autoregressive models"", ""information theory"", ""machine translation"", ""speech recognition.""]","Uncertainty estimation is important for ensuring safety and robustness of AI systems. While most research in the area has focused on un-structured prediction tasks, limited work has investigated general uncertainty estimation approaches for structured prediction. Thus, this work aims to investigate uncertainty estimation for structured prediction tasks within a single unified and interpretable probabilistic ensemble-based framework. We consider: uncertainty estimation for sequence data at the token-level and complete sequence-level; interpretations for, and applications of, various measures of uncertainty; and discuss both the theoretical and practical challenges associated with obtaining them. This work also provides baselines for token-level and sequence-level error detection, and sequence-level out-of-domain input detection on the WMT’14 English-French and WMT’17 English-German translation and LibriSpeech speech recognition datasets.",/pdf/e79c8b8744afc4b43fd9ff97d80f237b28c3125b.pdf,ICLR,2021,A Deep Investigation of Ensemble-based Uncertainty Estimation for Autoregressive ASR and NMT models. +BJxlmeBKwS,BJeU5Xetvr,1569440000000.0,1577170000000.0,2197,FRICATIVE PHONEME DETECTION WITH ZERO DELAY,"[""metehan.yurt@fau.de"", ""alberto.escalante@sivantos.com"", ""veniamin.morgenshtern@fau.de""]","[""Metehan Yurt"", ""Alberto N. Escalante B."", ""Veniamin I. Morgenshtern""]","[""fricative detection"", ""phoneme detection"", ""speech recognition"", ""deep learning"", ""hearing aids"", ""zero delay"", ""extrapolation"", ""TIMIT""]","People with high-frequency hearing loss rely on hearing aids that employ frequency lowering algorithms. These algorithms shift some of the sounds from the high frequency band to the lower frequency band where the sounds become more perceptible for the people with the condition. Fricative phonemes have an important part of their content concentrated in high frequency bands. It is important that the frequency lowering algorithm is activated exactly for the duration of a fricative phoneme, and kept off at all other times. Therefore, timely (with zero delay) and accurate fricative phoneme detection is a key problem for high quality hearing aids. In this paper we present a deep learning based fricative phoneme detection algorithm that has zero detection delay and achieves state-of-the-art fricative phoneme detection accuracy on the TIMIT Speech Corpus. All reported results are reproducible and come with easy to use code that could serve as a baseline for future research. +",/pdf/60574943b0539c6d459b72133cd280cb328c5d56.pdf,ICLR,2020,A deep learning based approach for zero delay fricative phoneme detection +HJePXkHtvS,HJec6L2_Dr,1569440000000.0,1577170000000.0,1620,Deep Generative Classifier for Out-of-distribution Sample Detection,"[""dongha0914@postech.ac.kr"", ""hunu12@postech.ac.kr"", ""hwanjoyu@postech.ac.kr""]","[""Dongha Lee"", ""Sehun Yu"", ""Hwanjo Yu""]","[""Out-of-distribution Detection"", ""Generative Classifier"", ""Deep Neural Networks"", ""Multi-class Classification"", ""Gaussian Discriminant Analysis""]","The capability of reliably detecting out-of-distribution samples is one of the key factors in deploying a good classifier, as the test distribution always does not match with the training distribution in most real-world applications. In this work, we propose a deep generative classifier which is effective to detect out-of-distribution samples as well as classify in-distribution samples, by integrating the concept of Gaussian discriminant analysis into deep neural networks. Unlike the discriminative (or softmax) classifier that only focuses on the decision boundary partitioning its latent space into multiple regions, our generative classifier aims to explicitly model class-conditional distributions as separable Gaussian distributions. Thereby, we can define the confidence score by the distance between a test sample and the center of each distribution. Our empirical evaluation on multi-class images and tabular data demonstrate that the generative classifier achieves the best performances in distinguishing out-of-distribution samples, and also it can be generalized well for various types of deep neural networks.",/pdf/11ac0898cd3ebc18d54a20f8e51424f430299881.pdf,ICLR,2020,"This paper proposes a deep generative classifier which is effective to detect out-of-distribution samples as well as classify in-distribution samples, by integrating the concept of Gaussian discriminant analysis into deep neural networks." +_QnwcbR-GG,IXt1REsvxJm,1601310000000.0,1614990000000.0,654,On the Effectiveness of Weight-Encoded Neural Implicit 3D Shapes ,"[""~Thomas_Ryan_Davies1"", ""~Derek_Nowrouzezahrai1"", ""~Alec_Jacobson1""]","[""Thomas Ryan Davies"", ""Derek Nowrouzezahrai"", ""Alec Jacobson""]",[],"A neural implicit outputs a number indicating whether the given query point in space is inside, outside, or on a surface. Many prior works have focused on _latent-encoded_ neural implicits, where a latent vector encoding of a specific shape is also fed as input. While affording latent-space interpolation, this comes at the cost of reconstruction accuracy for any _single_ shape. Training a specific network for each 3D shape, a _weight-encoded_ neural implicit may forgo the latent vector and focus reconstruction accuracy on the details of a single shape. While previously considered as an intermediary representation for 3D scanning tasks or as a toy-problem leading up to latent-encoding tasks, weight-encoded neural implicits have not yet been taken seriously as a 3D shape representation. In this paper, we establish that weight-encoded neural implicits meet the criteria of a first-class 3D shape representation. We introduce a suite of technical contributions to improve reconstruction accuracy, convergence, and robustness when learning the signed distance field induced by a polygonal mesh --- the _de facto_ standard representation. Viewed as a lossy compression, our conversion outperforms standard techniques from geometry processing. Compared to previous latent- and weight-encoded neural implicits we demonstrate superior robustness, scalability, and performance.",/pdf/16f83074f008f57d118717843e02333bd4ba3b13.pdf,ICLR,2021,Purposefully overfit neural networks are an efficient surface representation for solid 3D shapes. +ryzECoAcY7,rkeLkpaqKX,1538090000000.0,1567620000000.0,886,Learning Multi-Level Hierarchies with Hindsight,"[""andrew_levy2@brown.edu"", ""gdk@cs.brown.edu"", ""saenko@bu.edu"", ""rplatt@ccs.neu.edu""]","[""Andrew Levy"", ""George Konidaris"", ""Robert Platt"", ""Kate Saenko""]","[""Hierarchical Reinforcement Learning"", ""Reinforcement Learning"", ""Deep Reinforcement Learning""]","Hierarchical agents have the potential to solve sequential decision making tasks with greater sample efficiency than their non-hierarchical counterparts because hierarchical agents can break down tasks into sets of subtasks that only require short sequences of decisions. In order to realize this potential of faster learning, hierarchical agents need to be able to learn their multiple levels of policies in parallel so these simpler subproblems can be solved simultaneously. Yet, learning multiple levels of policies in parallel is hard because it is inherently unstable: changes in a policy at one level of the hierarchy may cause changes in the transition and reward functions at higher levels in the hierarchy, making it difficult to jointly learn multiple levels of policies. In this paper, we introduce a new Hierarchical Reinforcement Learning (HRL) framework, Hierarchical Actor-Critic (HAC), that can overcome the instability issues that arise when agents try to jointly learn multiple levels of policies. The main idea behind HAC is to train each level of the hierarchy independently of the lower levels by training each level as if the lower level policies are already optimal. We demonstrate experimentally in both grid world and simulated robotics domains that our approach can significantly accelerate learning relative to other non-hierarchical and hierarchical methods. Indeed, our framework is the first to successfully learn 3-level hierarchies in parallel in tasks with continuous state and action spaces.",/pdf/30e53961e2765131e7a72310088973ae42f8e70d.pdf,ICLR,2019,We introduce the first Hierarchical RL approach to successfully learn 3-level hierarchies in parallel in tasks with continuous state and action spaces. +0rNLjXgchOC,Kg1uoLheOvH,1601310000000.0,1623820000000.0,1404,Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks,"[""~Yikai_Wu1"", ""~Xingyu_Zhu1"", ""~Chenwei_Wu1"", ""~Annie_N._Wang1"", ""~Rong_Ge1""]","[""Yikai Wu"", ""Xingyu Zhu"", ""Chenwei Wu"", ""Annie N. Wang"", ""Rong Ge""]","[""Hessian"", ""neural network"", ""Kronecker factorization"", ""PAC-Bayes bound"", ""eigenspace"", ""eigenvalue""]","Hessian captures important properties of the deep neural network loss landscape. We observe that eigenvectors and eigenspaces of the layer-wise Hessian for neural network objective have several interesting structures -- top eigenspaces for different models have high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the weight matrix of the corresponding layer. These structures, as well as the low rank structure of the Hessian observed in previous studies, can be explained by approximating the Hessian using Kronecker factorization. Our new understanding can also explain why some of these structures become weaker when the network is trained with batch normalization. Finally, we show that the Kronecker factorization can be combined with PAC-Bayes techniques to get better generalization bounds.",/pdf/8284aa813b0ffddc95eac423a60be3c703e59152.pdf,ICLR,2021,"We investigate several interesting structures of layer-wise Hessian by approximating the Hessian using Kronecker factorization, and provide a nonvacuous PAC-Bayes generalization bound using the approximated Hessian eigenbasis." +HJgRCyHFDr,SkgXS_1Kvr,1569440000000.0,1577170000000.0,2044,On Weight-Sharing and Bilevel Optimization in Architecture Search,"[""khodak@cmu.edu"", ""me@liamcli.com"", ""ninamf@cs.cmu.edu"", ""talwalkar@cmu.edu""]","[""Mikhail Khodak"", ""Liam Li"", ""Maria-Florina Balcan"", ""Ameet Talwalkar""]","[""neural architecture search"", ""weight-sharing"", ""bilevel optimization"", ""non-convex optimization"", ""hyperparameter optimization"", ""model selection""]","Weight-sharing—the simultaneous optimization of multiple neural networks using the same parameters—has emerged as a key component of state-of-the-art neural architecture search. However, its success is poorly understood and often found to be surprising. We argue that, rather than just being an optimization trick, the weight-sharing approach is induced by the relaxation of a structured hypothesis space, and introduces new algorithmic and theoretical challenges as well as applications beyond neural architecture search. Algorithmically, we show how the geometry of ERM for weight-sharing requires greater care when designing gradient- based minimization methods and apply tools from non-convex non-Euclidean optimization to give general-purpose algorithms that adapt to the underlying structure. We further analyze the learning-theoretic behavior of the bilevel optimization solved by practical weight-sharing methods. Next, using kernel configuration and NLP feature selection as case studies, we demonstrate how weight-sharing applies to the architecture search generalization of NAS and effectively optimizes the resulting bilevel objective. Finally, we use our optimization analysis to develop a simple exponentiated gradient method for NAS that aligns with the underlying optimization geometry and matches state-of-the-art approaches on CIFAR-10.",/pdf/edea8b2b5b1ba3dadca4ca28d087178a5bfbf57f.pdf,ICLR,2020,An analysis of the learning and optimization structures of architecture search in neural networks and beyond. +uUlGTEbBRL,IGwQUQSu0Uu,1601310000000.0,1614990000000.0,3328,Rethinking Compressed Convolution Neural Network from a Statistical Perspective,"[""~Feiqing_Huang1"", ""yuefeng_si@hku.hk"", ""gdli@hku.hk""]","[""Feiqing Huang"", ""Yuefeng Si"", ""Guodong Li""]","[""Compressed Convolutional Neural Network"", ""Tensor Decomposition"", ""Sample Complexity Analysis""]","Many designs have recently been proposed to improve the model efficiency of convolutional neural networks (CNNs) at a fixed resource budget, while there is a lack of theoretical analysis to justify them. This paper first formulates CNNs with high-order inputs into statistical models, which have a special ""Tucker-like"" formulation. This makes it possible to further conduct the sample complexity analysis to CNNs as well as compressed CNNs via tensor decomposition. Tucker and CP decompositions are commonly adopted to compress CNNs in the literature. The low rank assumption is usually imposed on the output channels, which according to our study, may not be beneficial to obtain a computationally efficient model while a similar accuracy can be maintained. Our finding is further supported by ablation studies on CIFAR10, SVNH and UCF101 datasets.",/pdf/75079f3e38a8f20b02334adc2aebbd29a7a5e4b6.pdf,ICLR,2021,We theoretically explore the mechanism of tensor factorized convolutional neural networks. +zCu1BZYCueE,TVnM4QOkty,1601310000000.0,1614990000000.0,2309,Response Modeling of Hyper-Parameters for Deep Convolutional Neural Networks,"[""mathieutuli@cs.toronto.edu"", ""~Mahdi_S._Hosseini1"", ""~Konstantinos_N_Plataniotis1""]","[""Mathieu Tuli"", ""Mahdi S. Hosseini"", ""Konstantinos N Plataniotis""]","[""Hyper-Parameter Optimization"", ""Response Surface Modeling"", ""Convolution Neural Network"", ""Low-Rank Factorization""]","Hyper-parameter optimization (HPO) is critical in training high performing Deep Neural Networks (DNN). Current methodologies fail to define an analytical response surface and remain a training bottleneck due to their use of additional internal hyper-parameters and lengthy evaluation cycles. We demonstrate that the low-rank factorization of the convolution weights of intermediate layers of a CNN can define an analytical response surface. We quantify how this surface acts as an auxiliary to optimizing training metrics. We introduce a dynamic tracking algorithm -- autoHyper -- that performs HPO on the order of hours for various datasets including ImageNet and requires no manual tuning. Our method -- using a single RTX2080Ti -- is able to select a learning rate within 59 hours for AdaM on ResNet34 applied to ImageNet and improves in testing accuracy by 4.93% over the default learning rate. In contrast to previous methods, we empirically prove that our algorithm and response surface generalize well across model, optimizer, and dataset selection removing the need for extensive domain knowledge to achieve high levels of performance.",/pdf/a54cfc8092264652b2460e00aed151a18021ef33.pdf,ICLR,2021,A new response surface model is proposed to dynamically track the optimum Hyper-Parameters for training Convolution Neural Network. +SJCPLLpaW,r1aPILapZ,1508890000000.0,1518730000000.0,75,Exploring the Hidden Dimension in Accelerating Convolutional Neural Networks,"[""zhihao@cs.stanford.edu"", ""silin@microsoft.com"", ""rqi@stanford.edu"", ""aiken@cs.stanford.edu""]","[""Zhihao Jia"", ""Sina Lin"", ""Charles R. Qi"", ""Alex Aiken""]","[""Parallelism of Convolutional Neural Networks"", ""Accelerating Convolutional Neural Networks""]",DeePa is a deep learning framework that explores parallelism in all parallelizable dimensions to accelerate the training process of convolutional neural networks. DeePa optimizes parallelism at the granularity of each individual layer in the network. We present an elimination-based algorithm that finds an optimal parallelism configuration for every layer. Our evaluation shows that DeePa achieves up to 6.5× speedup compared to state-of-the-art deep learning frameworks and reduces data transfers by up to 23×.,/pdf/d2b0dd98ffe73945fb816a531593117ab80d7b8c.pdf,ICLR,2018,"To the best of our knowledge, DeePa is the first deep learning framework that controls and optimizes the parallelism of CNNs in all parallelizable dimensions at the granularity of each layer." +ryj38zWRb,H1PhLzb0W,1509140000000.0,1518730000000.0,849,Optimizing the Latent Space of Generative Networks,"[""bojanowski@fb.com"", ""ajoulin@fb.com"", ""dlp@fb.com"", ""aszlam@fb.com""]","[""Piotr Bojanowski"", ""Armand Joulin"", ""David Lopez-Paz"", ""Arthur Szlam""]","[""generative models"", ""latent variable models"", ""image generation"", ""generative adversarial networks"", ""convolutional neural networks""]","Generative Adversarial Networks (GANs) have achieved remarkable results in the task of generating realistic natural images. In most applications, GAN models share two aspects in common. On the one hand, GANs training involves solving a challenging saddle point optimization problem, interpreted as an adversarial game between a generator and a discriminator functions. On the other hand, the generator and the discriminator are parametrized in terms of deep convolutional neural networks. The goal of this paper is to disentangle the contribution of these two factors to the success of GANs. In particular, we introduce Generative Latent Optimization (GLO), a framework to train deep convolutional generators without using discriminators, thus avoiding the instability of adversarial optimization problems. Throughout a variety of experiments, we show that GLO enjoys many of the desirable properties of GANs: learning from large data, synthesizing visually-appealing samples, interpolating meaningfully between samples, and performing linear arithmetic with noise vectors.",/pdf/305fd049211de562d6a10bc2abc11952de4632cf.pdf,ICLR,2018,Are GANs successful because of adversarial training or the use of ConvNets? We show a ConvNet generator trained with a simple reconstruction loss and learnable noise vectors leads many of the desirable properties of a GAN. +H1lNPxHKDH,rye3LilFwS,1569440000000.0,1583910000000.0,2356,A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case,"[""gongie@uchicago.edu"", ""willett@uchicago.edu"", ""daniel.soudry@technion.ac.il"", ""nati@ttic.edu""]","[""Greg Ongie"", ""Rebecca Willett"", ""Daniel Soudry"", ""Nathan Srebro""]","[""inductive bias"", ""regularization"", ""infinite-width networks"", ""ReLU networks""]","We give a tight characterization of the (vectorized Euclidean) norm of weights required to realize a function $f:\mathbb{R}\rightarrow \mathbb{R}^d$ as a single hidden-layer ReLU network with an unbounded number of units (infinite width), extending the univariate characterization of Savarese et al. (2019) to the multivariate case.",/pdf/a4decc1cb86c7fbba42a437c2315e29ac79c8894.pdf,ICLR,2020,"We characterize the space of functions realizable as a ReLU network with an unbounded number of units (infinite width), but where the Euclidean norm of the weights is bounded." +HkeGhoA5FX,r1eMkfaqt7,1538090000000.0,1550690000000.0,693,Residual Non-local Attention Networks for Image Restoration,"[""yulun100@gmail.com"", ""kunpengli@ece.neu.edu"", ""li.kai.gml@gmail.com"", ""bnzhong@hqu.edu.cn"", ""yunfu@ece.neu.edu""]","[""Yulun Zhang"", ""Kunpeng Li"", ""Kai Li"", ""Bineng Zhong"", ""Yun Fu""]","[""Non-local network"", ""attention network"", ""image restoration"", ""residual learning""]","In this paper, we propose a residual non-local attention network for high-quality image restoration. Without considering the uneven distribution of information in the corrupted images, previous methods are restricted by local convolutional operation and equal treatment of spatial- and channel-wise features. To address this issue, we design local and non-local attention blocks to extract features that capture the long-range dependencies between pixels and pay more attention to the challenging parts. Specifically, we design trunk branch and (non-)local mask branch in each (non-)local attention block. The trunk branch is used to extract hierarchical features. Local and non-local mask branches aim to adaptively rescale these hierarchical features with mixed attentions. The local mask branch concentrates on more local structures with convolutional operations, while non-local attention considers more about long-range dependencies in the whole feature map. Furthermore, we propose residual local and non-local attention learning to train the very deep network, which further enhance the representation ability of the network. Our proposed method can be generalized for various image restoration applications, such as image denoising, demosaicing, compression artifacts reduction, and super-resolution. Experiments demonstrate that our method obtains comparable or better results compared with recently leading methods quantitatively and visually.",/pdf/04121d8887f84077f6486db18444e9113aa62a24.pdf,ICLR,2019,New state-of-the-art framework for image restoration +ryeHuJBtPH,HkeJYe0uwH,1569440000000.0,1583910000000.0,1803,Hyper-SAGNN: a self-attention based graph neural network for hypergraphs,"[""ruochiz@andrew.cmu.edu"", ""logic.zys@gmail.com"", ""jianma@cs.cmu.edu""]","[""Ruochi Zhang"", ""Yuesong Zou"", ""Jian Ma""]","[""graph neural network"", ""hypergraph"", ""representation learning""]","Graph representation learning for hypergraphs can be utilized to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. We believe that Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications. ",/pdf/78bbb8eaf72d13a892c120e09b784166991d78f9.pdf,ICLR,2020,We develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes that can fulfill tasks like node classification and hyperedge prediction. +HJjiFK5gx,,1478300000000.0,1489580000000.0,513,Neural Program Lattices,"[""ctli@mit.edu"", ""dtarlow@microsoft.com"", ""algaunt@microsoft.com"", ""mabrocks@microsoft.com"", ""nkushman@microsoft.com""]","[""Chengtao Li"", ""Daniel Tarlow"", ""Alexander L. Gaunt"", ""Marc Brockschmidt"", ""Nate Kushman""]","[""Deep learning"", ""Semi-Supervised Learning""]","We propose the Neural Program Lattice (NPL), a neural network that learns to perform complex tasks by composing low-level programs to express high-level programs. Our starting point is the recent work on Neural Programmer-Interpreters (NPI), which can only learn from strong supervision that contains the whole hierarchy of low-level and high-level programs. NPLs remove this limitation by providing the ability to learn from weak supervision consisting only of sequences of low-level operations. We demonstrate the capability of NPL to learn to perform long-hand addition and arrange blocks in a grid-world environment. Experiments show that it performs on par with NPI while using weak supervision in place of most of the strong supervision, thus indicating its ability to infer the high-level program structure from examples containing only the low-level operations.",/pdf/8d4a4645cf9a906b7cc8e5da7d7d8d54045bcda1.pdf,ICLR,2017, +7_G8JySGecm,eCLGeSxUXtj,1601310000000.0,1616050000000.0,3067,Monte-Carlo Planning and Learning with Language Action Value Estimates,"[""~Youngsoo_Jang2"", ""~Seokin_Seo1"", ""~Jongmin_Lee1"", ""~Kee-Eung_Kim4""]","[""Youngsoo Jang"", ""Seokin Seo"", ""Jongmin Lee"", ""Kee-Eung Kim""]","[""natural language processing"", ""Monte-Carlo tree search"", ""reinforcement learning"", ""interactive fiction""]","Interactive Fiction (IF) games provide a useful testbed for language-based reinforcement learning agents, posing significant challenges of natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space. Agents based on standard planning algorithms struggle to play IF games due to the massive search space of language actions. Thus, language-grounded planning is a key ability of such agents, since inferring the consequence of language action based on semantic understanding can drastically improve search. In this paper, we introduce Monte-Carlo planning with Language Action Value Estimates (MC-LAVE) that combines a Monte-Carlo tree search with language-driven exploration. MC-LAVE invests more search effort into semantically promising language actions using locally optimistic language value estimates, yielding a significant reduction in the effective search space of language actions. We then present a reinforcement learning approach via MC-LAVE, which alternates between MC-LAVE planning and supervised learning of the self-generated language actions. In the experiments, we demonstrate that our method achieves new high scores in various IF games.",/pdf/255385188b591f81f5ec4cb8c99ea2b92467f6be.pdf,ICLR,2021,We present Monte-Carlo planning with Language Action Value Estimates (MC-LAVE) that combines a Monte-Carlo tree search with language-driven exploration for Interactive Fiction games. +kVZ6WBYazFq,by1dGWpsfrvM,1601310000000.0,1614990000000.0,2880,Constraint-Driven Explanations of Black-Box ML Models,"[""~Aditya_Aniruddha_Shrotri1"", ""~Nina_Narodytska1"", ""~Alexey_Ignatiev1"", ""~Joao_Marques-Silva1"", ""~Kuldeep_S._Meel2"", ""~Moshe_Vardi1""]","[""Aditya Aniruddha Shrotri"", ""Nina Narodytska"", ""Alexey Ignatiev"", ""Joao Marques-Silva"", ""Kuldeep S. Meel"", ""Moshe Vardi""]","[""Explainability"", ""constraints"", ""uniform sampling""]","Modern machine learning techniques have enjoyed widespread success, but are plagued by lack of transparency in their decision making, which has led to the emergence of the field of explainable AI. One popular approach called LIME, seeks to explain an opaque model's behavior, by training a surrogate interpretable model to be locally faithful on perturbed instances. +Despite being model-agnostic and easy-to-use, it is known that LIME's explanations can be unstable and are susceptible to adversarial attacks as a result of Out-Of-Distribution (OOD) sampling. Quality of explanations is also calculated heuristically, and lacks a strong theoretical foundation. In spite of numerous attempts to remedy some of these issues, making the LIME framework more trustworthy and reliable remains an open problem. + +In this work, we demonstrate that the OOD sampling problem stems from rigidity of the perturbation procedure. To resolve this issue, we propose a theoretically sound framework based on uniform sampling of user-defined subspaces. Through logical constraints, we afford the end-user the flexibility to delineate the precise subspace of the input domain to be explained. This not only helps mitigate the problem of OOD sampling, but also allow experts to drill down and uncover bugs deep inside the model. For testing the quality of generated explanations, we develop an efficient estimation algorithm that is able to certifiably measure the true value of metrics such as fidelity up to any desired degree of accuracy, which can help in building trust in the generated explanations. Our framework called CLIME can be applied to any ML model, and extensive experiments demonstrate its versatility on real-world problems. +",/pdf/f5d72292d87a9ce138c6d936f752cf72628c9bea.pdf,ICLR,2021,Trustworthy and reliable explainations on user-defined constrained subspaces. +BJemQ209FQ,S1gzuKpctX,1538090000000.0,1550880000000.0,1342,Learning to Navigate the Web,"[""izzeddingur@gmail.com"", ""rueckert@google.com"", ""sandrafaust@google.com"", ""dilek@ieee.org""]","[""Izzeddin Gur"", ""Ulrich Rueckert"", ""Aleksandra Faust"", ""Dilek Hakkani-Tur""]","[""navigating web pages"", ""reinforcement learning"", ""q learning"", ""curriculum learning"", ""meta training""]","Learning in environments with large state and action spaces, and sparse rewards, can hinder a Reinforcement Learning (RL) agent’s learning through trial-and-error. For instance, following natural language instructions on the Web (such as booking a flight ticket) leads to RL settings where input vocabulary and number of actionable elements on a page can grow very large. Even though recent approaches improve the success rate on relatively simple environments with the help of human demonstrations to guide the exploration, they still fail in environments where the set of possible instructions can reach millions. We approach the aforementioned problems from a different perspective and propose guided RL approaches that can generate unbounded amount of experience for an agent to learn from. Instead of learning from a complicated instruction with a large vocabulary, we decompose it into multiple sub-instructions and schedule a curriculum in which an agent is tasked with a gradually increasing subset of these relatively easier sub-instructions. In addition, when the expert demonstrations are not available, we propose a novel meta-learning framework that generates new instruction following tasks and trains the agent more effectively. We train DQN, deep reinforcement learning agent, with Q-value function approximated with a novel QWeb neural network architecture on these smaller, synthetic instructions. We evaluate the ability of our agent to generalize to new instructions onWorld of Bits benchmark, on forms with up to 100 elements, supporting 14 million possible instructions. The QWeb agent outperforms the baseline without using any human demonstration achieving 100% success rate on several difficult environments.",/pdf/1ebd2cc00d943a087baca34ea2d616ac522de5d5.pdf,ICLR,2019,"We train reinforcement learning policies using reward augmentation, curriculum learning, and meta-learning to successfully navigate web pages." +H1lBYCEFDB,ByxXnQ__wB,1569440000000.0,1577170000000.0,1247,A Coordinate-Free Construction of Scalable Natural Gradient,"[""kevin.kh.luk@gmail.com"", ""rgrosse@cs.toronto.edu""]","[""Kevin Luk"", ""Roger Grosse""]","[""Natural gradient"", ""second-order optimization"", ""K-FAC"", ""parameterization invariance"", ""deep learning""]","Most neural networks are trained using first-order optimization methods, which are sensitive to the parameterization of the model. Natural gradient descent is invariant to smooth reparameterizations because it is defined in a coordinate-free way, but tractable approximations are typically defined in terms of coordinate systems, and hence may lose the invariance properties. We analyze the invariance properties of the Kronecker-Factored Approximate Curvature (K-FAC) algorithm by constructing the algorithm in a coordinate-free way. We explicitly construct a Riemannian metric under which the natural gradient matches the K-FAC update; invariance to affine transformations of the activations follows immediately. We extend our framework to analyze the invariance properties of K-FAC appied to convolutional networks and recurrent neural networks, as well as metrics other than the usual Fisher metric.",/pdf/fdf34b6c7974a5c2cf37092307eae848039e99d4.pdf,ICLR,2020,We explicitly construct a Riemannian metric under which the natural gradient matches the K-FAC update; exact affine invariances follows immediately. +B1MXz20cYQ,HklMDB2qKQ,1538090000000.0,1550960000000.0,1250,Explaining Image Classifiers by Counterfactual Generation,"[""kingsley@cs.toronto.edu"", ""creager@cs.toronto.edu"", ""anna.goldenberg@utoronto.ca"", ""duvenaud@cs.toronto.edu""]","[""Chun-Hao Chang"", ""Elliot Creager"", ""Anna Goldenberg"", ""David Duvenaud""]","[""Explainability"", ""Interpretability"", ""Generative Models"", ""Saliency Map"", ""Machine Learning"", ""Deep Learning""]","When an image classifier makes a prediction, which parts of the image are relevant and why? We can rephrase this question to ask: which parts of the image, if they were not seen by the classifier, would most change its decision? Producing an answer requires marginalizing over images that could have been seen but weren't. We can sample plausible image in-fills by conditioning a generative model on the rest of the image. We then optimize to find the image regions that most change the classifier's decision after in-fill. Our approach contrasts with ad-hoc in-filling approaches, such as blurring or injecting noise, which generate inputs far from the data distribution, and ignore informative relationships between different parts of the image. Our method produces more compact and relevant saliency maps, with fewer artifacts compared to previous methods.",/pdf/d0923e3c5fe15a4d5f098cd8ca7faadebf24673e.pdf,ICLR,2019,"We compute saliency by using a strong generative model to efficiently marginalize over plausible alternative inputs, revealing concentrated pixel areas that preserve label information." +Syxc1yrKvr,HJxijCc_DS,1569440000000.0,1577170000000.0,1478,Implicit λ-Jeffreys Autoencoders: Taking the Best of Both Worlds,"[""alanov.aibek@gmail.com"", ""maxim.v.kochurov@gmail.com"", ""asobolev@bayesgroup.ru"", ""daniil.yashkov@phystech.edu"", ""vetrovd@yandex.ru""]","[""Aibek Alanov"", ""Max Kochurov"", ""Artem Sobolev"", ""Daniil Yashkov"", ""Dmitry Vetrov""]","[""Variational Inference"", ""Generative Adversarial Networks""]","We propose a new form of an autoencoding model which incorporates the best properties of variational autoencoders (VAE) and generative adversarial networks (GAN). It is known that GAN can produce very realistic samples while VAE does not suffer from mode collapsing problem. Our model optimizes λ-Jeffreys divergence between the model distribution and the true data distribution. We show that it takes the best properties of VAE and GAN objectives. It consists of two parts. One of these parts can be optimized by using the standard adversarial training, and the second one is the very objective of the VAE model. However, the straightforward way of substituting the VAE loss does not work well if we use an explicit likelihood such as Gaussian or Laplace which have limited flexibility in high dimensions and are unnatural for modelling images in the space of pixels. To tackle this problem we propose a novel approach to train the VAE model with an implicit likelihood by an adversarially trained discriminator. In an extensive set of experiments on CIFAR-10 and TinyImagent datasets, we show that our model achieves the state-of-the-art generation and reconstruction quality and demonstrate how we can balance between mode-seeking and mode-covering behaviour of our model by adjusting the weight λ in our objective. ",/pdf/6aa5830656c806a3918bf3e1786ff0ca6e2750bb.pdf,ICLR,2020,We propose a new form of an autoencoding model which incorporates the best properties of variational autoencoders (VAE) and generative adversarial networks (GAN) +ryUprTOv7q0,ivCaFnZ1ZQF,1601310000000.0,1614990000000.0,2931,Quantum Deformed Neural Networks,"[""~Roberto_Bondesan1"", ""~Max_Welling1""]","[""Roberto Bondesan"", ""Max Welling""]","[""Quantum machine learning"", ""Binary neural networks"", ""Bayesian deep learning""]","We develop a new quantum neural network layer designed to run efficiently on a quantum computer but that can be simulated on a classical computer when restricted in the way it entangles input states. We first ask how a classical neural network architecture, both fully connected or convolutional, can be executed on a quantum computer using quantum phase estimation. We then deform the classical layer into a quantum design which entangles activations and weights into quantum superpositions. While the full model would need the exponential speedups delivered by a quantum computer, a restricted class of designs represent interesting new classical network layers that still use quantum features. We show that these quantum deformed neural networks can be trained and executed on normal data such as images, and even classically deliver modest improvements over standard architectures. ",/pdf/0885d298ac112d563719ea1c7a1b55d2f97ffdff.pdf,ICLR,2021,"We develop a new quantum neural network and simulate a restricted version classically for real world data sizes for the first time, showing modest improvements over standard architectures." +r1lfga4KvS,H1xcTYRHvS,1569440000000.0,1577170000000.0,331,Extreme Value k-means Clustering,"[""sxzheng18@fudan.edu.cn"", ""yxhou@fudan.edu.cn"", ""yanweifu@fudan.edu.cn"", ""jffeng@fudan.edu.cn""]","[""Sixiao Zheng"", ""Yanxi Hou"", ""Yanwei Fu"", ""Jianfeng Feng""]","[""unsupervised learning"", ""clustering"", ""k-means"", ""Extreme Value Theory""]","Clustering is the central task in unsupervised learning and data mining. k-means is one of the most widely used clustering algorithms. Unfortunately, it is generally non-trivial to extend k-means to cluster data points beyond Gaussian distribution, particularly, the clusters with non-convex shapes (Beliakov & King, 2006). To this end, we, for the first time, introduce Extreme Value Theory (EVT) to improve the clustering ability of k-means. Particularly, the Euclidean space was transformed into a novel probability space denoted as extreme value space by EVT. We thus propose a novel algorithm called Extreme Value k-means (EV k-means), including GEV k-means and GPD k-means. In addition, we also introduce the tricks to accelerate Euclidean distance computation in improving the computational efficiency of classical k-means. Furthermore, our EV k-means is extended to an online version, i.e., online Extreme Value k-means, in utilizing the Mini Batch k-means to cluster streaming data. Extensive experiments are conducted to validate our EV k-means and online EV k-means on synthetic datasets and real datasets. Experimental results show that our algorithms significantly outperform competitors in most cases.",/pdf/f66f977827a975c0124021b647b49ec3b5ee2229.pdf,ICLR,2020,This paper introduces Extreme Value Theory into k-means to measure similarity and proposes a novel algorithm called Extreme Value k-means for clustering. +r1gIa0NtDH,SkeJdWquPr,1569440000000.0,1577170000000.0,1394,MelNet: A Generative Model for Audio in the Frequency Domain,"[""seanjv@mit.edu"", ""mikelewis@fb.com""]","[""Sean Vasquez"", ""Mike Lewis""]",[],"Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales which time-domain models have yet to achieve. We demonstrate that our model captures longer-range dependencies than time-domain models such as WaveNet across a diverse set of unconditional generation tasks, including single-speaker speech generation, multi-speaker speech generation, and music generation.",/pdf/684fb7b4e9c29e405246f942f000fdd6b49a1d17.pdf,ICLR,2020,We introduce an autoregressive generative model for spectrograms and demonstrate applications to speech and music generation +HkMvEOlAb,SJZwV_e0W,1509090000000.0,1518730000000.0,307,Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization,"[""ozsel@mail.usf.edu"", ""iuysal@usf.edu""]","[""Ozsel Kilinc"", ""Ismail Uysal""]","[""representation learning"", ""unsupervised clustering"", ""pseudo supervision"", ""graph-based activity regularization"", ""auto-clustering output layer""]","In this paper, we propose a novel unsupervised clustering approach exploiting the hidden information that is indirectly introduced through a pseudo classification objective. Specifically, we randomly assign a pseudo parent-class label to each observation which is then modified by applying the domain specific transformation associated with the assigned label. Generated pseudo observation-label pairs are subsequently used to train a neural network with Auto-clustering Output Layer (ACOL) that introduces multiple softmax nodes for each pseudo parent-class. Due to the unsupervised objective based on Graph-based Activity Regularization (GAR) terms, softmax duplicates of each parent-class are specialized as the hidden information captured through the help of domain specific transformations is propagated during training. Ultimately we obtain a k-means friendly latent representation. Furthermore, we demonstrate how the chosen transformation type impacts performance and helps propagate the latent information that is useful in revealing unknown clusters. Our results show state-of-the-art performance for unsupervised clustering tasks on MNIST, SVHN and USPS datasets, with the highest accuracies reported to date in the literature.",/pdf/fc70dd2e264c0bdaae111681f42c8cce12e13ecb.pdf,ICLR,2018, +eqBwg3AcIAK,nZ57PUwYhK,1601310000000.0,1615950000000.0,1333,Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers,"[""~Benjamin_Eysenbach1"", ""~Shreyas_Chaudhari1"", ""~Swapnil_Asawa1"", ""~Sergey_Levine1"", ""~Ruslan_Salakhutdinov1""]","[""Benjamin Eysenbach"", ""Shreyas Chaudhari"", ""Swapnil Asawa"", ""Sergey Levine"", ""Ruslan Salakhutdinov""]","[""reinforcement learning"", ""transfer learning"", ""domain adaptation""]","We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Formally, we prove that applying our method in the source domain is guaranteed to obtain a near-optimal policy for the target domain, provided that the source and target domains satisfy a lightweight assumption. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional~tasks.",/pdf/cc68f287d3416da646309bf4689396ac35e2be5c.pdf,ICLR,2021,"We propose a method for addressing domain adaptation in RL by using a (learned) modified reward, and prove that our method recovers a near-optimal policy for the target domain." +VcB4QkSfyO,I5tdC3N-HgK,1601310000000.0,1616030000000.0,2234,Estimating Lipschitz constants of monotone deep equilibrium models,"[""~Chirag_Pabbaraju1"", ""~Ezra_Winston1"", ""~J_Zico_Kolter1""]","[""Chirag Pabbaraju"", ""Ezra Winston"", ""J Zico Kolter""]","[""deep equilibrium models"", ""Lipschitz constants""]","Several methods have been proposed in recent years to provide bounds on the Lipschitz constants of deep networks, which can be used to provide robustness guarantees, generalization bounds, and characterize the smoothness of decision boundaries. However, existing bounds get substantially weaker with increasing depth of the network, which makes it unclear how to apply such bounds to recently proposed models such as the deep equilibrium (DEQ) model, which can be viewed as representing an infinitely-deep network. In this paper, we show that monotone DEQs, a recently-proposed subclass of DEQs, have Lipschitz constants that can be bounded as a simple function of the strong monotonicity parameter of the network. We derive simple-yet-tight bounds on both the input-output mapping and the weight-output mapping defined by these networks, and demonstrate that they are small relative to those for comparable standard DNNs. We show that one can use these bounds to design monotone DEQ models, even with e.g. multi-scale convolutional structure, that still have constraints on the Lipschitz constant. We also highlight how to use these bounds to develop PAC-Bayes generalization bounds that do not depend on any depth of the network, and which avoid the exponential depth-dependence of comparable DNN bounds.",/pdf/62c8f87a22f20b30e037ebb6a618d34b540f0e93.pdf,ICLR,2021,"Monotone deep equilibrium models have Lipschitz constants which are simple to bound and small relative to those of standard DNNs, which suffer with depth." +3JI45wPuReY,qJ832i6O6P0,1601310000000.0,1614990000000.0,783,Neural Network Surgery: Combining Training with Topology Optimization,"[""~Elisabeth_Schiessler1"", ""~Roland_Aydin1"", ""kevin.linka@tuhh.de"", ""christian.cyron@hzg.de""]","[""Elisabeth Schiessler"", ""Roland Aydin"", ""Kevin Linka"", ""Christian Cyron""]","[""Neural Architecture Search"", ""Genetic Algorithm"", ""SVD""]","With ever increasing computational capacities, neural networks become more and more proficient at solving complex tasks. However, picking a sufficiently good network topology usually relies on expert human knowledge. Neural architecture search aims to reduce the extent of expertise that is needed. Modern architecture search techniques often rely on immense computational power, or apply trained meta controllers for decision making. We develop a framework for a genetic algorithm that is both computationally cheap and makes decisions based on mathematical criteria rather than trained parameters. It is a hybrid approach that fuses training and topology optimization together into one process. Structural modifications that are performed include adding or removing layers of neurons, with some re-training applied to make up for incurred change in input-output behaviour. Our ansatz is tested on both the SVHN and (augmented) CIFAR-10 datasets with limited computational overhead compared to training only the baseline. This algorithm can achieve a significant increase in accuracy (as compared to a fully trained baseline), rescue insufficient topologies that in their current state are only able to learn to a limited extent, and dynamically reduce network size without loss in achieved accuracy.",/pdf/f610120725facb9cb89b7d61686b983bdb2863d0.pdf,ICLR,2021,We demonstrate a hybrid approach for combining neural network training with a genetic-algorithm based architecture optimization. +sgJJjd3-Y3,WqWyezIFHLX,1601310000000.0,1614990000000.0,771,Semi-supervised regression with skewed data via adversarially forcing the distribution of predicted values,"[""~Dae-Woong_Jeong1"", ""elgee.kim@lgsp.co.kr"", ""changyoung.park@lgsp.co.kr"", ""hanssse.han@lgsp.co.kr"", ""~Woohyung_Lim1""]","[""Dae-Woong Jeong"", ""Kiyoung Kim"", ""Changyoung Park"", ""Sehui Han"", ""Woohyung Lim""]","[""Semi-supervised learning"", ""Adversarial"", ""regression""]","Advances in scientific fields including drug discovery or material design are accompanied by numerous trials and errors. However, generally only representative experimental results are reported. Because of this reporting bias, the distribution of labeled result data can deviate from their true distribution. A regression model can be erroneous if it is built on these skewed data. In this work, we propose a new approach to improve the accuracy of regression models that are trained using a skewed dataset. The method forces the regression outputs to follow the true distribution; the forcing algorithm regularizes the regression results while keeping the information of the training data. We assume the existence of enough unlabeled data that follow the true distribution, and that the true distribution can be roughly estimated from domain knowledge or a few samples. During training neural networks to generate a regression model, an adversarial network is used to force the distribution of predicted values to follow the estimated ‘true’ distribution. We evaluated the proposed approach on four real-world datasets (pLogP, Diamond, House, Elevators). In all four datasets, the proposed approach reduced the root mean squared error of the regression by around 55 percent to 75 percent compared to regression models without adjustment of the distribution.",/pdf/260d59afaa1f8be711f92094726319562f854705.pdf,ICLR,2021,We propose a new approach to improve the regression models trained with a skewed dataset by using a semi-supervised learning framework with an adversarial network to force the distribution of the predicted values to follow the true distribution. +SJgsCjCqt7,rJeTqnk_KQ,1538090000000.0,1551120000000.0,924,Variational Autoencoders with Jointly Optimized Latent Dependency Structure,"[""jha203@sfu.ca"", ""yu_gong@sfu.ca"", ""jmarino@caltech.edu"", ""mori@cs.sfu.ca"", ""andreas.lehrmann@gmail.com""]","[""Jiawei He"", ""Yu Gong"", ""Joseph Marino"", ""Greg Mori"", ""Andreas Lehrmann""]","[""deep generative models"", ""structure learning""]","We propose a method for learning the dependency structure between latent variables in deep latent variable models. Our general modeling and inference framework combines the complementary strengths of deep generative models and probabilistic graphical models. In particular, we express the latent variable space of a variational autoencoder (VAE) in terms of a Bayesian network with a learned, flexible dependency structure. The network parameters, variational parameters as well as the latent topology are optimized simultaneously with a single objective. Inference is formulated via a sampling procedure that produces expectations over latent variable structures and incorporates top-down and bottom-up reasoning over latent variable values. We validate our framework in extensive experiments on MNIST, Omniglot, and CIFAR-10. Comparisons to state-of-the-art structured variational autoencoder baselines show improvements in terms of the expressiveness of the learned model.",/pdf/88bbd3208909f2eadfa6e69c4d31b77326774b32.pdf,ICLR,2019,We propose a method for learning latent dependency structure in variational autoencoders. +jrA5GAccy_,oBv3F6B61u9,1601310000000.0,1615860000000.0,1481,Empirical or Invariant Risk Minimization? A Sample Complexity Perspective,"[""~Kartik_Ahuja1"", ""wangsidi76@gmail.com"", ""~Amit_Dhurandhar1"", ""~Karthikeyan_Shanmugam1"", ""~Kush_R._Varshney1""]","[""Kartik Ahuja"", ""Jun Wang"", ""Amit Dhurandhar"", ""Karthikeyan Shanmugam"", ""Kush R. Varshney""]","[""invariant risk minimization"", ""IRM""]","Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime, while ERM is biased even asymptotically. We further investigate how different factors --- the number of environments, complexity of the model, and IRM penalty weight --- impact the sample complexity of IRM in relation to its distance from the OOD solutions. ",/pdf/276b48d5dd233a104c8c4d2cffbcf0e687e83747.pdf,ICLR,2021,"In this work, we provide a sample complexity comparison of the recent invariant risk minimization (IRM) framework with the classic empirical risk minimization (ERM) to answer when is IRM better than ERM in terms of out-of-distribution generalization?" +B1eZRiC9YX,HJxg1PlgKQ,1538090000000.0,1545360000000.0,867,Sufficient Conditions for Robustness to Adversarial Examples: a Theoretical and Empirical Study with Bayesian Neural Networks,"[""yarin@cs.ox.ac.uk"", ""lsgs@robots.ox.ac.uk""]","[""Yarin Gal"", ""Lewis Smith""]","[""Bayesian deep learning"", ""Bayesian neural networks"", ""adversarial examples""]","We prove, under two sufficient conditions, that idealised models can have no adversarial examples. We discuss which idealised models satisfy our conditions, and show that idealised Bayesian neural networks (BNNs) satisfy these. We continue by studying near-idealised BNNs using HMC inference, demonstrating the theoretical ideas in practice. We experiment with HMC on synthetic data derived from MNIST for which we know the ground-truth image density, showing that near-perfect epistemic uncertainty correlates to density under image manifold, and that adversarial images lie off the manifold in our setting. This suggests why MC dropout, which can be seen as performing approximate inference, has been observed to be an effective defence against adversarial examples in practice; We highlight failure-cases of non-idealised BNNs relying on dropout, suggesting a new attack for dropout models and a new defence as well. Lastly, we demonstrate the defence on a cats-vs-dogs image classification task with a VGG13 variant.",/pdf/5f32fed6bf08335d741e0d2db8d94aaa32cc2244.pdf,ICLR,2019,"We prove that idealised Bayesian neural networks can have no adversarial examples, and give empirical evidence with real-world BNNs." +cU0a02VF8ZG,fvu4qJi5cQl,1601310000000.0,1614990000000.0,2528,Globetrotter: Unsupervised Multilingual Translation from Visual Alignment,"[""~Didac_Suris_Coll-Vinent1"", ""~Dave_Epstein1"", ""~Carl_Vondrick2""]","[""Didac Suris Coll-Vinent"", ""Dave Epstein"", ""Carl Vondrick""]","[""cross-modal"", ""multilingual"", ""unsupervised translation"", ""visual similarity""]","Machine translation in a multi-language scenario requires large-scale parallel corpora for every language pair. Unsupervised translation is challenging because there is no explicit connection between languages, and the existing methods have to rely on topological properties of the language representations. We introduce a framework that leverages visual similarity to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms prior work on unsupervised word-level and sentence-level translation using retrieval.",/pdf/e4b8f3d103c2e0e02613c5c826d2406665386cec.pdf,ICLR,2021,We propose a method that leverages cross- modal alignment between language and vision to train a multilingual translation system without any parallel corpora. +rkeIq2VYPr,BklZC8HkDH,1569440000000.0,1583910000000.0,117,Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient,"[""tianshuy@asu.edu"", ""yikang.li@asu.edu"", ""baoxin.li@asu.edu""]","[""Tianshu Yu"", ""Yikang Li"", ""Baoxin Li""]","[""determinantal point processes"", ""deep learning"", ""optimization""]","Determinantal point processes (DPPs) is an effective tool to deliver diversity on multiple machine learning and computer vision tasks. Under deep learning framework, DPP is typically optimized via approximation, which is not straightforward and has some conflict with diversity requirement. We note, however, there has been no deep learning paradigms to optimize DPP directly since it involves matrix inversion which may result in highly computational instability. This fact greatly hinders the wide use of DPP on some specific objectives where DPP serves as a term to measure the feature diversity. In this paper, we devise a simple but effective algorithm to address this issue to optimize DPP term directly expressed with L-ensemble in spectral domain over gram matrix, which is more flexible than learning on parametric kernels. By further taking into account some geometric constraints, our algorithm seeks to generate valid sub-gradients of DPP term in case when the DPP gram matrix is not invertible (no gradients exist in this case). In this sense, our algorithm can be easily incorporated with multiple deep learning tasks. Experiments show the effectiveness of our algorithm, indicating promising performance for practical learning problems. ",/pdf/1fea5d1a4d56edfc0b9b0c8a5b5a15d9f7aa7f43.pdf,ICLR,2020,We proposed a specific back-propagation method via proper spectral sub-gradient to integrate determinantal point process to deep learning framework. +ByxJjlHKwr,H1e9ryZFDS,1569440000000.0,1577170000000.0,2493,Learning Latent State Spaces for Planning through Reward Prediction,"[""ahavens2@illinois.edu"", ""ouyangyi@preferred-america.com"", ""prabhat@preferred.jp"", ""fujita@preferred.jp""]","[""Aaron Havens"", ""Yi Ouyang"", ""Prabhat Nagarajan"", ""Yasuhiro Fujita""]","[""Deep Reinforcement Learning"", ""Representation Learning"", ""Model Based Reinforcement Learning""]","Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction model and then plan in the latent state-space. The latent representation is learned exclusively from multi-step reward prediction which we show to be the only necessary information for successful planning. With this framework, we are able to benefit from the concise model-free representation, while still enjoying the data-efficiency of model-based algorithms. We demonstrate our framework in multi-pendulum and multi-cheetah environments where several pendulums or cheetahs are shown to the agent but only one of them produces rewards. In these environments, it is important for the agent to construct a concise latent representation to filter out irrelevant observations. We find that our method can successfully learn an accurate latent reward prediction model in the presence of the irrelevant information while existing model-based methods fail. Planning in the learned latent state-space shows strong performance and high sample efficiency over model-free and model-based baselines.",/pdf/df686e9e83f2c1db42382ff78bbeaba8b22d2e92.pdf,ICLR,2020,A latent reward prediction model is learned to achieve concise representation and plan efficiently using MPC. +dOiHyqVaFkg,buaKV73miKX,1601310000000.0,1614990000000.0,2342,Unsupervised Progressive Learning and the STAM Architecture,"[""~James_Smith1"", ""~Cameron_Ethan_Taylor1"", ""~Seth_Baer1"", ""~Constantine_Dovrolis1""]","[""James Smith"", ""Cameron Ethan Taylor"", ""Seth Baer"", ""Constantine Dovrolis""]","[""continual learning"", ""unsupervised learning"", ""representation learning"", ""online learning""]","We first pose the Unsupervised Progressive Learning (UPL) problem: an online representation learning problem in which the learner observes a non-stationary and unlabeled data stream, and identifies a growing number of features that persist over time even though the data is not stored or replayed. To solve the UPL problem we propose the Self-Taught Associative Memory (STAM) architecture. Layered hierarchies of STAM modules learn based on a combination of online clustering, novelty detection, forgetting outliers, and storing only prototypical features rather than specific examples. We evaluate STAM representations using classification and clustering tasks. While there are no existing learning scenarios which are directly comparable to UPL, we compare the STAM architecture with two recent continual learning works; Memory Aware Synapses (MAS), and Gradient Episodic Memories (GEM), which have been modified to be suitable for the UPL setting. ",/pdf/54fda03f0eb32fe631f6691e376faf22fad5f54b.pdf,ICLR,2021,"We pose and solve a new online representation learning problem in which the learner observes a non-stationary and unlabeled data stream, and identifies a growing number of features that persist over time without data storage or replay" +X6YPReSv5CX,S_hTSVg66yP,1601310000000.0,1614990000000.0,24,Mixture of Step Returns in Bootstrapped DQN,"[""~PoHan_Chiang2"", ""~Hsuan-Kung_Yang1"", ""~Zhang-Wei_Hong1"", ""~Chun-Yi_Lee1""]","[""PoHan Chiang"", ""Hsuan-Kung Yang"", ""Zhang-Wei Hong"", ""Chun-Yi Lee""]","[""Reinforcement Learning""]","The concept of utilizing multi-step returns for updating value functions has been adopted in deep reinforcement learning (DRL) for a number of years. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Conventional methods such as TD-lambda leverage these advantages by using a target value equivalent to an exponential average of different step returns. Nevertheless, integrating step returns into a single target sacrifices the diversity of the advantages offered by different step return targets. To address this issue, we propose Mixture Bootstrapped DQN (MB-DQN) built on top of bootstrapped DQN, and uses different backup lengths for different bootstrapped heads. MB-DQN enables heterogeneity of the target values that is unavailable in approaches relying only on a single target value. As a result, it is able to maintain the advantages offered by different backup lengths. In this paper, we first discuss the motivational insights through a simple maze environment. In order to validate the effectiveness of MB-DQN, we perform experiments on the Atari 2600 benchmark environments and demonstrate the performance improvement of MB-DQN over a number of baseline methods. We further provide a set of ablation studies to examine the impacts of different design configurations of MB-DQN.",/pdf/c455ade8e97b9391818a226509a2c500b861cecb.pdf,ICLR,2021,Utilize multi-step returns into Bootstrapped DQN +n5ej38Vfuup,XP3n_gADm9i-,1601310000000.0,1614990000000.0,3276,Deep Quotient Manifold Modeling,"[""~Jiseob_Kim1"", ""sjjung@bi.snu.ac.kr"", ""hdlee@bi.snu.ac.kr"", ""~Byoung-Tak_Zhang1""]","[""Jiseob Kim"", ""Seungjae Jung"", ""Hyundo Lee"", ""Byoung-Tak Zhang""]","[""deep generative models"", ""manifold learning""]","One of the difficulties in modeling real-world data is their complex multi-manifold structure due to discrete features. In this paper, we propose quotient manifold modeling (QMM), a new data-modeling scheme that considers generic manifold structure independent of discrete features, thereby deriving efficiency in modeling and allowing generalization over untrained manifolds. QMM considers a deep encoder inducing an equivalence between manifolds; but we show it is sufficient to consider it only implicitly via a bias-regularizer we derive. This makes QMM easily applicable to existing models such as GANs and VAEs, and experiments show that these models not only present superior FID scores but also make good generalizations across different datasets. In particular, we demonstrate an MNIST model that synthesizes EMNIST alphabets.",/pdf/4d130b56d3ac99b2ac43ad2800000193a7deb4f6.pdf,ICLR,2021,"We propose quotient manifold modeling, a new generative modeling scheme that considers generic manifold structure, thereby allowing generalizations over untrained manifolds." +S1xipR4FPB,H1eHamq_wr,1569440000000.0,1577170000000.0,1406,Teacher-Student Compression with Generative Adversarial Networks,"[""ruishan@stanford.edu"", ""lmackey@stanford.edu"", ""fusi@microsoft.com""]","[""Ruishan Liu"", ""Nicolo Fusi"", ""Lester Mackey""]",[],"More accurate machine learning models often demand more computation and memory at test time, making them difficult to deploy on CPU- or memory-constrained devices. Teacher-student compression (TSC), also known as distillation, alleviates this burden by training a less expensive student model to mimic the expensive teacher model while maintaining most of the original accuracy. However, when fresh data is unavailable for the compression task, the teacher's training data is typically reused, leading to suboptimal compression. In this work, we propose to augment the compression dataset with synthetic data from a generative adversarial network (GAN) designed to approximate the training data distribution. Our GAN-assisted TSC (GAN-TSC) significantly improves student accuracy for expensive models such as large random forests and deep neural networks on both tabular and image datasets. Building on these results, we propose a comprehensive metric—the TSC Score—to evaluate the quality of synthetic datasets based on their induced TSC performance. The TSC Score captures both data diversity and class affinity, and we illustrate its benefits over the popular Inception Score in the context of image classification.",/pdf/fd0e3200bca8c724c46a57aacde218a5f93d226f.pdf,ICLR,2020, +9sF3n8eAco,i_Ak6SOz4bU,1601310000000.0,1614990000000.0,1646,All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and Memory-Efficient Inference of Deep Neural Networks,"[""~Juinn-Dar_Huang1"", ""~Cheng-Wei_Huang1"", ""~Tim-Wei_Chen1""]","[""Juinn-Dar Huang"", ""Cheng-Wei Huang"", ""Tim-Wei Chen""]","[""8-bit floating-point format"", ""accuracy loss minimization"", ""numerics"", ""memory-efficient inference"", ""deep learning""]","Modern deep neural network (DNN) models generally require a huge amount of weight and activation values to achieve good inference outcomes. Those data inevitably demand a massive off-chip memory capacity/bandwidth, and the situation gets even worse if they are represented in high-precision floating-point formats. Effort has been made for representing those data in different 8-bit floating-point formats, nevertheless, a notable accuracy loss is still unavoidable. In this paper we introduce an extremely flexible 8-bit floating-point (FFP8) format whose defining factors – the bit width of exponent/fraction field, the exponent bias, and even the presence of the sign bit – are all configurable. We also present a methodology to properly determine those factors so that the accuracy of model inference can be maximized. The foundation of this methodology is based on a key observation – both the maximum magnitude and the value distribution are quite dissimilar between weights and activations in most DNN models. Experimental results demonstrate that the proposed FFP8 format achieves an extremely low accuracy loss of $0.1\%\sim 0.3\%$ for several representative image classification models even without the need of model retraining. Besides, it is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.",/pdf/93c33281fea918bdbcd3edc58fc0a0bbdd16909f.pdf,ICLR,2021,"An extremely flexible 8-bit floating-point format, where all parameters (bit width of sign/exponent/fraction field and exponent bias) are configurable, is proposed to achieve more accurate inference even without the need of model retraining." +ryl5CJSFPS,r1lcpPJFwB,1569440000000.0,1577170000000.0,2035,GENERALIZATION GUARANTEES FOR NEURAL NETS VIA HARNESSING THE LOW-RANKNESS OF JACOBIAN,"[""sametoymak@gmail.com"", ""zfabian@usc.edu"", ""mli176@ucr.edu"", ""msoltoon@gmail.com""]","[""Samet Oymak"", ""Zalan Fabian"", ""Mingchen Li"", ""Mahdi Soltanolkotabi""]","[""Theory of neural nets"", ""low-rank structure of Jacobian"", ""optimization and generalization theory""]","Modern neural network architectures often generalize well despite containing many more parameters than the size of the training dataset. This paper explores the generalization capabilities of neural networks trained via gradient descent. We develop a data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network. Our results help demystify why training and generalization is easier on clean and structured datasets and harder on noisy and unstructured datasets as well as how the network size affects the evolution of the train and test errors during training. Specifically, we use a control knob to split the Jacobian spectum into ``information"" and ``nuisance"" spaces associated with the large and small singular values. We show that over the information space learning is fast and one can quickly train a model with zero training loss that can also generalize well. Over the nuisance space training is slower and early stopping can help with generalization at the expense of some bias. We also show that the overall generalization capability of the network is controlled by how well the labels are aligned with the information space. A key feature of our results is that even constant width neural nets can provably generalize for sufficiently nice datasets. We conduct various numerical experiments on deep networks that corroborate our theoretical findings and demonstrate that: (i) the Jacobian of typical neural networks exhibit low-rank structure with a few large singular values and many small ones leading to a low-dimensional information space, (ii) over the information space learning is fast and most of the labels falls on this space, and (iii) label noise falls on the nuisance space and impedes optimization/generalization.",/pdf/a115824d8aaf25a4d9fc8c8cd600b16f60733f85.pdf,ICLR,2020,We empirically demonstrate that the Jacobian of neural networks exhibit a low-rank structure and harness this property to develop new optimization and generalization guarantees. +ByleB2CcKm,ByeZXaact7,1538090000000.0,1550900000000.0,1508,Learning Procedural Abstractions and Evaluating Discrete Latent Temporal Structure,"[""kgoel93@gmail.com"", ""ebrun@cs.stanford.edu""]","[""Karan Goel"", ""Emma Brunskill""]","[""learning procedural abstractions"", ""latent variable modeling"", ""evaluation criteria""]","Clustering methods and latent variable models are often used as tools for pattern mining and discovery of latent structure in time-series data. In this work, we consider the problem of learning procedural abstractions from possibly high-dimensional observational sequences, such as video demonstrations. Given a dataset of time-series, the goal is to identify the latent sequence of steps common to them and label each time-series with the temporal extent of these procedural steps. We introduce a hierarchical Bayesian model called Prism that models the realization of a common procedure across multiple time-series, and can recover procedural abstractions with supervision. We also bring to light two characteristics ignored by traditional evaluation criteria when evaluating latent temporal labelings (temporal clusterings) -- segment structure, and repeated structure -- and develop new metrics tailored to their evaluation. We demonstrate that our metrics improve interpretability and ease of analysis for evaluation on benchmark time-series datasets. Results on benchmark and video datasets indicate that Prism outperforms standard sequence models as well as state-of-the-art techniques in identifying procedural abstractions.",/pdf/92f0d580aea5fa8ee4288a7da135a90c09932c69.pdf,ICLR,2019, +SyMvJrdaW,SkZD1r_p-,1508560000000.0,1518730000000.0,33,Decoupling the Layers in Residual Networks,"[""ricky.fok3@gmail.com"", ""aan@cse.yorku.ca"", ""rashidi.zana@gmail.com"", ""stevenw@mathstat.yorku.ca""]","[""Ricky Fok"", ""Aijun An"", ""Zana Rashidi"", ""Xiaogang Wang""]","[""Warped residual networks"", ""residual networks""]","We propose a Warped Residual Network (WarpNet) using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network. We apply a perturbation theory on residual networks and decouple the interactions between residual units. The resulting warp operator is a first order approximation of the output over multiple layers. The first order perturbation theory exhibits properties such as binomial path lengths and exponential gradient scaling found experimentally by Veit et al (2016). +We demonstrate through an extensive performance study that the proposed network achieves comparable predictive performance to the original residual network with the same number of parameters, while achieving a significant speed-up on the total training time. As WarpNet performs model parallelism in residual network training in which weights are distributed over different GPUs, it offers speed-up and capability to train larger networks compared to original residual networks.",/pdf/ff734f0bfe41aadd89d9688534b90a0ad868e223.pdf,ICLR,2018,We propose the Warped Residual Network using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network. +BJg8_xHtPr,rkl7p3xtDH,1569440000000.0,1577170000000.0,2400,OBJECT-ORIENTED REPRESENTATION OF 3D SCENES,"[""chang.chen@rutgers.edu"", ""sjn.ahn@gmail.com""]","[""Chang Chen"", ""Sungjin Ahn""]","[""unsupervised learning"", ""representation learning"", ""3D scene decomposition"", ""3D detection""]","In this paper, we propose a generative model, called ROOTS (Representation of Object-Oriented Three-dimension Scenes), for unsupervised object-wise 3D-scene decomposition and and rendering. For 3D scene modeling, ROOTS bases on the Generative Query Networks (GQN) framework, but unlike GQN, provides object-oriented representation decomposition. The inferred object-representation of ROOTS is 3D in the sense that it is viewpoint invariant as the full scene representation of GQN is so. ROOTS also provides hierarchical object-oriented representation: at 3D global-scene level and at 2D local-image level. We achieve this without performance degradation. In experiments on datasets of 3D rooms with multiple objects, we demonstrate the above properties by focusing on its abilities for disentanglement, compositionality, and generalization in comparison to GQN.",/pdf/3835be9f6ab1ec9a2242b0899ff67b3fa93e0de6.pdf,ICLR,2020, +rvosiWfMoMR,ZEwUgElqG9,1601310000000.0,1614990000000.0,3427,Automatic Music Production Using Generative Adversarial Networks,"[""~Giorgio_Barnab\u00f21"", ""~Giovanni_Trappolini1"", ""~Lorenzo_Lastilla1"", ""campagnano.1615033@studenti.uniroma1.it"", ""~Angela_Fan2"", ""~Fabio_Petroni1"", ""~Fabrizio_Silvestri2""]","[""Giorgio Barnab\u00f2"", ""Giovanni Trappolini"", ""Lorenzo Lastilla"", ""Cesare Campagnano"", ""Angela Fan"", ""Fabio Petroni"", ""Fabrizio Silvestri""]","[""music arrangement"", ""generative adversarial networks"", ""music generation""]","When talking about computer-based music generation, two are the main threads of research: the construction of $\textit{autonomous music-making systems}$, and the design of $\textit{computer-based environments to assist musicians}$. However, even though creating accompaniments for melodies is an essential part of every producer's and songwriter's work, little effort has been done in the field of automatic music arrangement in the audio domain. In this contribution, we propose a novel framework for $\textit{automatic music accompaniment}$ $\textit{in the Mel-frequency domain}$. Using several songs converted into Mel-spectrograms, a two-dimensional time-frequency representation of audio signals, we were able to automatically generate original arrangements for both bass and voice lines. Treating music pieces as images (Mel-spectrograms) allowed us to reformulate our problem as an $\textit{unpaired image-to-image translation}$ problem, and to tackle it with CycleGAN, a well-established framework. Moreover, the choice to deploy raw audio and Mel-spectrograms enabled us to more effectively model long-range dependencies, to better represent how humans perceive music, and to potentially draw sounds for new arrangements from the vast collection of music recordings accumulated in the last century. Our approach was tested on two different downstream tasks: given a bass line creating credible and on-time drums, and given an acapella song arranging it to a full song. In absence of an objective way of evaluating the output of music generative systems, we also defined a possible metric for the proposed task, partially based on human (and expert) judgment.",/pdf/c4cb548600c38c1361fc7ca2d9e463b8c12ac6f4.pdf,ICLR,2021,We propose a novel framework for music arrangement from raw audio in the frequency domain +HkgaETNtDB,rygm6BNwDB,1569440000000.0,1583910000000.0,503,Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models,"[""bloodwass@kaist.ac.kr"", ""kyunghyun.cho@nyu.edu"", ""wanmo.kang@kaist.edu""]","[""Cheolhyoung Lee"", ""Kyunghyun Cho"", ""Wanmo Kang""]","[""regularization"", ""finetuning"", ""dropout"", ""dropconnect"", ""adaptive L2-penalty"", ""BERT"", ""pretrained language model""]","In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as “mixout”, motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.",/pdf/59c0d10f928cbfbf9716975d3ad21923241e9a01.pdf,ICLR,2020, +rJgSV3AqKQ,rJx187qYYX,1538090000000.0,1545360000000.0,1447,Combining adaptive algorithms and hypergradient method: a performance and robustness study,"[""akram.er-raqabi@umontreal.ca"", ""nicolas@le-roux.name""]","[""Akram Erraqabi"", ""Nicolas Le Roux""]","[""optimization"", ""adaptive methods"", ""learning rate decay""]","Wilson et al. (2017) showed that, when the stepsize schedule is properly designed, stochastic gradient generalizes better than ADAM (Kingma & Ba, 2014). In light of recent work on hypergradient methods (Baydin et al., 2018), we revisit these claims to see if such methods close the gap between the most popular optimizers. As a byproduct, we analyze the true benefit of these hypergradient methods compared to more classical schedules, such as the fixed decay of Wilson et al. (2017). In particular, we observe they are of marginal help since their performance varies significantly when tuning their hyperparameters. Finally, as robustness is a critical quality of an optimizer, we provide a sensitivity analysis of these gradient based optimizers to assess how challenging their tuning is.",/pdf/8ecf4971fff5e08d16f8b3a45e73f7ad52932a84.pdf,ICLR,2019,"We provide a study trying to see how the recent online learning rate adaptation extends the conclusion made by Wilson et al. 2018 about adaptive gradient methods, along with comparison and sensitivity analysis." +BJAFbaolg,,1478380000000.0,1488570000000.0,589,Learning to Generate Samples from Noise through Infusion Training,"[""florian.bordes@umontreal.ca"", ""sina.honari@umontreal.ca"", ""pascal.vincent@umontreal.ca""]","[""Florian Bordes"", ""Sina Honari"", ""Pascal Vincent""]","[""Deep learning"", ""Unsupervised Learning""]","In this work, we investigate a novel training procedure to learn a generative model as the transition operator of a Markov chain, such that, when applied repeatedly on an unstructured random noise sample, it will denoise it into a sample that matches the target distribution from the training set. The novel training procedure to learn this progressive denoising operation involves sampling from a slightly different chain than the model chain used for generation in the absence of a denoising target. In the training chain we infuse information from the training target example that we would like the chains to reach with a high probability. The thus learned transition operator is able to produce quality and varied samples in a small number of steps. Experiments show competitive results compared to the samples generated with a basic Generative Adversarial Net. ",/pdf/388d81ae0dd044220d7a880fbc2b8d904de2eedf.pdf,ICLR,2017,"We learn a markov transition operator acting on inputspace, to denoise random noise into a target distribution. We use a novel target injection technique to guide the training." +B1xybgSKwB,SkxGF0ktPB,1569440000000.0,1577170000000.0,2121,Self-Attentional Credit Assignment for Transfer in Reinforcement Learning,"[""jferret@google.com"", ""raphaelm@google.com"", ""mfgeist@google.com"", ""pietquin@google.com""]","[""Johan Ferret"", ""Rapha\u00ebl Marinier"", ""Matthieu Geist"", ""Olivier Pietquin""]","[""reinforcement learning"", ""transfer learning"", ""credit assignment""]","The ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brand-new perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample efficient. Our main contribution is Secret, a novel approach to transfer learning for RL that uses a backward-view credit assignment mechanism based on a self-attentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm.",/pdf/34673868c47510898bdc3c0f7999b38a352fa818.pdf,ICLR,2020,Secret is a transfer method for RL based on the transfer of credit assignment. +rylVHR4FPB,HylQKjIOvr,1569440000000.0,1583910000000.0,1108,Sampling-Free Learning of Bayesian Quantized Neural Networks,"[""jiahaosu@terpmail.umd.edu"", ""mcvitkov@caltech.edu"", ""furongh@cs.umd.edu""]","[""Jiahao Su"", ""Milan Cvitkovic"", ""Furong Huang""]","[""Bayesian neural networks"", ""Quantized neural networks""]","Bayesian learning of model parameters in neural networks is important in scenarios where estimates with well-calibrated uncertainty are important. In this paper, we propose Bayesian quantized networks (BQNs), quantized neural networks (QNNs) for which we learn a posterior distribution over their discrete parameters. We provide a set of efficient algorithms for learning and prediction in BQNs without the need to sample from their parameters or activations, which not only allows for differentiable learning in quantized models but also reduces the variance in gradients estimation. We evaluate BQNs on MNIST, Fashion-MNIST and KMNIST classification datasets compared against bootstrap ensemble of QNNs (E-QNN). We demonstrate BQNs achieve both lower predictive errors and better-calibrated uncertainties than E-QNN (with less than 20% of the negative log-likelihood).",/pdf/4e9ae8acdc2bbbe6d8141d2db25449c70422d445.pdf,ICLR,2020,"We propose Bayesian quantized networks, for which we learn a posterior distribution over their quantized parameters." +SJlyta4YPS,HyeUMj3PvS,1569440000000.0,1577170000000.0,654,DeepEnFM: Deep neural networks with Encoder enhanced Factorization Machine,"[""sunqiang85@gmail.com"", ""zhinancheng.bryan@gmail.com"", ""yanweifu@fudan.edu.cn"", ""wxwang.iris@gmail.com"", ""ygj@fudan.edu.cn"", ""xyxue@fudan.edu.cn""]","[""Qiang Sun"", ""Zhinan Cheng"", ""Yanwei Fu"", ""Wenxuan Wang"", ""Yu-Gang Jiang"", ""Xiangyang Xue""]","[""CTR"", ""Attention"", ""Transformer"", ""Encoder""]","Click Through Rate (CTR) prediction is a critical task in industrial applications, especially for online social and commerce applications. It is challenging to find a proper way to automatically discover the effective cross features in CTR tasks. We propose a novel model for CTR tasks, called Deep neural networks with Encoder enhanced Factorization Machine (DeepEnFM). Instead of learning the cross features directly, DeepEnFM adopts the Transformer encoder as a backbone to align the feature embeddings with the clues of other fields. The embeddings generated from encoder are beneficial for the further feature interactions. Particularly, DeepEnFM utilizes a bilinear approach to generate different similarity functions with respect to different field pairs. Furthermore, the max-pooling method makes DeepEnFM feasible to capture both the supplementary and suppressing information among different attention heads. Our model is validated on the Criteo and Avazu datasets, and achieves state-of-art performance.",/pdf/172f7a073710dcf32034814f99bd0b87fa2d863c.pdf,ICLR,2020,DNN and Encoder enhanced FM with bilinear attention and max-pooling for CTR +cvNYovr16SB,p2AONBQGElC,1601310000000.0,1614990000000.0,3013,Unsupervised Active Pre-Training for Reinforcement Learning,"[""~Hao_Liu1"", ""~Pieter_Abbeel2""]","[""Hao Liu"", ""Pieter Abbeel""]","[""Reinforcement Learning"", ""Unsupervised Learning"", ""Entropy Maximization"", ""Contrastive Learning"", ""Self-supervised Learning"", ""Exploration""]","We introduce a new unsupervised pre-training method for reinforcement learning called $\textbf{APT}$, which stands for $\textbf{A}\text{ctive}\textbf{P}\text{re-}\textbf{T}\text{raining}$. APT learns a representation and a policy initialization by actively searching for novel states in reward-free environments. We use the contrastive learning framework for learning the representation from collected transitions. The key novel idea is to collect data during pre-training by maximizing a particle based entropy computed in the learned latent representation space. By doing particle based entropy maximization, we alleviate the need for challenging density modeling and are thus able to scale our approach to image observations. APT successfully learns meaningful representations as well as policy initializations without using any reward. We empirically evaluate APT on the Atari game suite and DMControl suite by exposing task-specific reward to agent after a long unsupervised pre-training phase. On Atari games, APT achieves human-level performance on $12$ games and obtains highly competitive performance compared to canonical fully supervised RL algorithms. On DMControl suite, APT beats all baselines in terms of asymptotic performance and data efficiency and dramatically improves performance on tasks that are extremely difficult for training from scratch. Importantly, the pre-trained models can be fine-tuned to solve different tasks as long as the environment does not change. Finally, we also pre-train multi-environment encoders on data from multiple environments and show generalization to a broad set of RL tasks.",/pdf/9ac72e75d093e745cd5e2b2e45f794c5ccfbed65.pdf,ICLR,2021,"We propose APT, a reward-free pre-training approach which is based on maximizing particle-based entropy in contrastive representation space for learning pre-trained models that can be leveraged for solving downstream tasks efficiently" +cYr2OPNyTz7,yVr_BrDIhVO,1601310000000.0,1614990000000.0,2797,Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model,"[""~Mingzhi_Zheng1"", ""~Dinghan_Shen1"", ""~yelong_shen1"", ""~Weizhu_Chen1"", ""~Lin_Xiao1""]","[""Mingzhi Zheng"", ""Dinghan Shen"", ""yelong shen"", ""Weizhu Chen"", ""Lin Xiao""]","[""representation learning"", ""natural language processing""]","Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. Thus, we theoretically quantify the gradient variance via correlating the gradient covariance with the Hamming distance between two different masks (given a certain text sequence). To reduce the variance due to the sampling of masks, we propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments. Thereafter, the tokens within one segment are masked for training. We prove, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. We conduct extensive experiments on both continual pre-training and general pre-training from scratch. Empirical results confirm that this new masking strategy can consistently outperform standard random masking. Detailed efficiency analysis and ablation studies further validate the advantages of our fully-explored masking strategy under the MLM framework.",/pdf/12a234af3e2a32134c9c7df02f15efde990757fe.pdf,ICLR,2021,A novel masking strategy is proposed to improve the training efficiency of the Masked Language Model (MLM) framework. +B1l1qnEFwH,ByxTQYRaUr,1569440000000.0,1577170000000.0,101,Deep Audio Prior,"[""yapengtian@rochester.edu"", ""chenliang.xu@rochester.edu"", ""dinli@adobe.com""]","[""Yapeng Tian"", ""Chenliang Xu"", ""Dingzeyu Li""]","[""deep audio prior"", ""blind sound separation"", ""deep learning"", ""audio representation""]","Deep convolutional neural networks are known to specialize in distilling compact and robust prior from a large amount of data. We are interested in applying deep networks in the absence of training dataset. In this paper, we introduce deep audio prior (DAP) which leverages the structure of a network and the temporal information in a single audio file. Specifically, we demonstrate that a randomly-initialized neural network can be used with carefully designed audio prior to tackle challenging audio problems such as universal blind source separation, interactive audio editing, audio texture synthesis, and audio co-separation. + +To understand the robustness of the deep audio prior, we construct a benchmark dataset Universal-150 for universal sound source separation with a diverse set of sources. We show superior audio results than previous work on both qualitatively and quantitative evaluations. We also perform thorough ablation study to validate our design choices.",/pdf/95237b1b77b7369930afa4964e33ce96ed63f932.pdf,ICLR,2020,a deep audio network that does not require any external training data +1Kxxduqpd3E,eQFZuIPbd6y,1601310000000.0,1614990000000.0,3517,Rotograd: Dynamic Gradient Homogenization for Multitask Learning,"[""~Adri\u00e1n_Javaloy1"", ""~Isabel_Valera1""]","[""Adri\u00e1n Javaloy"", ""Isabel Valera""]","[""multitask learning"", ""deep learning"", ""gradnorm""]","GradNorm (Chen et al., 2018) is a broadly used gradient-based approach for training multitask networks, where different tasks share, and thus compete during learning, for the network parameters. GradNorm eases the fitting of all individual tasks by dynamically equalizing the contribution of each task to the overall gradient magnitude. However, it does not prevent the individual tasks’ gradients from conflicting, i.e., pointing towards opposite directions, and thus resulting in a poor multitask performance. In this work we propose Rotograd, an extension to GradNorm that addresses this problem by dynamically homogenizing not only the gradient magnitudes but also their directions across tasks. For this purpose,Rotograd adds a layer of task-specific rotation matrices that aligns all the task gradients. Importantly, we then analyze Rotograd (and its predecessor) through the lens of game theory, providing theoretical guarantees on the algorithm stability and convergence. Finally, our experiments on several real-world datasets and network architectures show that Rotograd outperforms previous approaches for multitask learning. + +",/pdf/e70b1e6005adfec9a0672806889087bb7fdc5b83.pdf,ICLR,2021,Rotograd is a gradient based multitask learning approach that dynamically homogenizes the gradient magnitudes and directions across tasks. +HyefgnCqFm,B1esL46qYX,1538090000000.0,1545360000000.0,1054,Learning Partially Observed PDE Dynamics with Neural Networks,"[""ayedibrahim@gmail.com"", ""emmanuel.de-bezenac@lip6.fr"", ""arthur.pajot@lip6.fr"", ""patrick.gallinari@lip6.fr""]","[""Ibrahim Ayed"", ""Emmanuel De B\u00e9zenac"", ""Arthur Pajot"", ""Patrick Gallinari""]","[""deep learning"", ""spatio-temporal dynamics"", ""physical processes"", ""differential equations"", ""dynamical systems""]","Spatio-Temporal processes bear a central importance in many applied scientific fields. Generally, differential equations are used to describe these processes. In this work, we address the problem of learning spatio-temporal dynamics with neural networks when only partial information on the system's state is available. Taking inspiration from the dynamical system approach, we outline a general framework in which complex dynamics generated by families of differential equations can be learned in a principled way. Two models are derived from this framework. We demonstrate how they can be applied in practice by considering the problem of forecasting fluid flows. We show how the underlying equations fit into our formalism and evaluate our method by comparing with standard baselines.",/pdf/aed2c0d36425c2f25a4815ef9ea85ee383149a75.pdf,ICLR,2019, +rkPLzgZAZ,Hy88fgbA-,1509130000000.0,1518730000000.0,561,Modular Continual Learning in a Unified Visual Environment,"[""feigelis@stanford.edu"", ""bsheffer@stanford.edu"", ""yamins@stanford.edu""]","[""Kevin T. Feigelis"", ""Blue Sheffer"", ""Daniel L. K. Yamins""]","[""Continual Learning"", ""Neural Modules"", ""Interface Learning"", ""Task Switching"", ""Reinforcement Learning"", ""Visual Decision Making""]"," A core aspect of human intelligence is the ability to learn new tasks quickly and switch between them flexibly. Here, we describe a modular continual reinforcement learning paradigm inspired by these abilities. We first introduce a visual interaction environment that allows many types of tasks to be unified in a single framework. We then describe a reward map prediction scheme that learns new tasks robustly in the very large state and action spaces required by such an environment. We investigate how properties of module architecture influence efficiency of task learning, showing that a module motif incorporating specific design principles (e.g. early bottlenecks, low-order polynomial nonlinearities, and symmetry) significantly outperforms more standard neural network motifs, needing fewer training examples and fewer neurons to achieve high levels of performance. Finally, we present a meta-controller architecture for task switching based on a dynamic neural voting scheme, which allows new modules to use information learned from previously-seen tasks to substantially improve their own learning efficiency. ",/pdf/da129bffd8e78561ae75e6dadcf9902a3ed4b88a.pdf,ICLR,2018,We propose a neural module approach to continual learning using a unified visual environment with a large action space. +HJel76NYPS,HJxAqlC8wS,1569440000000.0,1577170000000.0,437,Collaborative Generated Hashing for Market Analysis and Fast Cold-start Recommendation,"[""yixianqianzy@gmail.com"", ""ivor.tsang@uts.edu.au"", ""lxduan@gmail.com"", ""guowu@uestc.edu.cn""]","[""Yan Zhang"", ""Ivor W. Tsang"", ""Lixin Duan"", ""Guowu Yang""]","[""Recommender system"", ""generated model"", ""market analysis"", ""hash"", ""cold start""]","Cold-start and efficiency issues of the Top-k recommendation are critical to large-scale recommender systems. Previous hybrid recommendation methods are effective to deal with the cold-start issues by extracting real latent factors of cold-start items(users) from side information, but they still suffer low efficiency in online recommendation caused by the expensive similarity search in real latent space. This paper presents a collaborative generated hashing (CGH) to improve the efficiency by denoting users and items as binary codes, which applies to various settings: cold-start users, cold-start items and warm-start ones. Specifically, CGH is designed to learn hash functions of users and items through the Minimum Description Length (MDL) principle; thus, it can deal with various recommendation settings. In addition, CGH initiates a new marketing strategy through mining potential users by a generative step. To reconstruct effective users, the MDL principle is used to learn compact and informative binary codes from the content data. Extensive experiments on two public datasets show the advantages for recommendations in various settings over competing baselines and analyze the feasibility of the application in marketing.",/pdf/b8a2c3083c13e8abc2afeebea61322a1ba00d076.pdf,ICLR,2020,It can generate effective hash codes for efficient cold-start recommendation and meanwhile provide a feasible marketing strategy. +HJgpugrKPS,rJgxU6eKPH,1569440000000.0,1591620000000.0,2415,Scale-Equivariant Steerable Networks,"[""sosnovikivan@gmail.com"", ""szmajamichal@gmail.com"", ""a.w.m.smeulders@uva.nl""]","[""Ivan Sosnovik"", ""Micha\u0142 Szmaja"", ""Arnold Smeulders""]","[""Scale Equivariance"", ""Steerable Filters""]","The effectiveness of Convolutional Neural Networks (CNNs) has been substantially attributed to their built-in property of translation equivariance. However, CNNs do not have embedded mechanisms to handle other types of transformations. In this work, we pay attention to scale changes, which regularly appear in various tasks due to the changing distances between the objects and the camera. First, we introduce the general theory for building scale-equivariant convolutional networks with steerable filters. We develop scale-convolution and generalize other common blocks to be scale-equivariant. We demonstrate the computational efficiency and numerical stability of the proposed method. We compare the proposed models to the previously developed methods for scale equivariance and local scale invariance. We demonstrate state-of-the-art results on the MNIST-scale dataset and on the STL-10 dataset in the supervised learning setting.",/pdf/ce6c00156479706a1ff9e63741b0551427f47911.pdf,ICLR,2020, +Rd138pWXMvG,ARtyFR2c4Hy,1601310000000.0,1615290000000.0,3112,A statistical theory of cold posteriors in deep neural networks,"[""~Laurence_Aitchison1""]","[""Laurence Aitchison""]","[""Bayesian inference"", ""cold posteriors"", ""sgld""]","To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a tempered or cold posterior. This is extremely concerning: if the prior is accurate, Bayes inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as CIFAR-10 are carefully curated. We develop a generative model describing curation which gives a principled Bayesian account of cold posteriors, because the likelihood under this new generative model closely matches the tempered likelihoods used in past work.",/pdf/ad6b61823bafd130bfd5c821fd1ceb7913a54d2d.pdf,ICLR,2021,We develop a generative model of dataset curation that explains the cold-posterior effect +r1eBeyHFDH,SkeO74o_wS,1569440000000.0,1594220000000.0,1504,A Theory of Usable Information under Computational Constraints,"[""xuyilun@pku.edu.cn"", ""sjzhao@stanford.edu"", ""tsong@cs.stanford.edu"", ""russell.sb.nebel@gmail.com"", ""ermon@cs.stanford.edu""]","[""Yilun Xu"", ""Shengjia Zhao"", ""Jiaming Song"", ""Russell Stewart"", ""Stefano Ermon""]",[],"We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon’s information theory that takes into account the modeling power and computational constraints of the observer. The resulting predictive V-information encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon’s mutual information and in violation of the data processing inequality, V-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, V-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive V-information is more effective than mutual information for structure learning and fair representation learning. Codes are available at https://github.com/Newbeeer/V-information .",/pdf/8d57c8f0f8fa4453c6e1f25f4d364e1d23bf0d3e.pdf,ICLR,2020, +rJ6iJmWCW,HkjVJXWR-,1509140000000.0,1518730000000.0,1095,POLICY DRIVEN GENERATIVE ADVERSARIAL NETWORKS FOR ACCENTED SPEECH GENERATION,"[""prannayk@iitk.ac.in"", ""pjyothi@cse.iitb.ac.in"", ""vinaypn@cse.iitk.ac.in"", ""msrinivasan@nvidia.com""]","[""Prannay Khosla"", ""Preethi Jyothi"", ""Vinay P. Namboodiri"", ""Mukundhan Srinivasan""]","[""speech"", ""generation"", ""accent"", ""gan"", ""adversarial"", ""reinforcement"", ""memory"", ""lstm"", ""policy"", ""gradients"", ""human""]","In this paper, we propose the generation of accented speech using generative adversarial +networks. Through this work we make two main contributions a) The +ability to condition latent representations while generating realistic speech samples +b) The ability to efficiently generate long speech samples by using a novel +latent variable transformation module that is trained using policy gradients. Previous +methods are limited in being able to generate only relatively short samples +or are not very efficient at generating long samples. The generated speech samples +are validated through a number of various evaluation measures viz, a WGAN +critic loss and through subjective scores on user evaluations against competitive +speech synthesis baselines and detailed ablation analysis of the proposed model. +The evaluations demonstrate that the model generates realistic long speech samples +conditioned on accent efficiently.",/pdf/42b2803ce78dc167a45e73cbb6d1a1e2fd7384e1.pdf,ICLR,2018, +rJe8pxSFwr,S1gCzWbtPH,1569440000000.0,1577170000000.0,2574,End-to-end learning of energy-based representations for irregularly-sampled signals and images,"[""ronan.fablet@imt-atlantique.fr"", ""lucas.drumetz@imt-atlantique.fr"", ""francois.rousseau@imt-atlantique.fr""]","[""Ronan Fablet"", ""Lucas Drumetz"", ""Fran\u00e7ois Rousseau""]","[""end-to-end-learning"", ""irregularly-sampled data"", ""energy representations"", ""optimal interpolation""]","For numerous domains, including for instance earth observation, medical imaging, astrophysics,..., available image and signal datasets often irregular space-time sampling patterns and large missing data rates. These sampling properties is a critical issue to apply state-of-the-art learning-based (e.g., auto-encoders, CNNs,...) to fully benefit from the available large-scale observations and reach breakthroughs in the reconstruction and identification of processes of interest. In this paper, we address the end-to-end learning of representations of signals, images and image sequences from irregularly-sampled data, {\em i.e.} when the training data involved missing data. From an analogy to Bayesian formulation, we consider energy-based representations. Two energy forms are investigated: one derived from auto-encoders and one relating to Gibbs energies. The learning stage of these energy-based representations (or priors) involve a joint interpolation issue, which resorts to solving an energy minimization problem under observation constraints. Using a neural-network-based implementation of the considered energy forms, we can state an end-to-end learning scheme from irregularly-sampled data. We demonstrate the relevance of the proposed representations for different case-studies: namely, multivariate time series, 2{\sc } images and image sequences.",/pdf/62920292397c23d4bb5d3e3018070b8654a39baa.pdf,ICLR,2020,We address the end-to-end learning of energy-based representations for signal and image observation dataset with irregular sampling patterns. +SkgkJn05YX,Hke6nDo9Fm,1538090000000.0,1578340000000.0,951,RANDOM MASK: Towards Robust Convolutional Neural Networks,"[""luotg@pku.edu.cn"", ""caitianle1998@pku.edu.cn"", ""zhan147@usc.edu"", ""siyuchen@pku.edu.cn"", ""wanglw@cis.pku.edu.cn""]","[""Tiange Luo"", ""Tianle Cai"", ""Mengxiao Zhang"", ""Siyu Chen"", ""Liwei Wang""]","[""adversarial examples"", ""robust machine learning"", ""cnn structure"", ""metric"", ""deep feature representations""]","Robustness of neural networks has recently been highlighted by the adversarial examples, i.e., inputs added with well-designed perturbations which are imperceptible to humans but can cause the network to give incorrect outputs. In this paper, we design a new CNN architecture that by itself has good robustness. We introduce a simple but powerful technique, Random Mask, to modify existing CNN structures. We show that CNN with Random Mask achieves state-of-the-art performance against black-box adversarial attacks without applying any adversarial training. We next investigate the adversarial examples which “fool” a CNN with Random Mask. Surprisingly, we find that these adversarial examples often “fool” humans as well. This raises fundamental questions on how to define adversarial examples and robustness properly.",/pdf/065faae5add172add31a55eb6b5c3e2b514086d1.pdf,ICLR,2019,"We propose a technique that modifies CNN structures to enhance robustness while keeping high test accuracy, and raise doubt on whether current definition of adversarial examples is appropriate by generating adversarial examples able to fool humans." +Byxr73R5FQ,Hye8Ba2qF7,1538090000000.0,1545360000000.0,1354,Successor Options : An Option Discovery Algorithm for Reinforcement Learning,"[""manan.tomar@gmail.com"", ""rahul13ramesh@gmail.com"", ""ravi@cse.iitm.ac.in""]","[""Manan Tomar*"", ""Rahul Ramesh*"", ""Balaraman Ravindran""]","[""Hierarchical Reinforcement Learning""]","Hierarchical Reinforcement Learning is a popular method to exploit temporal abstractions in order to tackle the curse of dimensionality. The options framework is one such hierarchical framework that models the notion of skills or options. However, learning a collection of task-agnostic transferable skills is a challenging task. Option discovery typically entails using heuristics, the majority of which revolve around discovering bottleneck states. In this work, we adopt a method complementary to the idea of discovering bottlenecks. Instead, we attempt to discover ``landmark"" sub-goals which are prototypical states of well connected regions. These sub-goals are points from which densely connected set of states are easily accessible. We propose a new model called Successor options that leverages Successor Representations to achieve the same. We also design a novel pseudo-reward for learning the intra-option policies. Additionally, we describe an Incremental Successor options model that iteratively builds options and explores in environments where exploration through primitive actions is inadequate to form the Successor Representations. Finally, we demonstrate the efficacy of our approach on a collection of grid worlds and on complex high dimensional environments like Deepmind-Lab. +",/pdf/d69ac7f7bf46d73eec624abb16c14f05dc577575.pdf,ICLR,2019,An option discovery method for Reinforcement Learning using the Successor Representation +SkffVjUaW,B1-MViLpb,1508450000000.0,1518730000000.0,21,Building effective deep neural networks one feature at a time,"[""mundt@fias.uni-frankfurt.de"", ""weis@ccc.cs.uni-frankfurt.de"", ""kishore.konda@insofe.edu.in"", ""ramesh@fias.uni-frankfurt.de""]","[""Martin Mundt"", ""Tobias Weis"", ""Kishore Konda"", ""Visvanathan Ramesh""]","[""convolution neural networks"", ""architecture search"", ""meta-learning"", ""representational capacity""]","Successful training of convolutional neural networks is often associated with suffi- +ciently deep architectures composed of high amounts of features. These networks +typically rely on a variety of regularization and pruning techniques to converge +to less redundant states. We introduce a novel bottom-up approach to expand +representations in fixed-depth architectures. These architectures start from just a +single feature per layer and greedily increase width of individual layers to attain +effective representational capacities needed for a specific task. While network +growth can rely on a family of metrics, we propose a computationally efficient +version based on feature time evolution and demonstrate its potency in determin- +ing feature importance and a networks’ effective capacity. We demonstrate how +automatically expanded architectures converge to similar topologies that benefit +from lesser amount of parameters or improved accuracy and exhibit systematic +correspondence in representational complexity with the specified task. In contrast +to conventional design patterns with a typical monotonic increase in the amount of +features with increased depth, we observe that CNNs perform better when there is +more learnable parameters in intermediate, with falloffs to earlier and later layers.",/pdf/8100928dc43121b2c543537f6f03fd071fdd8180.pdf,ICLR,2018,A bottom-up algorithm that expands CNNs starting with one feature per layer to architectures with sufficient representational capacity. +rJg9OANFwS,ryelXJuuDS,1569440000000.0,1577170000000.0,1221,Topic Models with Survival Supervision: Archetypal Analysis and Neural Approaches,"[""georgechen@cmu.edu"", ""linhongl@andrew.cmu.edu"", ""renzuo.wren@gmail.com"", ""acoston@cs.cmu.edu"", ""jeremyweiss@cmu.edu""]","[""George H. Chen"", ""Linhong Li"", ""Ren Zuo"", ""Amanda Coston"", ""Jeremy C. Weiss""]",[],"We introduce two approaches to topic modeling supervised by survival analysis. Both approaches predict time-to-event outcomes while simultaneously learning topics over features that help prediction. The high-level idea is to represent each data point as a distribution over topics using some underlying topic model. Then each data point's distribution over topics is fed as input to a survival model. The topic and survival models are jointly learned. The two approaches we propose differ in the generality of topic models they can learn. The first approach finds topics via archetypal analysis, a nonnegative matrix factorization method that optimizes over a wide class of topic models encompassing latent Dirichlet allocation (LDA), correlated topic models, and topic models based on the ``anchor word'' assumption; the resulting survival-supervised variant solves an alternating minimization problem. Our second approach builds on recent work that approximates LDA in a neural net framework. We add a survival loss layer to this neural net to form an approximation to survival-supervised LDA. Both of our approaches can be combined with a variety of survival models. We demonstrate our approach on two survival datasets, showing that survival-supervised topic models can achieve competitive time-to-event prediction accuracy while outputting clinically interpretable topics.",/pdf/4c329e488cd12ff024884da5a3cf1f1bf7727c2d.pdf,ICLR,2020, +SyVOjfbRb,r1rwoGZAW,1509140000000.0,1518730000000.0,939,LSH-SAMPLING BREAKS THE COMPUTATIONAL CHICKEN-AND-EGG LOOP IN ADAPTIVE STOCHASTIC GRADIENT ESTIMATION,"[""beidi.chen@rice.edu"", ""yingchen.xu@rice.edu"", ""anshumali@rice.edu""]","[""Beidi Chen"", ""Yingchen Xu"", ""Anshumali Shrivastava""]","[""Stochastic Gradient Descent"", ""Optimization"", ""Sampling"", ""Estimation""]","Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient. As a result, the false impression of faster convergence in iterations leads to slower convergence in time, which we call a chicken-and-egg loop. In this paper, we break this barrier by providing the first demonstration of a sampling scheme, which leads to superior gradient estimation, while keeping the sampling cost per iteration similar to that of the uniform sampling. Such an algorithm is possible due to the sampling view of Locality Sensitive Hashing (LSH), which came to light recently. As a consequence of superior and fast estimation, we reduce the running time of all existing gradient descent algorithms. We demonstrate the benefits of our proposal on both SGD and AdaGrad.",/pdf/8e9ff307abb6c03d5f9613f697b6fbd4f7435692.pdf,ICLR,2018,We improve the running of all existing gradient descent algorithms. +D9pSaTGUemb,cunYmVTtsCn,1601310000000.0,1614990000000.0,3576,Implicit Acceleration of Gradient Flow in Overparameterized Linear Models,"[""~Salma_Tarmoun1"", ""~Guilherme_Fran\u00e7a1"", ""~Benjamin_David_Haeffele1"", ""~Rene_Vidal1""]","[""Salma Tarmoun"", ""Guilherme Fran\u00e7a"", ""Benjamin David Haeffele"", ""Rene Vidal""]",[],"We study the implicit acceleration of gradient flow in over-parameterized two-layer linear models. We show that implicit acceleration emerges from a conservation law that constrains the dynamics to follow certain trajectories. More precisely, gradient flow preserves the difference of the Gramian~matrices of the input and output weights and we show that the amount of acceleration depends on both the magnitude of that difference (which is fixed at initialization) and the spectrum of the data. In addition, and generalizing prior work, we prove our results without assuming small, balanced or spectral initialization for the weights, and establish interesting connections between the matrix factorization problem and Riccati type differential equations.",/pdf/f5b2842ced6c7dff84dbd5375532bdf4c72d5173.pdf,ICLR,2021, +tckGH8K9y6o,5FWLkYaoO5a,1601310000000.0,1614990000000.0,1010,Symmetric Wasserstein Autoencoders,"[""~Sun_Sun1"", ""~Hongyu_Guo1""]","[""Sun Sun"", ""Hongyu Guo""]","[""generative models"", ""variational autoencoders""]","Leveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resultant algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over several state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.",/pdf/407226e7bdf9f60116f69b12845fbda80d27a4db.pdf,ICLR,2021,"We introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders." +1AyPW2Emp6,SroGW-jS6VV,1601310000000.0,1614990000000.0,2218,Tight Second-Order Certificates for Randomized Smoothing,"[""~Alexander_Levine2"", ""aounon@umd.edu"", ""~Tom_Goldstein1"", ""~Soheil_Feizi2""]","[""Alexander Levine"", ""Aounon Kumar"", ""Tom Goldstein"", ""Soheil Feizi""]","[""certificates"", ""adversarial"", ""robustness"", ""defenses"", ""smoothing"", ""curvature""]","Randomized smoothing is a popular way of providing robustness guarantees against adversarial attacks: randomly-smoothed functions have a universal Lipschitz-like bound, allowing for robustness certificates to be easily computed. In this work, we show that there also exists a universal curvature-like bound for Gaussian random smoothing: given the exact value and gradient of a smoothed function, we compute a lower bound on the distance of a point to its closest adversarial example, called the Second-order Smoothing (SoS) robustness certificate. In addition to proving the correctness of this novel certificate, we show that SoS certificates are realizable and therefore tight. Interestingly, we show that the maximum achievable benefits, in terms of certified robustness, from using the additional information of the gradient norm are relatively small: because our bounds are tight, this is a fundamental negative result. The gain of SoS certificates further diminishes if we consider the estimation error of the gradient norms, for which we have developed an estimator. We therefore additionally develop a variant of Gaussian smoothing, called Gaussian dipole smoothing, which provides similar bounds to randomized smoothing with gradient information, but with much-improved sample efficiency. This allows us to achieve (marginally) improved robustness certificates on high-dimensional datasets such as CIFAR-10 and ImageNet.",/pdf/4c0b4423c4a3d333109157743b960b4757b9796b.pdf,ICLR,2021,We provide a tight robustness certificate for Gaussian smoothed classifiers using the gradient of the smoothed classifier in addition to its value. +H15RufWAW,ByUROzbCW,1509140000000.0,1518730000000.0,876,GraphGAN: Generating Graphs via Random Walks,"[""a.bojchevski@in.tum.de"", ""shchur@in.tum.de"", ""daniel.zuegner@gmail.com"", ""guennemann@in.tum.de""]","[""Aleksandar Bojchevski"", ""Oleksandr Shchur"", ""Daniel Z\u00fcgner"", ""Stephan G\u00fcnnemann""]","[""GAN"", ""graphs"", ""random walks"", ""implicit generative models""]","We propose GraphGAN - the first implicit generative model for graphs that enables to mimic real-world networks. +We pose the problem of graph generation as learning the distribution of biased random walks over a single input graph. +Our model is based on a stochastic neural network that generates discrete output samples, and is trained using the Wasserstein GAN objective. GraphGAN enables us to generate sibling graphs, which have similar properties yet are not exact replicas of the original graph. Moreover, GraphGAN learns a semantic mapping from the latent input space to the generated graph's properties. We discover that sampling from certain regions of the latent space leads to varying properties of the output graphs, with smooth transitions between them. Strong generalization properties of GraphGAN are highlighted by its competitive performance in link prediction as well as promising results on node classification, even though not specifically trained for these tasks.",/pdf/8c5e5e47f68250aa71704879c1c2d3ebffef4db5.pdf,ICLR,2018,Using GANs to generate graphs via random walks. +rke7geHtwH,rJxdj3JYwB,1569440000000.0,1591890000000.0,2093,Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning,"[""siegeln@google.com"", ""springenberg@google.com"", ""befelix@inf.ethz.ch"", ""aabdolmaleki@google.com"", ""neunertm@google.com"", ""thomaslampe@google.com"", ""rhafner@google.com"", ""heess@google.com"", ""riedmiller@google.com""]","[""Noah Siegel"", ""Jost Tobias Springenberg"", ""Felix Berkenkamp"", ""Abbas Abdolmaleki"", ""Michael Neunert"", ""Thomas Lampe"", ""Roland Hafner"", ""Nicolas Heess"", ""Martin Riedmiller""]","[""Reinforcement Learning"", ""Off-policy"", ""Multitask"", ""Continuous Control""]","Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable learning from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks -- including standard continuous control benchmarks and multi-task learning for simulated and real-world robots. ",/pdf/7a837306b7576a90082f30f3b81cfadb1515fdfd.pdf,ICLR,2020,"We develop a method for stable offline reinforcement learning from logged data. The key is to regularize the RL policy towards a learned ""advantage weighted"" model of the data." +BkggGREKvS,SkguGqEOPS,1569440000000.0,1577170000000.0,989,Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning,"[""paul.b.barde@gmail.com"", ""jul.roy1311@gmail.com"", ""c212.felixh@gmail.com"", ""derek@cim.mcgill.ca"", ""christopher.pal@polymtl.ca""]","[""Paul Barde"", ""Julien Roy"", ""F\u00e9lix G. Harvey"", ""Derek Nowrouzezahrai"", ""Christopher Pal""]","[""Reinforcement Learning"", ""Multi-Agent"", ""Continuous Control"", ""Regularization"", ""Coordination"", ""Inductive biases""]","A central challenge in multi-agent reinforcement learning is the induction of coordination between agents of a team. In this work, we investigate how to promote inter-agent coordination using policy regularization and discuss two possible avenues respectively based on inter-agent modelling and synchronized sub-policy selection. We test each approach in four challenging continuous control tasks with sparse rewards and compare them against three baselines including MADDPG, a state-of-the-art multi-agent reinforcement learning algorithm. To ensure a fair comparison, we rely on a thorough hyper-parameter selection and training methodology that allows a fixed hyper-parameter search budget for each algorithm and environment. We consequently assess both the hyper-parameter sensitivity, sample-efficiency and asymptotic performance of each learning method. Our experiments show that the proposed methods lead to significant improvements on cooperative problems. We further analyse the effects of the proposed regularizations on the behaviors learned by the agents.",/pdf/788cce74b3c6db550a2c5e4699e31192d2923f30.pdf,ICLR,2020,We propose regularization objectives for multi-agent RL algorithms that foster coordination on cooperative tasks. +rJe_cyrKPB,S1lV39R_DS,1569440000000.0,1577170000000.0,1883,GroSS Decomposition: Group-Size Series Decomposition for Whole Search-Space Training,"[""henryhj@robots.ox.ac.uk"", ""kate@robots.ox.ac.uk"", ""victor@robots.ox.ac.uk""]","[""Henry Howard-Jenkins"", ""Yiwen Li"", ""Victor Adrian Prisacariu""]","[""architecture search"", ""block term decomposition"", ""network decomposition"", ""network acceleration"", ""group convolution""]","We present Group-size Series (GroSS) decomposition, a mathematical formulation of tensor factorisation into a series of approximations of increasing rank terms. GroSS allows for dynamic and differentiable selection of factorisation rank, which is analogous to a grouped convolution. Therefore, to the best of our knowledge, GroSS is the first method to simultaneously train differing numbers of groups within a single layer, as well as all possible combinations between layers. In doing so, GroSS trains an entire grouped convolution architecture search-space concurrently. We demonstrate this with a proof-of-concept exhaustive architecure search with a performance objective. GroSS represents a significant step towards liberating network architecture search from the burden of training and finetuning.",/pdf/0448de88c011f64503c25cc4c3ee2ea214a2bfa3.pdf,ICLR,2020,A decomposition method which allows for simultaneous training of an entire search space of group convolution architectures. +BJeem3C9F7,Byg4dbo5YQ,1538090000000.0,1545360000000.0,1327,Pix2Scene: Learning Implicit 3D Representations from Images,"[""rajsai24@gmail.com"", ""fmannan@gmail.com"", ""florian.golemo@inria.fr"", ""dvazquez@cvc.uab.es"", ""dereknow@gmail.com"", ""aaron.courville@gmail.com""]","[""Sai Rajeswar"", ""Fahim Mannan"", ""Florian Golemo"", ""David Vazquez"", ""Derek Nowrouzezahrai"", ""Aaron Courville""]","[""Representation learning"", ""generative model"", ""adversarial learning"", ""implicit 3D generation"", ""scene generation""]","Modelling 3D scenes from 2D images is a long-standing problem in computer vision with implications in, e.g., simulation and robotics. We propose pix2scene, a deep generative-based approach that implicitly models the geometric properties of a scene from images. Our method learns the depth and orientation of scene points visible in images. Our model can then predict the structure of a scene from various, previously unseen view points. It relies on a bi-directional adversarial learning mechanism to generate scene representations from a latent code, inferring the 3D representation of the underlying scene geometry. We showcase a novel differentiable renderer to train the 3D model in an end-to-end fashion, using only images. We demonstrate the generative ability of our model qualitatively on both a custom dataset and on ShapeNet. Finally, we evaluate the effectiveness of the learned 3D scene representation in supporting a 3D spatial reasoning.",/pdf/3987a05476adb8b8de429d2bd2fe9d8b1d6caade.pdf,ICLR,2019,pix2scene: a deep generative based approach for implicitly modelling the geometrical properties of a 3D scene from images +xsx58rmaW2p,j-24yOUIDvx,1601310000000.0,1614990000000.0,977,Making Coherence Out of Nothing At All: Measuring Evolution of Gradient Alignment,"[""~Satrajit_Chatterjee1"", ""zielinski@google.com""]","[""Satrajit Chatterjee"", ""Piotr Zielinski""]","[""generalization"", ""deep learning""]","We propose a new metric ($m$-coherence) to experimentally study the alignment of per-example gradients during training. Intuitively, given a sample of size $m$, $m$-coherence is the number of examples in the sample that benefit from a small step along the gradient of any one example on average. We show that compared to other commonly used metrics, $m$-coherence is more interpretable, cheaper to compute ($O(m)$ instead of $O(m^2)$) and mathematically cleaner. (We note that $m$-coherence is closely connected to gradient diversity, a quantity previously used in some theoretical bounds.) Using $m$-coherence, we study the evolution of alignment of per-example gradients in ResNet and EfficientNet models on ImageNet and several variants with label noise, particularly from the perspective of the recently proposed Coherent Gradients (CG) theory that provides a simple, unified explanation for memorization and generalization [Chatterjee, ICLR 20]. Although we have several interesting takeaways, our most surprising result concerns memorization. Naively, one might expect that when training with completely random labels, each example is fitted independently, and so $m$-coherence should be close to 1. However, this is not the case: $m$-coherence reaches moderately high values during training (though still much smaller than real labels), indicating that over-parameterized neural networks find common patterns even in scenarios where generalization is not possible. A detailed analysis of this phenomenon provides both a deeper confirmation of CG, but at the same point puts into sharp relief what is missing from the theory in order to provide a complete explanation of generalization in neural networks.",/pdf/d01ba44bf7b327fb8ac7e32942d1de26eaad6ec1.pdf,ICLR,2021,We present a new metric to study per-example gradient alignment that is mathematically cleaner and more interpretable than previously proposed metrics and use it to empirically study the evolution of gradient alignment in large scale training. +S1xRnxSYwS,H1xupg-YPH,1569440000000.0,1577170000000.0,2556,Goten: GPU-Outsourcing Trusted Execution of Neural Network Training and Prediction,"[""nkl018@ie.cuhk.edu.hk"", ""smchow@ie.cuhk.edu.hk"", ""woopuiyung@gmail.com"", ""foreverjun.zhao@gmail.com""]","[""Lucien K.L. Ng"", ""Sherman S.M. Chow"", ""Anna P.Y. Woo"", ""Donald P. H. Wong"", ""Yongjun Zhao""]","[""machine learning"", ""security"", ""privacy"", ""TEE"", ""trusted processors"", ""Intel SGX"", ""GPU"", ""high-performance""]","Before we saw worldwide collaborative efforts in training machine-learning models or widespread deployments of prediction-as-a-service, we need to devise an efficient privacy-preserving mechanism which guarantees the privacy of all stakeholders (data contributors, model owner, and queriers). Slaom (ICLR ’19) preserves privacy only for prediction by leveraging both trusted environment (e.g., Intel SGX) and untrusted GPU. The challenges for enabling private training are explicitly left open – its pre-computation technique does not hide the model weights and fails to support dynamic quantization corresponding to the large changes in weight magnitudes during training. Moreover, it is not a truly outsourcing solution since (offline) pre-computation for a job takes as much time as computing the job locally by SGX, i.e., it only works before all pre-computations are exhausted. + +We propose Goten, a privacy-preserving framework supporting both training and prediction. We tackle all the above challenges by proposing a secure outsourcing protocol which 1) supports dynamic quantization, 2) hides the model weight from GPU, and 3) performs better than a pure-SGX solution even if we perform the precomputation online. Our solution leverages a non-colluding assumption which is often employed by cryptographic solutions aiming for practical efficiency (IEEE SP ’13, Usenix Security ’17, PoPETs ’19). We use three servers, which can be reduced to two if the pre-computation is done offline. Furthermore, we implement our tailor-made memory-aware measures for minimizing the overhead when the SGX memory limit is exceeded (cf., EuroSys ’17, Usenix ATC ’19). Compared to a pure-SGX solution, our experiments show that Goten can speed up linear-layer computations in VGG up to 40×, and overall speed up by 8.64× on VGG11.",/pdf/000e4ea3976d4a0200e0282b01af7d1e8bc40739.pdf,ICLR,2020,"Leveraging GPU and Intel SGX to protect privacy of training data, model, and queries while achieving high-performance training and prediction" +ByxoqJrtvr,H1g4XsAuwr,1569440000000.0,1577170000000.0,1890,Learning to Reach Goals Without Reinforcement Learning,"[""dibya.ghosh@berkeley.edu"", ""abhigupta@berkeley.edu"", ""justinjfu@eecs.berkeley.edu"", ""adreddy@berkeley.edu"", ""coline@berkeley.edu"", ""beysenba@cs.cmu.edu"", ""svlevine@eecs.berkeley.edu""]","[""Dibya Ghosh"", ""Abhishek Gupta"", ""Justin Fu"", ""Ashwin Reddy"", ""Coline Devin"", ""Benjamin Eysenbach"", ""Sergey Levine""]","[""Reinforcement Learning"", ""Goal Reaching"", ""Imitation Learning""]","Imitation learning algorithms provide a simple and straightforward approach for training control policies via standard supervised learning methods. By maximizing the likelihood of good actions provided by an expert demonstrator, supervised imitation learning can produce effective policies without the algorithmic complexities and optimization challenges of reinforcement learning, at the cost of requiring an expert demonstrator -- typically a person -- to provide the demonstrations. In this paper, we ask: can we use imitation learning to train effective policies without any expert demonstrations? The key observation that makes this possible is that, in the multi-task setting, trajectories that are generated by a suboptimal policy can still serve as optimal examples for other tasks. In particular, in the setting where the tasks correspond to different goals, every trajectory is a successful demonstration for the state that it actually reaches. Informed by this observation, we propose a very simple algorithm for learning behaviors without any demonstrations, user-provided reward functions, or complex reinforcement learning methods. Our method simply maximizes the likelihood of actions the agent actually took in its own previous rollouts, conditioned on the goal being the state that it actually reached. Although related variants of this approach have been proposed previously in imitation learning settings with example demonstrations, we present the first instance of this approach as a method for learning goal-reaching policies entirely from scratch. We present a theoretical result linking self-supervised imitation learning and reinforcement learning, and empirical results showing that it performs competitively with more complex reinforcement learning methods on a range of challenging goal reaching problems.",/pdf/3de72eb6a5b0a5c6c39123200091631309d327e7.pdf,ICLR,2020,Learning how to reach goals from scratch by using imitation learning with data relabeling +rJ33wwxRb,rysnvweCW,1509090000000.0,1521150000000.0,294,SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data,"[""alonbrutzkus@mail.tau.ac.il"", ""amir.globerson@gmail.com"", ""eran.malach@mail.huji.ac.il"", ""shais@cs.huji.ac.il""]","[""Alon Brutzkus"", ""Amir Globerson"", ""Eran Malach"", ""Shai Shalev-Shwartz""]","[""Deep Learning"", ""Non-convex Optmization"", ""Generalization"", ""Learning Theory"", ""Neural Networks""]","Neural networks exhibit good generalization behavior in the +over-parameterized regime, where the number of network parameters +exceeds the number of observations. Nonetheless, +current generalization bounds for neural networks fail to explain this +phenomenon. In an attempt to bridge this gap, we study the problem of +learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky +ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks. +Specifically, we prove convergence rates of SGD to a global +minimum and provide generalization guarantees for this global minimum +that are independent of the network size. +Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model. This is the first theoretical demonstration that SGD can avoid overfitting, when learning over-specified neural network classifiers.",/pdf/e740be2d100d0467d675468bc2fe2252acc906d0.pdf,ICLR,2018,We show that SGD learns two-layer over-parameterized neural networks with Leaky ReLU activations that provably generalize on linearly separable data. +giit4HdDNa,iYzF4WOT89i,1601310000000.0,1618470000000.0,1916,Go with the flow: Adaptive control for Neural ODEs,"[""~Mathieu_Chalvidal1"", ""~Matthew_Ricci1"", ""rufin.vanrullen@cnrs.fr"", ""~Thomas_Serre1""]","[""Mathieu Chalvidal"", ""Matthew Ricci"", ""Rufin VanRullen"", ""Thomas Serre""]","[""Neural ODEs"", ""Optimal Control Theory"", ""Hypernetworks"", ""Normalizing flows""]","Despite their elegant formulation and lightweight memory cost, neural ordinary differential equations (NODEs) suffer from known representational limitations. In particular, the single flow learned by NODEs cannot express all homeomorphisms from a given data space to itself, and their static weight parameterization restricts the type of functions they can learn compared to discrete architectures with layer-dependent weights. Here, we describe a new module called neurally-controlled ODE (N-CODE) designed to improve the expressivity of NODEs. The parameters of N-CODE modules are dynamic variables governed by a trainable map from initial or current activation state, resulting in forms of open-loop and closed-loop control, respectively. A single module is sufficient for learning a distribution on non-autonomous flows that adaptively drive neural representations. We provide theoretical and empirical evidence that N-CODE circumvents limitations of previous NODEs models and show how increased model expressivity manifests in several supervised and unsupervised learning problems. These favorable empirical results indicate the potential of using data- and activity-dependent plasticity in neural networks across numerous domains.",/pdf/8dc540aca5360588117ac82153292b6a2d2f4897.pdf,ICLR,2021,This paper presents a new method to enhance Neural ODEs representation power by dynamically controlling their weight parametrization. +rkl6As0cF7,SkgpPa69Y7,1538090000000.0,1548270000000.0,938,Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning,"[""ying.wen@cs.ucl.ac.uk"", ""yaodong.yang@cs.ucl.ac.uk"", ""rui.luo@cs.ucl.ac.uk"", ""jun.wang@cs.ucl.ac.uk"", ""wei.pan@tudelft.nl""]","[""Ying Wen"", ""Yaodong Yang"", ""Rui Luo"", ""Jun Wang"", ""Wei Pan""]","[""Multi-agent Reinforcement Learning"", ""Recursive Reasoning""]","Humans are capable of attributing latent mental contents such as beliefs, or intentions to others. The social skill is critical in everyday life to reason about the potential consequences of their behaviors so as to plan ahead. It is known that humans use this reasoning ability recursively, i.e. considering what others believe about their own beliefs. In this paper, we start from level-$1$ recursion and introduce a probabilistic recursive reasoning (PR2) framework for multi-agent reinforcement learning. Our hypothesis is that it is beneficial for each agent to account for how the opponents would react to its future behaviors. Under the PR2 framework, we adopt variational Bayes methods to approximate the opponents' conditional policy, to which each agent finds the best response and then improve their own policy. We develop decentralized-training-decentralized-execution algorithms, PR2-Q and PR2-Actor-Critic, that are proved to converge in the self-play scenario when there is one Nash equilibrium. Our methods are tested on both the matrix game and the differential game, which have a non-trivial equilibrium where common gradient-based methods fail to converge. Our experiments show that it is critical to reason about how the opponents believe about what the agent believes. We expect our work to contribute a new idea of modeling the opponents to the multi-agent reinforcement learning community. +",/pdf/12f430fb5a28c8adc39784fd4f45dda62a737de4.pdf,ICLR,2019,We proposed a novel probabilisitic recursive reasoning (PR2) framework for multi-agent deep reinforcement learning tasks. +9uvhpyQwzM_,Y96cYIfLHB0,1601310000000.0,1616050000000.0,2857,Evaluation of Similarity-based Explanations,"[""~Kazuaki_Hanawa1"", ""~Sho_Yokoi1"", ""~Satoshi_Hara1"", ""~Kentaro_Inui1""]","[""Kazuaki Hanawa"", ""Sho Yokoi"", ""Satoshi Hara"", ""Kentaro Inui""]","[""Interpretability"", ""Explainability""]","Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with confidence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Specifically, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. Our experiments revealed that the cosine similarity of the gradients of the loss performs best, which would be a recommended choice in practice. In addition, we showed that some metrics perform poorly in our tests and analyzed the reasons of their failure. We expect our insights to help practitioners in selecting appropriate relevance metrics and also aid further researches for designing better relevance metrics for explanations.",/pdf/ede4daa61cd87856ebce2c047d94f9fdc6149edf.pdf,ICLR,2021,"We investigated empirically which of the relevance metrics (e.g. similarity of hidden layer, influence function, etc.) are appropriate for similarity-based explanation." +SJxRKT4Fwr,H1l9ASpvvB,1569440000000.0,1577170000000.0,688,"Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation","[""jm4743@columbia.edu"", ""zs2262@columbia.edu"", ""alireza@cs.columbia.edu"", ""mansour@merl.com"", ""avetro@merl.com"", ""sc250@columbia.edu""]","[""Jiawei Ma*"", ""Zheng Shou*"", ""Alireza Zareian"", ""Hassan Mansour"", ""Anthony Vetro"", ""Shih-Fu Chang""]","[""self-attention"", ""cross-dimensional"", ""multivariate time series"", ""imputation""]","Many real-world applications involve multivariate, geo-tagged time series data: at each location, multiple sensors record corresponding measurements. For example, air quality monitoring system records PM2.5, CO, etc. The resulting time-series data often has missing values due to device outages or communication errors. In order to impute the missing values, state-of-the-art methods are built on Recurrent Neural Networks (RNN), which process each time stamp sequentially, prohibiting the direct modeling of the relationship between distant time stamps. Recently, the self-attention mechanism has been proposed for sequence modeling tasks such as machine translation, significantly outperforming RNN because the relationship between each two time stamps can be modeled explicitly. In this paper, we are the first to adapt the self-attention mechanism for multivariate, geo-tagged time series data. In order to jointly capture the self-attention across different dimensions (i.e. time, location and sensor measurements) while keep the size of attention maps reasonable, we propose a novel approach called Cross-Dimensional Self-Attention (CDSA) to process each dimension sequentially, yet in an order-independent manner. On three real-world datasets, including one our newly collected NYC-traffic dataset, extensive experiments demonstrate the superiority of our approach compared to state-of-the-art methods for both imputation and forecasting tasks. +",/pdf/2ee06ec1b4ac28a835efef63d1ee7f95441b9cf3.pdf,ICLR,2020,"A novel self-attention mechanism for multivariate, geo-tagged time series imputation." +AAes_3W-2z,yUHzDpTQyDF,1601310000000.0,1614650000000.0,692,Wasserstein Embedding for Graph Learning,"[""~Soheil_Kolouri1"", ""nnaderializadeh@hrl.com"", ""~Gustavo_K._Rohde1"", ""hhoffmann@hrl.com""]","[""Soheil Kolouri"", ""Navid Naderializadeh"", ""Gustavo K. Rohde"", ""Heiko Hoffmann""]","[""Wasserstein"", ""graph embedding"", ""graph-level prediction""]","We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein distance to measure the dissimilarity between node embeddings of different graphs. Unlike prior work, we avoid pairwise calculation of distances between graphs and reduce the computational complexity from quadratic to linear in the number of graphs. WEGL calculates Monge maps from a reference distribution to each node embedding and, based on these maps, creates a fixed-sized vector representation of the graph. We evaluate our new graph embedding approach on various benchmark graph-property prediction tasks, showing state-of-the-art classification performance while having superior computational efficiency. The code is available at https://github.com/navid-naderi/WEGL.",/pdf/91a2b065854f096c0ed827b88b9fc26dff36f359.pdf,ICLR,2021,Wasserstein Embedding for Graph Learning (WEGL) is a novel and fast framework for embedding entire graphs into a vector space in which the Euclidean distance between representations approximates the 2-Wasserstein distance. +zsKWh2pRSBK,KnWU3Er8H0I7,1601310000000.0,1614990000000.0,2044,"Poisoned classifiers are not only backdoored, they are fundamentally broken","[""~Mingjie_Sun1"", ""~Siddhant_Agarwal1"", ""~J_Zico_Kolter1""]","[""Mingjie Sun"", ""Siddhant Agarwal"", ""J Zico Kolter""]","[""Backdoor Attacks"", ""Denoised Smoothing"", ""Perceptually-Aligned Gradients""]","Under a commonly-studied “backdoor” poisoning attack against classification models, an attacker adds a small “trigger” to a subset of the training data, such that the presence of this trigger at test time causes the classifier to always predict some target class. It is often implicitly assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger. In this paper, we show empirically that this view of backdoored classifiers is fundamentally incorrect. We demonstrate that anyone with access to the classifier, even without access to any original training data or trigger, can construct several alternative triggers that are as effective or more so at eliciting the target class at test time. We construct these alternative triggers by first generating adversarial examples for a smoothed version of the classifier, created with a recent process called Denoised Smoothing, and then extracting colors or cropped portions of adversarial images. We demonstrate the effectiveness of our attack through extensive experiments on ImageNet and TrojAI datasets, including a user study which demonstrates that our method allows users to easily determine the existence of such backdoors in existing poisoned classifiers. Furthermore, we demonstrate that our alternative triggers can in fact look entirely different from the original trigger, highlighting that the backdoor actually learned by the classifier differs substantially from the trigger image itself. Thus, we argue that there is no such thing as a “secret” backdoor in poisoned classifiers: poisoning a classifier invites attacks not just by the party that possesses the trigger, but from anyone with access to the classifier.",/pdf/4959459ccc8a6c2d401fe6ca978ce4b82b4f3ff0.pdf,ICLR,2021,"We show that backdoored classifiers can be easily attacked without access to the original trigger, by constructing alternative triggers that are just as effective as, or even more so than the original one that are as successful as the original one." +B1lXGnRctX,SyxGNb09YQ,1538090000000.0,1545360000000.0,1252,Classification in the dark using tactile exploration,"[""mudigonda@berkeley.edu"", ""btickell@berkeley.edu"", ""pulkitag@berkeley.edu""]","[""Mayur Mudigonda"", ""Blake Tickell"", ""Pulkit Agrawal""]","[""tactile sensing"", ""multimodal representations"", ""vision"", ""object identification""]","Combining information from different sensory modalities to execute goal directed actions is a key aspect of human intelligence. Specifically, human agents are very easily able to translate the task communicated in one sensory domain (say vision) into a representation that enables them to complete this task when they can only sense their environment using a separate sensory modality (say touch). In order to build agents with similar capabilities, in this work we consider the problem of a retrieving a target object from a drawer. The agent is provided with an image of a previously unseen object and it explores objects in the drawer using only tactile sensing to retrieve the object that was shown in the image without receiving any visual feedback. Success at this task requires close integration of visual and tactile sensing. We present a method for performing this task in a simulated environment using an anthropomorphic hand. We hope that future research in the direction of combining sensory signals for acting will find the object retrieval from a drawer to be a useful benchmark problem",/pdf/f45a4ba2821dc271070e94959752b3fb440e5af8.pdf,ICLR,2019,"In this work, we study the problem of learning representations to identify novel objects by exploring objects using tactile sensing. Key point here is that the query is provided in image domain." +HfnQjEN_ZC,HUl0ToOqgbsU,1601310000000.0,1614990000000.0,1999,Ballroom Dance Movement Recognition Using a Smart Watch and Representation Learning,"[""~Varun_Badrinath_Krishna1""]","[""Varun Badrinath Krishna""]","[""ballroom"", ""sequence"", ""deep"", ""learning"", ""machine"", ""markov"", ""prior""]","Smart watches are being increasingly used to detect human gestures and movements. Using a single smart watch, whole body movement recognition remains a hard problem because movements may not be adequately captured by the sensors in the watch. In this paper, we present a whole body movement detection study using a single smart watch in the context of ballroom dancing. Deep learning representations are used to classify well-defined sequences of movements, called \emph{figures}. Those representations are found to outperform ensembles of random forests and hidden Markov models. The classification accuracy of 85.95\% was improved to 92.31\% by modeling a dance as a first-order Markov chain of figures.",/pdf/146e4cd8da3bf1cd841042a3b7aa15dc42d0002a.pdf,ICLR,2021,Deep learning combined with Markov priors are used +BJg_2JHKvH,HkxkA-yKvS,1569440000000.0,1577170000000.0,1957,Semi-Supervised Learning with Normalizing Flows,"[""izmailovpavel@gmail.com"", ""pk1822@nyu.edu"", ""maf820@nyu.edu"", ""andrew@cornell.edu""]","[""Pavel Izmailov"", ""Polina Kirichenko"", ""Marc Finzi"", ""Andrew Wilson""]","[""Semi-Supervised Learning"", ""Normalizing Flows""]","We propose Flow Gaussian Mixture Model (FlowGMM), a general-purpose method for semi-supervised learning based on a simple and principled probabilistic framework. We approximate the joint distribution of the labeled and unlabeled data with a flexible mixture model implemented as a Gaussian mixture transformed by a normalizing flow. We train the model by maximizing the exact joint likelihood of the labeled and unlabeled data. We evaluate FlowGMM on a wide range of semi-supervised classification problems across different data types: AG-News and Yahoo Answers text data, MNIST, SVHN and CIFAR-10 image classification problems as well as tabular UCI datasets. FlowGMM achieves promising results on image classification problems and outperforms the competing methods on other types of data. FlowGMM learns an interpretable latent repesentation space and allows hyper-parameter free feature visualization at real time rates. Finally, we show that FlowGMM can be calibrated to produce meaningful uncertainty estimates for its predictions. ",/pdf/19509e2eafbe499c8e9237ea96a9f792dab1cd1a.pdf,ICLR,2020,Probabilistic semi-supervised learning method based on normalizing flows +r1MSBjA9Ym,rkgLXrB3Om,1538090000000.0,1545360000000.0,85,Collapse of deep and narrow neural nets,"[""lu_lu_1@brown.edu"", ""suyh@fzu.edu.cn"", ""george_karniadakis@brown.edu""]","[""Lu Lu"", ""Yanhui Su"", ""George Em Karniadakis""]","[""neural networks"", ""deep and narrow"", ""ReLU"", ""collapse""]","Recent theoretical work has demonstrated that deep neural networks have superior performance over shallow networks, but their training is more difficult, e.g., they suffer from the vanishing gradient problem. This problem can be typically resolved by the rectified linear unit (ReLU) activation. However, here we show that even for such activation, deep and narrow neural networks (NNs) will converge to erroneous mean or median states of the target function depending on the loss with high probability. Deep and narrow NNs are encountered in solving partial differential equations with high-order derivatives. We demonstrate this collapse of such NNs both numerically and theoretically, and provide estimates of the probability of collapse. We also construct a diagram of a safe region for designing NNs that avoid the collapse to erroneous states. Finally, we examine different ways of initialization and normalization that may avoid the collapse problem. Asymmetric initializations may reduce the probability of collapse but do not totally eliminate it.",/pdf/533fc8cc6871b2c6327297619fdf4337883ac59c.pdf,ICLR,2019,Deep and narrow neural networks will converge to erroneous mean or median states of the target function depending on the loss with high probability. +xpx9zj7CUlY,umcBHvvuZ4Z,1601310000000.0,1615660000000.0,2076,Randomized Automatic Differentiation,"[""~Deniz_Oktay2"", ""mcgreivy@princeton.edu"", ""jaduol@princeton.edu"", ""~Alex_Beatson1"", ""~Ryan_P_Adams1""]","[""Deniz Oktay"", ""Nick McGreivy"", ""Joshua Aduol"", ""Alex Beatson"", ""Ryan P Adams""]","[""automatic differentiation"", ""autodiff"", ""backprop"", ""deep learning"", ""pdes"", ""stochastic optimization""]","The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.",/pdf/e0f487101840a7b0e72a5ec77e3d0b6b57734740.pdf,ICLR,2021,"We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance." +rJGZq6g0-,SkM-qTxR-,1509120000000.0,1523910000000.0,423,"Emergent Communication in a Multi-Modal, Multi-Step Referential Game","[""kve216@nyu.edu"", ""apd283@nyu.edu"", ""dkiela@fb.com"", ""kyunghyun.cho@nyu.edu""]","[""Katrina Evtimova"", ""Andrew Drozdov"", ""Douwe Kiela"", ""Kyunghyun Cho""]","[""emergent communication"", ""multi-agent systems"", ""multi-modal""]","Inspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.",/pdf/951c34f661117068b27a2a80d637e3d2ea13c961.pdf,ICLR,2018, +r1e74a4twH,ByeMndGPwr,1569440000000.0,1577170000000.0,479,CZ-GEM: A FRAMEWORK FOR DISENTANGLED REPRESENTATION LEARNING,"[""akash.srivastava@me.com"", ""ybansal@g.harvard.edu"", ""yding5@nd.edu"", ""egger@mit.edu"", ""psattig@us.ibm.com"", ""jbt@mit.edu"", ""david.d.cox@ibm.com"", ""dgutfre@us.ibm.com""]","[""Akash Srivastava"", ""Yamini Bansal"", ""Yukun Ding"", ""Bernhard Egger"", ""Prasanna Sattigeri"", ""Josh Tenenbaum"", ""David D. Cox"", ""Dan Gutfreund""]","[""disentangled representation learning"", ""gan"", ""generative model"", ""simulator""]","Learning disentangled representations of data is one of the central themes in unsupervised learning in general and generative modelling in particular. In this work, we tackle a slightly more intricate scenario where the observations are generated from a conditional distribution of some known control variate and some latent noise variate. To this end, we present a hierarchical model and a training method (CZ-GEM) that leverages some of the recent developments in likelihood-based and likelihood-free generative models. We show that by formulation, CZ-GEM introduces the right inductive biases that ensure the disentanglement of the control from the noise variables, while also keeping the components of the control variate disentangled. This is achieved without compromising on the quality of the generated samples. Our approach is simple, general, and can be applied both in supervised and unsupervised settings.",/pdf/fcd1d9121a29f51e2159b0911cc63d946dd23cb4.pdf,ICLR,2020,Hierarchical generative model (hybrid of VAE and GAN) that learns a disentangled representation of data without compromising the generative quality. +B1mvVm-C-,rkGINQ-R-,1509140000000.0,1520070000000.0,1160,Universal Agent for Disentangling Environments and Tasks,"[""mjy14@mails.tsinghua.edu.cn"", ""dhh14@mails.tsinghua.edu.cn"", ""limjj@usc.edu""]","[""Jiayuan Mao"", ""Honghua Dong"", ""Joseph J. Lim""]","[""reinforcement learning"", ""transfer learning""]","Recent state-of-the-art reinforcement learning algorithms are trained under the goal of excelling in one specific task. Hence, both environment and task specific knowledge are entangled into one framework. However, there are often scenarios where the environment (e.g. the physical world) is fixed while only the target task changes. Hence, borrowing the idea from hierarchical reinforcement learning, we propose a framework that disentangles task and environment specific knowledge by separating them into two units. The environment-specific unit handles how to move from one state to the target state; and the task-specific unit plans for the next target state given a specific task. The extensive results in simulators indicate that our method can efficiently separate and learn two independent units, and also adapt to a new task more efficiently than the state-of-the-art methods.",/pdf/28775c0797029152d1a7938db55ce9fbab82bae7.pdf,ICLR,2018,We propose a DRL framework that disentangles task and environment specific knowledge. +BkllBpEKDH,H1gNj3VvwB,1569440000000.0,1577170000000.0,510,Continuous Adaptation in Multi-agent Competitive Environments,"[""fuj30089@gmail.com"", ""shengjyh@faculty.nctu.edu.tw""]","[""Kuei-Tso Lee"", ""Sheng-Jyh Wang""]","[""multi-agent environment"", ""continuous adaptation"", ""Nash equilibrium"", ""deep counterfactual regret minimization"", ""reinforcement learning"", ""stochastic game"", ""baseball""]","In a multi-agent competitive environment, we would expect an agent who can quickly adapt to environmental changes may have a higher probability to survive and beat other agents. In this paper, to discuss whether the adaptation capability can help a learning agent to improve its competitiveness in a multi-agent environment, we construct a simplified baseball game scenario to develop and evaluate the adaptation capability of learning agents. Our baseball game scenario is modeled as a two-player zero-sum stochastic game with only the final reward. We purpose a modified Deep CFR algorithm to learn a strategy that approximates the Nash equilibrium strategy. We also form several teams, with different teams adopting different playing strategies, trying to analyze (1) whether an adaptation mechanism can help in increasing the winning percentage and (2) what kind of initial strategies can help a team to get a higher winning percentage. The experimental results show that the learned Nash-equilibrium strategy is very similar to real-life baseball game strategy. Besides, with the proposed strategy adaptation mechanism, the winning percentage can be increased for the team with a Nash-equilibrium initial strategy. Nevertheless, based on the same adaptation mechanism, those teams with deterministic initial strategies actually become less competitive.",/pdf/f52ea026c6f2d702c23348c0554562bc545a59a2.pdf,ICLR,2020,We construct a simplified baseball game scenario to develop and evaluate the adaptation capability of learning agents. +HygDF1rYDB,Byl0mLAOwH,1569440000000.0,1577170000000.0,1843,Explaining Time Series by Counterfactuals,"[""stonekaboni@cs.toronto.edu"", ""shalmali@vectorinstitute.ai"", ""duvenaud@cs.toronto.edu"", ""anna.goldenberg@utoronto.ca""]","[""Sana Tonekaboni"", ""Shalmali Joshi"", ""David Duvenaud"", ""Anna Goldenberg""]","[""explainability"", ""counterfactual modeling"", ""time series""]","We propose a method to automatically compute the importance of features at every observation in time series, by simulating counterfactual trajectories given previous observations. We define the importance of each observation as the change in the model output caused by replacing the observation with a generated one. Our method can be applied to arbitrarily complex time series models. We compare the generated feature importance to existing methods like sensitivity analyses, feature occlusion, and other explanation baselines to show that our approach generates more precise explanations and is less sensitive to noise in the input signals.",/pdf/46ace7efc65e99328c4fa8b2007a85cd0a4f7111.pdf,ICLR,2020,Explaining Multivariate Time Series Models by finding important observations in time using Counterfactuals +BkeC_J-R-,HyJ0Ok-0Z,1509120000000.0,1518730000000.0,494,Combination of Supervised and Reinforcement Learning For Vision-Based Autonomous Control,"[""d.kangin@exeter.ac.uk"", ""n.pugeault@exeter.ac.uk""]","[""Dmitry Kangin"", ""Nicolas Pugeault""]","[""Reinforcement learning"", ""deep learning"", ""autonomous control""]"," Reinforcement learning methods have recently achieved impressive results on a wide range of control problems. However, especially with complex inputs, they still require an extensive amount of training data in order to converge to a meaningful solution. This limitation largely prohibits their usage for complex input spaces such as video signals, and it is still impossible to use it for a number of complex problems in a real world environments, including many of those for video based control. Supervised learning, on the contrary, is capable of learning on a relatively small number of samples, however it does not take into account reward-based control policies and is not capable to provide independent control policies. In this article we propose a model-free control method, which uses a combination of reinforcement and supervised learning for autonomous control and paves the way towards policy based control in real world environments. We use SpeedDreams/TORCS video game to demonstrate that our approach requires much less samples (hundreds of thousands against millions or tens of millions) comparing to the state-of-the-art reinforcement learning techniques on similar data, and at the same time overcomes both supervised and reinforcement learning approaches in terms of quality. Additionally, we demonstrate the applicability of the method to MuJoCo control problems. ",/pdf/3ef5574c4ec2f92196a3000f1c1f8b1a6f7056c8.pdf,ICLR,2018,"The new combination of reinforcement and supervised learning, dramatically decreasing the number of required samples for training on video" +yUxUNaj2Sl,YQvXgaTU8TB,1601310000000.0,1616020000000.0,2111,Does enhanced shape bias improve neural network robustness to common corruptions?,"[""~Chaithanya_Kumar_Mummadi1"", ""~Ranjitha_Subramaniam1"", ""~Robin_Hutmacher1"", ""~Julien_Vitay1"", ""~Volker_Fischer1"", ""~Jan_Hendrik_Metzen1""]","[""Chaithanya Kumar Mummadi"", ""Ranjitha Subramaniam"", ""Robin Hutmacher"", ""Julien Vitay"", ""Volker Fischer"", ""Jan Hendrik Metzen""]","[""neural network robustness"", ""shape bias"", ""corruptions"", ""distribution shift""]","Convolutional neural networks (CNNs) learn to extract representations of complex features, such as object shapes and textures to solve image recognition tasks. Recent work indicates that CNNs trained on ImageNet are biased towards features that encode textures and that these alone are sufficient to generalize to unseen test data from the same distribution as the training data but often fail to generalize to out-of-distribution data. It has been shown that augmenting the training data with different image styles decreases this texture bias in favor of increased shape bias while at the same time improving robustness to common corruptions, such as noise and blur. Commonly, this is interpreted as shape bias increasing corruption robustness. However, this relationship is only hypothesized. We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct.",/pdf/afd34372762e3e8c1a3e742ee5fa8ae88e866c47.pdf,ICLR,2021,We show that robustness on common corruptions donot correlate with strong shape bias but with the effective data augmentation strategies like stylization +H1ewdiR5tQ,HkeuD5y9Ym,1538090000000.0,1547120000000.0,363,Graph Wavelet Neural Network,"[""xubingbing@ict.ac.cn"", ""shenhuawei@ict.ac.cn"", ""caoqi@ict.ac.cn"", ""qiuyunqi@ict.ac.cn"", ""cxq@ict.ac.cn""]","[""Bingbing Xu"", ""Huawei Shen"", ""Qi Cao"", ""Yunqi Qiu"", ""Xueqi Cheng""]","[""graph convolution"", ""graph wavelet transform"", ""graph Fourier transform"", ""semi-supervised learning""]","We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcomings of previous spectral graph CNN methods that depend on graph Fourier transform. Different from graph Fourier transform, graph wavelet transform can be obtained via a fast algorithm without requiring matrix eigendecomposition with high computational cost. Moreover, graph wavelets are sparse and localized in vertex domain, offering high efficiency and good interpretability for graph convolution. The proposed GWNN significantly outperforms previous spectral graph CNNs in the task of graph-based semi-supervised classification on three benchmark datasets: Cora, Citeseer and Pubmed.",/pdf/3298c2f91e55e505a6cb5cc98588ba2b5a76ad5a.pdf,ICLR,2019,"We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcoming of previous spectral graph CNN methods that depend on graph Fourier transform." +nzKv5vxZfge,KvVn-e3oKX,1601310000000.0,1614990000000.0,1666,Causal Screening to Interpret Graph Neural Networks,"[""~Xiang_Wang6"", ""wuyxin@mail.ustc.edu.cn"", ""~An_Zhang2"", ""~Xiangnan_He1"", ""~Tat-seng_Chua1""]","[""Xiang Wang"", ""Yingxin Wu"", ""An Zhang"", ""Xiangnan He"", ""Tat-seng Chua""]","[""Feature Attribution"", ""Graph Neural Networks"", ""Explainable Methods"", ""Causal Effect""]","With the growing success of graph neural networks (GNNs), the explainability of GNN is attracting considerable attention. However, current works on feature attribution, which frame explanation generation as attributing a prediction to the graph features, mostly focus on the statistical interpretability. They may struggle to distinguish causal and noncausal effects of features, and quantify redundancy among features, thus resulting in unsatisfactory explanations. In this work, we focus on the causal interpretability in GNNs and propose a method, Causal Screening, from the perspective of cause-effect. It incrementally selects a graph feature (i.e., edge) with large causal attribution, which is formulated as the individual causal effect on the model outcome. As a model-agnostic tool, Causal Screening can be used to generate faithful and concise explanations for any GNN model. Further, by conducting extensive experiments on three graph classification datasets, we observe that Causal Screening achieves significant improvements over state-of-the-art approaches w.r.t. two quantitative metrics: predictive accuracy, contrastivity, and safely passes sanity checks.",/pdf/6cbc051b2a41c91c6991ee72c22b3b26b40304ba.pdf,ICLR,2021,"We explore the causal interpretability of graph neural networks, and propose a new method, Causal Screening, to identify the most influential edges and generate post-hoc explanations for model predictions." +SkeATxrKwH,ByeMPefYPS,1569440000000.0,1577170000000.0,2592,"Generative Hierarchical Models for Parts, Objects, and Scenes","[""fei.deng@rutgers.edu"", ""zhizz001@stu.xjtu.edu.cn"", ""sjn.ahn@gmail.com""]","[""Fei Deng"", ""Zhuo Zhi"", ""Sungjin Ahn""]",[],"Hierarchical structure such as part-whole relationship in objects and scenes are the most inherent structure in natural scenes. Learning such representation via unsupervised learning can provide various benefits such as interpretability, compositionality, and transferability, which are important in many downstream tasks. In this paper, we propose the first hierarchical generative model for learning multiple latent part-whole relationships in a scene. During inference, taking top-down approach, our model infers the representation of more abstract concept (e.g., objects) and then infers that of more specific concepts (e.g., parts) by conditioning on the corresponding abstract concept. This makes the model avoid a difficult problem of routing between parts and whole. In experiments on images containing multiple objects with different shapes and part compositions, we demonstrate that our model can learn the latent hierarchical structure between parts and wholes and generate imaginary scenes.",/pdf/f4f0b7f203b69465ae873732db88cf107f0f7561.pdf,ICLR,2020, +R5M7Mxl1xZ,mXbPiTUHfAO,1601310000000.0,1614990000000.0,1516,Minimal Geometry-Distortion Constraint for Unsupervised Image-to-Image Translation,"[""~Jiaxian_Guo2"", ""~Jiachen_Li4"", ""~Mingming_Gong1"", ""~Huan_Fu1"", ""~Kun_Zhang1"", ""~Dacheng_Tao1""]","[""Jiaxian Guo"", ""Jiachen Li"", ""Mingming Gong"", ""Huan Fu"", ""Kun Zhang"", ""Dacheng Tao""]","[""Unsupervised image translation"", ""Geometry distortion""]","Unsupervised image-to-image (I2I) translation, which aims to learn a domain mapping function without paired data, is very challenging because the function is highly under-constrained. Despite the significant progress in constraining the mapping function, current methods suffer from the \textit{geometry distortion} problem: the geometry structure of the translated image is inconsistent with the input source image, which may cause the undesired distortions in the translated images. To remedy this issue, we propose a novel I2I translation constraint, called \textit{Minimal Geometry-Distortion Constraint} (MGC), which promotes the consistency of geometry structures and reduce the unwanted distortions in translation by reducing the randomness of color transformation in the translation process. To facilitate estimation and maximization of MGC, we propose an approximate representation of mutual information called relative Squared-loss Mutual Information (rSMI) that can be efficiently estimated analytically. We demonstrate the effectiveness of our MGC by providing quantitative and qualitative comparisons with the state-of-the-art methods on several benchmark datasets. +",/pdf/d279752dbfa33ccac5e1371eb483bc1890adf3de.pdf,ICLR,2021,We propose the Minimal Geometry-Distortion Constraint to promote the consistency of geometry structures and reduce the unwanted distortions in I2I translation. +uQfOy7LrlTR,FytV1OySACs,1601310000000.0,1611610000000.0,126,Scaling the Convex Barrier with Active Sets,"[""~Alessandro_De_Palma1"", ""~Harkirat_Behl1"", ""~Rudy_R_Bunel1"", ""~Philip_Torr1"", ""~M._Pawan_Kumar1""]","[""Alessandro De Palma"", ""Harkirat Behl"", ""Rudy R Bunel"", ""Philip Torr"", ""M. Pawan Kumar""]","[""Neural Network Verification"", ""Neural Network Bounding"", ""Optimisation for Deep Learning""]","Tight and efficient neural network bounding is of critical importance for the scaling of neural network verification systems. A number of efficient specialised dual solvers for neural network bounds have been presented recently, but they are often too loose to verify more challenging properties. This lack of tightness is linked to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise linear activations exists, it comes at the cost of exponentially many constraints and thus currently lacks an efficient customised solver. We alleviate this deficiency via a novel dual algorithm that realises the full potential of the new relaxation by operating on a small active set of dual variables. Our method recovers the strengths of the new relaxation in the dual space: tightness and a linear separation oracle. At the same time, it shares the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we obtain better bounds than off-the-shelf solvers in only a fraction of their running time and recover the speed-accuracy trade-offs of looser dual solvers if the computational budget is small. We demonstrate that this results in significant formal verification speed-ups.",/pdf/2458e1a8af2e8df1e3da86a42fb7ebcbad62dc2f.pdf,ICLR,2021,We present a specialised dual solver for a tight ReLU convex relaxation and show that it speeds up formal network verification. +Sy7m72Ogg,,1478180000000.0,1481730000000.0,63,An Actor-critic Algorithm for Learning Rate Learning,"[""changxu@nbjl.nankai.edu.cn"", ""taoqin@microsoft.com"", ""wgzwp@nbjl.nankai.edu.cn"", ""tie-yan.liu@microsoft.com""]","[""Chang Xu"", ""Tao Qin"", ""Gang Wang"", ""Tie-Yan Liu""]","[""Deep learning"", ""Reinforcement Learning""]","Stochastic gradient descent (SGD), which updates the model parameters by adding a local gradient times a learning rate at each step, is widely used in model training of machine learning algorithms such as neural networks. It is observed that the models trained by SGD are sensitive to learning rates and good learning rates are problem specific. To avoid manually searching of learning rates, which is tedious and inefficient, we propose an algorithm to automatically learn learning rates using actor-critic methods from reinforcement learning (RL). In particular, we train a policy network called actor to decide the learning rate at each step during training, and a value network called critic to give feedback about quality of the decision (e.g., the goodness of the learning rate outputted by the actor) that the actor made. Experiments show that our method leads to good convergence of SGD and can prevent overfitting to a certain extent, resulting in better performance than human-designed competitors.",/pdf/7092cca2d341f4546ef5dd82256bdaf0045a988d.pdf,ICLR,2017,We propose an algorithm to automatically learn learning rates using actor-critic methods from reinforcement learning. +Yj4mmVB_l6,nfDhDnbIcKq,1601310000000.0,1614990000000.0,3446,Two steps at a time --- taking GAN training in stride with Tseng's method,"[""~Axel_B\u00f6hm1"", ""~Michael_Sedlmayer1"", ""~Ern\u00f6_Robert_Csetnek1"", ""radu.bot@univie.ac.at""]","[""Axel B\u00f6hm"", ""Michael Sedlmayer"", ""Ern\u00f6 Robert Csetnek"", ""Radu Ioan Bot""]",[],"Motivated by the training of Generative Adversarial Networks (GANs), we study methods for solving minimax problems with additional nonsmooth regularizers. +We do so by employing \emph{monotone operator} theory, in particular the \emph{Forward-Backward-Forward (FBF)} method, which avoids the known issue of limit cycling by correcting each update by a second gradient evaluation. +Furthermore, we propose a seemingly new scheme which recycles old gradients to mitigate the additional computational cost. +In doing so we rediscover a known method, related to \emph{Optimistic Gradient Descent Ascent (OGDA)}. +For both schemes we prove novel convergence rates for convex-concave minimax problems via a unifying approach. The derived error bounds are in terms of the gap function for the ergodic iterates. +For the deterministic and the stochastic problem we show a convergence rate of $\mathcal{O}(\nicefrac{1}{k})$ and $\mathcal{O}(\nicefrac{1}{\sqrt{k}})$, respectively. +We complement our theoretical results with empirical improvements in the training of Wasserstein GANs on the CIFAR10 dataset.",/pdf/aa426a1a631cad4c4a07c1ef515c2ae6fe1799d8.pdf,ICLR,2021, +S1eSoeSYwr,rkeSjk-KwB,1569440000000.0,1577170000000.0,2508,Deep Evidential Uncertainty,"[""amini@mit.edu"", ""wilkos@mit.edu"", ""asolei@mit.edu"", ""rus@csail.mit.edu""]","[""Alexander Amini"", ""Wilko Schwarting"", ""Ava Soleimany"", ""Daniela Rus""]","[""Evidential deep learning"", ""Uncertainty estimation"", ""Epistemic uncertainty""]","Deterministic neural networks (NNs) are increasingly being deployed in safety critical domains, where calibrated, robust and efficient measures of uncertainty are crucial. While it is possible to train regression networks to output the parameters of a probability distribution by maximizing a Gaussian likelihood function, the resulting model remains oblivious to the underlying confidence of its predictions. In this paper, we propose a novel method for training deterministic NNs to not only estimate the desired target but also the associated evidence in support of that target. We accomplish this by placing evidential priors over our original Gaussian likelihood function and training our NN to infer the hyperparameters of our evidential distribution. We impose priors during training such that the model is penalized when its predicted evidence is not aligned with the correct output. Thus the model estimates not only the probabilistic mean and variance of our target but also the underlying uncertainty associated with each of those parameters. We observe that our evidential regression method learns well-calibrated measures of uncertainty on various benchmarks, scales to complex computer vision tasks, and is robust to adversarial input perturbations. +",/pdf/b0dd6c91b99250caa18315b28e61ded20499c327.pdf,ICLR,2020,"Fast, calibrated uncertainty estimation for neural networks without sampling" +KcImcc3j-qS,f97pBzDbDHD,1601310000000.0,1614990000000.0,1075,Fast Predictive Uncertainty for Classification with Bayesian Deep Networks,"[""~Marius_Hobbhahn1"", ""~Agustinus_Kristiadi1"", ""~Philipp_Hennig1""]","[""Marius Hobbhahn"", ""Agustinus Kristiadi"", ""Philipp Hennig""]","[""Bayesian Deep Learning"", ""Approximate Inference""]","In Bayesian Deep Learning, distributions over the output of classification neural networks are approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the categorical output distribution. This is costly. We reconsider old work to construct a Dirichlet approximation of this output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the categorical) in the output space. We argue that the resulting Dirichlet distribution has theoretical and practical advantages, in particular, more efficient computation of the uncertainty estimate, scaling to large datasets and networks like ImageNet and DenseNet. We demonstrate the use of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for the ImageNet setup.",/pdf/1f4878fd0266b1a7e51dced2cb4182777f19ec02.pdf,ICLR,2021,We re-use an old method (the Laplace Bridge) in the context of Bayesian Deep Learning to improve computation time of the posterior predictive significantly for all Networks that have a Gaussian over the logits. +BJC_jUqxe,,1478290000000.0,1487630000000.0,319,A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING,"[""lin.zhouhan@gmail.com"", ""mfeng@us.ibm.com"", ""cicerons@us.ibm.com"", ""yum@us.ibm.com"", ""bingxia@us.ibm.com"", ""zhou@us.ibm.com"", ""yoshua.bengio@umontreal.ca""]","[""Zhouhan Lin"", ""Minwei Feng"", ""Cicero Nogueira dos Santos"", ""Mo Yu"", ""Bing Xiang"", ""Bowen Zhou"", ""Yoshua Bengio""]","[""Natural language processing"", ""Deep learning"", ""Supervised Learning""]","This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.",/pdf/04457e6e3ea211450caf6a06cf0981744ba33849.pdf,ICLR,2017,a new model for extracting an interpretable sentence embedding by introducing self-attention and matrix representation. +PEcNk5Bad7z,DFhZQVjAyin,1601310000000.0,1614990000000.0,461,Learning Irreducible Representations of Noncommutative Lie Groups,"[""~Noah_Shutty1"", ""casimir.wierzynski@intel.com""]","[""Noah Shutty"", ""Casimir Wierzynski""]","[""equivariance"", ""object tracking"", ""equivariant neural networks"", ""deep learning"", ""point cloud"", ""lie group"", ""lie algebra"", ""lorentz group"", ""poincar\u00e9 group""]","Recent work has constructed neural networks that are equivariant to continuous symmetry groups such as 2D and 3D rotations. This is accomplished using explicit group representations to derive the equivariant kernels and nonlinearities. We present two contributions motivated by frontier applications of equivariance beyond rotations and translations. First, we relax the requirement for explicit Lie group representations, presenting a novel algorithm that finds irreducible representations of noncommutative Lie groups given only the structure constants of the associated Lie algebra. Second, we demonstrate that Lorentz-equivariance is a useful prior for object-tracking tasks and construct the first object-tracking model equivariant to the Poincaré group.",/pdf/300467f87b4ce05bc5cfb25b02dac8d598a4c6dd.pdf,ICLR,2021,We automate an essential task in equivariant deep learning and apply Lorentz-equivariance to object tracking. +SyyGPP0TZ,B11fDPAab,1508960000000.0,1518730000000.0,94,Regularizing and Optimizing LSTM Language Models,"[""smerity@smerity.com"", ""keskar.nitish@u.northwestern.edu"", ""richard@socher.org""]","[""Stephen Merity"", ""Nitish Shirish Keskar"", ""Richard Socher""]","[""language model"", ""LSTM"", ""regularization"", ""optimization"", ""ASGD"", ""dropconnect""]","In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM, which uses DropConnect on hidden-to-hidden weights, as a form of recurrent regularization. Further, we introduce NT-ASGD, a non-monotonically triggered (NT) variant of the averaged stochastic gradient method (ASGD), wherein the averaging trigger is determined using a NT condition as opposed to being tuned by the user. Using these and other regularization strategies, our ASGD Weight-Dropped LSTM (AWD-LSTM) achieves state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2. We also explore the viability of the proposed regularization and optimization strategies in the context of the quasi-recurrent neural network (QRNN) and demonstrate comparable performance to the AWD-LSTM counterpart. The code for reproducing the results is open sourced and is available at https://github.com/salesforce/awd-lstm-lm.",/pdf/8f5893fee58907e88ff15962c29dc93ee1bac2df.pdf,ICLR,2018,Effective regularization and optimization strategies for LSTM-based language models achieves SOTA on PTB and WT2. +BkgosiRcKm,HylXU3jqK7,1538090000000.0,1545360000000.0,650,Deep Recurrent Gaussian Process with Variational Sparse Spectrum Approximation,"[""foell@mathematik.uni-stuttgart.de"", ""haasdonk@mathematik.uni-stuttgart.de"", ""markus.hanselmann@etas.com"", ""holger.ulmer@etas.com""]","[""Roman F\u00f6ll"", ""Bernard Haasdonk"", ""Markus Hanselmann"", ""Holger Ulmer""]","[""Deep Gaussian Process Model"", ""Recurrent Model"", ""State-Space Model"", ""Nonlinear system identification"", ""Dynamical modeling""]","Modeling sequential data has become more and more important in practice. Some applications are autonomous driving, virtual sensors and weather forecasting. To model such systems, so called recurrent models are frequently used. In this paper we introduce several new Deep Recurrent Gaussian Process (DRGP) models based on the Sparse Spectrum Gaussian Process (SSGP) and the improved version, called Variational Sparse Spectrum Gaussian Process (VSSGP). We follow the recurrent structure given by an existing DRGP based on a specific variational sparse Nyström approximation, the recurrent Gaussian Process (RGP). Similar to previous work, we also variationally integrate out the input-space and hence can propagate uncertainty through the Gaussian Process (GP) layers. Our approach can deal with a larger class of covariance functions than the RGP, because its spectral nature allows variational integration in all stationary cases. Furthermore, we combine the (Variational) Sparse Spectrum ((V)SS) approximations with a well known inducing-input regularization framework. For the DRGP extension of these combined approximations and the simple (V)SS approximations an optimal variational distribution exists. We improve over current state of the art methods in prediction accuracy for experimental data-sets used for their evaluation and introduce a new data-set for engine control, named Emission.",/pdf/37b4cfef6eb4b737fdfdc6707ba01fa59cfe44be.pdf,ICLR,2019,Modeling time-series with several Gaussian Processes in a row via a specific Variational Sparse Spectrum Approximation +xboZWqM_ELA,5HW9SDnzKqz,1601310000000.0,1614990000000.0,561,Debiased Graph Neural Networks with Agnostic Label Selection Bias,"[""~Shaohua_Fan1"", ""xiaowang@bupt.edu.cn"", ""shichuan@bupt.edu.cn"", ""~Kun_Kuang1"", ""nianliu@bupt.edu.cn"", ""wangbai@bupt.edu.cn""]","[""Shaohua Fan"", ""Xiao Wang"", ""Chuan Shi"", ""Kun Kuang"", ""Nian Liu"", ""Bai Wang""]","[""GRAPH NEURAL NETWORKS"", ""LABEL SELECTION BIAS""]","Most existing Graph Neural Networks (GNNs) are proposed without considering the selection bias in data, i.e., the inconsistent distribution between the training set with test set. In reality, the test data is not even available during the training process, making selection bias agnostic. Training GNNs with biased selected nodes leads to significant parameter estimation bias and greatly impacts the generalization ability on test nodes. In this paper, we first present an experimental investigation, which clearly shows that the selection bias drastically hinders the generalization ability of GNNs, and theoretically prove that the selection bias will cause the biased estimation on GNN parameters. Then to remove the bias in GNN estimation, we propose a novel Debiased Graph Neural Networks (DGNN) with a differentiated decorrelation regularizer. The differentiated decorrelation regularizer estimates a sample weight for each labeled node such that the spurious correlation of learned embeddings could be eliminated. We analyze the regularizer in causal view and it motivates us to differentiate the weights of the variables based on their contribution on the confounding bias. Then, these sample weights are used for reweighting GNNs to eliminate the estimation bias, thus help to improve the stability of prediction on unknown test nodes. Comprehensive experiments are conducted on several challenging graph datasets with two kinds of label selection bias. The results well verify that our proposed model outperforms the state-of-the-art methods and DGNN is a flexible framework to enhance existing GNNs.",/pdf/538fa3073a8c62b947c3a467363c5e8fce25c5f3.pdf,ICLR,2021,It is the first work to propose and solve the agnostic label selection bias problem in graph neural networks. +Bkl87h09FX,r1gsiO4BF7,1538090000000.0,1545360000000.0,1360,Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling,"[""bowman@nyu.edu"", ""ellie_pavlick@brown.edu"", ""egrave@fb.com"", ""vandurme@cs.jhu.edu"", ""alexwang@nyu.edu"", ""jan.hula21@gmail.com"", ""paxia@jhu.edu"", ""raghu1991.p@gmail.com"", ""tom.mccoy@jhu.edu"", ""romapatel@brown.edu"", ""n.kim@jhu.edu"", ""iftenney@google.com"", ""huangyi@us.ibm.com"", ""yukatherin@fb.com"", ""jinxx596@d.umn.edu"", ""bchen6@swarthmore.edu""]","[""Samuel R. Bowman"", ""Ellie Pavlick"", ""Edouard Grave"", ""Benjamin Van Durme"", ""Alex Wang"", ""Jan Hula"", ""Patrick Xia"", ""Raghavendra Pappagari"", ""R. Thomas McCoy"", ""Roma Patel"", ""Najoung Kim"", ""Ian Tenney"", ""Yinghui Huang"", ""Katherin Yu"", ""Shuning Jin"", ""Berlin Chen""]","[""natural language processing"", ""transfer learning"", ""multitask learning""]","Work on the problem of contextualized word representation—the development of reusable neural network components for sentence understanding—has recently seen a surge of progress centered on the unsupervised pretraining task of language modeling with methods like ELMo (Peters et al., 2018). This paper contributes the first large-scale systematic study comparing different pretraining tasks in this context, both as complements to language modeling and as potential alternatives. The primary results of the study support the use of language modeling as a pretraining task and set a new state of the art among comparable models using multitask learning with language models. However, a closer look at these results reveals worryingly strong baselines and strikingly varied results across target tasks, suggesting that the widely-used paradigm of pretraining and freezing sentence encoders may not be an ideal platform for further work. +",/pdf/05feb28cf06d05de51879e7f22f79a7fc78e8272.pdf,ICLR,2019,"We compare many tasks and task combinations for pretraining sentence-level BiLSTMs for NLP tasks. Language modeling is the best single pretraining task, but simple baselines also do well." +BJe-91BtvH,BJlouuAOvH,1569440000000.0,1583910000000.0,1866,Masked Based Unsupervised Content Transfer,"[""sagiebenaim@gmail.com"", ""ron.mokady@gmail.com"", ""wolf@fb.com"", ""amit.bermano@gmail.com""]","[""Ron Mokady"", ""Sagie Benaim"", ""Lior Wolf"", ""Amit Bermano""]",[],"We consider the problem of translating, in an unsupervised manner, between two domains where one contains some additional information compared to the other. The proposed method disentangles the common and separate parts of these domains and, through the generation of a mask, focuses the attention of the underlying network to the desired augmentation alone, without wastefully reconstructing the entire target. This enables state-of-the-art quality and variety of content translation, as demonstrated through extensive quantitative and qualitative evaluation. Our method is also capable of adding the separate content of different guide images and domains as well as remove existing separate content. Furthermore, our method enables weakly-supervised semantic segmentation of the separate part of each domain, where only class labels are provided. Our code is available at https://github.com/rmokady/mbu-content-tansfer. +",/pdf/b13c0059577e9b0336cdb2648d98320b62954582.pdf,ICLR,2020, +HkehD3VtvS,SkxsatLhBB,1569440000000.0,1577170000000.0,20,"Deep Reasoning Networks: Thinking Fast and Slow, for Pattern De-mixing","[""di@cs.cornell.edu"", ""bywbilly@cs.cornell.edu"", ""wzhao@cs.cornell.edu"", ""ament@cs.cornell.edu"", ""gregoire@caltech.edu"", ""gomes@cs.cornell.edu""]","[""Di Chen"", ""Yiwei Bai"", ""Wenting Zhao"", ""Sebastian Ament"", ""John M. Gregoire"", ""Carla P. Gomes""]","[""Deep Reasoning Network"", ""Pattern De-mixing""]","We introduce Deep Reasoning Networks (DRNets), an end-to-end framework that combines deep learning with reasoning for solving pattern de-mixing problems, typically in an unsupervised or weakly-supervised setting. DRNets exploit problem structure and prior knowledge by tightly combining logic and constraint reasoning with stochastic-gradient-based neural network optimization. We illustrate the power of DRNets on de-mixing overlapping hand-written Sudokus (Multi-MNIST-Sudoku) and on a substantially more complex task in scientific discovery that concerns inferring crystal structures of materials from X-ray diffraction data (Crystal-Structure-Phase-Mapping). DRNets significantly outperform the state of the art and experts' capabilities on Crystal-Structure-Phase-Mapping, recovering more precise and physically meaningful crystal structures. On Multi-MNIST-Sudoku, DRNets perfectly recovered the mixed Sudokus' digits, with 100% digit accuracy, outperforming the supervised state-of-the-art MNIST de-mixing models.",/pdf/910209ff687cf6f78ea3b184527b8252031ab1d3.pdf,ICLR,2020,"We introduce Deep Reasoning Networks (DRNets), an end-to-end framework that combines deep learning with reasoning for solving pattern de-mixing tasks, typically in an unsupervised or weakly-supervised setting. " +g75kUi1jAc_,1Yo7B2LM4o8,1601310000000.0,1614990000000.0,1283,WAFFLe: Weight Anonymized Factorization for Federated Learning,"[""~Weituo_Hao1"", ""~Nikhil_Mehta1"", ""~Kevin_J_Liang1"", ""~Pengyu_Cheng1"", ""~Mostafa_El-Khamy1"", ""~Lawrence_Carin2""]","[""Weituo Hao"", ""Nikhil Mehta"", ""Kevin J Liang"", ""Pengyu Cheng"", ""Mostafa El-Khamy"", ""Lawrence Carin""]","[""Federated Learning"", ""Fairness"", ""Privacy""]","In domains where data are sensitive or private, there is great value in methods that can learn in a distributed manner without the data ever leaving the local devices. In light of this need, federated learning has emerged as a popular training paradigm. However, many federated learning approaches trade transmitting data for communicating updated weight parameters for each local device. Therefore, a successful breach that would have otherwise directly compromised the data instead grants whitebox access to the local model, which opens the door to a number of attacks, including exposing the very data federated learning seeks to protect. Additionally, in distributed scenarios, individual client devices commonly exhibit high statistical heterogeneity. Many common federated approaches learn a single global model; while this may do well on average, performance degrades when the i.i.d. assumption is violated, underfitting individuals further from the mean and raising questions of fairness. To address these issues, we propose Weight Anonymized Factorization for Federated Learning (WAFFLe), an approach that combines the Indian Buffet Process with a shared dictionary of weight factors for neural networks. Experiments on MNIST, FashionMNIST, and CIFAR-10 demonstrate WAFFLe's significant improvement to local test performance and fairness while simultaneously providing an extra layer of security.",/pdf/181a39c92de7bea29245f4b73eca8539b7d0e465.pdf,ICLR,2021,"We propose WAFFLe, combining IBP with a shared dictionary of weight factors to model statistical heterogeneity, leading to a fairer FL method with better data security." +S1NHaMW0b,B15GafZCW,1509140000000.0,1518730000000.0,1003,ShakeDrop regularization,"[""yamada@m.cs.osakafu-u.ac.jp"", ""masa@cs.osakafu-u.ac.jp"", ""kise@cs.osakafu-u.ac.jp""]","[""Yoshihiro Yamada"", ""Masakazu Iwamura"", ""Koichi Kise""]",[],"This paper proposes a powerful regularization method named \textit{ShakeDrop regularization}. +ShakeDrop is inspired by Shake-Shake regularization that decreases error rates by disturbing learning. +While Shake-Shake can be applied to only ResNeXt which has multiple branches, ShakeDrop can be applied to not only ResNeXt but also ResNet, Wide ResNet and PyramidNet in a memory efficient way. +Important and interesting feature of ShakeDrop is that it strongly disturbs learning by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. +The effectiveness of ShakeDrop is confirmed by experiments on CIFAR-10/100 and Tiny ImageNet datasets. +",/pdf/9ee17c9a4b8705ee53c2cda971bb1e0d1c4d142d.pdf,ICLR,2018, +igkmo23BgzB,eQra2SpU7ew,1601310000000.0,1614990000000.0,2000,End-to-end Quantized Training via Log-Barrier Extensions,"[""~Juncheng_B_Li1"", ""~Shuhui_Qu1"", ""~Xinjian_Li2"", ""~Emma_Strubell1"", ""~Florian_Metze1""]","[""Juncheng B Li"", ""Shuhui Qu"", ""Xinjian Li"", ""Emma Strubell"", ""Florian Metze""]","[""Quantization"", ""Constrained Optimization"", ""Mu-law"", ""8-bit training""]","Quantization of neural network parameters and activations has emerged as a successful approach to reducing the model size and inference time on hardware that sup-ports native low-precision arithmetic. Fully quantized training would facilitate further computational speed-ups as well as enable model training on embedded devices, a feature that would alleviate privacy concerns resulting from the transfer of sensitive data and models that is necessitated by off-device training. Existing approaches to quantization-aware training (QAT) perform “fake” quantization in the forward pass in order to learn model parameters that will perform well when quantized, but rely on higher precision variables to avoid overflow in large matrix multiplications, which is unsuitable for training on fully low-precision (e.g. 8-bit)hardware. To enable fully end-to-end quantized training, we propose Log BarrierTail-bounded Quantization (LogBTQ). LogBTQ introduces a loss term, inspired by the log-barrier for constrained optimization, that enforces soft constraints on the range of values that model parameters can take on. By constraining and sparsifying model parameters, activations and inputs, our approach eliminates over-flow in practice, allowing for fully quantized 8-bit training of deep neural network models. We show that models trained using our approach achieve results competitive with state-of-the-art full-precision networks on the MNIST, CIFAR-10 andImageNet classification benchmarks.",/pdf/c1cb38d33d7e5bdc6c2af131f3e589700e945491.pdf,ICLR,2021,fully 8-bit quantized training using log-barrier constrain and MU8(8-bit) encoding +B1ElR4cgg,,1478280000000.0,1487690000000.0,222,Adversarially Learned Inference,"[""vincent.dumoulin@umontreal.ca"", ""ishmael.belghazi@gmail.com"", ""poole@cs.stanford.edu"", ""alex6200@gmail.com"", ""martinarjovsky@gmail.com"", ""oli.mastro@gmail.com"", ""aaron.courville@gmail.com""]","[""Vincent Dumoulin"", ""Ishmael Belghazi"", ""Ben Poole"", ""Alex Lamb"", ""Martin Arjovsky"", ""Olivier Mastropietro"", ""Aaron Courville""]","[""Computer vision"", ""Deep learning"", ""Unsupervised Learning"", ""Semi-Supervised Learning""]","We introduce the adversarially learned inference (ALI) model, which jointly +learns a generation network and an inference network using an adversarial +process. The generation network maps samples from stochastic latent variables to +the data space while the inference network maps training examples in data space +to the space of latent variables. An adversarial game is cast between these two +networks and a discriminative network that is trained to distinguish between +joint latent/data-space samples from the generative network and joint samples +from the inference network. We illustrate the ability of the model to learn +mutually coherent inference and generation networks through the inspections of +model samples and reconstructions and confirm the usefulness of the learned +representations by obtaining a performance competitive with other recent +approaches on the semi-supervised SVHN task.",/pdf/7a3fa67c5f7f97e5095d822afd76a76eaf6b9551.pdf,ICLR,2017,We present and adverserially trained generative model with an inference network. Samples quality is high. Competitive semi-supervised results are achieved. +rJ4qXnCqFX,HklagkpcKm,1538090000000.0,1545360000000.0,1387,Probabilistic Knowledge Graph Embeddings,"[""farnood.salehi@epfl.ch"", ""robert.bamler@gmail.com"", ""stephan.mandt@gmail.com""]","[""Farnood Salehi"", ""Robert Bamler"", ""Stephan Mandt""]","[""knowledge graph"", ""variational inference"", ""probabilistic models"", ""representation learning""]","We develop a probabilistic extension of state-of-the-art embedding models for link prediction in relational knowledge graphs. Knowledge graphs are collections of relational facts, where each fact states that a certain relation holds between two entities, such as people, places, or objects. We argue that knowledge graphs should be treated within a Bayesian framework because even large knowledge graphs typically contain only few facts per entity, leading effectively to a small data problem where parameter uncertainty matters. We introduce a probabilistic reinterpretation of the DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016) models and employ variational inference to estimate a lower bound on the marginal likelihood of the data. We find that the main benefit of the Bayesian approach is that it allows for efficient, gradient based optimization over hyperparameters, which would lead to divergences in a non-Bayesian treatment. Models with such learned hyperparameters improve over the state-of-the-art by a significant margin, as we demonstrate on several benchmarks.",/pdf/3185dc96d7609dca3448484dd1104906ee592300.pdf,ICLR,2019,Scalable hyperparameter learning for knowledge graph embedding models using variational EM +S1HcOI5le,,1478290000000.0,1481660000000.0,304,OMG: Orthogonal Method of Grouping With Application of K-Shot Learning,"[""haoqif@andrew.cmu.edu"", ""kkitani@cs.cmu.edu""]","[""Haoqi Fan"", ""Yu Zhang"", ""Kris M. Kitani""]",[],"Training a classifier with only a few examples remains a significant barrier when using neural networks with large number of parameters. Though various specialized network architectures have been proposed for these k-shot learning tasks to avoid overfitting, a question remains: is there a generalizable framework for the k-shot learning problem that can leverage existing deep models as well as avoid model overfitting? In this paper, we proposed a generalizable k-shot learning framework that can be used on any pre-trained network, by grouping network parameters to produce a low-dimensional representation of the parameter space. The grouping of the parameters is based on an orthogonal decomposition of the parameter space. To avoid overfitting, groups of parameters will be updated together during the k-shot training process. Furthermore, this framework can be integrated with any existing popular deep neural networks such as VGG, GoogleNet, ResNet, without any changes in the original network structure or any sacrifices in performance. We evaluate our framework on a wide range of intra/inter-dataset k-shot learning tasks and show state-of-the-art performance.",/pdf/369a5da0a0a4abeec41c3541591979aa7cdff827.pdf,ICLR,2017, +SkHDoG-Cb,rJOmsMWRb,1509140000000.0,1519360000000.0,930,Simulated+Unsupervised Learning With Adaptive Data Generation and Bidirectional Mappings,"[""kw1jjang@gmail.com"", ""gnsrla12@kaist.ac.kr"", ""chsuh@kaist.ac.kr""]","[""Kangwook Lee"", ""Hoon Kim"", ""Changho Suh""]",[],"Collecting a large dataset with high quality annotations is expensive and time-consuming. Recently, Shrivastava et al. (2017) propose Simulated+Unsupervised (S+U) learning: It first learns a mapping from synthetic data to real data, translates a large amount of labeled synthetic data to the ones that resemble real data, and then trains a learning model on the translated data. Bousmalis et al. (2017) propose a similar framework that jointly trains a translation mapping and a learning model. +While these algorithms are shown to achieve the state-of-the-art performances on various tasks, it may have a room for improvement, as they do not fully leverage flexibility of data simulation process and consider only the forward (synthetic to real) mapping. While these algorithms are shown to achieve the state-of-the-art performances on various tasks, it may have a room for improvement, as it does not fully leverage flexibility of data simulation process and consider only the forward (synthetic to real) mapping. Inspired by this limitation, we propose a new S+U learning algorithm, which fully leverage the flexibility of data simulators and bidirectional mappings between synthetic data and real data. We show that our approach achieves the improved performance on the gaze estimation task, outperforming (Shrivastava et al., 2017).",/pdf/decd8235d55231c06a5f59e936a69ddd75a8b164.pdf,ICLR,2018, +eo6U4CAwVmg,A0mturSAr1J,1601310000000.0,1615970000000.0,2902,Training GANs with Stronger Augmentations via Contrastive Discriminator,"[""~Jongheon_Jeong1"", ""~Jinwoo_Shin1""]","[""Jongheon Jeong"", ""Jinwoo Shin""]","[""generative adversarial networks"", ""contrastive learning"", ""data augmentation"", ""visual representation learning"", ""unsupervised learning""]","Recent works in Generative Adversarial Networks (GANs) are actively revisiting various data augmentation techniques as an effective way to prevent discriminator overfitting. It is still unclear, however, that which augmentations could actually improve GANs, and in particular, how to apply a wider range of augmentations in training. In this paper, we propose a novel way to address these questions by incorporating a recent contrastive representation learning scheme into the GAN discriminator, coined ContraD. This ""fusion"" enables the discriminators to work with much stronger augmentations without increasing their training instability, thereby preventing the discriminator overfitting issue in GANs more effectively. Even better, we observe that the contrastive learning itself also benefits from our GAN training, i.e., by maintaining discriminative features between real and fake samples, suggesting a strong coherence between the two worlds: good contrastive representations are also good for GAN discriminators, and vice versa. Our experimental results show that GANs with ContraD consistently improve FID and IS compared to other recent techniques incorporating data augmentations, still maintaining highly discriminative features in the discriminator in terms of the linear evaluation. Finally, as a byproduct, we also show that our GANs trained in an unsupervised manner (without labels) can induce many conditional generative models via a simple latent sampling, leveraging the learned features of ContraD. Code is available at https://github.com/jh-jeong/ContraD.",/pdf/2d308c93802630f8c000471788307eb87a9027fd.pdf,ICLR,2021,"We propose a novel discriminator of GAN showing that contrastive representation learning, e.g., SimCLR, and GAN can benefit each other when they are jointly trained. " +H1loF2NFwr,HJe5nDB6IB,1569440000000.0,1583910000000.0,91,Evaluating The Search Phase of Neural Architecture Search,"[""kaicheng.yu@epfl.ch"", ""sciutochristian@gmail.com"", ""martin.jaggi@epfl.ch"", ""claudiu.musat@swisscom.com"", ""mathieu.salzmann@epfl.ch""]","[""Kaicheng Yu"", ""Christian Sciuto"", ""Martin Jaggi"", ""Claudiu Musat"", ""Mathieu Salzmann""]","[""Neural architecture search"", ""parameter sharing"", ""random search"", ""evaluation framework""]"," +Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks. Existing techniques rely on two stages: searching over the architecture space and validating the best architecture. NAS algorithms are currently compared solely based on their results on the downstream task. While intuitive, this fails to explicitly evaluate the effectiveness of their search strategies. In this paper, we propose to evaluate the NAS search phase. +To this end, we compare the quality of the solutions obtained by NAS search policies with that of random architecture selection. We find that: (i) On average, the state-of-the-art NAS algorithms perform similarly to the random policy; (ii) the widely-used weight sharing strategy degrades the ranking of the NAS candidates to the point of not reflecting their true performance, thus reducing the effectiveness of the search process. +We believe that our evaluation framework will be key to designing NAS strategies that consistently discover architectures superior to random ones.",/pdf/7f33208e11494d72d68aa514cdd657396e02728f.pdf,ICLR,2020,We empirically disprove a fundamental hypothesis of the widely-adopted weight sharing strategy in neural architecture search and explain why the state-of-the-arts NAS algorithms performs similarly to random search. +Bke9u1HFwB,ryxw8GAdvr,1569440000000.0,1577170000000.0,1814,Do recent advancements in model-based deep reinforcement learning really improve data efficiency?,"[""k.kielak@bham.ac.uk""]","[""Kacper Piotr Kielak""]","[""deep learning"", ""reinforcement learning"", ""data efficiency"", ""DQN"", ""Rainbow"", ""SimPLe""]","Reinforcement learning (RL) has seen great advancements in the past few years. Nevertheless, the consensus among the RL community is that currently used model-free methods, despite all their benefits, suffer from extreme data inefficiency. To circumvent this problem, novel model-based approaches were introduced that often claim to be much more efficient than their model-free counterparts. In this paper, however, we demonstrate that the state-of-the-art model-free Rainbow DQN algorithm can be trained using a much smaller number of samples than it is commonly reported. By simply allowing the algorithm to execute network updates more frequently we manage to reach similar or better results than existing model-based techniques, at a fraction of complexity and computational costs. Furthermore, based on the outcomes of the study, we argue that the agent similar to the modified Rainbow DQN that is presented in this paper should be used as a baseline for any future work aimed at improving sample efficiency of deep reinforcement learning.",/pdf/8b2b49acf4b1cc4124eb9a228de0ae68029c92bc.pdf,ICLR,2020,Recent advancements in data-efficient model-based reinforcement learning are not any more data efficient than existing model-free approaches. +BkepbpNFwr,BJednad8wS,1569440000000.0,1583910000000.0,392,Progressive Memory Banks for Incremental Domain Adaptation,"[""nasghar@uwaterloo.ca"", ""doublepower.mou@gmail.com"", ""kaselby@uwaterloo.ca"", ""kevin.pantasdo@uwaterloo.ca"", ""ppoupart@uwaterloo.ca"", ""jiang.xin@huawei.com""]","[""Nabiha Asghar"", ""Lili Mou"", ""Kira A. Selby"", ""Kevin D. Pantasdo"", ""Pascal Poupart"", ""Xin Jiang""]","[""natural language processing"", ""domain adaptation""]","This paper addresses the problem of incremental domain adaptation (IDA) in natural language processing (NLP). We assume each domain comes one after another, and that we could only access data in the current domain. The goal of IDA is to build a unified model performing well on all the domains that we have encountered. We adopt the recurrent neural network (RNN) widely used in NLP, but augment it with a directly parameterized memory bank, which is retrieved by an attention mechanism at each step of RNN transition. The memory bank provides a natural way of IDA: when adapting our model to a new domain, we progressively add new slots to the memory bank, which increases the number of parameters, and thus the model capacity. We learn the new memory slots and fine-tune existing parameters by back-propagation. Experimental results show that our approach achieves significantly better performance than fine-tuning alone. Compared with expanding hidden states, our approach is more robust for old domains, shown by both empirical and theoretical results. Our model also outperforms previous work of IDA including elastic weight consolidation and progressive neural networks in the experiments.",/pdf/09cab4009ea99ae5ec0339eca94cdcb853cfba52.pdf,ICLR,2020,"We present a neural memory-based architecture for incremental domain adaptation, and provide theoretical and empirical results." +Byg5flHFDr,rkx_lmlKwr,1569440000000.0,1577170000000.0,2183,EvoNet: A Neural Network for Predicting the Evolution of Dynamic Graphs,"[""changmin.wu@polytechnique.edu"", ""giannisnik@hotmail.com"", ""mvazirg@lix.polytechnique.fr""]","[""Changmin Wu"", ""Giannis Nikolentzos"", ""Michalis Vazirgiannis""]","[""temporal graphs"", ""graph neural network"", ""graph generative model"", ""graph topology prediction""]","Neural networks for structured data like graphs have been studied extensively in recent years. +To date, the bulk of research activity has focused mainly on static graphs. +However, most real-world networks are dynamic since their topology tends to change over time. +Predicting the evolution of dynamic graphs is a task of high significance in the area of graph mining. +Despite its practical importance, the task has not been explored in depth so far, mainly due to its challenging nature. +In this paper, we propose a model that predicts the evolution of dynamic graphs. +Specifically, we use a graph neural network along with a recurrent architecture to capture the temporal evolution patterns of dynamic graphs. +Then, we employ a generative model which predicts the topology of the graph at the next time step and constructs a graph instance that corresponds to that topology. +We evaluate the proposed model on several artificial datasets following common network evolving dynamics, as well as on real-world datasets. +Results demonstrate the effectiveness of the proposed model. ",/pdf/d18adcbbd9667d537eeac3d834c059c183da5bdb.pdf,ICLR,2020,"Combining graph neural networks and the RNN graph generative model, we propose a novel architecture that is able to learn from a sequence of evolving graphs and predict the graph topology evolution for the future timesteps" +ByJIWUnpW,S1CBZL2T-,1508820000000.0,1519710000000.0,60,Automatically Inferring Data Quality for Spatiotemporal Forecasting,"[""sungyons@usc.edu"", ""mohegh@usc.edu"", ""banweiss@usc.edu"", ""yanliu.cs@usc.edu""]","[""Sungyong Seo"", ""Arash Mohegh"", ""George Ban-Weiss"", ""Yan Liu""]","[""spatiotemporal data"", ""graph convolutional network"", ""data quality""]","Spatiotemporal forecasting has become an increasingly important prediction task in machine learning and statistics due to its vast applications, such as climate modeling, traffic prediction, video caching predictions, and so on. While numerous studies have been conducted, most existing works assume that the data from different sources or across different locations are equally reliable. Due to cost, accessibility, or other factors, it is inevitable that the data quality could vary, which introduces significant biases into the model and leads to unreliable prediction results. The problem could be exacerbated in black-box prediction models, such as deep neural networks. In this paper, we propose a novel solution that can automatically infer data quality levels of different sources through local variations of spatiotemporal signals without explicit labels. Furthermore, we integrate the estimate of data quality level with graph convolutional networks to exploit their efficient structures. We evaluate our proposed method on forecasting temperatures in Los Angeles.",/pdf/2034c5db32d7f1ca4da1de1573b67e85b0f48aee.pdf,ICLR,2018,We propose a method that infers the time-varying data quality level for spatiotemporal forecasting without explicitly assigned labels. +S1esMkHYPr,ryl_ImhdDr,1569440000000.0,1583910000000.0,1591,GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation,"[""chenceshi@pku.edu.cn"", ""mkxu@apex.sjtu.edu.cn"", ""zhaocheng.zhu@umontreal.ca"", ""wnzhang@sjtu.edu.cn"", ""mzhang_cs@pku.edu.cn"", ""jian.tang@hec.ca""]","[""Chence Shi*"", ""Minkai Xu*"", ""Zhaocheng Zhu"", ""Weinan Zhang"", ""Ming Zhang"", ""Jian Tang""]","[""Molecular graph generation"", ""deep generative models"", ""normalizing flows"", ""autoregressive models""]","Molecular graph generation is a fundamental problem for drug discovery and has been attracting growing attention. The problem is challenging since it requires not only generating chemically valid molecular structures but also optimizing their chemical properties in the meantime. Inspired by the recent progress in deep generative models, in this paper we propose a flow-based autoregressive model for graph generation called GraphAF. GraphAF combines the advantages of both autoregressive and flow-based approaches and enjoys: (1) high model flexibility for data density estimation; (2) efficient parallel computation for training; (3) an iterative sampling process, which allows leveraging chemical domain knowledge for valency checking. Experimental results show that GraphAF is able to generate 68\% chemically valid molecules even without chemical knowledge rules and 100\% valid molecules with chemical rules. The training process of GraphAF is two times faster than the existing state-of-the-art approach GCPN. After fine-tuning the model for goal-directed property optimization with reinforcement learning, GraphAF achieves state-of-the-art performance on both chemical property optimization and constrained property optimization. ",/pdf/e2ef8a6407f03fbdb526bf73073ba5c5c4d81678.pdf,ICLR,2020,A flow-based autoregressive model for molecular graph generation. Reaching state-of-the-art results on molecule generation and properties optimization. +V8YXffoDUSa,i7s6YrU_8U,1601310000000.0,1614990000000.0,3597,Iterative convergent computation is not a useful inductive bias for ResNets,"[""~Samuel_Lippl1"", ""benjamin.peters@columbia.edu"", ""~Nikolaus_Kriegeskorte3""]","[""Samuel Lippl"", ""Benjamin Peters"", ""Nikolaus Kriegeskorte""]","[""Residual neural networks"", ""Recurrent neural networks"", ""Computer vision""]","Recent work has suggested that feedforward residual neural networks (ResNets) approximate iterative recurrent computations. Iterative computations are useful in many domains, so they might provide good solutions for neural networks to learn. Here we quantify the degree to which ResNets learn iterative solutions and introduce a regularization approach that encourages learning of iterative solutions. Iterative methods are characterized by two properties: iteration and convergence. To quantify these properties, we define three indices of iterative convergence. Consistent with previous work, we show that, even though ResNets can express iterative solutions, they do not learn them when trained conventionally on computer vision tasks. We then introduce regularizations to encourage iterative convergent computation and test whether this provides a useful inductive bias. To make the networks more iterative, we manipulate the degree of weight sharing across layers using soft gradient coupling. This new method provides a form of recurrence regularization and can interpolate smoothly between an ordinary ResNet and a ``recurrent"" ResNet (i.e., one that uses identical weights across layers and thus could be physically implemented with a recurrent network computing the successive stages iteratively across time). To make the networks more convergent we impose a Lipschitz constraint on the residual functions using spectral normalization. The three indices of iterative convergence reveal that the gradient coupling and the Lipschitz constraint succeed at making the networks iterative and convergent, respectively. However, neither recurrence regularization nor spectral normalization improve classification accuracy on standard visual recognition tasks (MNIST, CIFAR-10, CIFAR-100) or on challenging recognition tasks with partial occlusions (Digitclutter). Iterative convergent computation, in these tasks, does not provide a useful inductive bias for ResNets.",/pdf/7cd77fb7aa58f58b772a6a2a88bc185c97a302f5.pdf,ICLR,2021,We present methods to make ResNets more iterative and convergent and demonstrate that this does not provide a useful inductive bias on the examined tasks. +F8lXvXpZdrL,6SpA2563Q3s,1601310000000.0,1614990000000.0,1262,Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks,"[""~Alexander_Shekhovtsov1"", ""yanushviktor@gmail.com""]","[""Alexander Shekhovtsov"", ""Viktor Yanush""]","[""straight-through"", ""binary"", ""stochastic binary"", ""mirror descent""]","Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. +Many successful experimental results have been achieved with empirical straight-through (ST) approaches, proposing a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. At the same time, ST methods can be truly derived as estimators in the stochastic binary network (SBN) model with Bernoulli weights. We advance these derivations to a more complete and systematic study. We analyze properties, estimation accuracy, obtain different forms of correct ST estimators for activations and weights, explain existing empirical approaches and their shortcomings, explain how latent weights arise from the mirror descent method when optimizing over probabilities. This allows to reintroduce, once empirical, ST methods as sound approximations, apply them with clarity and develop further improvements.",/pdf/033f4300b7a3daa75919b4d94f36ddd28969d70f.pdf,ICLR,2021,"Straight-through estimators, wide-spread in the empirical form, are given a proper theoretical treatment for the first time." +S18Su--CW,H1SBu-WRW,1509130000000.0,1519360000000.0,696,Thermometer Encoding: One Hot Way To Resist Adversarial Examples,"[""buckman@google.com"", ""aurkor@google.com"", ""craffel@google.com"", ""goodfellow@google.com""]","[""Jacob Buckman"", ""Aurko Roy"", ""Colin Raffel"", ""Ian Goodfellow""]","[""Adversarial examples"", ""robust neural networks""]","It is well known that it is possible to construct ""adversarial examples"" +for neural networks: inputs which are misclassified by the network +yet indistinguishable from true data. We propose a simple +modification to standard neural network architectures, thermometer +encoding, which significantly increases the robustness of the network to +adversarial examples. We demonstrate this robustness with experiments +on the MNIST, CIFAR-10, CIFAR-100, and SVHN datasets, and show that +models with thermometer-encoded inputs consistently have higher accuracy +on adversarial examples, without decreasing generalization. +State-of-the-art accuracy under the strongest known white-box attack was +increased from 93.20% to 94.30% on MNIST and 50.00% to 79.16% on CIFAR-10. +We explore the properties of these networks, providing evidence +that thermometer encodings help neural networks to +find more-non-linear decision boundaries.",/pdf/9d862ed16434958a6b8c2b1739c180de37a0be6e.pdf,ICLR,2018,Input discretization leads to robustness against adversarial examples +PrvaKdJcKhX,64FRW-HHfVv,1601310000000.0,1614990000000.0,3170,Differentiable Approximations for Multi-resource Spatial Coverage Problems,"[""~Nitin_Kamra1"", ""~Yan_Liu1""]","[""Nitin Kamra"", ""Yan Liu""]","[""Multi-agent coverage"", ""Multi-resource coverage"", ""Areal coverage"", ""Differentiable approximations""]","Resource allocation for coverage of physical spaces is a challenging problem in robotic surveillance, mobile sensor networks and security domains. Recent gradient-based optimization approaches to this problem estimate utilities of actions by using neural networks to learn a differentiable approximation to spatial coverage objectives. In this work, we empirically show that spatial coverage objectives with multiple-resources are combinatorially hard to approximate for neural networks and lead to sub-optimal policies. As our major contribution, we propose a tractable framework to approximate a general class of spatial coverage objectives and their gradients using a combination of Newton-Leibniz theorem, spatial discretization and implicit boundary differentiation. We empirically demonstrate the efficacy of our proposed framework on single and multi-agent spatial coverage problems.",/pdf/ebcba115018f393a9d1dd5293273959abb5a4425.pdf,ICLR,2021,"Tractable approximations for a large class of spatial coverage objectives and their gradients using a combination of Newton-Leibniz theorem, spatial discretization and implicit boundary differentiation." +S1Auv-WRZ,B1idv--Ab,1509130000000.0,1518730000000.0,687,Data Augmentation Generative Adversarial Networks,"[""a.antoniou@sms.ed.ac.uk"", ""a.storkey@ed.ac.uk"", ""h.l.edwards@sms.ed.ac.uk""]","[""Anthreas Antoniou"", ""Amos Storkey"", ""Harrison Edwards""]",[],"Effective training of neural networks requires much data. In the low-data regime, +parameters are underdetermined, and learnt networks generalise poorly. Data +Augmentation (Krizhevsky et al., 2012) alleviates this by using existing data +more effectively. However standard data augmentation produces only limited +plausible alternative data. Given there is potential to generate a much broader set +of augmentations, we design and train a generative model to do data augmentation. +The model, based on image conditional Generative Adversarial Networks, takes +data from a source domain and learns to take any data item and generalise it +to generate other within-class data items. As this generative process does not +depend on the classes themselves, it can be applied to novel unseen classes of data. +We show that a Data Augmentation Generative Adversarial Network (DAGAN) +augments standard vanilla classifiers well. We also show a DAGAN can enhance +few-shot learning systems such as Matching Networks. We demonstrate these +approaches on Omniglot, on EMNIST having learnt the DAGAN on Omniglot, and +VGG-Face data. In our experiments we can see over 13% increase in accuracy in +the low-data regime experiments in Omniglot (from 69% to 82%), EMNIST (73.9% +to 76%) and VGG-Face (4.5% to 12%); in Matching Networks for Omniglot we +observe an increase of 0.5% (from 96.9% to 97.4%) and an increase of 1.8% in +EMNIST (from 59.5% to 61.3%).",/pdf/30db496b2453da8d96dde909f8aadd97369fc82a.pdf,ICLR,2018,Conditional GANs trained to generate data augmented samples of their conditional inputs used to enhance vanilla classification and one shot learning systems such as matching networks and pixel distance +Byx0iAEYPH,SkgV2mF_PB,1569440000000.0,1577170000000.0,1338,Fully Polynomial-Time Randomized Approximation Schemes for Global Optimization of High-Dimensional Folded Concave Penalized Generalized Linear Models,"[""cdhernandez@ufl.edu"", ""hungyilee@ufl.edu"", ""hliu@ise.ufl.edu""]","[""Charles Hernandez"", ""Hungyi Lee"", ""Hongchen Liu""]","[""statistical learning"", ""FPRAS"", ""global optimization"", ""folded concave penalty"", ""GLM"", ""high dimensional learning""]","Global solutions to high-dimensional sparse estimation problems with a folded concave penalty (FCP) have been shown to be statistically desirable but are strongly NP-hard to compute, which implies the non-existence of a pseudo-polynomial time global optimization schemes in the worst case. This paper shows that, with high probability, a global solution to the formulation for a FCP-based high-dimensional generalized linear model coincides with a stationary point characterized by the significant subspace second order necessary conditions (S$^3$ONC). Since the desired S$^3$ONC solution admits a fully polynomial-time approximation schemes (FPTAS), we thus have shown the existence of fully polynomial-time randomized approximation scheme (FPRAS) for a strongly NP-hard problem. We further demonstrate two versions of the FPRAS for generating the desired S$^3$ONC solutions. One follows the paradigm of an interior point trust region algorithm and the other is the well-studied local linear approximation (LLA). Our analysis thus provides new techniques for global optimization of certain NP-Hard problems and new insights on the effectiveness of LLA.",/pdf/c6318ee7392f14129a772140d539bc5efb96df12.pdf,ICLR,2020,This paper primarily demonstrates a technique to find the global optima of FCP regularized GLMs which is to our knowledge the first of its kind. +WweBNiwWkZh,G0PGXMBU7X,1601310000000.0,1614990000000.0,693,Skinning a Parameterization of Three-Dimensional Space for Neural Network Cloth,"[""~Jane_Wu2"", ""zhenglin@stanford.edu"", ""hui.zhou@jd.com"", ""~Ronald_Fedkiw1""]","[""Jane Wu"", ""Zhenglin Geng"", ""Hui Zhou"", ""Ronald Fedkiw""]",[],"We present a novel learning framework for cloth deformation by embedding virtual cloth into a tetrahedral mesh that parametrizes the volumetric region of air surrounding the underlying body. In order to maintain this volumetric parameterization during character animation, the tetrahedral mesh is constrained to follow the body surface as it deforms. We embed the cloth mesh vertices into this parameterization of three-dimensional space in order to automatically capture much of the nonlinear deformation due to both joint rotations and collisions. We then train a convolutional neural network to recover ground truth deformation by learning cloth embedding offsets for each skeletal pose. Our experiments show significant improvement over learning cloth offsets from body surface parameterizations, both quantitatively and visually, with prior state of the art having a mean error five standard deviations higher than ours. Without retraining, our neural network generalizes to other body shapes and T-shirt sizes, giving the user some indication of how well clothing might fit. Our results demonstrate the efficacy of a general learning paradigm where high-frequency details can be embedded into low-frequency parameterizations.",/pdf/b2b28b4157cb072d1405ac828a4df74ddb57686c.pdf,ICLR,2021,We present a novel learning framework for cloth deformation by embedding virtual cloth into a tetrahedral mesh that parametrizes the volumetric region of air surrounding the underlying body. +HkePNpVKPB,BJx3sbmwvr,1569440000000.0,1583910000000.0,489,Compositional languages emerge in a neural iterated learning model,"[""y.ren-18@sms.ed.ac.uk"", ""s.guo-16@sms.ed.ac.uk"", ""matthieu.labeau@gmail.com"", ""scohen@inf.ed.ac.uk"", ""simon.kirby@ed.ac.uk""]","[""Yi Ren"", ""Shangmin Guo"", ""Matthieu Labeau"", ""Shay B. Cohen"", ""Simon Kirby""]","[""Compositionality"", ""Multi-agent"", ""Emergent language"", ""Iterated learning""]","The principle of compositionality, which enables natural language to represent complex concepts via a structured combination of simpler ones, allows us to convey an open-ended set of messages using a limited vocabulary. If compositionality is indeed a natural property of language, we may expect it to appear in communication protocols that are created by neural agents via grounded language learning. Inspired by the iterated learning framework, which simulates the process of language evolution, we propose an effective neural iterated learning algorithm that, when applied to interacting neural agents, facilitates the emergence of a more structured type of language. Indeed, these languages provide specific advantages to neural agents during training, which translates as a larger posterior probability, which is then incrementally amplified via the iterated learning procedure. Our experiments confirm our analysis, and also demonstrate that the emerged languages largely improve the generalization of the neural agent communication.",/pdf/5cbabffbd566946a98615003cccab2b738d36740.pdf,ICLR,2020,Use iterated learning framework to facilitate the dominance of high compositional language in multi-agent games. +ADWd4TJO13G,NWza-6JnuCp,1601310000000.0,1615930000000.0,1652,Lifelong Learning of Compositional Structures,"[""~Jorge_A_Mendez1"", ""~ERIC_EATON1""]","[""Jorge A Mendez"", ""ERIC EATON""]","[""lifelong learning"", ""continual learning"", ""compositional learning"", ""modular networks""]","A hallmark of human intelligence is the ability to construct self-contained chunks of knowledge and adequately reuse them in novel combinations for solving different yet structurally related problems. Learning such compositional structures has been a significant challenge for artificial systems, due to the combinatorial nature of the underlying search problem. To date, research into compositional learning has largely proceeded separately from work on lifelong or continual learning. We integrate these two lines of work to present a general-purpose framework for lifelong learning of compositional structures that can be used for solving a stream of related tasks. Our framework separates the learning process into two broad stages: learning how to best combine existing components in order to assimilate a novel problem, and learning how to adapt the set of existing components to accommodate the new problem. This separation explicitly handles the trade-off between the stability required to remember how to solve earlier tasks and the flexibility required to solve new tasks, as we show empirically in an extensive evaluation.",/pdf/4098c1ce3205eddbf5c4ba54920410460b0861cb.pdf,ICLR,2021,"We create a general-purpose framework for lifelong learning of compositional structures that splits the learning process into two stages: assimilation of new tasks with existing components, and accommodation of new knowledge into the components." +jjKzfD9vP9,KoA6RNIaS7n,1601310000000.0,1614990000000.0,3330,Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated Label Mixing,"[""~Joonhyung_Park1"", ""~June_Yong_Yang1"", ""~Jinwoo_Shin1"", ""~Sung_Ju_Hwang1"", ""~Eunho_Yang1""]","[""Joonhyung Park"", ""June Yong Yang"", ""Jinwoo Shin"", ""Sung Ju Hwang"", ""Eunho Yang""]","[""Deep learning"", ""Data augmentation"", ""Input attribution""]","The Mixup scheme of mixing a pair of samples to create an augmented training sample has gained much attention recently for better training of neural networks. A straightforward and widely used extension is to combine Mixup and regional dropout methods: removing random patches from a sample and replacing it with the features from another sample. Albeit their simplicity and effectiveness, these methods are prone to create harmful samples due to their randomness. In recent studies, attempts to prevent such a phenomenon by selecting only the most informative features are gradually emerging. However, this maximum saliency strategy acts against their fundamental duty of sample diversification as they always deterministically select regions with maximum saliency, injecting bias into the augmented data. To address this problem, we present Saliency Grafting, a novel Mixup-like data augmentation method that captures the best of both ways. By stochastically sampling the features and ‘grafting’ them onto another sample, our method effectively generates diverse yet meaningful samples. The second ingredient of Saliency Grafting is to produce the label of the grafted sample by mixing the labels in a saliency-calibrated fashion, which rectifies supervision misguidance introduced by the random sampling procedure. Our experiments under CIFAR and ImageNet datasets show that our scheme outperforms the current state-of-the-art augmentation strategies not only in terms of classification accuracy, but is also superior in coping under stress conditions such as data corruption and data scarcity. The code will be released.",/pdf/77e0bcfbf560f069ccf70103195b950cb850c86c.pdf,ICLR,2021, +wxRwhSdORKG,7J7ayT_ZXja,1601310000000.0,1616060000000.0,275,Learning Subgoal Representations with Slow Dynamics,"[""~Siyuan_Li1"", ""zll19@mails.tsinghua.edu.cn"", ""wjh19@mails.tsinghua.edu.cn"", ""~Chongjie_Zhang1""]","[""Siyuan Li"", ""Lulu Zheng"", ""Jianhao Wang"", ""Chongjie Zhang""]","[""Hierarchical Reinforcement Learning"", ""Representation Learning"", ""Exploration""]","In goal-conditioned Hierarchical Reinforcement Learning (HRL), a high-level policy periodically sets subgoals for a low-level policy, and the low-level policy is trained to reach those subgoals. A proper subgoal representation function, which abstracts a state space to a latent subgoal space, is crucial for effective goal-conditioned HRL, since different low-level behaviors are induced by reaching subgoals in the compressed representation space. Observing that the high-level agent operates at an abstract temporal scale, we propose a slowness objective to effectively learn the subgoal representation (i.e., the high-level action space). We provide a theoretical grounding for the slowness objective. That is, selecting slow features as the subgoal space can achieve efficient hierarchical exploration. As a result of better exploration ability, our approach significantly outperforms state-of-the-art HRL and exploration methods on a number of benchmark continuous-control tasks. Thanks to the generality of the proposed subgoal representation learning method, empirical results also demonstrate that the learned representation and corresponding low-level policies can be transferred between distinct tasks.",/pdf/72f082bc7ce485bf0a24a86ed48e9d0b9f0fb493.pdf,ICLR,2021,We propose a slowness objective to learn subgoal representations in hierarchical reinforcement learning. +rJleFREKDr,BygXff__PH,1569440000000.0,1577170000000.0,1234,Learning to Control Latent Representations for Few-Shot Learning of Named Entities,"[""omar.florez@aggiemail.usu.edu"", ""erikmueller@capitalone.com""]","[""Omar U. Florez"", ""Erik Mueller""]","[""Memory management"", ""neuroscience"", ""reinforcement learning"", ""learning with small data""]","Humans excel in continuously learning with small data without forgetting how to solve old problems. +However, neural networks require large datasets to compute latent representations across different tasks while minimizing a loss function. For example, a natural language understanding (NLU) system will often deal with emerging entities during its deployment as interactions with users in realistic scenarios will generate new and infrequent names, events, and locations. Here, we address this scenario by introducing a RL trainable controller that disentangles the representation learning of a neural encoder from its memory management role. + +Our proposed solution is straightforward and simple: we train a controller to execute an optimal sequence of read and write operations on an external memory with the goal of leveraging diverse activations from the past and provide accurate predictions. Our approach is named Learning to Control (LTC) and allows few-shot learning with two degrees of memory plasticity. We experimentally show that our system obtains accurate results for few-shot learning of entity recognition in the Stanford Task-Oriented Dialogue dataset.",/pdf/8bb50d62bffa3dd0978ea59b1bdaea335c2ce876.pdf,ICLR,2020,We want to learning with small data by introducing a RL trainable controller that learn to write and read in an external memory. +SJgMK64Ywr,SkgWVn2PDS,1569440000000.0,1583910000000.0,661,AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures,"[""mryoo@google.com"", ""ajpiergi@indiana.edu"", ""tanmingxing@google.com"", ""anelia@google.com""]","[""Michael S. Ryoo"", ""AJ Piergiovanni"", ""Mingxing Tan"", ""Anelia Angelova""]","[""video representation learning"", ""video understanding"", ""activity recognition"", ""neural architecture search""]","Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. +Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time.",/pdf/47259a18219d759a9fe8ebefbe0829933b4edcaa.pdf,ICLR,2020,We search for multi-stream neural architectures with better connectivity and spatio-temporal interactions for video understanding. +rkgBHoCqYX,SklWTYwRu7,1538090000000.0,1545410000000.0,82,A Kernel Random Matrix-Based Approach for Sparse PCA,"[""melaseddik@gmail.com"", ""mohamed.tamaazousti@cea.fr"", ""romain.couillet@gmail.com""]","[""Mohamed El Amine Seddik"", ""Mohamed Tamaazousti"", ""Romain Couillet""]","[""Random Matrix Theory"", ""Concentration of Measure"", ""Sparse PCA"", ""Covariance Thresholding""]","In this paper, we present a random matrix approach to recover sparse principal components from n p-dimensional vectors. Specifically, considering the large dimensional setting where n, p → ∞ with p/n → c ∈ (0, ∞) and under Gaussian vector observations, we study kernel random matrices of the type f (Ĉ), where f is a three-times continuously differentiable function applied entry-wise to the sample covariance matrix Ĉ of the data. Then, assuming that the principal components are sparse, we show that taking f in such a way that f'(0) = f''(0) = 0 allows for powerful recovery of the principal components, thereby generalizing previous ideas involving more specific f functions such as the soft-thresholding function.",/pdf/7692056668caa94f38432a24a84e811aa511ccf3.pdf,ICLR,2019, +Syee1pVtDS,H1eDMWNrPH,1569440000000.0,1577170000000.0,289,Distributed Online Optimization with Long-Term Constraints,"[""dmyuan1012@gmail.com"", ""alepro@kth.se"", ""guodong.shi@anu.edu.au""]","[""Deming Yuan"", ""Alexandre Proutiere"", ""Guodong Shi""]",[],"We consider distributed online convex optimization problems, where the distributed system consists of various computing units connected through a time-varying communication graph. In each time step, each computing unit selects a constrained vector, experiences a loss equal to an arbitrary convex function evaluated at this vector, and may communicate to its neighbors in the graph. The objective is to minimize the system-wide loss accumulated over time. We propose a decentralized algorithm with regret and cumulative constraint violation in ${\cal O}(T^{\max\{c,1-c\} })$ and ${\cal O}(T^{1-c/2})$, respectively, for any $c\in (0,1)$, where $T$ is the time horizon. When the loss functions are strongly convex, we establish improved regret and constraint violation upper bounds in ${\cal O}(\log(T))$ and ${\cal O}(\sqrt{T\log(T)})$. These regret scalings match those obtained by state-of-the-art algorithms and fundamental limits in the corresponding centralized online optimization problem (for both convex and strongly convex loss functions). In the case of bandit feedback, the proposed algorithms achieve a regret and constraint violation in ${\cal O}(T^{\max\{c,1-c/3 \} })$ and ${\cal O}(T^{1-c/2})$ for any $c\in (0,1)$. We numerically illustrate the performance of our algorithms for the particular case of distributed online regularized linear regression problems.",/pdf/f1466ecffdf06f9bea9930a57b0392183d5382ee.pdf,ICLR,2020, +YTWGvpFOQD-,CowXrVXVuP,1601310000000.0,1613620000000.0,1641,Differentially Private Learning Needs Better Features (or Much More Data),"[""~Florian_Tramer1"", ""~Dan_Boneh1""]","[""Florian Tramer"", ""Dan Boneh""]","[""Differential Privacy"", ""Privacy"", ""Deep Learning""]","We demonstrate that differentially private machine learning has not yet reached its ''AlexNet moment'' on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. +To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to features learned on public data from a similar domain. +Our work introduces simple yet strong baselines for differentially private learning that can inform the evaluation of future progress in this area.",/pdf/63107901e325896b18874aad193314befc47c7ae.pdf,ICLR,2021,Linear models with handcrafted features outperform end-to-end CNNs for differentially private learning +WMUSP41HQWS,nD99jmp5cqu,1601310000000.0,1614990000000.0,3119,DISE: Dynamic Integrator Selection to Minimize Forward Pass Time in Neural ODEs,"[""sy_k@yonsei.ac.kr"", ""pgh300@yonsei.ac.kr"", ""~Kwang-Sung_Jun1"", ""~Noseong_Park1""]","[""Soyoung Kang"", ""Ganghyeon Park"", ""Kwang-Sung Jun"", ""Noseong Park""]","[""Neural ODE"", ""DOPRI""]","Neural ordinary differential equations (Neural ODEs) are appreciated for their ability to significantly reduce the number of parameters when constructing a neural network. On the other hand, they are sometimes blamed for their long forward-pass inference time, which is incurred by solving integral problems. To improve the model accuracy, they rely on advanced solvers, such as the Dormand--Prince (DOPRI) method. To solve an integral problem, however, it requires at least tens (or sometimes thousands) of steps in many Neural ODE experiments. In this work, we propose to i) directly regularize the step size of DOPRI to make the forward-pass faster and ii) dynamically choose a simpler integrator than DOPRI for a carefully selected subset of input. Because it is not the case that every input requires the advanced integrator, we design an auxiliary neural network to choose an appropriate integrator given input to decrease the overall inference time without significantly sacrificing accuracy. We consider the Euler method, the fourth-order Runge--Kutta (RK4) method, and DOPRI as selection candidates. We found that 10-30% of cases can be solved with simple integrators in our experiments. Therefore, the overall number of functional evaluations (NFE) decreases up to 78% with improved accuracy.",/pdf/5a827e83a7d83d3db793ecf71d7de31e6d0463c4.pdf,ICLR,2021,To make Neural ODEs' forward-pass inference faster +S1erHoR5t7,HklMWLIYu7,1538090000000.0,1546620000000.0,87, The relativistic discriminator: a key element missing from standard GAN,"[""alexia.jolicoeur-martineau@mail.mcgill.ca""]","[""Alexia Jolicoeur-Martineau""]","[""AI"", ""deep learning"", ""generative models"", ""GAN""]","In standard generative adversarial network (SGAN), the discriminator estimates the probability that the input data is real. The generator is trained to increase the probability that fake data is real. We argue that it should also simultaneously decrease the probability that real data is real because 1) this would account for a priori knowledge that half of the data in the mini-batch is fake, 2) this would be observed with divergence minimization, and 3) in optimal settings, SGAN would be equivalent to integral probability metric (IPM) GANs. + +We show that this property can be induced by using a relativistic discriminator which estimate the probability that the given real data is more realistic than a randomly sampled fake data. We also present a variant in which the discriminator estimate the probability that the given real data is more realistic than fake data, on average. We generalize both approaches to non-standard GAN loss functions and we refer to them respectively as Relativistic GANs (RGANs) and Relativistic average GANs (RaGANs). We show that IPM-based GANs are a subset of RGANs which use the identity function. + +Empirically, we observe that 1) RGANs and RaGANs are significantly more stable and generate higher quality data samples than their non-relativistic counterparts, 2) Standard RaGAN with gradient penalty generate data of better quality than WGAN-GP while only requiring a single discriminator update per generator update (reducing the time taken for reaching the state-of-the-art by 400%), and 3) RaGANs are able to generate plausible high resolutions images (256x256) from a very small sample (N=2011), while GAN and LSGAN cannot; these images are of significantly better quality than the ones generated by WGAN-GP and SGAN with spectral normalization. + +The code is freely available on https://github.com/AlexiaJM/RelativisticGAN.",/pdf/9d1feb597fbc059ec906836cb0234e5b2ef0731c.pdf,ICLR,2019,Improving the quality and stability of GANs using a relativistic discriminator; IPM GANs (such as WGAN-GP) are a special case. +rJeINp4KwH,Hyx-Eb7vPB,1569440000000.0,1583910000000.0,488,Population-Guided Parallel Policy Search for Reinforcement Learning,"[""wy.jung@kaist.ac.kr"", ""gs.park@kaist.ac.kr"", ""ycsung@kaist.ac.kr""]","[""Whiyoung Jung"", ""Giseung Park"", ""Youngchul Sung""]","[""Reinforcement Learning"", ""Parallel Learning"", ""Population Based Learning""]","In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search, and monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm, and numerical results show that the constructed P3S-TD3 outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment.",/pdf/17b46ff13242a110aae198cfeb220153596d2b1d.pdf,ICLR,2020, +SyW2QSige,,1478350000000.0,1481010000000.0,554,Towards Information-Seeking Agents,"[""phil.bachman@maluuba.com"", ""alessandro.sordoni@maluuba.com"", ""adam.trischler@maluuba.com""]","[""Philip Bachman"", ""Alessandro Sordoni"", ""Adam Trischler""]",[],"We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",/pdf/1f5ea4d7a782bc373d3b7149d1e96313e80ce8a4.pdf,ICLR,2017,We investigate the behavior of models trained to answer questions by asking sequences of simple questions. +ByeSdsC9Km,HJxadct9FQ,1538090000000.0,1551840000000.0,350,Adaptive Posterior Learning: few-shot learning with a surprise-based memory module,"[""tiago.mpramalho@gmail.com"", ""garnelo@google.com""]","[""Tiago Ramalho"", ""Marta Garnelo""]","[""metalearning"", ""memory"", ""few-shot"", ""relational"", ""self-attention"", ""classification"", ""sequential"", ""reasoning"", ""working memory"", ""episodic memory""]","The ability to generalize quickly from few observations is crucial for intelligent systems. In this paper we introduce APL, an algorithm that approximates probability distributions by remembering the most surprising observations it has encountered. These past observations are recalled from an external memory module and processed by a decoder network that can combine information from different memory slots to generalize beyond direct recall. We show this algorithm can perform as well as state of the art baselines on few-shot classification benchmarks with a smaller memory footprint. In addition, its memory compression allows it to scale to thousands of unknown labels. Finally, we introduce a meta-learning reasoning task which is more challenging than direct classification. In this setting, APL is able to generalize with fewer than one example per class via deductive reasoning.",/pdf/956c46c6a073db95ea165145a217984e60e5d2fb.pdf,ICLR,2019,We introduce a model which generalizes quickly from few observations by storing surprising information and attending over the most relevant data at each time point. +SkeK3s0qKQ,rklCE_aUFX,1538090000000.0,1557410000000.0,731,Episodic Curiosity through Reachability,"[""nikolay.savinov@inf.ethz.ch"", ""raveman@google.com"", ""damienv@google.com"", ""raphaelm@google.com"", ""marc.pollefeys@inf.ethz.ch"", ""countzero@google.com"", ""sylvaingelly@google.com""]","[""Nikolay Savinov"", ""Anton Raichuk"", ""Damien Vincent"", ""Raphael Marinier"", ""Marc Pollefeys"", ""Timothy Lillicrap"", ""Sylvain Gelly""]","[""deep learning"", ""reinforcement learning"", ""curiosity"", ""exploration"", ""episodic memory""]","Rewards are sparse in the real world and most of today's reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself - thus making rewards dense and more suitable for learning. In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward - making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory - which incorporates rich information about environment dynamics. This allows us to overcome the known ""couch-potato"" issues of prior work - when the agent finds a way to instantly gratify itself by exploiting actions which lead to hardly predictable consequences. We test our approach in visually rich 3D environments in ViZDoom, DMLab and MuJoCo. In navigational tasks from ViZDoom and DMLab, our agent outperforms the state-of-the-art curiosity method ICM. In MuJoCo, an ant equipped with our curiosity module learns locomotion out of the first-person-view curiosity only. The code is available at https://github.com/google-research/episodic-curiosity/.",/pdf/a81ed9381b42322b6099ca895beadba6a985c8d9.pdf,ICLR,2019,"We propose a novel model of curiosity based on episodic memory and the ideas of reachability which allows us to overcome the known ""couch-potato"" issues of prior work." +rklVOnNtwH,ryg4avQQLS,1569440000000.0,1577170000000.0,38,Out-of-Distribution Detection Using Layerwise Uncertainty in Deep Neural Networks,"[""h-okamoto@weblab.t.u-tokyo.ac.jp"", ""masa@weblab.t.u-tokyo.ac.jp"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Hirono Okamoto"", ""Masahiro Suzuki"", ""Yutaka Matsuo""]","[""out-of-distribution"", ""uncertainty""]","In this paper, we tackle the problem of detecting samples that are not drawn from the training distribution, i.e., out-of-distribution (OOD) samples, in classification. Many previous studies have attempted to solve this problem by regarding samples with low classification confidence as OOD examples using deep neural networks (DNNs). However, on difficult datasets or models with low classification ability, these methods incorrectly regard in-distribution samples close to the decision boundary as OOD samples. This problem arises because their approaches use only the features close to the output layer and disregard the uncertainty of the features. Therefore, we propose a method that extracts the uncertainties of features in each layer of DNNs using a reparameterization trick and combines them. In experiments, our method outperforms the existing methods by a large margin, achieving state-of-the-art detection performance on several datasets and classification models. For example, our method increases the AUROC score of prior work (83.8%) to 99.8% in DenseNet on the CIFAR-100 and Tiny-ImageNet datasets.",/pdf/699ccba976fc09e43c43f59a9c7053400d6f6abf.pdf,ICLR,2020,We propose a method that extracts the uncertainties of features in each layer of DNNs and combines them for detecting OOD samples when solving classification tasks. +BJgNJgSFPS,Skx4pFJKPr,1569440000000.0,1583910000000.0,2058,Building Deep Equivariant Capsule Networks,"[""vsairaam@sssihl.edu.in"", ""sbalasubramanian@sssihl.edu.in"", ""rraghunathasarma@sssihl.edu.in""]","[""Sai Raam Venkataraman"", ""S. Balasubramanian"", ""R. Raghunatha Sarma""]","[""Capsule networks"", ""equivariance""]","Capsule networks are constrained by the parameter-expensive nature of their layers, and the general lack of provable equivariance guarantees. We present a variation of capsule networks that aims to remedy this. We identify that learning all pair-wise part-whole relationships between capsules of successive layers is inefficient. Further, we also realise that the choice of prediction networks and the routing mechanism are both key to equivariance. Based on these, we propose an alternative framework for capsule networks that learns to projectively encode the manifold of pose-variations, termed the space-of-variation (SOV), for every capsule-type of each layer. This is done using a trainable, equivariant function defined over a grid of group-transformations. Thus, the prediction-phase of routing involves projection into the SOV of a deeper capsule using the corresponding function. As a specific instantiation of this idea, and also in order to reap the benefits of increased parameter-sharing, we use type-homogeneous group-equivariant convolutions of shallower capsules in this phase. We also introduce an equivariant routing mechanism based on degree-centrality. We show that this particular instance of our general model is equivariant, and hence preserves the compositional representation of an input under transformations. We conduct several experiments on standard object-classification datasets that showcase the increased transformation-robustness, as well as general performance, of our model to several capsule baselines.",/pdf/211c58a41ccc234f38d664aeb9596cbde287d441.pdf,ICLR,2020,"A new scalable, group-equivariant model for capsule networks that preserves compositionality under transformations, and is empirically more transformation-robust to older capsule network models." +YZ-NHPj6c6O,Zn_uk_rKK9m,1601310000000.0,1614990000000.0,3369,Quantifying and Learning Disentangled Representations with Limited Supervision,"[""~Loek_Tonnaer1"", ""~Luis_Armando_P\u00e9rez_Rey1"", ""~Vlado_Menkovski2"", ""~Mike_Holenderski1"", ""j.w.portegies@tue.nl""]","[""Loek Tonnaer"", ""Luis Armando P\u00e9rez Rey"", ""Vlado Menkovski"", ""Mike Holenderski"", ""Jacobus W. Portegies""]","[""Representation Learning"", ""Disentanglement"", ""Group Theory""]","Learning low-dimensional representations that disentangle the underlying factors of variation in data has been posited as an important step towards interpretable machine learning with good generalization. To address the fact that there is no consensus on what disentanglement entails, Higgins et al. (2018) propose a formal definition for Linear Symmetry-Based Disentanglement, or LSBD, arguing that underlying real-world transformations give exploitable structure to data. + +Although several works focus on learning LSBD representations, such methods require supervision on the underlying transformations for the entire dataset, and cannot deal with unlabeled data. Moreover, none of these works provide a metric to quantify LSBD. + +We propose a metric to quantify LSBD representations that is easy to compute under certain well-defined assumptions. Furthermore, we present a method that can leverage unlabeled data, such that LSBD representations can be learned with limited supervision on transformations. Using our LSBD metric, our results show that limited supervision is indeed sufficient to learn LSBD representations.",/pdf/616a8ebd6659a6bf44f358bbeea0b6e20ca4fbf7.pdf,ICLR,2021,"We propose a metric to quantify linearly symmetry-based disentangled representations, as well as a method to learn such representations with limited supervision." +Qk-Wq5AIjpq,l0mmYyzcKUS,1601310000000.0,1616010000000.0,1383,PAC Confidence Predictions for Deep Neural Network Classifiers,"[""~Sangdon_Park1"", ""lishuo1@seas.upenn.edu"", ""~Insup_Lee1"", ""~Osbert_Bastani1""]","[""Sangdon Park"", ""Shuo Li"", ""Insup Lee"", ""Osbert Bastani""]","[""classification"", ""calibration"", ""probably approximated correct guarantee"", ""fast DNN inference"", ""safe planning""]","A key challenge for deploying deep neural networks (DNNs) in safety critical settings is the need to provide rigorous ways to quantify their uncertainty. In this paper, we propose a novel algorithm for constructing predicted classification confidences for DNNs that comes with provable correctness guarantees. Our approach uses Clopper-Pearson confidence intervals for the Binomial distribution in conjunction with the histogram binning approach to calibrated prediction. In addition, we demonstrate how our predicted confidences can be used to enable downstream guarantees in two settings: (i) fast DNN inference, where we demonstrate how to compose a fast but inaccurate DNN with an accurate but slow DNN in a rigorous way to improve performance without sacrificing accuracy, and (ii) safe planning, where we guarantee safety when using a DNN to predict whether a given action is safe based on visual observations. In our experiments, we demonstrate that our approach can be used to provide guarantees for state-of-the-art DNNs.",/pdf/722efb9d50df44cb8e7fca2dbab280da02707515.pdf,ICLR,2021,"We propose a novel algorithm for constructing predicted classification confidences for DNNs that comes with provable correctness guarantees, and demonstrate how our predicted confidences can be used to enable downstream guarantees in two settings." +rylbWhC5Ym,SJg2EHictm,1538090000000.0,1545360000000.0,1143,HR-TD: A Regularized TD Method to Avoid Over-Generalization,"[""ishand@cs.utexas.edu"", ""liubo19831214@gmail.com"", ""pstone@cs.utexas.edu""]","[""Ishan Durugkar"", ""Bo Liu"", ""Peter Stone""]","[""Reinforcement Learning"", ""TD Learning"", ""Deep Learning""]","Temporal Difference learning with function approximation has been widely used recently and has led to several successful results. However, compared with the original tabular-based methods, one major drawback of temporal difference learning with neural networks and other function approximators is that they tend to over-generalize across temporally successive states, resulting in slow convergence and even instability. In this work, we propose a novel TD learning method, Hadamard product Regularized TD (HR-TD), that reduces over-generalization and thus leads to faster convergence. This approach can be easily applied to both linear and nonlinear function approximators. +HR-TD is evaluated on several linear and nonlinear benchmark domains, where we show improvement in learning behavior and performance.",/pdf/7c8f69e154640f8fa0e114300691784da51551aa.pdf,ICLR,2019,"A regularization technique for TD learning that avoids temporal over-generalization, especially in Deep Networks" +K9bw7vqp_s,a9XZdz3fRzE,1601310000000.0,1618740000000.0,255,Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch,"[""~Aojun_Zhou2"", ""~Yukun_Ma2"", ""junnan.zhu@nlpr.ia.ac.cn"", ""~Jianbo_Liu3"", ""~Zhijie_Zhang1"", ""~Kun_Yuan1"", ""~Wenxiu_Sun1"", ""~Hongsheng_Li3""]","[""Aojun Zhou"", ""Yukun Ma"", ""Junnan Zhu"", ""Jianbo Liu"", ""Zhijie Zhang"", ""Kun Yuan"", ""Wenxiu Sun"", ""Hongsheng Li""]","[""sparsity"", ""efficient training and inference.""]","Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot simultaneously achieve both apparent acceleration on modern GPUs and +decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2 : 4 sparse network could achieve 2× speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network’s topology change during the training process. Finally, We justify SR-STE’s advantages with SAD and demonstrate the effectiveness of SR-STE by performing +comprehensive experiments on various tasks. Anonymous code and model will be at available at https://github.com/anonymous-NM-sparsity/NM-sparsity.",/pdf/75cc29fc217b7a42a135a1f1ee7a57e605181177.pdf,ICLR,2021,a simple yet universal recipe to learn N:M sparse neural networks from scratch +lvRTC669EY_,rvbQRV-iyJM,1601310000000.0,1615520000000.0,2803,Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization,"[""tangzhenggang@pku.edu.cn"", ""yc19@mails.tsinghua.edu.cn"", ""~Boyuan_Chen2"", ""~Huazhe_Xu1"", ""~Xiaolong_Wang3"", ""~Fei_Fang1"", ""~Simon_Shaolei_Du1"", ""~Yu_Wang3"", ""~Yi_Wu1""]","[""Zhenggang Tang"", ""Chao Yu"", ""Boyuan Chen"", ""Huazhe Xu"", ""Xiaolong Wang"", ""Fei Fang"", ""Simon Shaolei Du"", ""Yu Wang"", ""Yi Wu""]","[""strategic behavior"", ""multi-agent reinforcement learning"", ""reward randomization"", ""diverse strategies""]","We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover a set of multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games and a real-world game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques. Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents. ",/pdf/2062fdf1e8a1dbc3c1d293239ad291f853463ba8.pdf,ICLR,2021,"We propose an MARL algorithm, RPG, which discovers diverse non-trivial strategic behavior in several challenging multi-agent games." +uXl3bZLkr3c,okq1j9FkZc,1601310000000.0,1616080000000.0,1015,Tent: Fully Test-Time Adaptation by Entropy Minimization,"[""~Dequan_Wang1"", ""~Evan_Shelhamer2"", ""~Shaoteng_Liu1"", ""~Bruno_Olshausen1"", ""~Trevor_Darrell2""]","[""Dequan Wang"", ""Evan Shelhamer"", ""Shaoteng Liu"", ""Bruno Olshausen"", ""Trevor Darrell""]","[""deep learning"", ""unsupervised learning"", ""domain adaptation"", ""self-supervision"", ""robustness""]","A model must adapt itself to generalize to new and different data during testing. In this setting of fully test-time adaptation the model has only the test data and its own parameters. We propose to adapt by test entropy minimization (tent): we optimize the model for confidence as measured by the entropy of its predictions. Our method estimates normalization statistics and optimizes channel-wise affine transformations to update online on each batch. Tent reduces generalization error for image classification on corrupted ImageNet and CIFAR-10/100 and reaches a new state-of-the-art error on ImageNet-C. Tent handles source-free domain adaptation on digit recognition from SVHN to MNIST/MNIST-M/USPS, on semantic segmentation from GTA to Cityscapes, and on the VisDA-C benchmark. These results are achieved in one epoch of test-time optimization without altering training.",/pdf/4de0af9691a5dcc52de7de756676fded33d037ef.pdf,ICLR,2021,Deep networks can generalize better during testing by adapting to feedback from their own predictions. +fESskTMMSv,be6QuLLGGNw,1601310000000.0,1614990000000.0,1409,Practical Marginalized Importance Sampling with the Successor Representation,"[""~Scott_Fujimoto1"", ""~David_Meger2"", ""~Doina_Precup1""]","[""Scott Fujimoto"", ""David Meger"", ""Doina Precup""]","[""marginalized importance sampling"", ""off-policy evaluation"", ""deep reinforcement learning"", ""successor representation""]","Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.",/pdf/38e5bc45c0a26b126084f859c71c135bc0c58324.pdf,ICLR,2021,We develop an approach for MIS that can be computed from the successor representation and scales to high-dimensional systems. +mhEd8uOyNTI,aTkITAhqyY,1601310000000.0,1614990000000.0,2553,Representational correlates of hierarchical phrase structure in deep language models,"[""ma3811@columbia.edu"", ""jonathan.mamou@intel.com"", ""drmiguel@alum.mit.edu"", ""~Hanlin_Tang1"", ""~Yoon_Kim1"", ""~SueYeon_Chung1""]","[""Matteo Alleman"", ""Jonathan Mamou"", ""Miguel A Del Rio"", ""Hanlin Tang"", ""Yoon Kim"", ""SueYeon Chung""]","[""bertology"", ""interpretability"", ""computational neuroscience"", ""population coding""]","While contextual representations from Transformer-based architectures have set a new standard for many NLP tasks, there is not yet a complete accounting of their inner workings. In particular, it is not entirely clear what aspects of sentence-level syntax are captured by these representations, nor how (if at all) they are built along the stacked layers of the network. In this paper, we aim to address such questions with a general class of input perturbation-based analyses of representations from Transformer networks pretrained on self-supervised objectives. Importing from computational and cognitive neuroscience the notion of representational invariance, we perform a series of probes designed to test the sensitivity of Transformer representations to several kinds of structure in sentences. Each probe involves swapping words in a sentence and comparing the representations from perturbed sentences against the original. We experiment with three different perturbations: (1) random permutations of n-grams of varying width, to test the scale at which a representation is sensitive to word position; (2) swapping of two spans which do or do not form a syntactic phrase, to test sensitivity to global phrase structure; and (3) swapping of two adjacent words which do or do not break apart a syntactic phrase, to test sensitivity to local phrase structure. We also connect our probe results to the Transformer architecture by relating the attention mechanism to syntactic distance between two words. Results from the three probes collectively suggest that Transformers build sensitivity to larger parts of the sentence along their layers, and that hierarchical phrase structure plays a role in this process. In particular, sensitivity to local phrase structure increases along deeper layers. Based on our analysis of attention, we show that this is at least partly explained by generally larger attention weights between syntactically distant words.",/pdf/8c924ae31a130e7245a197be4f5a3d54b5919adf.pdf,ICLR,2021,We use methods from computational neuroscience to analyze representational correlates of syntax in BERT-models and find that these models gradually build up sensitivities to hierarchical phrase structure along its layers. +Hye1RJHKwB,HJloHI1Kvr,1569440000000.0,1583910000000.0,2009,Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators,"[""d.stoller@qmul.ac.uk"", ""sewert@spotify.com"", ""s.e.dixon@qmul.ac.uk""]","[""Daniel Stoller"", ""Sebastian Ewert"", ""Simon Dixon""]","[""Adversarial Learning"", ""Semi-supervised Learning"", ""Image generation"", ""Image segmentation"", ""Missing Data""]","Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting. +However, they typically require large datasets, which are often not available, especially in the context of prediction tasks such as image segmentation that require labels. Therefore, methods such as the CycleGAN use more easily available unlabelled data, but do not offer a way to leverage additional labelled data for improved performance. To address this shortcoming, we show how to factorise the joint data distribution into a set of lower-dimensional distributions along with their dependencies. This allows splitting the discriminator in a GAN into multiple ""sub-discriminators"" that can be independently trained from incomplete observations. Their outputs can be combined to estimate the density ratio between the joint real and the generator distribution, which enables training generators as in the original GAN framework. We apply our method to image generation, image segmentation and audio source separation, and obtain improved performance over a standard GAN when additional incomplete training examples are available. For the Cityscapes segmentation task in particular, our method also improves accuracy by an absolute 14.9% over CycleGAN while using only 25 additional paired examples.",/pdf/dd2f617b749f8375bb46bfd434be0338b70e4cd5.pdf,ICLR,2020,"We decompose the discriminator in a GAN in a principled way so that each component can be independently trained on different parts of the input. The resulting ""FactorGAN"" can be used for semi-supervised learning and in missing data scenarios." +H-SPvQtMwm,eK0VQI6vLvc,1601310000000.0,1614990000000.0,1899,Synthesizer: Rethinking Self-Attention for Transformer Models,"[""~Yi_Tay1"", ""~Dara_Bahri1"", ""metzler@google.com"", ""~Da-Cheng_Juan1"", ""~Zhe_Zhao3"", ""chezheng@google.com""]","[""Yi Tay"", ""Dara Bahri"", ""Donald Metzler"", ""Da-Cheng Juan"", ""Zhe Zhao"", ""Che Zheng""]","[""Transformers"", ""Deep Learning"", ""Attention""]","The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60\%$ faster but also improves perplexity by a relative $3.5\%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks. ",/pdf/f26d5eaa38649d3b05fa49ed55255e6d5c2c02cb.pdf,ICLR,2021,"We propose synthesizing the attention matrix and achieve simple, efficient and competitive performance." +_CrmWaJ2uvP,KoT6myxgsqk,1601310000000.0,1614990000000.0,1695,Recurrent Neural Network Architecture based on Dynamic Systems Theory for Data Driven Modelling of Complex Physical Systems,"[""~Deniz_Neufeld1""]","[""Deniz Neufeld""]","[""dynamic system identification"", ""recurrent networks"", ""explainable AI"", ""time series modelling""]","While dynamic systems can be modelled as sequence-to-sequence tasks by deep learning using different network architectures like DNN, CNN, RNNs or neural ODEs, the resulting models often provide poor understanding of the underlying system properties. We propose a new recurrent network architecture, the Dynamic Recurrent Network, where the computation function is based on the discrete difference equations of basic linear system transfer functions known from dynamic system identification. This results in a more explainable model, since the learnt weights can provide insight on a system's time dependent behaviour. It also introduces the sequences' sampling rate as an additional model parameter, which can be leveraged, for example, for time series data augmentation and model robustness checks. The network is trained using traditional gradient descent optimization and can be used in combination with other state of the art neural network layers. We show that our new layer type yields results comparable to or better than other recurrent layer types on several system identification tasks.",/pdf/674dca5d1daa6e05f8f943716c0740db6f5c652e.pdf,ICLR,2021,A new recurrent network structure consisting of basic linear building blocks from dynamic system identification. +HyGySsAct7,S1xgEdnLKX,1538090000000.0,1545360000000.0,52,Targeted Adversarial Examples for Black Box Audio Systems,"[""rohantaori@berkeley.edu"", ""amogkamsetty@berkeley.edu"", ""brentonlongchu@berkeley.edu"", ""nikitavemuri@berkeley.edu""]","[""Rohan Taori"", ""Amog Kamsetty"", ""Brenton Chu"", ""Nikita Vemuri""]","[""adversarial attack"", ""adversarial examples"", ""audio processing"", ""speech to text"", ""deep learning"", ""adversarial audio"", ""black box"", ""machine learning""]","The application of deep recurrent networks to audio transcription has led to impressive gains in automatic speech recognition (ASR) systems. Many have demonstrated that small adversarial perturbations can fool deep neural networks into incorrectly predicting a specified target with high confidence. Current work on fooling ASR systems have focused on white-box attacks, in which the model architecture and parameters are known. In this paper, we adopt a black-box approach to adversarial generation, combining the approaches of both genetic algorithms and gradient estimation to solve the task. We achieve a 89.25% targeted attack similarity after 3000 generations while maintaining 94.6% audio file similarity.",/pdf/1d511d4e96a5890caa42c60884c55bfda8ae1cd5.pdf,ICLR,2019,We present a novel black-box targeted attack on speech to text systems that supports arbitrarily long adversarial transcriptions and achieves state of the art performance. +HygwvC4tPH,SkgluOwuPr,1569440000000.0,1577170000000.0,1179,Learning Cross-Context Entity Representations from Text,"[""jeffreyling@google.com"", ""nfitz@google.com"", ""zifeis@google.com"", ""liviobs@google.com"", ""tfevry@google.com"", ""djweiss@google.com"", ""tomkwiat@google.com""]","[""Jeffrey Ling"", ""Nicholas FitzGerald"", ""Zifei Shan"", ""Livio Baldini Soares"", ""Thibault F\u00e9vry"", ""David Weiss"", ""Tom Kwiatkowski""]","[""entities"", ""entity representations"", ""knowledge representation"", ""entity linking"", ""entity typing""]","Language modeling tasks, in which words, or word-pieces, are predicted on the basis of a local context, have been very effective for learning word embeddings and context dependent representations of phrases. Motivated by the observation that efforts to code world knowledge into machine readable knowledge bases or human readable encyclopedias tend to be entity-centric, we investigate the use of a fill-in-the-blank task to learn context independent representations of entities from the text contexts in which those entities were mentioned. We show that large scale training of neural models allows us to learn high quality entity representations, and we demonstrate successful results on four domains: (1) existing entity-level typing benchmarks, including a 64% error reduction over previous work on TypeNet (Murty et al., 2018); (2) a novel few-shot category reconstruction task; (3) existing entity linking benchmarks, where we achieve a score of 87.3% on TAC-KBP 2010 without using any alias table, external knowledge base or in domain training data and (4) answering trivia questions, which uniquely identify entities. Our global entity representations encode fine-grained type categories, such as ""Scottish footballers"", and can answer trivia questions such as ""Who was the last inmate of Spandau jail in Berlin?"".",/pdf/903a7eddf7f5665b3caa8f227ecb62a9da529e8c.pdf,ICLR,2020,We investigate the use of a fill-in-the-blank task to learn context independent representations of entities from the text contexts in which those entities were mentioned +rC8sJ4i6kaH,mygGL1W6SiY,1601310000000.0,1616740000000.0,2250,Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data,"[""~Colin_Wei1"", ""kshen6@stanford.edu"", ""~Yining_Chen1"", ""~Tengyu_Ma1""]","[""Colin Wei"", ""Kendrick Shen"", ""Yining Chen"", ""Tengyu Ma""]","[""deep learning theory"", ""domain adaptation theory"", ""unsupervised learning theory"", ""semi-supervised learning theory""]","Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic “expansion” assumption, which states that a low-probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.",/pdf/fe47b1d2bc77e31125aa9437f2979edc98a6d02c.pdf,ICLR,2021,"This paper provides accuracy guarantees for self-training with deep networks on polynomial unlabeled samples for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning." +Hkx7xRVYDr,HkxuvVQ_PB,1569440000000.0,1583910000000.0,924,Duration-of-Stay Storage Assignment under Uncertainty,"[""mlli@mit.edu"", ""ewolf@lineagelogistics.com"", ""dwintz@lineagelogistics.com""]","[""Michael Lingzhi Li"", ""Elliott Wolf"", ""Daniel Wintz""]","[""Storage Assignment"", ""Deep Learning"", ""Duration-of-Stay"", ""Application"", ""Natural Language Processing"", ""Parallel Network""]","Storage assignment, the act of choosing what goods are placed in what locations in a warehouse, is a central problem of supply chain logistics. Past literature has shown that the optimal method to assign pallets is to arrange them in increasing duration of stay in the warehouse (the Duration-of-Stay, or DoS, method), but the methodology requires perfect prior knowledge of DoS for each pallet, which is unknown and uncertain under realistic conditions. Attempts to predict DoS have largely been unfruitful due to the multi-valuedness nature (every shipment contains multiple identical pallets with different DoS) and data sparsity induced by lack of matching historical conditions. In this paper, we introduce a new framework for storage assignment that provides a solution to the DoS prediction problem through a distributional reformulation and a novel neural network, ParallelNet. Through collaboration with a world-leading cold storage company, we show that the system is able to predict DoS with a MAPE of 29%, a decrease of ~30% compared to a CNN-LSTM model, and suffers less performance decay into the future. The framework is then integrated into a first-of-its-kind Storage Assignment system, which is being deployed in warehouses across United States, with initial results showing up to 21% in labor savings. We also release the first publicly available set of warehousing records to facilitate research into this central problem.",/pdf/f7efcc023fa6e51de2e0abcda63968dd94ea6de0.pdf,ICLR,2020,We develop a new storage assignment framework with a novel neural network that enables large efficiency gains in the warehouse. +SJlRDCVtwr,rJxLCqv_wS,1569440000000.0,1577170000000.0,1195,Simplicial Complex Networks,"[""mfirouzi@alphabist.com"", ""sadra.boreiri@epfl.ch"", ""hfirouzi@alphabist.com""]","[""Mohammad Firouzi"", ""Sadra Boreiri"", ""Hamed Firouzi""]","[""topological data analysis"", ""supervised learning"", ""simplicial approximation""]","Universal approximation property of neural networks is one of the motivations to use these models in various real-world problems. However, this property is not the only characteristic that makes neural networks unique as there is a wide range of other approaches with similar property. Another characteristic which makes these models interesting is that they can be trained with the backpropagation algorithm which allows an efficient gradient computation and gives these universal approximators the ability to efficiently learn complex manifolds from a large amount of data in different domains. Despite their abundant use in practice, neural networks are still not well understood and a broad range of ongoing research is to study the interpretability of neural networks. On the other hand, topological data analysis (TDA) relies on strong theoretical framework of (algebraic) topology along with other mathematical tools for analyzing possibly complex datasets. In this work, we leverage a universal approximation theorem originating from algebraic topology to build a connection between TDA and common neural network training framework. We introduce the notion of automatic subdivisioning and devise a particular type of neural networks for regression tasks: Simplicial Complex Networks (SCNs). SCN's architecture is defined with a set of bias functions along with a particular policy during the forward pass which alternates the common architecture search framework in neural networks. We believe the view of SCNs can be used as a step towards building interpretable deep learning models. Finally, we verify its performance on a set of regression problems.",/pdf/5635fefb9533068eb43b1f628d68824431d7f884.pdf,ICLR,2020,A novel method for supervised learning through subdivisioning the input space along with function approximation. +roNqYL0_XP,Akho39sajOp,1601310000000.0,1611610000000.0,3177,Learning Mesh-Based Simulation with Graph Networks,"[""~Tobias_Pfaff1"", ""~Meire_Fortunato1"", ""~Alvaro_Sanchez-Gonzalez1"", ""~Peter_Battaglia1""]","[""Tobias Pfaff"", ""Meire Fortunato"", ""Alvaro Sanchez-Gonzalez"", ""Peter Battaglia""]","[""graph networks"", ""simulation"", ""mesh"", ""physics""]","Mesh-based simulations are central to modeling complex physical systems in many disciplines across science and engineering. Mesh representations support powerful numerical integration methods and their resolution can be adapted to strike favorable trade-offs between accuracy and efficiency. However, high-dimensional scientific simulations are very expensive to run, and solvers and parameters must often be tuned individually to each system studied. +Here we introduce MeshGraphNets, a framework for learning mesh-based simulations using graph neural networks. Our model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. Our results show it can accurately predict the dynamics of a wide range of physical systems, including aerodynamics, structural mechanics, and cloth. The model's adaptivity supports learning resolution-independent dynamics and can scale to more complex state spaces at test time. Our method is also highly efficient, running 1-2 orders of magnitude faster than the simulation on which it is trained. Our approach broadens the range of problems on which neural network simulators can operate and promises to improve the efficiency of complex, scientific modeling tasks.",/pdf/25e22a812f559c7389d64412f32a87195fb7acbb.pdf,ICLR,2021,We introduce a general method for learning the dynamics of complex physics systems accurately and efficiently on meshes +H1eCw3EKvH,SJlRHzH6Br,1569440000000.0,1583910000000.0,24,On the Weaknesses of Reinforcement Learning for Neural Machine Translation,"[""leshem.choshen@mail.huji.ac.il"", ""lior.fox@mail.huji.ac.il"", ""zohar.aizenbud@mail.huji.ac.il"", ""oabend@cs.huji.ac.il""]","[""Leshem Choshen"", ""Lior Fox"", ""Zohar Aizenbud"", ""Omri Abend""]","[""Reinforcement learning"", ""MRT"", ""minimum risk training"", ""reinforce"", ""machine translation"", ""peakkiness"", ""generation""]","Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, +including machine translation (MT), +notably through the use of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN). +However, little is known about what and how these methods learn in the context of MT. +We prove that one of the most common RL methods for MT does not optimize the +expected reward, as well as show that other methods take an infeasibly long time to converge. +In fact, our results suggest that RL practices in MT are likely to improve performance +only where the pre-trained parameters are already close to yielding the correct translation. +Our findings further suggest that observed gains may be due to effects unrelated to the training signal, concretely, changes in the shape of the distribution curve.",/pdf/05e6f95de9f8eb952e3cbd7f1c21ac9a755cf9d6.pdf,ICLR,2020,Reinforcment practices for machine translation performance gains might not come from better predictions. +ByME42AqK7,S1xC3liqFQ,1538090000000.0,1551090000000.0,1442,Efficient Multi-Objective Neural Architecture Search via Lamarckian Evolution,"[""thomas.elsken@de.bosch.com"", ""janhendrik.metzen@de.bosch.com"", ""fh@cs.uni-freiburg.de""]","[""Thomas Elsken"", ""Jan Hendrik Metzen"", ""Frank Hutter""]","[""Neural Architecture Search"", ""AutoML"", ""AutoDL"", ""Deep Learning"", ""Evolutionary Algorithms"", ""Multi-Objective Optimization""]","Architecture search aims at automatically finding neural architectures that are competitive with architectures designed by human experts. While recent approaches have achieved state-of-the-art predictive performance for image recognition, they are problematic under resource constraints for two reasons: (1) the neural architectures found are solely optimized for high predictive performance, without penalizing excessive resource consumption; (2)most architecture search methods require vast computational resources. We address the first shortcoming by proposing LEMONADE, an evolutionary algorithm for multi-objective architecture search that allows approximating the Pareto-front of architectures under multiple objectives, such as predictive performance and number of parameters, in a single run of the method. We address the second shortcoming by proposing a Lamarckian inheritance mechanism for LEMONADE which generates children networks that are warmstarted with the predictive performance of their trained parents. This is accomplished by using (approximate) network morphism operators for generating children. The combination of these two contributions allows finding models that are on par or even outperform different-sized NASNets, MobileNets, MobileNets V2 and Wide Residual Networks on CIFAR-10 and ImageNet64x64 within only one week on eight GPUs, which is about 20-40x less compute power than previous architecture search methods that yield state-of-the-art performance.",/pdf/5e81ec86474db6f04ea7dc4b6b0683a16c1f099a.pdf,ICLR,2019,We propose a method for efficient Multi-Objective Neural Architecture Search based on Lamarckian inheritance and evolutionary algorithms. +4Nt1F3qf9Gn,Kt3uaeyUAX1,1601310000000.0,1614990000000.0,1639,"CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients","[""~Dani_Kiyasseh1"", ""tingting.zhu@eng.ox.ac.uk"", ""~David_A._Clifton1""]","[""Dani Kiyasseh"", ""Tingting Zhu"", ""David A. Clifton""]","[""Contrastive learning"", ""physiological signals"", ""healthcare""]","The healthcare industry generates troves of unlabelled physiological data. This data can be exploited via contrastive learning, a self-supervised pre-training method that encourages representations of instances to be similar to one another. We propose a family of contrastive learning methods, CLOCS, that encourages representations across space, time, \textit{and} patients to be similar to one another. We show that CLOCS consistently outperforms the state-of-the-art methods, BYOL and SimCLR, when performing a linear evaluation of, and fine-tuning on, downstream tasks. We also show that CLOCS achieves strong generalization performance with only 25\% of labelled training data. Furthermore, our training procedure naturally generates patient-specific representations that can be used to quantify patient-similarity. ",/pdf/a8c3a853fe897be275cb53f7b029e666f23158bb.pdf,ICLR,2021, +D62nJAdpijt,TfSifKj7sw,1601310000000.0,1614990000000.0,1254,Trojans and Adversarial Examples: A Lethal Combination,"[""~Guanxiong_Liu1"", ""ikhalil@hbku.edu.qa"", ""~Abdallah_Khreishah1"", ""~Hai_Phan1""]","[""Guanxiong Liu"", ""Issa Khalil"", ""Abdallah Khreishah"", ""Hai Phan""]",[],"In this work, we naturally unify adversarial examples and Trojan backdoors into a new stealthy attack, that is activated only when 1) adversarial perturbation is injected into the input examples and 2) a Trojan backdoor is used to poison the training process simultaneously. Different from traditional attacks, we leverage adversarial noise in the input space to move Trojan-infected examples across the model decision boundary, thus making it difficult to be detected. Our attack can fool the user into accidentally trusting the infected model as a robust classifier against adversarial examples. We perform a thorough analysis and conduct an extensive set of experiments on several benchmark datasets to show that our attack can bypass existing defenses with a success rate close to 100%.",/pdf/6c70f3c04ce551e071f1ed5fb2fb7910bd437f23.pdf,ICLR,2021, +HyxLRTVKPH,Byer_FZOvS,1569440000000.0,1583910000000.0,856,Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints,"[""mtli@cs.cmu.edu"", ""meyumer@gmail.com"", ""deva@cs.cmu.edu""]","[""Mengtian Li"", ""Ersin Yumer"", ""Deva Ramanan""]","[""budgeted training"", ""learning rate schedule"", ""linear schedule"", ""annealing"", ""learning rate decay""]","In most practical settings and theoretical analyses, one assumes that a model can be trained until convergence. However, the growing complexity of machine learning datasets and models may violate such assumptions. Indeed, current approaches for hyper-parameter tuning and neural architecture search tend to be limited by practical resource constraints. Therefore, we introduce a formal setting for studying training under the non-asymptotic, resource-constrained regime, i.e., budgeted training. We analyze the following problem: ""given a dataset, algorithm, and fixed resource budget, what is the best achievable performance?"" We focus on the number of optimization iterations as the representative resource. Under such a setting, we show that it is critical to adjust the learning rate schedule according to the given budget. Among budget-aware learning schedules, we find simple linear decay to be both robust and high-performing. We support our claim through extensive experiments with state-of-the-art models on ImageNet (image classification), Kinetics (video classification), MS COCO (object detection and instance segmentation), and Cityscapes (semantic segmentation). We also analyze our results and find that the key to a good schedule is budgeted convergence, a phenomenon whereby the gradient vanishes at the end of each allowed budget. We also revisit existing approaches for fast convergence and show that budget-aware learning schedules readily outperform such approaches under (the practical but under-explored) budgeted training setting.",/pdf/bc6efbe44896472edd87187c87d89ea3d4b0ed22.pdf,ICLR,2020,Introduce a formal setting for budgeted training and propose a budget-aware linear learning rate schedule +SJx0oAEYwH,ByxM7VFdDS,1569440000000.0,1577170000000.0,1339,Cover Filtration and Stable Paths in the Mapper,"[""dustin.arendt@pnnl.gov"", ""matthew.broussard@wsu.edu"", ""kbala@wsu.edu"", ""nat@riverasaul.com""]","[""Dustin L. Arendt"", ""Matthew Broussard"", ""Bala Krishnamoorthy"", ""Nathaniel Saul""]","[""cover and nerve"", ""Jaccard distance"", ""stable paths in filtration"", ""Mapper"", ""recommender systems"", ""explainable machine learning""]","The contributions of this paper are two-fold. We define a new filtration called the cover filtration built from a single cover based on a generalized Steinhaus distance, which is a generalization of Jaccard distance. We then develop a language and theory for stable paths within this filtration, inspired by ideas of persistent homology. This framework can be used to develop several new learning representations in applications where an obvious metric may not be defined but a cover is readily available. We demonstrate the utility of our framework as applied to recommendation systems and explainable machine learning. + +We demonstrate a new perspective for modeling recommendation system data sets that does not require manufacturing a bespoke metric. As a direct application, we find that the stable paths identified by our framework in a movies data set represent a sequence of movies constituting a gentle transition and ordering from one genre to another. + +For explainable machine learning, we apply the Mapper for model induction, providing explanations in the form of paths between subpopulations. Our framework provides an alternative way of building a filtration from a single mapper that is then used to explore stable paths. As a direct illustration, we build a mapper from a supervised machine learning model trained on the FashionMNIST data set. We show that the stable paths in the cover filtration provide improved explanations of relationships between subpopulations of images. +",/pdf/f9836d54e997066d9c2a51877933c6b892940a49.pdf,ICLR,2020,"A new filtration from a SINGLE cover, with applications to movie recommendations and explainable machine learning" +S1lF8xHYwS,B1eRw9xFPB,1569440000000.0,1577170000000.0,2330,Unsupervised Domain Adaptation through Self-Supervision,"[""yusun@berkeley.edu"", ""etzeng@eecs.berkeley.edu"", ""trevor@eecs.berkeley.edu"", ""efros@eecs.berkeley.edu""]","[""Yu Sun"", ""Eric Tzeng"", ""Trevor Darrell"", ""Alexei A. Efros""]","[""unsupervised domain adaptation""]","This paper addresses unsupervised domain adaptation, the setting where labeled training data is available on a source domain, but the goal is to have good performance on a target domain with only unlabeled data. Like much of previous work, we seek to align the learned representations of the source and target domains while preserving discriminability. The way we accomplish alignment is by learning to perform auxiliary self-supervised task(s) on both domains simultaneously. Each self-supervised task brings the two domains closer together along the direction relevant to that task. Training this jointly with the main task classifier on the source domain is shown to successfully generalize to the unlabeled target domain. The presented objective is straightforward to implement and easy to optimize. We achieve state-of-the-art results on four out of seven standard benchmarks, and competitive results on segmentation adaptation. We also demonstrate that our method composes well with another popular pixel-level adaptation method.",/pdf/0698da137652ee4960ac278be5e1e42204ffedbb.pdf,ICLR,2020,We use self-supervision on both domain to align them for unsupervised domain adaptation. +04LZCAxMSco,PN-Qs9lNQ7x,1601310000000.0,1616560000000.0,2604,Learning a Latent Simplex in Input Sparsity Time,"[""~Ainesh_Bakshi1"", ""~Chiranjib_Bhattacharyya1"", ""~Ravi_Kannan1"", ""~David_Woodruff1"", ""~Samson_Zhou1""]","[""Ainesh Bakshi"", ""Chiranjib Bhattacharyya"", ""Ravi Kannan"", ""David Woodruff"", ""Samson Zhou""]","[""Latent Simplex"", ""numerical linear algebra"", ""low-rank approximation""]","We consider the problem of learning a latent $k$-vertex simplex $K\in\mathbb{R}^d$, given $\mathbf{A}\in\mathbb{R}^{d\times n}$, which can be viewed as $n$ data points that are formed by randomly perturbing some latent points in $K$, possibly beyond $K$. A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast in this view of learning a latent simplex. Bhattacharyya and Kannan (SODA 2020) give an algorithm for learning such a $k$-vertex latent simplex in time roughly $O(k\cdot\text{nnz}(\mathbf{A}))$, where $\text{nnz}(\mathbf{A})$ is the number of non-zeros in $\mathbf{A}$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $\mathbf{A}$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply a better low rank approximation algorithm than what is known. + +We obtain a spectral low-rank approximation to $\mathbf{A}$ in input-sparsity time and show that the column space thus obtained has small $\sin\Theta$ (angular) distance to the right top-$k$ singular space of $\mathbf{A}$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product (in absolute value) with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $\Theta(k\cdot\text{nnz}(\mathbf{A}))$ running time.",/pdf/f27cd4887ef1ff6feb1973dc1b610ff04bfe30d3.pdf,ICLR,2021,We obtain the first input sparsity runtime algorithm for the problem of learning a latent simplex. +T3RyQtRHebj,0k1cbxDtITD,1601310000000.0,1614990000000.0,943,Slot Machines: Discovering Winning Combinations of Random Weights in Neural Networks,"[""~Maxwell_Mbabilla_Aladago1"", ""~Lorenzo_Torresani1""]","[""Maxwell Mbabilla Aladago"", ""Lorenzo Torresani""]","[""initialization"", ""optimization""]","In contrast to traditional weight optimization in a continuous space, we demonstrate the existence of effective random networks whose weights are never updated. By selecting a weight among a fixed set of random values for each individual connection, our method uncovers combinations of random weights that match the performance of trained networks of the same capacity. We refer to our networks as ``slot machines'' where each reel (connection) contains a fixed set of symbols (random values). Our backpropagation algorithm ``spins'' the reels to seek ``winning'' combinations, i.e., selections of random weight values that minimize the given loss. Quite surprisingly, we find that allocating just a few random values to each connection (e.g., 8 values per connection) yields highly competitive combinations despite being dramatically more constrained compared to traditionally learned weights. Moreover, finetuning these combinations often improves performance over the trained baselines. A randomly initialized VGG-19 with 8 values per connection contains a combination that achieves 90% test accuracy on CIFAR-10. Our method also achieves an impressive performance of 98.1% on MNIST for neural networks containing only random weights. ",/pdf/8bd6c77eafac7db4429250697bc5da67ff8e73c1.pdf,ICLR,2021,"In contrast to traditional weight optimization in a continuous space, we demonstrate the existence of effective random networks whose weights are never updated." +aFvG-DNPNB9,eUKZIpuieI3,1601310000000.0,1614990000000.0,1711,Self-Reflective Variational Autoencoder,"[""~Ifigeneia_Apostolopoulou1"", ""~Elan_Rosenfeld1"", ""~Artur_Dubrawski2""]","[""Ifigeneia Apostolopoulou"", ""Elan Rosenfeld"", ""Artur Dubrawski""]","[""deep generative models"", ""variational inference"", ""approximate inference"", ""variational auto encoder""]","The Variational Autoencoder (VAE) is a powerful framework for learning probabilistic latent variable generative models. However, typical assumptions on the approximate posterior distributions can substantially restrict its capacity for inference and generative modeling. Variational inference based on neural autoregressive models respects the conditional dependencies of the exact posterior, but this flexibility comes at a cost: the resulting models are expensive to train in high-dimensional regimes and can be slow to produce samples. In this work, we introduce an orthogonal solution, which we call self-reflective inference. By redesigning the hierarchical structure of existing VAE architectures, self-reflection ensures that the stochastic flow preserves the factorization of the exact posterior, sequentially updating the latent codes in a manner consistent with the generative model. We empirically demonstrate the advantages of matching the variational posterior to the exact posterior---on binarized MNIST self-reflective inference achieves state-of-the-art performance without resorting to complex, computationally expensive components such as autoregressive layers. Moreover, we design a variational normalizing flow that employs the proposed architecture, yielding predictive benefits compared to its purely generative counterpart. Our proposed modification is quite general and it complements the existing literature; self-reflective inference can naturally leverage advances in distribution estimation and generative modeling to improve the capacity of each layer in the hierarchy.",/pdf/8a3d69197c2bccdd595cb4a1346c9dbfe95f7b07.pdf,ICLR,2021,We present the first deep probabilistic model without modeling mismatches between the true and variational posterior yielding computational and predictive benefits. +r1xRW3A9YX,BkeNpqTttm,1538090000000.0,1545360000000.0,1221,Riemannian TransE: Multi-relational Graph Embedding in Non-Euclidean Space,"[""atsushi-suzuki@g.ecc.u-tokyo.ac.jp"", ""xenolay@g.ecc.u-tokyo.ac.jp"", ""yamanishi@mist.i.u-tokyo.ac.jp""]","[""Atsushi Suzuki"", ""Yosuke Enokida"", ""Kenji Yamanishi""]","[""Riemannian TransE"", ""graph embedding"", ""multi-relational graph"", ""Riemannian manifold"", ""TransE"", ""hyperbolic space"", ""sphere"", ""knowledge base""]","Multi-relational graph embedding which aims at achieving effective representations with reduced low-dimensional parameters, has been widely used in knowledge base completion. Although knowledge base data usually contains tree-like or cyclic structure, none of existing approaches can embed these data into a compatible space that in line with the structure. To overcome this problem, a novel framework, called Riemannian TransE, is proposed in this paper to embed the entities in a Riemannian manifold. Riemannian TransE models each relation as a move to a point and defines specific novel distance dissimilarity for each relation, so that all the relations are naturally embedded in correspondence to the structure of data. Experiments on several knowledge base completion tasks have shown that, based on an appropriate choice of manifold, Riemannian TransE achieves good performance even with a significantly reduced parameters.",/pdf/54f8007973526d44affa09a68244a15c15fb5075.pdf,ICLR,2019,Multi-relational graph embedding with Riemannian manifolds and TransE-like loss function. +Hyx5qhEYvH,SJgTdIR1Dr,1569440000000.0,1577170000000.0,126,A SPIKING SEQUENTIAL MODEL: RECURRENT LEAKY INTEGRATE-AND-FIRE,"[""samuel.gao023@gmail.com"", ""hongwei.wang@lynxi.com"", ""zhh@bupt.edu.cn"", ""wangmeng_wm@bupt.edu.cn"", ""zhenzhi.wu@lynxi.com""]","[""Daiheng Gao"", ""Hongwei Wang"", ""Hehui Zhang"", ""Meng Wang"", ""Zhenzhi Wu""]","[""spiking neural network"", ""RNN"", ""spiking mode"", ""brain-inspired"", ""text summarization"", ""DVS""]","Stemming from neuroscience, Spiking neural networks (SNNs), a brain-inspired neural network that is a versatile solution to fault-tolerant and energy efficient information processing pertains to the ”event-driven” characteristic as the analogy of the behavior of biological neurons. However, they are inferior to artificial neural networks (ANNs) in real complicated tasks and only had it been achieved good results in rather simple applications. When ANNs usually being questioned about it expensive processing costs and lack of essential biological plausibility, the temporal characteristic of RNN-based architecture makes it suitable to incorporate SNN inside as imitating the transition of membrane potential through time, and a brain-inspired Recurrent Leaky Integrate-and-Fire (RLIF) model has been put forward to overcome a series of challenges, such as discrete binary output and dynamical trait. The experiment results show that our recurrent architecture has an ultra anti-interference ability and strictly follows the guideline of SNN that spike output through it is discrete. Furthermore, this architecture achieves a good result on neuromorphic datasets and can be extended to tasks like text summarization and video understanding.",/pdf/02486a4cdbdec72831586b064e284a167f0a882d.pdf,ICLR,2020, +iKQAk8a2kM0,RTXXssPhjz6,1601310000000.0,1613870000000.0,289,Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits,"[""~Jiawang_Bai2"", ""~Baoyuan_Wu1"", ""~Yong_Zhang6"", ""~Yiming_Li1"", ""~Zhifeng_Li5"", ""~Shu-Tao_Xia1""]","[""Jiawang Bai"", ""Baoyuan Wu"", ""Yong Zhang"", ""Yiming Li"", ""Zhifeng Li"", ""Shu-Tao Xia""]","[""targeted attack"", ""bit-flip"", ""weight attack""]","To explore the vulnerability of deep neural networks (DNNs), many attack paradigms have been well studied, such as the poisoning-based backdoor attack in the training stage and the adversarial attack in the inference stage. In this paper, we study a novel attack paradigm, which modifies model parameters in the deployment stage for malicious purposes. Specifically, our goal is to misclassify a specific sample into a target class without any sample modification, while not significantly reduce the prediction accuracy of other samples to ensure the stealthiness. To this end, we formulate this problem as a binary integer programming (BIP), since the parameters are stored as binary bits ($i.e.$, 0 and 1) in the memory. By utilizing the latest technique in integer programming, we equivalently reformulate this BIP problem as a continuous optimization problem, which can be effectively and efficiently solved using the alternating direction method of multipliers (ADMM) method. Consequently, the flipped critical bits can be easily determined through optimization, rather than using a heuristic strategy. Extensive experiments demonstrate the superiority of our method in attacking DNNs.",/pdf/ed4d75e28ae70ba28f4895cf7097cf634745d11a.pdf,ICLR,2021,We propose a targeted attack method against the deployed DNN via flipping a few binary weight bits. +SkgewU5ll,,1478290000000.0,1484260000000.0,299,GRAM: Graph-based Attention Model for Healthcare Representation Learning,"[""mp2893@gatech.edu"", ""bahadori@gatech.edu"", ""lsong@cc.gatech.edu"", ""stewarwf@sutterhealth.org"", ""jsun@cc.gatech.edu""]","[""Edward Choi"", ""Mohammad Taha Bahadori"", ""Le Song"", ""Walter F. Stewart"", ""Jimeng Sun""]","[""Deep learning"", ""Applications""]","Deep learning methods exhibit promising performance for predictive modeling in healthcare, but two important challenges remain: +- Data insufficiency: Often in healthcare predictive modeling, the sample size is insufficient for deep learning methods to achieve satisfactory results. +- Interpretation: The representations learned by deep learning models should align with medical knowledge. +To address these challenges, we propose a GRaph-based Attention Model, GRAM that supplements electronic health records (EHR) with hierarchical information inherent to medical ontologies. +Based on the data volume and the ontology structure, GRAM represents a medical concept as a combination of its ancestors in the ontology via an attention mechanism. +We compared predictive performance (i.e. accuracy, data needs, interpretability) of GRAM to various methods including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure prediction task. +Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data. Additionally, unlike other methods, the medical concept representations learned by GRAM are well aligned with the medical ontology. Finally, GRAM exhibits intuitive attention behaviors by adaptively generalizing to higher level concepts when facing data insufficiency at the lower level concepts.",/pdf/e986742070dbf1d3a51ef5901c4360422b21498b.pdf,ICLR,2017,We propose a novel attention mechanism on graphs to learn representations for medical concepts from both data and medical ontologies to cope with insufficient data volume. +KIS8jqLp4fQ,FxDZjRSNB1,1601310000000.0,1614990000000.0,2054,On Dynamic Noise Influence in Differential Private Learning,"[""~Junyuan_Hong1"", ""~Zhangyang_Wang1"", ""~Jiayu_Zhou1""]","[""Junyuan Hong"", ""Zhangyang Wang"", ""Jiayu Zhou""]","[""privacy"", ""private learning"", ""dynamic policy""]","Protecting privacy in learning while maintaining the model performance has become increasingly critical in many applications that involve sensitive data. Private Gradient Descent (PGD) is a commonly used private learning framework, which adds noise according to the Differential Privacy protocol.Recent studies show that dynamic privacy schedules of decreasing noise magnitudes can improve loss at the final iteration, and yet theoretical understandings of the effectiveness of such schedules and their connections to optimization algorithms remain limited. In this paper, we provide comprehensive analysis of noise influence in dynamic privacy schedules to answer these critical questions. We first present a dynamic noise schedule minimizing the utility upper bound of PGD, and show how the noise influence from each optimization step collectively impacts utility of the final model. Our study also reveals how impacts from dynamic noise influence change when momentum is used. We empirically show the connection exists for general non-convex losses, and the influence is greatly impacted by the loss curvature.",/pdf/e88daa18d9d5316bd1ad22aa67216c64d00226b5.pdf,ICLR,2021,Improve utility upper bound for differential private learning by dynamic noise influence +Y5TgO3J_Glc,vd6ls9886Lv,1601310000000.0,1614990000000.0,444,Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints,"[""~Halley_Young1"", ""maxdu@seas.upenn.edu"", ""~Osbert_Bastani1""]","[""Halley Young"", ""Maxwell Du"", ""Osbert Bastani""]","[""neurosymbolic"", ""sequence"", ""program synthesis"", ""generative"", ""constraint"", ""music"", ""poetry""]","Recently, there has been significant progress designing deep generative models that generate realistic sequence data such as text or music. Nevertheless, it remains difficult to incorporate high-level structure to guide the generative process. We propose a novel approach for incorporating structure in the form of relational constraints between different subcomponents of an example (e.g., lines of a poem or measures of music). Our generative model has two parts: (i) one model to generate a realistic set of relational constraints, and (ii) a second model to generate realistic data satisfying these constraints. To train model (i), we propose a novel program synthesis algorithm that infers the relational constraints present in the training data, and then train the models based on the resulting relational constraints. In our experiments, we show that our approach significantly improves over state-of-the-art approaches in terms of capturing high-level structure in the data, while performing comparably or better in terms of low-level structure.",/pdf/b2658086a36edaa830b58609127b751d4e0f2d60.pdf,ICLR,2021,We use program synthesis to learn and condition on the relational structure of real sequential data. +Sye2s2VtDr,HyxPefzWvS,1569440000000.0,1577170000000.0,166,Automatically Learning Feature Crossing from Model Interpretation for Tabular Data,"[""zhaocheng.liu@realai.ai"", ""qiang.liu@realai.ai"", ""haoli.zhang@realai.ai""]","[""Zhaocheng Liu"", ""Qiang Liu"", ""Haoli Zhang""]","[""AutoML"", ""feature crossing"", ""interpretation""]","Automatically feature generation is a major topic of automated machine learning. Among various feature generation approaches, feature crossing, which takes cross-product of sparse features, is a promising way to effectively capture the interactions among categorical features in tabular data. Previous works on feature crossing try to search in the set of all the possible cross feature fields. This is obviously not efficient when the size of original feature fields is large. Meanwhile, some deep learning-based methods combines deep neural networks and various interaction components. However, due to the existing of Deep Neural Networks (DNN), only a few cross features can be explicitly generated by the interaction components. Recently, piece-wise interpretation of DNN has been widely studied, and the piece-wise interpretations are usually inconsistent in different samples. Inspired by this, we give a definition of interpretation inconsistency in DNN, and propose a novel method called CrossGO, which selects useful cross features according to the interpretation inconsistency. The whole process of learning feature crossing can be done via simply training a DNN model and a logistic regression (LR) model. CrossGO can generate compact candidate set of cross feature fields, and promote the efficiency of searching. Extensive experiments have been conducted on several real-world datasets. Cross features generated by CrossGO can empower a simple LR model achieving approximate or even better performances comparing with complex DNN models.",/pdf/c39b15440e8a22dfb2c8ffd1b37118d17d498950.pdf,ICLR,2020,"We propose a novel method called CrossGO, which automatically and efficiently selects useful cross features according to the interpretation inconsistency computed in deep neural networks." +BJjBnN9a-,rk5B3EqaW,1508690000000.0,1518730000000.0,43,Continuous Convolutional Neural Networks for Image Classification,"[""vitor.guizilini@sydney.edu.au"", ""fabio.ramos@sydney.edu.au""]","[""Vitor Guizilini"", ""Fabio Ramos""]","[""convolutional neural networks"", ""image classification"", ""deep learning"", ""feature representation"", ""hilbert maps"", ""reproducing kernel hilbert space""]","This paper introduces the concept of continuous convolution to neural networks and deep learning applications in general. Rather than directly using discretized information, input data is first projected into a high-dimensional Reproducing Kernel Hilbert Space (RKHS), where it can be modeled as a continuous function using a series of kernel bases. We then proceed to derive a closed-form solution to the continuous convolution operation between two arbitrary functions operating in different RKHS. Within this framework, convolutional filters also take the form of continuous functions, and the training procedure involves learning the RKHS to which each of these filters is projected, alongside their weight parameters. This results in much more expressive filters, that do not require spatial discretization and benefit from properties such as adaptive support and non-stationarity. Experiments on image classification are performed, using classical datasets, with results indicating that the proposed continuous convolutional neural network is able to achieve competitive accuracy rates with far fewer parameters and a faster convergence rate.",/pdf/00cfd9978a15230bc5705eb19c22b0d536589c2b.pdf,ICLR,2018,This paper proposes a novel convolutional layer that operates in a continuous Reproducing Kernel Hilbert Space. +B1l08oAct7,HkgXkJ8qK7,1538090000000.0,1550360000000.0,220,Deterministic Variational Inference for Robust Bayesian Neural Networks,"[""anqiw@princeton.edu"", ""sebastian.nowozin@microsoft.com"", ""ted.meeds@microsoft.com"", ""ret26@cam.ac.uk"", ""jmh233@cam.ac.uk"", ""algaunt@microsoft.com""]","[""Anqi Wu"", ""Sebastian Nowozin"", ""Edward Meeds"", ""Richard E. Turner"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato"", ""Alexander L. Gaunt""]","[""Bayesian neural network"", ""variational inference"", ""variational bayes"", ""variance reduction"", ""empirical bayes""]","Bayesian neural networks (BNNs) hold great promise as a flexible and principled solution to deal with uncertainty when learning from finite data. Among approaches to realize probabilistic inference in deep neural networks, variational Bayes (VB) is theoretically grounded, generally applicable, and computationally efficient. With wide recognition of potential advantages, why is it that variational Bayes has seen very limited practical use for BNNs in real applications? We argue that variational inference in neural networks is fragile: successful implementations require careful initialization and tuning of prior variances, as well as controlling the variance of Monte Carlo gradient estimates. We provide two innovations that aim to turn VB into a robust inference tool for Bayesian neural networks: first, we introduce a novel deterministic method to approximate moments in neural networks, eliminating gradient variance; second, we introduce a hierarchical prior for parameters and a novel Empirical Bayes procedure for automatically selecting prior variances. Combining these two innovations, the resulting method is highly efficient and robust. On the application of heteroscedastic regression we demonstrate good predictive performance over alternative approaches.",/pdf/08cba0bf95bb1cc37174e137630045570fec8abc.pdf,ICLR,2019,A method for eliminating gradient variance and automatically tuning priors for effective training of bayesian neural networks +SklKcRNYDH,BklHNau_vS,1569440000000.0,1583910000000.0,1291,Extreme Tensoring for Low-Memory Preconditioning ,"[""xinyic@google.com"", ""namanagarwal@google.com"", ""ehazan@cs.princeton.edu"", ""cyril.zhang@cs.princeton.edu"", ""y.zhang@cs.princeton.edu""]","[""Xinyi Chen"", ""Naman Agarwal"", ""Elad Hazan"", ""Cyril Zhang"", ""Yi Zhang""]","[""optimization"", ""deep learning""]","State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose \emph{extreme tensoring} for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our technique applies to arbitrary models (not necessarily with tensor-shaped parameters), and is accompanied by regret and convergence guarantees, which shed light on the tradeoffs between preconditioner quality and expressivity. On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance.",/pdf/cbc8beb0d5c1eb1fa1422f556e190231cbd11556.pdf,ICLR,2020, +RovX-uQ1Hua,3fALgKvvZN_,1601310000000.0,1614740000000.0,655,Text Generation by Learning from Demonstrations,"[""~Richard_Yuanzhe_Pang1"", ""~He_He2""]","[""Richard Yuanzhe Pang"", ""He He""]","[""text generation"", ""learning from demonstrations"", ""nlp""]","Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.",/pdf/b85a24ba77b30aff76c1f56b4e90e23fea31f402.pdf,ICLR,2021, +ryljMpNtwr,H1eJi4hLPH,1569440000000.0,1577170000000.0,424,Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming,"[""claudio.michaelis@uni-tuebingen.de"", ""benjamin.mitzkus@uni-tuebingen.de"", ""robert@geirhos.de"", ""evgenia.rusak@bethgelab.org"", ""oliver.bringmann@uni-tuebingen.de"", ""alexander.ecker@uni-tuebingen.de"", ""matthias@bethgelab.org"", ""wieland.brendel@bethgelab.org""]","[""Claudio Michaelis"", ""Benjamin Mitzkus"", ""Robert Geirhos"", ""Evgenia Rusak"", ""Oliver Bringmann"", ""Alexander S. Ecker"", ""Matthias Bethge"", ""Wieland Brendel""]","[""deep learning"", ""object detection"", ""robustness"", ""neural networks"", ""data augmentation"", ""autonomous driving""]","The ability to detect objects regardless of image distortions or weather conditions is crucial for real-world applications of deep learning like autonomous driving. We here provide an easy-to-use benchmark to assess how object detection models perform when image quality degrades. The three resulting benchmark datasets, termed PASCAL-C, COCO-C and Cityscapes-C, contain a large variety of image corruptions. We show that a range of standard object detection models suffer a severe performance loss on corrupted images (down to 30-60% of the original performance). However, a simple data augmentation trick - stylizing the training images - leads to a substantial increase in robustness across corruption type, severity and dataset. We envision our comprehensive benchmark to track future progress towards building robust object detection models. Benchmark, code and data are available at: (hidden for double blind review)",/pdf/b9a472db882a5c76b0fcddc5dfc073b3a146d87a.pdf,ICLR,2020,"A benchmark to asses the robustness of object detection models towards common image corruptions. Like classification models, object detection models perform worse on corrupted images. Training with stylized data reduces the gap for all corruptions." +3FkrodAXdk,v8WTco5-v53,1601310000000.0,1614990000000.0,2023,Deep Ensembles with Hierarchical Diversity Pruning,"[""~Yanzhao_Wu1"", ""~Ling_Liu3""]","[""Yanzhao Wu"", ""Ling Liu""]","[""Ensemble"", ""Diversity Metrics"", ""Hierarchical Pruning"", ""Ensemble Accuracy"", ""Deep Neural Networks""]","Diverse deep ensembles hold the potential for improving accuracy and robustness of deep learning models. Both pairwise and non-pairwise ensemble diversity metrics have been proposed over the past two decades. However, it is also challenging to find the right metrics that can effectively prune those deep ensembles with insufficient ensemble diversity, thus failing to deliver effective ensemble accuracy. In this paper, we first compare six popular diversity metrics in the literature, coined as Q metrics, including both pairwise and non-pairwise representatives. We analyze their inherent limitations in capturing the negative correlation of ensemble member models, and thus inefficient in identifying and pruning low quality ensembles. We next present six HQ ensemble diversity metrics by extending the existing Q-metrics with three novel optimizations: (1) We introduce the concept of focal model and separately measure the ensemble diversity among the deep ensembles of the same team size with the concept of focal model, aiming to better capture the negative correlations of member models of an ensemble. (2) We introduce six HQ-diversity metrics to optimize the corresponding Q-metrics respectively in terms of measuring negative correlation among member models of an ensemble using its ensemble diversity score. (3) We introduce a two phase hierarchical pruning method to effectively identify and prune those deep ensembles with high HQ diversity scores, aiming to increase the lower and upper bounds on ensemble accuracy for the selected ensembles. By combining these three optimizations, deep ensembles selected based on our hierarchical diversity pruning approach significantly outperforms those selected by the corresponding Q-metrics. Comprehensive experimental evaluation over several benchmark datasets shows that our HQ-metrics can effectively select high diversity deep ensembles by pruning out those ensembles with insufficient diversity, and successfully increase the lower bound (worst case) accuracy of the selected deep ensembles, compared to those selected using the state-of-the-art Q-metrics.",/pdf/c3485a7b58c4fe8648e8aa6aebe729e1d40ed481.pdf,ICLR,2021,Our proposed HQ-metrics significantly outperformed the state-of-the-art Q-metrics in effective selecting high quality deep ensembles. +ryjw_eAaZ,Sy9DugR6b,1508930000000.0,1518730000000.0,83,Unsupervised Deep Structure Learning by Recursive Dependency Analysis,"[""raanan.y.yehezkel.rohekar@intel.com"", ""guy.koren@intel.com"", ""shami.nisimov@intel.com"", ""gal.novik@intel.com""]","[""Raanan Y. Yehezkel Rohekar"", ""Guy Koren"", ""Shami Nisimov"", ""Gal Novik""]","[""unsupervised learning"", ""structure learning"", ""deep belief networks"", ""probabilistic graphical models"", ""Bayesian networks""]","We introduce an unsupervised structure learning algorithm for deep, feed-forward, neural networks. We propose a new interpretation for depth and inter-layer connectivity where a hierarchy of independencies in the input distribution is encoded in the network structure. This results in structures allowing neurons to connect to neurons in any deeper layer skipping intermediate layers. Moreover, neurons in deeper layers encode low-order (small condition sets) independencies and have a wide scope of the input, whereas neurons in the first layers encode higher-order (larger condition sets) independencies and have a narrower scope. Thus, the depth of the network is automatically determined---equal to the maximal order of independence in the input distribution, which is the recursion-depth of the algorithm. The proposed algorithm constructs two main graphical models: 1) a generative latent graph (a deep belief network) learned from data and 2) a deep discriminative graph constructed from the generative latent graph. We prove that conditional dependencies between the nodes in the learned generative latent graph are preserved in the class-conditional discriminative graph. Finally, a deep neural network structure is constructed based on the discriminative graph. We demonstrate on image classification benchmarks that the algorithm replaces the deepest layers (convolutional and dense layers) of common convolutional networks, achieving high classification accuracy, while constructing significantly smaller structures. The proposed structure learning algorithm requires a small computational cost and runs efficiently on a standard desktop CPU.",/pdf/7d3eb1f9185363784dbf71f24bcbfcadfefece13.pdf,ICLR,2018,A principled approach for structure learning of deep neural networks with a new interpretation for depth and inter-layer connectivity. +HkeoOo09YX,HJeCew8KY7,1538090000000.0,1550790000000.0,383,Meta-Learning For Stochastic Gradient MCMC,"[""wg242@cam.ac.uk"", ""yl494@cam.ac.uk"", ""jmh233@cam.ac.uk""]","[""Wenbo Gong"", ""Yingzhen Li"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""Meta Learning"", ""MCMC""]","Stochastic gradient Markov chain Monte Carlo (SG-MCMC) has become increasingly popular for simulating posterior samples in large-scale Bayesian modeling. However, existing SG-MCMC schemes are not tailored to any specific probabilistic model, even a simple modification of the underlying dynamical system requires significant physical intuition. This paper presents the first meta-learning algorithm that allows automated design for the underlying continuous dynamics of an SG-MCMC sampler. The learned sampler generalizes Hamiltonian dynamics with state-dependent drift and diffusion, enabling fast traversal and efficient exploration of energy landscapes. Experiments validate the proposed approach on Bayesian fully connected neural network, Bayesian convolutional neural network and Bayesian recurrent neural network tasks, showing that the learned sampler outperforms generic, hand-designed SG-MCMC algorithms, and generalizes to different datasets and larger architectures.",/pdf/6f93aee3c40e2984e37f8f6266759f9a29a3ff0e.pdf,ICLR,2019,This paper proposes a method to automate the design of stochastic gradient MCMC proposal using meta learning approach. +SJe8DsR9tm,rkxoFADcY7,1538090000000.0,1545360000000.0,266,Dynamic Early Terminating of Multiply Accumulate Operations for Saving Computation Cost in Convolutional Neural Networks,"[""wwball34@gmail.com"", ""ycchen.phi@gmail.com"", ""jaubau999@gmail.com"", ""scchang@cs.nthu.edu.tw""]","[""Yu-Yi Su"", ""Yung-Chih Chen"", ""Xiang-Xiu Wu"", ""Shih-Chieh Chang""]","[""Convolutional neural network"", ""Early terminating"", ""Dynamic model optimization""]","Deep learning has been attracting enormous attention from academia as well as industry due to its great success in many artificial intelligence applications. As more applications are developed, the need for implementing a complex neural network model on an energy-limited edge device becomes more critical. To this end, this paper proposes a new optimization method to reduce the computation efforts of convolutional neural networks. The method takes advantage of the fact that some convolutional operations are actually wasteful since their outputs are pruned by the following activation or pooling layers. Basically, a convolutional filter conducts a series of multiply-accumulate (MAC) operations. We propose to set a checkpoint in the MAC process to determine whether a filter could terminate early based on the intermediate result. Furthermore, a fine-tuning process is conducted to recover the accuracy drop due to the applied checkpoints. The experimental results show that the proposed method can save approximately 50% MAC operations with less than 1% accuracy drop for CIFAR-10 example model and Network in Network on the CIFAR-10 and CIFAR-100 datasets. Additionally, compared with the state-of- the-art method, the proposed method is more effective on the CIFAR-10 dataset and is competitive on the CIFAR-100 dataset.",/pdf/95800c1b0e945927bac449345917e54d9a864023.pdf,ICLR,2019, +Aj4_e50nB8,FAi6xnogK2R,1601310000000.0,1614990000000.0,1577,Contextual Knowledge Distillation for Transformer Compression,"[""~Geondo_Park1"", ""~Gyeongman_Kim1"", ""~Eunho_Yang1""]","[""Geondo Park"", ""Gyeongman Kim"", ""Eunho Yang""]","[""Knowledge Distillation"", ""Transformer Compression"", ""BERT""]","A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation strategy for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks. The code will be released.",/pdf/51ae662b48d589de07d367eea4fc6fab1594cc12.pdf,ICLR,2021, +B1xFxh0cKX,SklPRGh5tm,1538090000000.0,1545360000000.0,1094,Guided Evolutionary Strategies: Escaping the curse of dimensionality in random search,"[""nirum@google.com"", ""lmetz@google.com"", ""gjt@google.com"", ""damichoi@google.com"", ""jaschasd@google.com""]","[""Niru Maheswaranathan"", ""Luke Metz"", ""George Tucker"", ""Dami Choi"", ""Jascha Sohl-Dickstein""]","[""evolutionary strategies"", ""optimization"", ""gradient estimators"", ""biased gradients""]","Many applications in machine learning require optimizing a function whose true gradient is unknown, but where surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) is available instead. This arises when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in certain reinforcement learning applications or training networks with discrete variables). We propose Guided Evolutionary Strategies, a method for optimally using surrogate gradient directions along with random search. We define a search distribution for evolutionary strategies that is elongated along a subspace spanned by the surrogate gradients. This allows us to estimate a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace, and use this to derive a setting of the hyperparameters that works well across problems. Finally, we apply our method to example problems including truncated unrolled optimization and training neural networks with discrete variables, demonstrating improvement over both standard evolutionary strategies and first-order methods (that directly follow the surrogate gradient). We provide a demo of Guided ES at: redacted URL",/pdf/0b14c36140480f115b0847908574fc08e2bfb320.pdf,ICLR,2019,"We propose an optimization method for when only biased gradients are available--we define a new gradient estimator for this scenario, derive the bias and variance of this estimator, and apply it to example problems." +sSjqmfsk95O,dk9HqzSG7H,1601310000000.0,1616080000000.0,1061,Large Scale Image Completion via Co-Modulated Generative Adversarial Networks,"[""~Shengyu_Zhao1"", ""~Jonathan_Cui1"", ""simon1727@qq.com"", ""dongyue8@gmail.com"", ""liangx@rdfz.cn"", ""echang@microsoft.com"", ""~Yan_Xu2""]","[""Shengyu Zhao"", ""Jonathan Cui"", ""Yilun Sheng"", ""Yue Dong"", ""Xiao Liang"", ""Eric I-Chao Chang"", ""Yan Xu""]","[""image completion"", ""generative adversarial networks"", ""co-modulation""]","Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for image completion, we propose the new Paired/Unpaired Inception Discriminative Score (P-IDS/U-IDS), which robustly measures the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space. Experiments demonstrate superior performance in terms of both quality and diversity over state-of-the-art methods in free-form image completion and easy generalization to image-to-image translation. Code is available at https://github.com/zsyzzsoft/co-mod-gan.",/pdf/9a3cfa3a1710ee23378772a3be3070ef32a29e17.pdf,ICLR,2021,Bridging the gap between between image-conditional and unconditional GAN architectures via co-modulation +OQ08SN70M1V,jkKRjyPmcVG,1601310000000.0,1616730000000.0,260,Better Fine-Tuning by Reducing Representational Collapse,"[""~Armen_Aghajanyan1"", ""akshats@fb.com"", ""anchit@fb.com"", ""~Naman_Goyal1"", ""~Luke_Zettlemoyer1"", ""sonalgupta@fb.com""]","[""Armen Aghajanyan"", ""Akshat Shrivastava"", ""Anchit Gupta"", ""Naman Goyal"", ""Luke Zettlemoyer"", ""Sonal Gupta""]","[""finetuning"", ""nlp"", ""representational learning"", ""glue""]","Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.",/pdf/2e098788f4b0c69470d410cbd519d3e0346848d3.pdf,ICLR,2021,"We present a lightweight augmentation to standard fine-tuning which outperforms previous methods across the board (i.e. SOTA on 3 summarization tasks, XNLI, RoBERTa on GLUE) while being computationally cheaper than other fine-tuning approaches." +guetrIHLFGI,c0GjEzTkZKb,1601310000000.0,1616010000000.0,2172,The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers,"[""~Preetum_Nakkiran1"", ""~Behnam_Neyshabur1"", ""~Hanie_Sedghi1""]","[""Preetum Nakkiran"", ""Behnam Neyshabur"", ""Hanie Sedghi""]","[""generalization"", ""optimization"", ""online learning"", ""understanding deep learning"", ""empirical investigation""]","We propose a new framework for reasoning about generalization in deep learning. +The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. +We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is ""because"" they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays the foundation for future research in this direction. ",/pdf/04581f1e8f49dff9317ee5405d413f8b8a245a82.pdf,ICLR,2021,We show empirical evidence that the performance gap between offline generalization and online optimization is small and propose an alternative framework for studying generalization. +1-Mh-cWROZ,MlWG4TDYqRV,1601310000000.0,1614990000000.0,1872,Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design,"[""~Yue_Cao4"", ""~Payel_Das1"", ""~Pin-Yu_Chen1"", ""~Vijil_Chenthamarakshan1"", ""~Igor_Melnyk1"", ""~Yang_Shen4""]","[""Yue Cao"", ""Payel Das"", ""Pin-Yu Chen"", ""Vijil Chenthamarakshan"", ""Igor Melnyk"", ""Yang Shen""]","[""Joint Embedding Learning"", ""Generative Model"", ""Transformer Autoencoder"", ""Inverse Protein Folding"", ""Sequence Design""]","Designing novel protein sequences consistent with a desired 3D structure or fold, often referred to as the inverse protein folding problem, is a central, but non-trivial, task in protein engineering. It has a wide range of applications in energy, biomedicine, and materials science. However, challenges exist due to the complex sequence-fold relationship and difficulties associated with modeling 3D folds. To overcome these challenges, we propose Fold2Seq, a novel transformer-based generative framework for designing protein sequences conditioned on a specific fold. Our model learns a fold embedding from the density of the secondary structural elements in 3D voxels, and then models the complex sequence-structure relationship by learning a joint sequence-fold embedding. Experiments on high-resolution, complete, and single-structure test set demonstrate improved performance of Fold2Seq in terms of speed and reliability for sequence design, compared to existing baselines including the state-of-the-art RosettaDesign and other neural net-based approaches. The unique advantages of fold-based Fold2Seq becomes more evident on diverse real-world test sets comprised of low-resolution, incomplete, or ensemble structures, in comparison to a structure-based model. ",/pdf/fc0b5f49d460d4b4eb9200226c5b3754e6e2e529.pdf,ICLR,2021,A novel transformer-based generative model for learning joint sequence-fold embedding and designing protein sequences shows superior performance and efficiency against existing methods. +m0ECRXO6QlP,PuNsVI2OPOx,1601310000000.0,1614990000000.0,2064,Supervision Accelerates Pre-training in Contrastive Semi-Supervised Learning of Visual Representations,"[""~Mido_Assran1"", ""~Nicolas_Ballas1"", ""~Lluis_Castrejon1"", ""~Michael_Rabbat1""]","[""Mido Assran"", ""Nicolas Ballas"", ""Lluis Castrejon"", ""Michael Rabbat""]","[""semi-supervised learning"", ""contrastive learning"", ""self-supervised learning"", ""deep learning"", ""representation learning"", ""metric learning"", ""visual representations""]","We investigate a strategy for improving the efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt, based on noise-contrastive estimation and neighbourhood component analysis, that aims to distinguish examples of different classes in addition to the self-supervised instance-wise pretext tasks. On ImageNet, we find that SuNCEt can be used to match the semi-supervised learning accuracy of previous contrastive approaches while using less than half the amount of pre-training and compute. Our main insight is that leveraging even a small amount of labeled data during pre-training, and not only during fine-tuning, provides an important signal that can significantly accelerate contrastive learning of visual representations.",/pdf/c572a34bebc88aa1c0fdd6c0f4e6d2cf1d0b2511.pdf,ICLR,2021,A few labeled samples can accelerate contrastive pre-training. +HkNKFiGex,,1477780000000.0,1487780000000.0,11,Neural Photo Editing with Introspective Adversarial Networks,"[""ajb5@hw.ac.uk"", ""t.lim@hw.ac.uk"", ""j.m.ritchie@hw.ac.uk"", ""Nick.Weston@renishaw.com""]","[""Andrew Brock"", ""Theodore Lim"", ""J.M. Ritchie"", ""Nick Weston""]","[""Computer vision"", ""Unsupervised Learning"", ""Applications""]","The increasingly photorealistic sample quality of generative image models suggests their feasibility in applications beyond image generation. We present the Neural Photo Editor, an interface that leverages the power of generative neural networks to make large, semantically coherent changes to existing images. To tackle the challenge of achieving accurate reconstructions without loss of feature quality, we introduce the Introspective Adversarial Network, +a novel hybridization of the VAE and GAN. Our model efficiently captures long-range dependencies through use of a computational block based on weight-shared dilated convolutions, and improves generalization performance with Orthogonal Regularization, a novel weight regularization method. We validate our contributions on CelebA, SVHN, and CIFAR-100, and produce samples and reconstructions with high visual fidelity.",/pdf/23875ca42bb892377b060f746dbfc12423c074e2.pdf,ICLR,2017,An interface for editing photos using generative image models. +2234Pp-9ikZ,Jib4JPtwXvT,1601310000000.0,1614990000000.0,3272,"Don't be picky, all students in the right family can learn from good teachers","[""~Roy_Henha_Eyono1"", ""~Fabio_Maria_Carlucci2"", ""~Pedro_M_Esperan\u00e7a1"", ""~Binxin_Ru1"", ""~Philip_Torr1""]","[""Roy Henha Eyono"", ""Fabio Maria Carlucci"", ""Pedro M Esperan\u00e7a"", ""Binxin Ru"", ""Philip Torr""]","[""knowledge distillation"", ""neural architecture search"", ""nas"", ""automl"", ""knowledge trasfer"", ""model compression""]","State-of-the-art results in deep learning have been improving steadily, in good part due to the use of larger models. However, widespread use is constrained by device hardware limitations, resulting in a substantial performance gap between state-of-the-art models and those that can be effectively deployed on small devices. + +While Knowledge Distillation (KD) theoretically enables small student models to emulate larger teacher models, in practice selecting a good student architecture requires considerable human expertise. Neural Architecture Search (NAS) appears as a natural solution to this problem but most approaches can be inefficient, as most of the computation is spent comparing architectures sampled from the same distribution, with negligible differences in performance. + +In this paper, we propose to instead search for a family of student architectures sharing the property of being good at learning from a given teacher. +Our approach AutoKD, powered by Bayesian Optimization, explores a flexible graph-based search space, enabling us to automatically learn the optimal student architecture distribution and KD parameters, while being 20x more sample efficient compared to existing state-of-the-art. We evaluate our method on 3 datasets; on large images specifically, we reach the teacher performance while using 3x less memory and 10x less parameters. Finally, while AutoKD uses the traditional KD loss, it outperforms more advanced KD variants using hand-designed students.",/pdf/4cb6cf60386e8d067634a38399e7e4d71211830e.pdf,ICLR,2021,An efficient method for emulating large models by searching for the optimal family of student architectures. +P42rXLGZQ07,kf2GsLcw_k,1601310000000.0,1614990000000.0,302,Direct Evolutionary Optimization of Variational Autoencoders with Binary Latents,"[""~Enrico_Guiraud1"", ""~Jakob_Drefs1"", ""~Jorg_Lucke1""]","[""Enrico Guiraud"", ""Jakob Drefs"", ""Jorg Lucke""]","[""variational optimization"", ""variational autoencoders"", ""denoising"", ""evolutionary algorithms""]","Discrete latent variables are considered important to model the generation process of real world data, which has motivated research on Variational Autoencoders (VAEs) with discrete latents. However, standard VAE training is not possible in this case, which has motivated different strategies to manipulate discrete distributions in order to train discrete VAEs similarly to conventional ones. +Here we ask if it is also possible to keep the discrete nature of the latents fully intact by applying a direct discrete optimization for the encoding model. The studied approach is consequently strongly diverting from standard VAE training by altogether sidestepping absolute standard VAE mechanisms such as sampling approximation, reparameterization trick and amortization. + +Discrete optimization is realized in a variational setting using truncated posteriors in conjunction with evolutionary algorithms (using a recently suggested approach). For VAEs with binary latents, we first show how such a discrete variational method (A)~ties into gradient ascent for network weights and (B)~uses the decoder network to select latent states for training. + +More conventional amortized training is, as may be expected, more efficient than direct discrete optimization, and applicable to large neural networks. +However, we here find direct optimization to be efficiently scalable to hundreds of latent variables using smaller networks. +More importantly, we find the effectiveness of direct optimization to be highly competitive in 'zero-shot' learning (where high effectiveness for small networks is required). +In contrast to large supervised neural networks, the here investigated VAEs can denoise a single image without previous training on clean data and/or training on large image datasets. + +More generally, the studied approach shows that training of VAEs is indeed possible without sampling-based approximation and reparameterization, which may be interesting for the analysis of VAE training in general. In the regime of few data, direct optimization, furthermore, makes VAEs competitive for denoising where they have previously been outperformed by non-generative approaches.",/pdf/3c127dd43d94575a3919fdf86aa17ebbec922939.pdf,ICLR,2021,We investigate a novel approach to optimize Variational Autoencoders with binary latents which does not alter the discrete latent distribution. +jQSBcVURlpW,PITiYzGVggHz,1601310000000.0,1614990000000.0,950,Learning Algebraic Representation for Abstract Spatial-Temporal Reasoning,"[""~Chi_Zhang12"", ""~Sirui_Xie1"", ""~Baoxiong_Jia1"", ""~Yixin_Zhu1"", ""~Ying_Nian_Wu1"", ""~Song-Chun_Zhu1""]","[""Chi Zhang"", ""Sirui Xie"", ""Baoxiong Jia"", ""Yixin Zhu"", ""Ying Nian Wu"", ""Song-Chun Zhu""]",[],"Is intelligence realized by connectionist or classicist? While connectionist approaches have achieved superhuman performance, there has been growing evidence that such task-specific superiority is particularly fragile in systematic generalization. This observation lies in the central debate (Fodor et al., 1988; Fodor &McLaughlin, 1990) between connectionist and classicist, wherein the latter continually advocates an algebraic treatment in cognitive architectures. In this work, we follow the classicist's call and propose a hybrid approach to improve systematic generalization in reasoning. Specifically, we showcase a prototype with algebraic representations for the abstract spatial-temporal reasoning task of Raven’s Progressive Matrices (RPM) and present the ALgebra-Aware Neuro-Semi-Symbolic (ALANS$^2$) learner. The ALANS$^2$ learner is motivated by abstract algebra and the representation theory. It consists of a neural visual perception frontend and an algebraic abstract reasoning backend: the frontend summarizes the visual information from object-based representations, while the backend transforms it into an algebraic structure and induces the hidden operator on-the-fly. The induced operator is later executed to predict the answer's representation, and the choice most similar to the prediction is selected as the solution. Extensive experiments show that by incorporating an algebraic treatment, the ALANS$^2$ learner outperforms various pure connectionist models in domains requiring systematic generalization. We further show that the algebraic representation learned can be decoded by isomorphism and used to generate an answer.",/pdf/41ae92cb56eedb90e61a7e95df7ebc94984b6696.pdf,ICLR,2021, +CHLhSw9pSw8,cehS_2Fll4Y,1601310000000.0,1611610000000.0,2148,Single-Photon Image Classification,"[""~Thomas_Fischbacher1"", ""~Luciano_Sbaiz1""]","[""Thomas Fischbacher"", ""Luciano Sbaiz""]","[""quantum mechanics"", ""image classification"", ""quantum machine learning"", ""theoretical limits""]","Quantum Computing based Machine Learning mainly focuses on quantum computing hardware that is experimentally challenging to realize due to requiring quantum gates that operate at very low temperature. We demonstrate the existence of a ""quantum computing toy model"" that illustrates key aspects of quantum information processing while being experimentally accessible with room temperature optics. Pondering the question of the theoretical classification accuracy performance limit for MNIST (respectively ""Fashion-MNIST"") classifiers, subject to the constraint that a decision has to be made after detection of the very first photon that passed through an image-filter, we show that a machine learning system that is permitted to use quantum interference on the photon's state can substantially outperform any machine learning system that can not. Specifically, we prove that a ""classical"" MNIST (respectively ""Fashion-MNIST"") classifier cannot achieve an accuracy of better than $21.28\%$ (respectively $18.28\%$ for ""Fashion-MNIST"") if it must make a decision after seeing a single photon falling on one of the $28\times 28$ image pixels of a detector array. We further demonstrate that a classifier that is permitted to employ quantum interference by optically transforming the photon state prior to detection can achieve a classification accuracy of at least $41.27\%$ for MNIST (respectively $36.14\%$ for ""Fashion-MNIST""). We show in detail how to train the corresponding quantum state transformation with TensorFlow and also explain how this example can serve as a teaching tool for the measurement process in quantum mechanics. +",/pdf/d7978e041131e45cff5b7c28261bcb70fb3b1655.pdf,ICLR,2021,Mathematical proof that the classical accuracy limit for single-photon image classification can be exceeded very substantially by employing a problem-tailored quantum transformation on the photon state. +QpU7n-6l0n,TENMjaHe-KT,1601310000000.0,1614990000000.0,1460,On the Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations,"[""~Haohan_Wang1"", ""~Zeyi_Huang3"", ""~Xindi_Wu1"", ""~Eric_Xing1""]","[""Haohan Wang"", ""Zeyi Huang"", ""Xindi Wu"", ""Eric Xing""]","[""robustness"", ""invariance"", ""data augmentation"", ""consistency loss""]","Data augmentation is one of the most popular techniques for improving the robustness of neural networks. In addition to directly training the model with original samples and augmented samples, a torrent of methods regularizing the distance between embeddings/representations of the original samples and their augmented counterparts have been introduced. In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings. Our analysis suggests how the ideal choices of regularization correspond to various assumptions. With an invariance test, we show that regularization is important if the model is to be used in a broader context than the in-lab setting because non-regularized approaches are limited in learning the concept of invariance, despite equally high accuracy. Finally, we also show that the generic approach we identified (squared $\ell_2$ norm regularized augmentation) performs better than several recent methods, which are each specially designed for one task and significantly more complicated than ours, over three different tasks.",/pdf/eb0467a92cd3e0d20c27e36dac002730ed90d26f.pdf,ICLR,2021,"We show that consistency loss when using data augmentation is important to learn robust and invariant representations, and show that squared $\ell_2$ norm regularization is the best candidate" +uKZsVyFKbaj,I_Qtwf8jXz,1601310000000.0,1614990000000.0,2514,It's Hard for Neural Networks to Learn the Game of Life,"[""~Jacob_M._Springer1"", ""~Garrett_T._Kenyon1""]","[""Jacob M. Springer"", ""Garrett T. Kenyon""]","[""Deep Learning"", ""Game of Life""]","Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods rather than on weight initializations. Recent findings, however, suggest that neural networks rely on lucky random initial weights of subnetworks called ""lottery tickets"" that converge quickly to a solution. To investigate how weight initializations affect performance, we examine small convolutional networks that are trained to predict $n$ steps of the two-dimensional cellular automaton Conway’s Game of Life, the update rules of which can be implemented efficiently in a small CNN. We find that networks of this architecture trained on this task rarely converge. Rather, networks require substantially more parameters to consistently converge. Furthermore, we find that the initialization parameters that gradient descent converges to a solution are sensitive to small perturbations, such as a single sign change. Finally, we observe a critical value $d_0$ such that training minimal networks with examples in which cells are alive with probability $d_0$ dramatically increases the chance of convergence to a solution. Our results are consistent with the lottery ticket hypothesis.",/pdf/0c4aa344cbec43189d298eab156818b809122477.pdf,ICLR,2021,"We show that Conway's Game of Life can be represented by a simple neural network, yet find that traditional gradient descent methods do not often converge on a solution without significant overparameterization." +rkxdexBYPB,HJxKqTyKvS,1569440000000.0,1577170000000.0,2104,Group-Transformer: Towards A Lightweight Character-level Language Model,"[""sungrae.park@navercorp.com"", ""geewook@sys.i.kyoto-u.ac.jp"", ""junyeop.lee@navercorp.com"", ""junbum.cha@navercorp.com"", ""genesis.kim@navercorp.com"", ""hwalsuk.lee@navercorp.com""]","[""Sungrae Park"", ""Geewook Kim"", ""Junyeop Lee"", ""Junbum Cha"", ""Ji-Hoon Kim Hwalsuk Lee""]","[""Transformer"", ""Lightweight model"", ""Language Modeling"", ""Character-level language modeling""]","Character-level language modeling is an essential but challenging task in Natural Language Processing. +Prior works have focused on identifying long-term dependencies between characters and have built deeper and wider networks for better performance. However, their models require substantial computational resources, which hinders the usability of character-level language models in applications with limited resources. In this paper, we propose a lightweight model, called Group-Transformer, that reduces the resource requirements for a Transformer, a promising method for modeling sequence with long-term dependencies. Specifically, the proposed method partitions linear operations to reduce the number of parameters and computational cost. As a result, Group-Transformer only uses 18.2\% of parameters compared to the best performing LSTM-based model, while providing better performance on two benchmark tasks, enwik8 and text8. When compared to Transformers with a comparable number of parameters and time complexity, the proposed model shows better performance. The implementation code will be available.",/pdf/83c45b5a04f84cebce80fae24860ff0397365df8.pdf,ICLR,2020,"This paper proposes a novel lightweight Transformer for character-level language modeling, utilizing group-wise operations." +B1ZZTfZAW,r1DCnMZCb,1509140000000.0,1518730000000.0,991,Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs,"[""stephanie.hyland@inf.ethz.ch"", ""cr_est@ethz.ch"", ""raetsch@inf.ethz.ch""]","[""Stephanie Hyland"", ""Crist\u00f3bal Esteban"", ""Gunnar R\u00e4tsch""]","[""GAN"", ""medical"", ""records"", ""time"", ""series"", ""generation"", ""privacy""]","Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks (RNNs) in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from ‘serialised’ MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data, and demonstrate results from differentially private training of the RCGAN.",/pdf/147131acb559b07c0a3d7595451bc56c793993c8.pdf,ICLR,2018,"Conditional recurrent GANs for real-valued medical sequences generation, showing novel evaluation approaches and an empirical privacy analysis." +B1NOXfWR-,BkQuQzWRZ,1509140000000.0,1518730000000.0,808,Neural Task Graph Execution,"[""srsohn@umich.edu"", ""junhyuk@umich.edu"", ""honglak@eecs.umich.edu""]","[""Sungryull Sohn"", ""Junhyuk Oh"", ""Honglak Lee""]","[""deep reinforcement learning"", ""task execution"", ""instruction execution""]","In order to develop a scalable multi-task reinforcement learning (RL) agent that is able to execute many complex tasks, this paper introduces a new RL problem where the agent is required to execute a given task graph which describes a set of subtasks and dependencies among them. Unlike existing approaches which explicitly describe what the agent should do, our problem only describes properties of subtasks and relationships between them, which requires the agent to perform a complex reasoning to find the optimal subtask to execute. To solve this problem, we propose a neural task graph solver (NTS) which encodes the task graph using a recursive neural network. To overcome the difficulty of training, we propose a novel non-parametric gradient-based policy that performs back-propagation over a differentiable form of the task graph to compute the influence of each subtask on the other subtasks. Our NTS is pre-trained to approximate the proposed gradient-based policy and fine-tuned through actor-critic method. The experimental results on a 2D visual domain show that our method to pre-train from the gradient-based policy significantly improves the performance of NTS. We also demonstrate that our agent can perform a complex reasoning to find the optimal way of executing the task graph and generalize well to unseen task graphs. In addition, we compare our agent with a Monte-Carlo Tree Search (MCTS) method showing that our method is much more efficient than MCTS, and the performance of our agent can be further improved by combining with MCTS. The demo video is available at https://youtu.be/e_ZXVS5VutM.",/pdf/6d103cc202b5c22f312e6754270e4d892920c8ef.pdf,ICLR,2018, +rkMW1hRqKX,rke4OqhcK7,1538090000000.0,1546450000000.0,963,Optimal Completion Distillation for Sequence Learning,"[""sasabour@google.com"", ""williamchan@google.com"", ""mnorouzi@google.com""]","[""Sara Sabour"", ""William Chan"", ""Mohammad Norouzi""]","[""Sequence Learning"", ""Edit Distance"", ""Speech Recognition"", ""Deep Reinforcement Learning""]","We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pre-training or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution which puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.5\%$ WER, respectively.",/pdf/de7507e029f8d5fced948dfa043fa8a39a5a4be6.pdf,ICLR,2019,Optimal Completion Distillation (OCD) is a training procedure for optimizing sequence to sequence models based on edit distance which achieves state-of-the-art on end-to-end Speech Recognition tasks. +rJehVyrKwH,rJgY2RhODB,1569440000000.0,1583910000000.0,1668,And the Bit Goes Down: Revisiting the Quantization of Neural Networks,"[""pstock@fb.com"", ""ajoulin@fb.com"", ""remi.gribonval@inria.fr"", ""benjamingraham@fb.com"", ""rvj@fb.com""]","[""Pierre Stock"", ""Armand Joulin"", ""R\u00e9mi Gribonval"", ""Benjamin Graham"", ""Herv\u00e9 J\u00e9gou""]","[""compression"", ""quantization""]","In this paper, we address the problem of reducing the memory footprint of convolutional network architectures. We introduce a vector quantization method that aims at preserving the quality of the reconstruction of the network outputs rather than its weights. The principle of our approach is that it minimizes the loss reconstruction error for in-domain inputs. Our method only requires a set of unlabelled data at quantization time and allows for efficient inference on CPU by using byte-aligned codebooks to store the compressed weights. We validate our approach by quantizing a high performing ResNet-50 model to a memory size of 5MB (20x compression factor) while preserving a top-1 accuracy of 76.1% on ImageNet object classification and by compressing a Mask R-CNN with a 26x factor.",/pdf/b2bdf3ea140d4beb3be2a63f813a0e04da7088d0.pdf,ICLR,2020,Using a structured quantization technique aiming at better in-domain reconstruction to compress convolutional neural networks +S1_pAu9xl,,1478300000000.0,1487840000000.0,467,Trained Ternary Quantization,"[""zhucz13@mails.tsinghua.edu.cn"", ""songhan@stanford.edu"", ""huizi@stanford.edu"", ""dally@stanford.edu""]","[""Chenzhuo Zhu"", ""Song Han"", ""Huizi Mao"", ""William J. Dally""]","[""Deep learning""]","Deep neural networks are widely used in machine learning applications. However, the deployment of large neural networks models can be difficult to deploy on mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet. And our AlexNet model is trained from scratch, which means it’s as easy as to train normal full precision model. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, only ternary values (2-bit weights) and scaling factors are needed, therefore our models are nearly 16× smaller than full- precision models. Our ternary models can also be viewed as sparse binary weight networks, which can potentially be accelerated with custom circuit. Experiments on CIFAR-10 show that the ternary models obtained by trained quantization method outperform full-precision models of ResNet-32,44,56 by 0.04%, 0.16%, 0.36%, respectively. On ImageNet, our model outperforms full-precision AlexNet model by 0.3% of Top-1 accuracy and outperforms previous ternary models by 3%.",/pdf/8534394d7c823fa401a91f5d642e0fd63b51263b.pdf,ICLR,2017,Ternary Neural Network with accuracy close to or even higher than the full-precision one +rywUcQogx,,1478340000000.0,1478370000000.0,550,Differentiable Canonical Correlation Analysis,"[""matthias.dorfer@jku.at"", ""jan.schlueter@ofai.at"", ""gerhard.widmer@jku.at""]","[""Matthias Dorfer"", ""Jan Schl\u00fcter"", ""Gerhard Widmer""]","[""Multi-modal learning""]","Canonical Correlation Analysis (CCA) computes maximally-correlated +linear projections of two modalities. We propose Differentiable CCA, a +formulation of CCA that can be cast as a layer within a multi-view +neural network. Unlike Deep CCA, an earlier extension of CCA to +nonlinear projections, our formulation enables gradient flow through the +computation of the CCA projection matrices, and free choice of the final +optimization target. We show the effectiveness of this approach in +cross-modality retrieval experiments on two public image-to-text +datasets, surpassing both Deep CCA and a multi-view network with +freely-learned projections. We assume that Differentiable CCA could be a +useful building block for many multi-modality tasks.",/pdf/6820cff043fabaac08ad991cb9057cb0721e58c7.pdf,ICLR,2017,We propose Differentiable CCA a formulation of CCA that enables gradient flow through the computation of the CCA projection matrices. +HylVB3AqYm,HygMGr1qFm,1538090000000.0,1550850000000.0,1534,ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware,"[""hancai@mit.edu"", ""ligeng@mit.edu"", ""songhan@mit.edu""]","[""Han Cai"", ""Ligeng Zhu"", ""Song Han""]","[""Neural Architecture Search"", ""Efficient Neural Networks""]","Neural architecture search (NAS) has a great impact by automatically designing effective neural network architectures. However, the prohibitive computational demand of conventional NAS algorithms (e.g. 10 4 GPU hours) makes it difficult to directly search the architectures on large-scale tasks (e.g. ImageNet). Differentiable NAS can reduce the cost of GPU hours via a continuous representation of network architecture but suffers from the high GPU memory consumption issue (grow linearly w.r.t. candidate set size). As a result, they need to utilize proxy tasks, such as training on a smaller dataset, or learning with only a few blocks, or training just for a few epochs. These architectures optimized on proxy tasks are not guaranteed to be optimal on the target task. In this paper, we present ProxylessNAS that can directly learn the architectures for large-scale target tasks and target hardware platforms. We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness of directness and specialization. On CIFAR-10, our model achieves 2.08% test error with only 5.7M parameters, better than the previous state-of-the-art architecture AmoebaNet-B, while using 6× fewer parameters. On ImageNet, our model achieves 3.1% better top-1 accuracy than MobileNetV2, while being 1.2× faster with measured GPU latency. We also apply ProxylessNAS to specialize neural architectures for hardware with direct hardware metrics (e.g. latency) and provide insights for efficient CNN architecture design.",/pdf/8d4f27771fa0d882feaf780ae2b1d1dfb5b9b66f.pdf,ICLR,2019,Proxy-less neural architecture search for directly learning architectures on large-scale target task (ImageNet) while reducing the cost to the same level of normal training. +HJGODLqgx,,1478290000000.0,1488590000000.0,300,Recurrent Hidden Semi-Markov Model,"[""hanjundai@gatech.edu"", ""bodai@gatech.edu"", ""ymzhang@nlpr.ia.ac.cn"", ""sli370@gatech.edu"", ""lsong@cc.gatech.edu""]","[""Hanjun Dai"", ""Bo Dai"", ""Yan-Ming Zhang"", ""Shuang Li"", ""Le Song""]","[""Deep learning"", ""Unsupervised Learning"", ""Structured prediction""]","Segmentation and labeling of high dimensional time series data has wide applications in behavior understanding and medical diagnosis. Due to the difficulty in obtaining the label information for high dimensional data, realizing this objective in an unsupervised way is highly desirable. Hidden Semi-Markov Model (HSMM) is a classical tool for this problem. However, existing HSMM and its variants has simple conditional assumptions of observations, thus the ability to capture the nonlinear and complex dynamics within segments is limited. To tackle this limitation, we propose to incorporate the Recurrent Neural Network (RNN) to model the generative process in HSMM, resulting the Recurrent HSMM (R-HSMM). To accelerate the inference while preserving accuracy, we designed a structure encoding function to mimic the exact inference. By generalizing the penalty method to distribution space, we are able to train the model and the encoding function simultaneously. Empirical results show that the proposed R-HSMM achieves the state-of-the-art performances on both synthetic and real-world datasets. ",/pdf/3a5404264bf2fcc0ee5fab95ac85b33b25e7bd5c.pdf,ICLR,2017,We propose to incorporate the RNN to model the generative process in Hidden Semi-Markov Model for unsupervised segmentation and labeling. +awnQ2qTLSwn,Hnlae1HtZF4,1601310000000.0,1614990000000.0,3486,Learning to Share in Multi-Agent Reinforcement Learning,"[""~Yuxuan_Yi1"", ""~Ge_Li2"", ""~Yaowei_Wang1"", ""~Zongqing_Lu2""]","[""Yuxuan Yi"", ""Ge Li"", ""Yaowei Wang"", ""Zongqing Lu""]",[],"In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. We propose a hierarchically decentralized MARL method, \textit{LToS}, which enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and two networked MARL scenarios.",/pdf/4d5792165c0bb7bd5a710c4d1ca6dc24060b16aa.pdf,ICLR,2021, +rke-f6NKvS,BJgQVDtUvB,1569440000000.0,1583910000000.0,401,Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling,"[""yupingl@cs.princeton.edu"", ""huazhe_xu@eecs.berkeley.edu"", ""tengyuma@stanford.edu""]","[""Yuping Luo"", ""Huazhe Xu"", ""Tengyu Ma""]","[""imitation learning"", ""model-based imitation learning"", ""model-based RL"", ""behavior cloning"", ""covariate shift""]","Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results +in cascading errors of the learned policy. We introduce a notion of conservatively extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform prior works in sample efficiency.",/pdf/ffd4070fdb6f1ce2691d35def62153b18778730d.pdf,ICLR,2020,"We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies that can self-correct to stay close to the demonstration states, and learn them with a novel negative sampling technique." +SJGyFiRqK7,S1ljaQu5Ym,1538090000000.0,1545360000000.0,408,Decoupling Gating from Linearity,"[""jonathan.fiat@gmail.com"", ""eran.malach@mail.huji.ac.il"", ""shais@cs.huji.ac.il""]","[""Yonathan Fiat"", ""Eran Malach"", ""Shai Shalev-Shwartz""]","[""Artificial Neural Networks"", ""Neural Networks"", ""ReLU"", ""GaLU"", ""Deep Learning""]","The gap between the empirical success of deep learning and the lack of strong theoretical guarantees calls for studying simpler models. By observing that a ReLU neuron is a product of a linear function with a gate (the latter determines whether the neuron is active or not), where both share a jointly trained weight vector, we propose to decouple the two. We introduce GaLU networks — networks in which each neuron is a product of a Linear Unit, defined by a weight vector which is being trained, with a Gate, defined by a different weight vector which is not being trained. Generally speaking, given a base model and a simpler version of it, the two parameters that determine the quality of the simpler version are whether its practical performance is close enough to the base model and whether it is easier to analyze it theoretically. We show that GaLU networks perform similarly to ReLU networks on standard datasets and we initiate a study of their theoretical properties, demonstrating that they are indeed easier to analyze. We believe that further research of GaLU networks may be fruitful for the development of a theory of deep learning.",/pdf/692ed15150f994737dc37adfb71c45ac8519e277.pdf,ICLR,2019,We propose Gated Linear Unit networks — a model that performs similarly to ReLU networks on real data while being much easier to analyze theoretically. +B1eY_pVYvB,HyxZQGnPvS,1569440000000.0,1583910000000.0,639,Efficient and Information-Preserving Future Frame Prediction and Beyond,"[""gnosis@cs.toronto.edu"", ""yichao@cs.toronto.edu"", ""sme@cs.toronto.edu"", ""fidler@cs.toronto.edu""]","[""Wei Yu"", ""Yichao Lu"", ""Steve Easterbrook"", ""Sanja Fidler""]","[""self-supervised learning"", ""generative pre-training"", ""video prediction"", ""reversible architecture""]","Applying resolution-preserving blocks is a common practice to maximize information preservation in video prediction, yet their high memory consumption greatly limits their application scenarios. We propose CrevNet, a Conditionally Reversible Network that uses reversible architectures to build a bijective two-way autoencoder and its complementary recurrent predictor. Our model enjoys the theoretically guaranteed property of no information loss during the feature extraction, much lower memory consumption and computational efficiency. The lightweight nature of our model enables us to incorporate 3D convolutions without concern of memory bottleneck, enhancing the model's ability to capture both short-term and long-term temporal dependencies. Our proposed approach achieves state-of-the-art results on Moving MNIST, Traffic4cast and KITTI datasets. We further demonstrate the transferability of our self-supervised learning method by exploiting its learnt features for object detection on KITTI. Our competitive results indicate the potential of using CrevNet as a generative pre-training strategy to guide downstream tasks.",/pdf/11473a76f42b3b89d9492e081d40d8772af5e26a.pdf,ICLR,2020, +yOkSW62hqq2,L7KOOSxs8pP,1601310000000.0,1614990000000.0,1130,Explicit Connection Distillation,"[""~Lujun_Li1"", ""~Yikai_Wang2"", ""~Anbang_Yao1"", ""~Yi_Qian2"", ""~Xiao_Zhou3"", ""~Ke_He1""]","[""Lujun Li"", ""Yikai Wang"", ""Anbang Yao"", ""Yi Qian"", ""Xiao Zhou"", ""Ke He""]",[],"One effective way to ease the deployment of deep neural networks on resource constrained devices is Knowledge Distillation (KD), which boosts the accuracy of a low-capacity student model by mimicking the learnt information of a high-capacity teacher (either a single model or a multi-model ensemble). Although great progress has been attained on KD research, existing efforts are primarily invested to design better distillation losses by using soft logits or intermediate feature representations of the teacher as the extra supervision. In this paper, we present Explicit Connection Distillation (ECD), a new KD framework, which addresses the knowledge distillation problem in a novel perspective of bridging dense intermediate feature connections between a student network and its corresponding teacher generated automatically in the training, achieving knowledge transfer goal via direct cross-network layer-to-layer gradients propagation. ECD has two interdependent modules. In the first module, given a student network, an auxiliary teacher architecture is temporarily generated conditioned on strengthening feature representations of basic convolutions of the student network via replacing them with dynamic additive convolutions and keeping the other layers unchanged in structure. The teacher generated in this way guarantees its superior capacity and makes a perfect feature alignment (both in input and output dimensions) to the student at every convolutional layer. In the second module, dense feature connections between the aligned convolutional layers from the student to its auxiliary teacher are introduced, which allows explicit layer-to-layer gradients propagation from the teacher to the student via the merged model training from scratch. Intriguingly, as feature connection direction is one-way, all feature connections together with the auxiliary teacher merely exist during training phase. Experiments on popular image classification tasks validate the effectiveness of our method. Code will be made publicly available.",/pdf/ab053bf673bacba208f889cf35ee87a0b83616bb.pdf,ICLR,2021, +5mhViEOQxaV,X04AbU8bz1,1601310000000.0,1614990000000.0,1412,Controllable Pareto Multi-Task Learning,"[""~Xi_Lin2"", ""~Zhiyuan_YANG1"", ""qingfu.zhang@cityu.edu.hk"", ""~Sam_Kwong1""]","[""Xi Lin"", ""Zhiyuan YANG"", ""Qingfu Zhang"", ""Sam Kwong""]","[""Multi-Task Learning"", ""Multi-Objective Optimization""]","A multi-task learning (MTL) system aims at solving multiple related tasks at the same time. With a fixed model capacity, the tasks would be conflicted with each other, and the system usually has to make a trade-off among learning all of them together. Multiple models with different preferences over tasks have to be trained and stored for many real-world applications where the trade-off has to be made online. This work proposes a novel controllable Pareto multi-task learning framework, to enable the system to make real-time trade-off switch among different tasks with a single model. To be specific, we formulate the MTL as a preference-conditioned multiobjective optimization problem, for which there is a parametric mapping from the preferences to the Pareto stationary solutions. A single hypernetwork-based multi-task neural network is built to learn all tasks with different trade-off preferences among them, where the hypernetwork generates the model parameters conditioned on the preference. At the inference time, MTL practitioners can easily control the model performance based on different trade-off preferences in real-time. Experiments on different applications demonstrate that the proposed model is efficient for solving various multi-task learning problems. ",/pdf/87a652b7496cdaa6f48b33918fc2ff3ff44aa8b7.pdf,ICLR,2021,This work proposes a novel approach to learn the entire trade-off curve for MTL problems. +eMP1j9efXtX,DRLIpb0z2s,1601310000000.0,1615910000000.0,1325,DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs,"[""~Aayam_Kumar_Shrestha1"", ""~Stefan_Lee1"", ""~Prasad_Tadepalli1"", ""~Alan_Fern1""]","[""Aayam Kumar Shrestha"", ""Stefan Lee"", ""Prasad Tadepalli"", ""Alan Fern""]","[""Offline Reinforcement Learning"", ""Planning""]","We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.",/pdf/41ec2c7a3d80d8e07956f446e858586b83aa7620.pdf,ICLR,2021,The paper introduces and investigates an offline RL approach based on optimally solving a finite-state MDP that is derived from the experience dataset using any latent state representation. +HyEtjoCqFX,rJeV_T2qFX,1538090000000.0,1550670000000.0,643,Soft Q-Learning with Mutual-Information Regularization,"[""jordi@prowler.io"", ""felix@prowler.io"", ""peter@prowler.io""]","[""Jordi Grau-Moya"", ""Felix Leibfried"", ""Peter Vrancx""]","[""reinforcement learning"", ""regularization"", ""entropy"", ""mutual information""]","We propose a reinforcement learning (RL) algorithm that uses mutual-information regularization to optimize a prior action distribution for better performance and exploration. Entropy-based regularization has previously been shown to improve both exploration and robustness in challenging sequential decision-making tasks. It does so by encouraging policies to put probability mass on all actions. However, entropy regularization might be undesirable when actions have significantly different importance. In this paper, we propose a theoretically motivated framework that dynamically weights the importance of actions by using the mutual-information. In particular, we express the RL problem as an inference problem where the prior probability distribution over actions is subject to optimization. We show that the prior optimization introduces a mutual-information regularizer in the RL objective. This regularizer encourages the policy to be close to a non-uniform distribution that assigns higher probability mass to more important actions. We empirically demonstrate that our method significantly improves over entropy regularization methods and unregularized methods.",/pdf/c24bca780db2b04b984014cfbe241d1d607ec4eb.pdf,ICLR,2019, +Syl89aNYwS,HJeLi06wDB,1569440000000.0,1577170000000.0,709,Robust saliency maps with distribution-preserving decoys,"[""ylu465@uw.edu"", ""wzg13@ist.psu.edu"", ""xxing@ist.psu.edu"", ""william-noble@uw.edu""]","[""Yang Young Lu"", ""Wenbo Guo"", ""Xinyu Xing"", ""William Stafford Noble""]","[""explainable machine learning"", ""explainable AI"", ""deep learning interpretability"", ""saliency maps"", ""perturbation"", ""convolutional neural network""]","Saliency methods help to make deep neural network predictions more interpretable by identifying particular features, such as pixels in an image, that contribute most strongly to the network's prediction. Unfortunately, recent evidence suggests that many saliency methods perform poorly when gradients are saturated or in the presence of strong inter-feature dependence or noise injected by an adversarial attack. In this work, we propose a data-driven technique that uses the distribution-preserving decoys to infer robust saliency scores in conjunction with a pre-trained convolutional neural network classifier and any off-the-shelf saliency method. We formulate the generation of decoys as an optimization problem, potentially applicable to any convolutional network architecture. We also propose a novel decoy-enhanced saliency score, which provably compensates for gradient saturation and considers joint activation patterns of pixels in a single-layer convolutional neural network. Empirical results on the ImageNet data set using three different deep neural network architectures---VGGNet, AlexNet and ResNet---show both qualitatively and quantitatively that decoy-enhanced saliency scores outperform raw scores produced by three existing saliency methods.",/pdf/90a3b1b2b3dd043070227c8b243a618da3f616c9.pdf,ICLR,2020,We propose a robust saliency method which alleviate the limitations of mainstream competing methods with theoretical soundness +HJMjW3RqtX,r1eNyHhcFX,1538090000000.0,1545360000000.0,1205,One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL,"[""tpaine@google.com"", ""sergomez@google.com"", ""ziyu@google.com"", ""reedscot@google.com"", ""yusufaytar@google.com"", ""tpfaff@google.com"", ""mwhoffman@google.com"", ""gabrielbm@google.com"", ""cabi@google.com"", ""budden@google.com"", ""nandodefreitas@google.com""]","[""Tom Le Paine"", ""Sergio Gomez"", ""Ziyu Wang"", ""Scott Reed"", ""Yusuf Aytar"", ""Tobias Pfaff"", ""Matt Hoffman"", ""Gabriel Barth-Maron"", ""Serkan Cabi"", ""David Budden"", ""Nando de Freitas""]","[""Imitation Learning"", ""Deep Learning""]","Humans are experts at high-fidelity imitation -- closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to bootstrap learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. +The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions. ",/pdf/34848bbbdaf97736acefe815555df81d55db8d2b.pdf,ICLR,2019,"We present MetaMimic, an algorithm that takes as input a demonstration dataset and outputs (i) a one-shot high-fidelity imitation policy (ii) an unconditional task policy." +oGq4d9TbyIA,YBccQv77NtA,1601310000000.0,1614990000000.0,2686,Uniform-Precision Neural Network Quantization via Neural Channel Expansion,"[""skstjdals@hanyang.ac.kr"", ""qjatjr4828@hanyang.ac.kr"", ""kysim@nota.ai"", ""jieun.lim@nota.ai"", ""~Tae-Ho_Kim2"", ""~Jungwook_Choi1""]","[""Seongmin Park"", ""Beomseok Kwon"", ""Kyuyoung Sim"", ""Jieun Lim"", ""Tae-Ho Kim"", ""Jungwook Choi""]","[""deep neural network"", ""quantization"", ""neural architecture search"", ""image classification"", ""reduced precision"", ""inference""]","Uniform-precision neural network quantization has gained popularity thanks to its simple arithmetic unit densely packed for high computing capability. However, it ignores heterogeneous sensitivity to the impact of quantization across the layers, resulting in sub-optimal inference accuracy. This work proposes a novel approach to adjust the network structure to alleviate the impact of uniform-precision quantization. The proposed neural architecture search selectively expands channels for the quantization sensitive layers while satisfying hardware constraints (e.g., FLOPs). We provide substantial insights and empirical evidence that the proposed search method called neural channel expansion can adapt several popular networks' channels to achieve superior 2-bit quantization accuracy on CIFAR10 and ImageNet. In particular, we demonstrate the best-to-date Top-1/Top-5 accuracy for 2-bit ResNet50 with smaller FLOPs and the parameter size.",/pdf/8e122d62bcbf69f1d3f574a691a55fefa5a193fd.pdf,ICLR,2021,"a new NAS-based quantization algorithm called neural channel expansion (NCE), which is equipped with a simple yet innovative channel expansion mechanism to balance the number of channels across the layers under uniform-precision quantization." +S1xI_TEtwS,B1luJ-nvDB,1569440000000.0,1577170000000.0,634,Amata: An Annealing Mechanism for Adversarial Training Acceleration,"[""yn272@cam.ac.uk"", ""qianxiao@nus.edu.sg"", ""zhanxing.zhu@pku.edu.cn""]","[""Nanyang Ye"", ""Qianxiao Li"", ""Zhanxing Zhu""]",[],"Despite of the empirical success in various domains, it has been revealed that deep neural networks are vulnerable to maliciously perturbed input data that much degrade their performance. This is known as adversarial attacks. To counter adversarial attacks, adversarial training formulated as a form of robust optimization has been demonstrated to be effective. However, conducting adversarial training brings much computational overhead compared with standard training. In order to reduce the computational cost, we propose a simple yet effective modification to the commonly used projected gradient descent (PGD) adversarial training by increasing the number of adversarial training steps and decreasing the adversarial training step size gradually as training proceeds. We analyze the optimality of this annealing mechanism through the lens of optimal control theory, and we also prove the convergence of our proposed algorithm. Numerical experiments on standard datasets, such as MNIST and CIFAR10, show that our method can achieve similar or even better robustness with around 1/3 to 1/2 computation time compared with PGD.",/pdf/55c5a2001a7fbfa471ff85c4c32470b0a82bc585.pdf,ICLR,2020,Amata: a simple modification to PGD reduces the adversarial training time to 1/2~1/3. +Hy0L4t5el,,1478300000000.0,1478300000000.0,476,Tree-Structured Variational Autoencoder,"[""ricshin@cs.berkeley.edu"", ""alemi@google.com"", ""geoffreyi@google.com"", ""vinyals@google.com""]","[""Richard Shin"", ""Alexander A. Alemi"", ""Geoffrey Irving"", ""Oriol Vinyals""]",[],"Many kinds of variable-sized data we would like to model contain an internal hierarchical structure in the form of a tree, including source code, formal logical statements, and natural language sentences with parse trees. For such data it is natural to consider a model with matching computational structure. In this work, we introduce a variational autoencoder-based generative model for tree-structured data. We evaluate our model on a synthetic dataset, and a dataset with applications to automated theorem proving. By learning a latent representation over trees, our model can achieve similar test log likelihood to a standard autoregressive decoder, but with the number of sequentially dependent computations proportional to the depth of the tree instead of the number of nodes in the tree.",/pdf/29f62a14085790233f2c2036e86a2b2425898ae5.pdf,ICLR,2017, +LDSeViRs4-Q,kozD0hdJtaL,1601310000000.0,1614990000000.0,1309,Increasing-Margin Adversarial (IMA) training to Improve Adversarial Robustness of Neural Networks,"[""~Linhai_Ma1"", ""~Liang_Liang2""]","[""Linhai Ma"", ""Liang Liang""]","[""Robustness"", ""CNN"", ""Medical image classification""]","Deep neural networks (DNNs), including convolutional neural networks, are known to be vulnerable to adversarial attacks, which may lead to disastrous consequences in life-critical applications. Adversarial samples are usually generated by attack algorithms and can also be induced by white noises, and therefore the threats are real. In this study, we propose a novel training method, named Increasing Margin Adversarial (IMA) Training, to improve DNN robustness against adversarial noises. During training, the IMA method increases the margins of training samples by moving the decision boundaries of the DNN model far away from the training samples to improve robustness. The IMA method is evaluated on six publicly available datasets (including a COVID-19 CT image dataset) under strong 100-PGD white-box adversarial attacks, and the results show that the proposed method significantly improved classification accuracy on noisy data while keeping a relatively high accuracy on clean data. We hope our approach may facilitate the development of robust DNN applications, especially for COVID-19 diagnosis using CT images.",/pdf/8884767e3a049fa5a9b870efd97785195a020d0b.pdf,ICLR,2021,A new adversarial training method with individualized margin estimation to improve robustness against adversarial noises. +jOQbDGngsg8,_npfgzHJa7K,1601310000000.0,1614990000000.0,130,Secure Network Release with Link Privacy,"[""~Carl_Yang1"", ""~Haonan_Wang1"", ""~Ke_ZHANG7"", ""~Lichao_Sun1""]","[""Carl Yang"", ""Haonan Wang"", ""Ke ZHANG"", ""Lichao Sun""]","[""generative model"", ""graph neural network"", ""data release""]","Many data mining and analytical tasks rely on the abstraction of networks (graphs) to summarize relational structures among individuals (nodes). Since relational data are often sensitive, we aim to seek effective approaches to release utility-preserved yet privacy-protected structured data. In this paper, we leverage the differential privacy (DP) framework, to formulate and enforce rigorous privacy constraints on deep graph generation models, with a focus on edge-DP to guarantee individual link privacy. In particular, we enforce edge-DP by injecting Gaussian noise to the gradients of a link reconstruction based graph generation model, and ensure data utility by improving structure learning with structure-oriented graph comparison. Extensive experiments on two real-world network datasets show that our proposed DPGGAN model is able to generate networks with effectively preserved global structure and rigorously protected individual link privacy.",/pdf/50eec3fb07071b95e187988b57b1b0bdecbdf109.pdf,ICLR,2021,We study secure network release by learning to generate globally useful graphs with individual link privacy. +Bke8UR4FPB,rJlpeZwuvB,1569440000000.0,1588530000000.0,1139,Oblique Decision Trees from Derivatives of ReLU Networks,"[""guanghe@csail.mit.edu"", ""tommi@csail.mit.edu""]","[""Guang-He Lee"", ""Tommi S. Jaakkola""]","[""oblique decision trees"", ""ReLU networks""]","We show how neural models can be used to realize piece-wise constant functions such as decision trees. The proposed architecture, which we call locally constant networks, builds on ReLU networks that are piece-wise linear and hence their associated gradients with respect to the inputs are locally constant. We formally establish the equivalence between the classes of locally constant networks and decision trees. Moreover, we highlight several advantageous properties of locally constant networks, including how they realize decision trees with parameter sharing across branching / leaves. Indeed, only $M$ neurons suffice to implicitly model an oblique decision tree with $2^M$ leaf nodes. The neural representation also enables us to adopt many tools developed for deep networks (e.g., DropConnect (Wan et al., 2013)) while implicitly training decision trees. We demonstrate that our method outperforms alternative techniques for training oblique decision trees in the context of molecular property classification and regression tasks. ",/pdf/4be2fc47fc1d018292995db8b286af2469c91b53.pdf,ICLR,2020,A novel neural architecture which implicitly realizes (oblique) decision trees. +P3WG6p6Jnb,6cB4_lIEvwO,1601310000000.0,1614990000000.0,2724,Offline Policy Optimization with Variance Regularization,"[""~Riashat_Islam1"", ""~Samarth_Sinha1"", ""~Homanga_Bharadhwaj1"", ""~Samin_Yeasar_Arnob1"", ""~Zhuoran_Yang1"", ""~Zhaoran_Wang1"", ""~Animesh_Garg1"", ""~Lihong_Li1"", ""~Doina_Precup1""]","[""Riashat Islam"", ""Samarth Sinha"", ""Homanga Bharadhwaj"", ""Samin Yeasar Arnob"", ""Zhuoran Yang"", ""Zhaoran Wang"", ""Animesh Garg"", ""Lihong Li"", ""Doina Precup""]","[""reinforcement learning"", ""offline batch RL"", ""off-policy"", ""policy optimization"", ""variance regularization""]","Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing algorithms. ",/pdf/f08def68d69cab97de6c7d7622021ef57ae59346.pdf,ICLR,2021,Variance regularization based on stationary state-action distribution corrections in offline policy optimization +CGFN_nV1ql,aJbS8XefaJ,1601310000000.0,1614990000000.0,960,Non-Attentive Tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling,"[""~Jonathan_Shen1"", ""~Ye_Jia1"", ""~Mike_Chrzanowski2"", ""~Yu_Zhang2"", ""isaace@google.com"", ""~Heiga_Zen1"", ""~Yonghui_Wu1""]","[""Jonathan Shen"", ""Ye Jia"", ""Mike Chrzanowski"", ""Yu Zhang"", ""Isaac Elias"", ""Heiga Zen"", ""Yonghui Wu""]","[""tts"", ""text-to-speech""]","This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.",/pdf/1868713d01e66c6f590aeb4eeb35dff5848ec601.pdf,ICLR,2021,"Non-Attentive Tacotron replaces the attention mechanism in Tacotron 2 with a duration predictor leading to improved robustness, and can be trained with reasonable performance even without duration labels." +GHCu1utcBvX,-61VXWyKGAv,1601310000000.0,1614990000000.0,220,Transferability of Compositionality,"[""~Yuanpeng_Li2"", ""~Liang_Zhao2"", ""~Joel_Hestness2"", ""kayeelun@gmail.com"", ""~Kenneth_Church1"", ""~Mohamed_Elhoseiny1""]","[""Yuanpeng Li"", ""Liang Zhao"", ""Joel Hestness"", ""Ka Yee Lun"", ""Kenneth Church"", ""Mohamed Elhoseiny""]","[""Compositionality""]","Compositional generalization is the algebraic capacity to understand and produce large amount of novel combinations from known components. It is a key element of human intelligence for out-of-distribution generalization. To equip neural networks with such ability, many algorithms have been proposed to extract compositional representations from the training distribution. However, it has not been discussed whether the trained model can still extract such representations in the test distribution. In this paper, we argue that the extraction ability does not transfer naturally, because the extraction network suffers from the divergence of distributions. To address this problem, we propose to use an auxiliary reconstruction network with regularized hidden representations as input, and optimize the representations during inference. The proposed approach significantly improves accuracy, showing more than a 20% absolute increase in various experiments compared with baselines. To our best knowledge, this is the first work to focus on the transferability of compositionality, and it is orthogonal to existing efforts of learning compositional representations in training distribution. We hope this work will help to advance compositional generalization and artificial intelligence research.",/pdf/f864332bfbd9bb1f654c5789701af400f2ecfa62.pdf,ICLR,2021, +BklRFpVKPH,S1eu4I6wPH,1569440000000.0,1577170000000.0,690,Demonstration Actor Critic,"[""lgq1001@mail.ustc.edu.cn"", ""lizo@microsoft.com"", ""zpschang@gmail.com"", ""jiang.bian@microsoft.com"", ""taoqin@microsoft.com"", ""ynh@ustc.edu.cn"", ""tyliu@microsoft.com""]","[""Guoqing Liu"", ""Li Zhao"", ""Pushi Zhang"", ""Jiang Bian"", ""Tao Qin"", ""Nenghai Yu"", ""Tie-Yan Liu""]","[""Deep Reinforcement Learning"", ""Reinforcement Learning from Demonstration""]","We study the problem of \textit{Reinforcement learning from demonstrations (RLfD)}, where the learner is provided with both some expert demonstrations and reinforcement signals from the environment. One approach leverages demonstration data in a supervised manner, which is simple and direct, but can only provide supervision signal over those states seen in the demonstrations. Another approach uses demonstration data for reward shaping. By contrast, the latter approach can provide guidance on how to take actions, even for those states are not seen in the demonstrations. But existing algorithms in the latter one adopt shaping reward which is not directly dependent on current policy, limiting the algorithms to treat demonstrated states the same as other states, failing to directly exploit supervision signal in demonstration data. In this paper, we propose a novel objective function with policy-dependent shaping reward, so as to get the best of both worlds. We present a convergence proof for policy iteration of the proposed objective, under the tabular setting. Then we develop a new practical algorithm, termed as Demonstration Actor Critic (DAC). Experiments on a range of popular benchmark sparse-reward tasks shows that our DAC method obtains a significant performance gain over five strong and off-the-shelf baselines.",/pdf/da604872980e8c14b2fedbca47e6be8ab7ea3bc7.pdf,ICLR,2020, +zbEupOtJFF,Pqau_PN5C99,1601310000000.0,1614990000000.0,704,On interaction between augmentations and corruptions in natural corruption robustness,"[""~Eric_Mintun1"", ""~Alexander_Kirillov1"", ""~Saining_Xie2""]","[""Eric Mintun"", ""Alexander Kirillov"", ""Saining Xie""]","[""corruption robustness"", ""data augmentation"", ""perceptual similarity"", ""deep learning""]","Invariance to a broad array of image corruptions, such as warping, noise, or color shifts, is an important aspect of building robust models in computer vision. Recently, several new data augmentations have been proposed that significantly improve performance on ImageNet-C, a benchmark of such corruptions. However, there is still a lack of basic understanding on the relationship between data augmentations and test-time corruptions. To this end, we develop a feature space for image transforms, and then use a new measure in this space between augmentations and corruptions called the Minimal Sample Distance to demonstrate there is a strong correlation between similarity and performance. We then investigate recent data augmentations and observe a significant degradation in corruption robustness when the test-time corruptions are sampled to be perceptually dissimilar from ImageNet-C in this feature space. Our results suggest that test error can be improved by training on perceptually similar augmentations, and data augmentations may risk overfitting to the existing benchmark. We hope our results and tools will allow for more robust progress towards improving robustness to image corruptions.",/pdf/a58ff8e728278c3fecd495d3fcd7da6e22f17e0a.pdf,ICLR,2021,We show that data augmentation improves error on images corrupted by transforms that are visually similar to the augmentations and that this leads to overfitting on a common corruption benchmark. +r1y1aawlg,,1478120000000.0,1481820000000.0,49,Iterative Refinement for Machine Translation,"[""roman.novak@polytechnique.edu"", ""michaelauli@fb.com"", ""grangier@fb.com""]","[""Roman Novak"", ""Michael Auli"", ""David Grangier""]","[""Natural language processing""]","Existing machine translation decoding algorithms generate translations in a strictly monotonic fashion and never revisit previous decisions. As a result, earlier mistakes cannot be corrected at a later stage. In this paper, we present a translation scheme that starts from an initial guess and then makes iterative improvements that may revisit previous decisions. We parameterize our model as a convolutional neural network that predicts discrete substitutions to an existing translation based on an attention mechanism over both the source sentence as well as the current translation output. By making less than one modification per sentence, we improve the output of a phrase-based translation system by up to 0.4 BLEU on WMT15 German-English translation.",/pdf/3ec7e1271fc04842c6d37b32886da4d92cd78424.pdf,ICLR,2017,"We propose of novel decoding strategy for MT, after producing a full sentence the model can revisit its choice and substitute words; multiple words can iteratively be edited." +rkgQL6VFwr,S1gRzvuDDH,1569440000000.0,1577170000000.0,555,Learning Generative Image Object Manipulations from Language Instructions,"[""martin.langkvist@oru.se"", ""andreas.persson@oru.se"", ""amy.loutfi@oru.se""]","[""Martin L\u00e4ngkvist"", ""Andreas Persson"", ""Amy Loutfi""]",[],"The use of adequate feature representations is essential for achieving high performance in high-level human cognitive tasks in computational modeling. Recent developments in deep convolutional and recurrent neural networks architectures enable learning powerful feature representations from both images and natural language text. Besides, other types of networks such as Relational Networks (RN) can learn relations between objects and Generative Adversarial Networks (GAN) have shown to generate realistic images. In this paper, we combine these four techniques to acquire a shared feature representation of the relation between objects in an input image and an object manipulation action description in the form of human language encodings to generate an image that shows the resulting end-effect the action would have on a computer-generated scene. The system is trained and evaluated on a simulated dataset and experimentally used on real-world photos.",/pdf/54891ff2c30230948a513990c45f44e42578656f.pdf,ICLR,2020, +SkeUG30cFQ,r1giUzRqY7,1538090000000.0,1545360000000.0,1265,The Expressive Power of Deep Neural Networks with Circulant Matrices,"[""alexandre.araujo@dauphine.eu"", ""benjamin.negrevergne@dauphine.fr"", ""yann.chevaleyre@lamsade.dauphine.fr"", ""jamal.atif@lamsade.dauphine.fr""]","[""Alexandre Araujo"", ""Benjamin Negrevergne"", ""Yann Chevaleyre"", ""Jamal Atif""]","[""deep learning"", ""circulant matrices"", ""universal approximation""]","Recent results from linear algebra stating that any matrix can be decomposed into products of diagonal and circulant matrices has lead to the design of compact deep neural network architectures that perform well in practice. In this paper, we bridge the gap between these good empirical results +and the theoretical approximation capabilities of Deep diagonal-circulant ReLU networks. More precisely, we first demonstrate that a Deep diagonal-circulant ReLU networks of +bounded width and small depth can approximate a deep ReLU network in which the dense matrices are +of low rank. Based on this result, we provide new bounds on the expressive power and universal approximativeness of this type of networks. We support our experimental results with thorough experiments on a large, real world video classification problem.",/pdf/6640b3c1a3ecb4248b3fbb31ce90ec578e2ab4b0.pdf,ICLR,2019,We provide a theoretical study of the properties of Deep circulant-diagonal ReLU Networks and demonstrate that they are bounded width universal approximators. +3F0Qm7TzNDM,PZ90I0ouRnIQ,1601310000000.0,1614990000000.0,429,Variance Based Sample Weighting for Supervised Deep Learning,"[""~Paul_Novello1"", ""gael.poette@cea.fr"", ""david.lugato@cea.fr"", ""pietro.congedo@inria.fr""]","[""Paul Novello"", ""Ga\u00ebl Po\u00ebtte"", ""David Lugato"", ""Pietro Congedo""]","[""supervised learning"", ""sample distribution"", ""statistical methods"", ""sample weighting"", ""approximation theory"", ""Taylor expansion""]","In the context of supervised learning of a function by a Neural Network (NN), we claim and empirically justify that a NN yields better results when the distribution of the data set focuses on regions where the function to learn is steeper. We first traduce this assumption in a mathematically workable way using Taylor expansion. Then, theoretical derivations allow to construct a methodology that we call Variance Based Samples Weighting (VBSW). VBSW uses local variance of the labels to weight the training points. This methodology is general, scalable, cost effective, and significantly increases the performances of a large class of models for various classification and regression tasks on image, text and multivariate data. We highlight its benefits with experiments involving NNs from shallow linear NN to ResNet or Bert.",/pdf/1253738cef5d2840c118e7f751e0ded10e5c6bd1.pdf,ICLR,2021,This paper constructs a new training distribution to weight the training data set and boost the performances of a Neural Network. +HkgsPhNYPS,BJe3rTSqrr,1569440000000.0,1583910000000.0,16,SELF: Learning to Filter Noisy Labels with Self-Ensembling,"[""ductam.nguyen08@gmail.com"", ""chaithanyakumar.mummadi@de.bosch.com"", ""thiphuongnhung.ngo@de.bosch.com"", ""hoai.phuong.nguyen198@gmail.com"", ""laura.beggel@de.bosch.com"", ""brox@cs.uni-freiburg.de""]","[""Duc Tam Nguyen"", ""Chaithanya Kumar Mummadi"", ""Thi Phuong Nhung Ngo"", ""Thi Hoai Phuong Nguyen"", ""Laura Beggel"", ""Thomas Brox""]","[""Ensemble Learning"", ""Robust Learning"", ""Noisy Labels"", ""Labels Filtering""]","Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training. Our method improves the task performance by gradually allowing supervision only from the potentially non-noisy (clean) labels and stops learning on the filtered noisy labels. For the filtering, we form running averages of predictions over the entire training dataset using the network output at different training epochs. We show that these ensemble estimates yield more accurate identification of inconsistent predictions throughout training than the single estimates of the network at the most recent training epoch. While filtered samples are removed entirely from the supervised training loss, we dynamically leverage them via semi-supervised learning in the unsupervised loss. We demonstrate the positive effect of such an approach on various image classification tasks under both symmetric and asymmetric label noise and at different noise ratios. It substantially outperforms all previous works on noise-aware learning across different datasets and can be applied to a broad set of network architectures.",/pdf/3bafa3876cfccbe25d2fc5c4db5cd495aae2e6c7.pdf,ICLR,2020,We propose a self-ensemble framework to train more robust deep learning models under noisy labeled datasets. +xyEx4_lHqvB,7IB_JsV_G8L,1601310000000.0,1614990000000.0,1277,Ensemble-based Adversarial Defense Using Diversified Distance Mapping,"[""~Ehsan_Kazemi3"", ""~Mohamed_E._Hussein1"", ""~Wael_AbdAlmgaeed1""]","[""Ehsan Kazemi"", ""Mohamed E. Hussein"", ""Wael AbdAlmgaeed""]","[""adversarial machine learning"", ""ensemble"", ""mahalanobis distance""]","We propose an ensemble-based defense against adversarial examples using distance map layers (DMLs). Similar to fully connected layers, DMLs can be used to output logits for a multi-class classification model. We show in this paper how DMLs can be deployed to prevent transferability of attacks across ensemble members by adapting pairwise (almost) orthogonal covariance matrices. We also illustrate how DMLs provide an efficient way to regularize the Lipschitz constant of the ensemble's member models, which further boosts the resulting robustness. Through empirical evaluations across multiple datasets and attack models, we demonstrate that the ensembles based on DMLs can achieve high benign accuracy while exhibiting robustness against adversarial attacks using multiple white-box techniques along with AutoAttack.",/pdf/3921d418bbe5c82fe5c66c604dfc4836a44343df.pdf,ICLR,2021, +ryZElGZ0Z,BJeVlM-C-,1509130000000.0,1518730000000.0,766,Discovery of Predictive Representations With a Network of General Value Functions,"[""mkschleg@ualberta.ca"", ""andnpatt@indiana.edu"", ""amw8@ualberta.ca"", ""whitem@ualberta.ca""]","[""Matthew Schlegel"", ""Andrew Patterson"", ""Adam White"", ""Martha White""]","[""Reinforcement Learning"", ""General Value Functions"", ""Predictive Representations""]","The ability of an agent to {\em discover} its own learning objectives has long been considered a key ingredient for artificial general intelligence. Breakthroughs in autonomous decision making and reinforcement learning have primarily been in domains where the agent's goal is outlined and clear: such as playing a game to win, or driving safely. Several studies have demonstrated that learning extramural sub-tasks and auxiliary predictions can improve (1) single human-specified task learning, (2) transfer of learning, (3) and the agent's learned representation of the world. In all these examples, the agent was instructed what to learn about. We investigate a framework for discovery: curating a large collection of predictions, which are used to construct the agent's representation of the world. Specifically, our system maintains a large collection of predictions, continually pruning and replacing predictions. We highlight the importance of considering stability rather than convergence for such a system, and develop an adaptive, regularized algorithm towards that aim. We provide several experiments in computational micro-worlds demonstrating that this simple approach can be effective for discovering useful predictions autonomously.",/pdf/9260c857ba02d6c19e84d2102b68c18bab39deb3.pdf,ICLR,2018,"We investigate a framework for discovery: curating a large collection of predictions, which are used to construct the agent’s representation in partially observable domains." +BJ0hF1Z0b,rJR2F1-A-,1509120000000.0,1519430000000.0,504,Learning Differentially Private Recurrent Language Models,"[""mcmahan@google.com"", ""dramage@google.com"", ""kunal@google.com"", ""liqzhang@google.com""]","[""H. Brendan McMahan"", ""Daniel Ramage"", ""Kunal Talwar"", ""Li Zhang""]","[""differential privacy"", ""LSTMs"", ""language models"", ""privacy""]","We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Our work builds on recent advances in the training of deep networks on user-partitioned data and privacy accounting for stochastic gradient descent. In particular, we add user-level privacy protection to the federated averaging algorithm, which makes large step updates from user-level data. Our work demonstrates that given a dataset with a sufficiently large number of users (a requirement easily met by even small internet-scale datasets), achieving differential privacy comes at the cost of increased computation, rather than in decreased utility as in most prior work. We find that our private LSTM language models are quantitatively and qualitatively similar to un-noised models when trained on a large dataset.",/pdf/3f8fd2b61e7e83c63a36b191a9a9881f9a8602e6.pdf,ICLR,2018,User-level differential privacy for recurrent neural network language models is possible with a sufficiently large dataset. +PdauS7wZBfC,grzlOdOEpPy,1601310000000.0,1614990000000.0,1764,Predictive Coding Approximates Backprop along Arbitrary Computation Graphs,"[""~Beren_Millidge1"", ""~Alexander_Tschantz1"", ""~Christopher_Buckley1""]","[""Beren Millidge"", ""Alexander Tschantz"", ""Christopher Buckley""]","[""Predictive Coding"", ""Backprop"", ""Biological plausibility"", ""neural networks""]","The backpropagation of error (backprop) is a powerful algorithm for training machine learning architectures through end-to-end differentiation. Recently it has been shown that backprop in multilayer-perceptrons (MLPs) can be approximated using predictive coding, a biologically-plausible process theory of cortical computation which relies solely on local and Hebbian updates. The power of backprop, however, lies not in its instantiation in MLPs, but rather in the concept of automatic differentiation which allows for the optimisation of any differentiable program expressed as a computation graph. Here, we demonstrate that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules. We apply this result to develop a straightforward strategy to translate core machine learning architectures into their predictive coding equivalents. We construct predictive coding CNNs, RNNs, and the more complex LSTMs, which include a non-layer-like branching internal graph structure and multiplicative interactions. Our models perform equivalently to backprop on challenging machine learning benchmarks, while utilising only local and (mostly) Hebbian plasticity. Our method raises the potential that standard machine learning algorithms could in principle be directly implemented in neural circuitry, and may also contribute to the development of completely distributed neuromorphic architectures.",/pdf/cf265d888b10e6a8f351a7945c93545f5dfa3048.pdf,ICLR,2021,We show that predictive coding algorithms from neuroscience can be setup to approximate the backpropagation of error algorithm on any computational graph. +SyEiHNKxx,,1478210000000.0,1485270000000.0,91,A Differentiable Physics Engine for Deep Learning in Robotics,"[""Jonas.Degrave@UGent.be"", ""x@UGent.be"", ""Joni.Dambre@UGent.be"", ""Francis.wyffels@UGent.be""]","[""Jonas Degrave"", ""Michiel Hermans"", ""Joni Dambre"", ""Francis wyffels""]","[""Deep learning""]","One of the most important fields in robotics is the optimization of controllers. Currently, robots are often treated as a black box in this optimization process, which is the reason why derivative-free optimization methods such as evolutionary algorithms or reinforcement learning are omnipresent. When gradient-based methods are used, models are kept small or rely on finite difference approximations for the Jacobian. This method quickly grows expensive with increasing numbers of parameters, such as found in deep learning. We propose an implementation of a modern physics engine, which can differentiate control parameters. This engine is implemented for both CPU and GPU. Firstly, this paper shows how such an engine speeds up the optimization process, even for small problems. Furthermore, it explains why this is an alternative approach to deep Q-learning, for using deep learning in robotics. Finally, we argue that this is a big step for deep learning in robotics, as it opens up new possibilities to optimize robots, both in hardware and software.",/pdf/37c4713bd37eda2768b800ec1c2fe9a9ef7384d4.pdf,ICLR,2017,We wrote a framework to differentiate through physics and show that this makes training deep learned controllers for robotics remarkably fast and straightforward +B1EA-M-0Z,rJXAbfZAW,1509130000000.0,1519440000000.0,783,Deep Neural Networks as Gaussian Processes,"[""jaehlee@google.com"", ""yasamanb@google.com"", ""romann@google.com"", ""schsam@google.com"", ""jpennin@google.com"", ""jaschasd@google.com""]","[""Jaehoon Lee"", ""Yasaman Bahri"", ""Roman Novak"", ""Samuel S. Schoenholz"", ""Jeffrey Pennington"", ""Jascha Sohl-Dickstein""]","[""Gaussian process"", ""Bayesian regression"", ""deep networks"", ""kernel methods""]","It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network. + +In this work, we derive the exact equivalence between infinitely wide, deep, networks and GPs with a particular covariance function. We further develop a computationally efficient pipeline to compute this covariance function. We then use the resulting GP to perform Bayesian inference for deep neural networks on MNIST and CIFAR-10. We observe that the trained neural network accuracy approaches that of the corresponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We further find that test performance increases as finite-width trained networks are made wider and more similar to a GP, and that the GP-based predictions typically outperform those of finite-width networks. Finally we connect the prior distribution over weights and variances in our GP formulation to the recent development of signal propagation in random neural networks.",/pdf/1c3d32c01d40a6f0226aa4656940f4a8299f3b5b.pdf,ICLR,2018,"We show how to make predictions using deep networks, without training deep networks." +HkgSEnA5KQ,rJeh69n5FX,1538090000000.0,1548790000000.0,1449,Guiding Policies with Language via Meta-Learning,"[""jcoreyes@eecs.berkeley.edu"", ""abhigupta@berkeley.edu"", ""suvansh@berkeley.edu"", ""naltieri@berkeley.edu"", ""j.d.andreas@gmail.com"", ""denero@berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""John D. Co-Reyes"", ""Abhishek Gupta"", ""Suvansh Sanjeev"", ""Nick Altieri"", ""Jacob Andreas"", ""John DeNero"", ""Pieter Abbeel"", ""Sergey Levine""]","[""meta-learning"", ""language grounding"", ""interactive""]","Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration. Instruction following from natural language instructions provides an appealing alternative: in the same way that we can specify goals to other humans simply by speaking or writing, we would like to be able to specify tasks for our machines. However, a single instruction may be insufficient to fully communicate our intent or, even if it is, may be insufficient for an autonomous agent to actually understand how to perform the desired task. In this work, we propose an interactive formulation of the task specification problem, where iterative language corrections are provided to an autonomous agent, guiding it in acquiring the desired skill. Our proposed language-guided policy learning algorithm can integrate an instruction and a sequence of corrections to acquire new skills very quickly. In our experiments, we show that this method can enable a policy to follow instructions and corrections for simulated navigation and manipulation tasks, substantially outperforming direct, non-interactive instruction following.",/pdf/0d2fa487022c6bef09fe6c4993c83b261997cbe4.pdf,ICLR,2019,We propose a meta-learning method for interactively correcting policies with natural language. +BJxVT3EKDH,HygSyEUQwS,1569440000000.0,1577170000000.0,222,Corpus Based Amharic Sentiment Lexicon Generation,"[""girma1978@gmail.com"", ""rauber@ifs.tuwien.ac.at"", ""solomon.atnafu@aau.edu.et""]","[""Girma Neshir"", ""Andeas Rauber"", ""and Solomon Atnafu""]","[""Amharic sentiment lexicon"", ""Amharic sentiment classification"", ""seed words""]","Sentiment classification is an active research area with several applications including analysis of political opinions, classifying comments, movie reviews, news reviews and product reviews. To employ rule based sentiment classification, we require sentiment lexicons. However, manual construction of sentiment lexicon is time consuming and costly for resource-limited languages. To bypass manual development time and costs, we tried to build Amharic Sentiment Lexicons relying on corpus based approach. The intention of this approach is to handle sentiment terms specific to Amharic language from Amharic Corpus. Small set of seed terms are manually prepared from three parts of speech such as noun, adjective and verb. We developed algorithms for constructing Amharic sentiment lexicons automatically from Amharic news corpus. Corpus based approach is proposed relying on the word co-occurrence distributional embedding including frequency based embedding (i.e. Positive Point-wise Mutual Information PPMI). First we build word-context unigram frequency count matrix and transform it to point-wise mutual Information matrix. Using this matrix, we computed the cosine distance of mean vector of seed lists and each word in the corpus vocabulary. Based on the threshold value, the top closest words to the mean vector of seed list are added to the lexicon. Then the mean vector of the new sentiment seed list is updated and process is repeated until we get sufficient terms in the lexicon. Using PPMI with threshold value of 100 and 200, we got corpus based Amharic Sentiment lexicons of size 1811 and 3794 respectively by expanding 519 seeds. Finally, the lexicon generated in corpus based approach is evaluated. +",/pdf/83490390395f4cc7d4f43ec60ce7be223b2b6f19.pdf,ICLR,2020,Corpus based Algorithm is developed generate Amharic Sentiment lexicon relying on corpus +H8hgu4XsTXi,TQTmJ1vl8mv,1601310000000.0,1617090000000.0,2888,Estimating Treatment Effects via Orthogonal Regularization,"[""~Tobias_Hatt1"", ""sfeuerriegel@ethz.ch""]","[""Tobias Hatt"", ""Stefan Feuerriegel""]","[""Treatment Effects"", ""Regularization"", ""Neural Networks""]","Decision-making often requires accurate estimation of causal effects from observational data. This is challenging as outcomes of alternative decisions are not observed and have to be estimated. Previous methods estimate outcomes based on unconfoundedness but neglect any constraints that unconfoundedness imposes on the outcomes. In this paper, we propose a novel regularization framework in which we formalize unconfoundedness as an orthogonality constraint. We provide theoretical guarantees that this yields an asymptotically normal estimator for the average causal effect. Compared to other estimators, its asymptotic variance is strictly smaller. Based on our regularization framework, we develop deep orthogonal networks for unconfounded treatments (DONUT) which learn outcomes that are orthogonal to the treatment assignment. Using a variety of benchmark datasets for causal inference, we demonstrate that DONUT outperforms the state-of-the-art substantially.",/pdf/45d5026be4e295467e379c42f7816cdb6407c240.pdf,ICLR,2021,"In order to estimate average causal effects, we develop a regularization framework in which we formalize unconfoundedness as an orthogonality constraint." +Ske-ih4FPS,ryx47b3evr,1569440000000.0,1577170000000.0,141,Unsupervised Few Shot Learning via Self-supervised Training,"[""jizilong@mail.bnu.edu.cn"", ""xiaolz@pku.edu.cn"", ""tjhuang@pku.edu.cn"", ""siwu@pku.edu.cn""]","[""Zilong Ji"", ""Xiaolong Zou"", ""Tiejun Huang"", ""Si Wu""]","[""few shot learning"", ""self-supervised learning"", ""meta-learning""]","Learning from limited exemplars (few-shot learning) is a fundamental, unsolved problem that has been laboriously explored in the machine learning community. However, current few-shot learners are mostly supervised and rely heavily on a large amount of labeled examples. Unsupervised learning is a more natural procedure for cognitive mammals and has produced promising results in many machine learning tasks. In the current study, we develop a method to learn an unsupervised few-shot learner via self-supervised training (UFLST), which can effectively generalize to novel but related classes. The proposed model consists of two alternate processes, progressive clustering and episodic training. The former generates pseudo-labeled training examples for constructing episodic tasks; and the later trains the few-shot learner using the generated episodic tasks which further optimizes the feature representations of data. The two processes facilitate with each other, and eventually produce a high quality few-shot learner. Using the benchmark dataset Omniglot, we show that our model outperforms other unsupervised few-shot learning methods to a large extend and approaches to the performances of supervised methods. Using the benchmark dataset Market1501, we further demonstrate the feasibility of our model to a real-world application on person re-identification.",/pdf/75028ec0c103388f336c9bf3b755491fcfb1f86d.pdf,ICLR,2020, +HyezmlBKwr,HJleamgKPr,1569440000000.0,1577170000000.0,2201,Test-Time Training for Out-of-Distribution Generalization,"[""yusun@berkeley.edu"", ""dragonwxl123@gmail.com"", ""zhuangl@berkeley.edu"", ""miller_john@berkeley.edu"", ""efros@eecs.berkeley.edu"", ""hardt@berkeley.edu""]","[""Yu Sun"", ""Xiaolong Wang"", ""Zhuang Liu"", ""John Miller"", ""Alexei A. Efros"", ""Moritz Hardt""]","[""out-of-distribution"", ""distribution shifts""]","We introduce a general approach, called test-time training, for improving the performance of predictive models when test and training data come from different distributions. Test-time training turns a single unlabeled test instance into a self-supervised learning problem, on which we update the model parameters before making a prediction on the test sample. We show that this simple idea leads to surprising improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts. Theoretical investigations on a convex model reveal helpful intuitions for when we can expect our approach to help.",/pdf/bda25e59cb37b4e3b3e97f2bdee093ff4882125d.pdf,ICLR,2020,Training on a single test input with self-supervision makes the prediction better on this input when it is out-of-distribution. +rkr1UDeC-,rk4kIPlA-,1509090000000.0,1518730000000.0,291,Large scale distributed neural network training through online distillation,"[""rohananil@google.com"", ""pereyra@google.com"", ""apassos@google.com"", ""ormandi@google.com"", ""gdahl@google.com"", ""geoffhinton@google.com""]","[""Rohan Anil"", ""Gabriel Pereyra"", ""Alexandre Passos"", ""Robert Ormandi"", ""George E. Dahl"", ""Geoffrey E. Hinton""]","[""distillation"", ""distributed training"", ""neural networks"", ""deep learning""]","Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.",/pdf/ff8cab7c0c994f6fc91eb801de63225e997cf626.pdf,ICLR,2018,We perform large scale experiments to show that a simple online variant of distillation can help us scale distributed neural network training to more machines. +SJfZKiC5FX,SyeuxYEjuX,1538090000000.0,1551060000000.0,419,Dynamically Unfolding Recurrent Restorer: A Moving Endpoint Control Method for Image Restoration,"[""jet@pku.edu.cn"", ""luyiping9712@pku.edu.cn"", ""liujiaying@pku.edu.cn"", ""dongbin@math.pku.edu.cn""]","[""Xiaoshuai Zhang"", ""Yiping Lu"", ""Jiaying Liu"", ""Bin Dong""]","[""image restoration"", ""differential equation""]","In this paper, we propose a new control framework called the moving endpoint control to restore images corrupted by different degradation levels in one model. The proposed control problem contains a restoration dynamics which is modeled by an RNN. The moving endpoint, which is essentially the terminal time of the associated dynamics, is determined by a policy network. We call the proposed model the dynamically unfolding recurrent restorer (DURR). Numerical experiments show that DURR is able to achieve state-of-the-art performances on blind image denoising and JPEG image deblocking. Furthermore, DURR can well generalize to images with higher degradation levels that are not included in the training stage.",/pdf/ffb6c7d397b724ca3a2fd6065f2b317d77f07a55.pdf,ICLR,2019,We propose a novel method to handle image degradations of different levels by learning a diffusion terminal time. Our model can generalize to unseen degradation level and different noise statistic. +TWDczblpqE,9gJn1Cr6OX,1601310000000.0,1614990000000.0,1352,Semi-Supervised Audio Representation Learning for Modeling Beehive Strengths,"[""~Tony_Zhang2"", ""szymek@google.com"", ""nsv@google.com"", ""matthsmith@google.com"", ""bhopkins@wsu.edu""]","[""Tony Zhang"", ""Szymon Zmyslony"", ""Sergei Nozdrenkov"", ""Matthew Smith"", ""Brandon Kingsley Hopkins""]","[""bee"", ""beehive"", ""audio"", ""sound"", ""computational ethology"", ""deep learning"", ""representation learning"", ""semi-supervised learning"", ""modeling"", ""population"", ""disease""]","Honey bees are critical to our ecosystem and food security as a pollinator, contributing 35% of our global agriculture yield. In spite of their importance, beekeeping is exclusively dependent on human labor and experience-derived heuristics, while requiring frequent human checkups to ensure the colony is healthy, which can disrupt the colony. Increasingly, pollinator populations are declining due to threats from climate change, pests, environmental toxicity, making their management even more critical than ever before in order to ensure sustained global food security. To start addressing this pressing challenge, we developed an integrated hardware sensing system for beehive monitoring through audio and environment measurements, and a hierarchical semi-supervised deep learning model, composed of an audio modeling module and a predictor, to model the strength of beehives. The model is trained jointly on audio reconstruction and prediction losses based on human inspections, in order to model both low-level audio features and circadian temporal dynamics. We show that this model performs well despite limited labels, and can learn an audio embedding that is useful for characterizing different sound profiles of beehives. This is the first instance to our knowledge of applying audio-based deep learning to model beehives and population size in an observational setting across a large number of hives.",/pdf/6b672e627ff7ec410f515b50e313d8edf29430b3.pdf,ICLR,2021,"We collected multi-modal observational beehive data, and used semi-supervised audio deep learning to model population and disease states." +BJ6anzb0Z,ry5FnGZRW,1509140000000.0,1518730000000.0,978,Multimodal Sentiment Analysis To Explore the Structure of Emotions,"[""anthony.hu@stats.ox.ac.uk"", ""s.flaxman@imperial.ac.uk""]","[""Anthony Hu"", ""Seth Flaxman""]",[],"We propose a novel approach to multimodal sentiment analysis using deep neural +networks combining visual recognition and natural language processing. Our +goal is different than the standard sentiment analysis goal of predicting whether +a sentence expresses positive or negative sentiment; instead, we aim to infer the +latent emotional state of the user. Thus, we focus on predicting the emotion word +tags attached by users to their Tumblr posts, treating these as “self-reported emotions.” +We demonstrate that our multimodal model combining both text and image +features outperforms separate models based solely on either images or text. Our +model’s results are interpretable, automatically yielding sensible word lists associated +with emotions. We explore the structure of emotions implied by our model +and compare it to what has been posited in the psychology literature, and validate +our model on a set of images that have been used in psychology studies. Finally, +our work also provides a useful tool for the growing academic study of images— +both photographs and memes—on social networks.",/pdf/c67ea97ba6ec82c7f5cdc05854baf5de21205158.pdf,ICLR,2018, +SJgzXaNFwS,rJxGIsR8DH,1569440000000.0,1577170000000.0,441,HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled embedding of n-gram statistics ,"[""pedro.alonso@ltu.se"", ""kumar@neuralspace.ai"", ""denis.kleyko@ltu.se"", ""evgeny.osipov@ltu.se"", ""marcus.liwicki@ltu.se""]","[""Pedro Alonso"", ""Kumar Shridhar"", ""Denis Kleyko"", ""Evgeny Osipov"", ""Marcus Liwicki""]","[""NLP"", ""Hyperdimensional computing"", ""n-gram statistics"", ""word representation"", ""semantic hashing""]","Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient algorithms for NLP tasks. In particular, it investigates distributed representations of n-gram statistics of texts. The representations are formed using hyperdimensional computing enabled embedding. These representations then serve as features, which are used as input to standard classifiers. We investigate the applicability of the embedding on one large and three small standard datasets for classification tasks using nine classifiers. The embedding achieved on par F1 scores while decreasing the time and memory requirements by several times compared to the conventional n-gram statistics, e.g., for one of the classifiers on a small dataset, the memory reduction was 6.18 times; while train and test speed-ups were 4.62 and 3.84 times, respectively. For many classifiers on the large dataset, the memory reduction was about 100 times and train and test speed-ups were over 100 times. More importantly, the usage of distributed representations formed via hyperdimensional computing allows dissecting the strict dependency between the dimensionality of the representation and the parameters of n-gram statistics, thus, opening a room for tradeoffs.",/pdf/ab4a892e7a3cf8a2d57d6bafd685cd94211fbf2a.pdf,ICLR,2020,Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled embedding of n-gram statistics +SkBsEQYll,,1478210000000.0,1486410000000.0,85,Learning similarity preserving representations with neural similarity and context encoders,"[""franziska.horn@campus.tu-berlin.de"", ""klaus-robert.mueller@tu-berlin.de""]","[""Franziska Horn"", ""Klaus-Robert M\u00fcller""]","[""Natural language processing"", ""Unsupervised Learning"", ""Supervised Learning""]","We introduce similarity encoders (SimEc), which learn similarity preserving representations by using a feed-forward neural network to map data into an embedding space where the original similarities can be approximated linearly. The model can easily compute representations for novel (out-of-sample) data points, even if the original pairwise similarities of the training set were generated by an unknown process such as human ratings. This is demonstrated by creating embeddings of both image and text data. +Furthermore, the idea behind similarity encoders gives an intuitive explanation of the optimization strategy used by the continuous bag-of-words (CBOW) word2vec model trained with negative sampling. Based on this insight, we define context encoders (ConEc), which can improve the word embeddings created with word2vec by using the local context of words to create out-of-vocabulary embeddings and representations for words with multiple meanings. The benefit of this is illustrated by using these word embeddings as features in the CoNLL 2003 named entity recognition task.",/pdf/5779204f81750d711baecf5fca47c5fc7cb2705c.pdf,ICLR,2017,Neural network way of doing kernel PCA and an extension of word2vec to compute out-of-vocabulary embeddings and distinguish between multiple meanings of a word based on its local context. +Q5ZxoD2LqcI,wISZO2CsSuU,1601310000000.0,1614990000000.0,3267,On the use of linguistic similarities to improve Neural Machine Translation for African Languages,"[""~Tikeng_Notsawo_Pascal1"", ""~NANDA_ASSOBJIO_Brice_Yvan1"", ""~James_Assiene1""]","[""Tikeng Notsawo Pascal"", ""NANDA ASSOBJIO Brice Yvan"", ""James Assiene""]","[""Machine Translation"", ""Multilingualism"", ""Linguistic similarity"", ""Dataset"", ""African languages"", ""Multi-task learning""]","In recent years, there has been a resurgence in research on empirical methods for machine translation. Most of this research has been focused on high-resource, European languages. Despite the fact that around 30% of all languages spoken worldwide are African, the latter have been heavily under investigated and this, partly due to the lack of public parallel corpora online. Furthermore, despite their large number (more than 2,000) and the similarities between them, there is currently no publicly available study on how to use this multilingualism (and associated similarities) to improve machine translation systems performance on African languages. So as to address these issues: +We propose a new dataset for African languages that provides parallel data for vernaculars not present in commonly used dataset like JW300 [1]. To exploit multilingualism, we first use a historical approach based on historical origins of these languages, their morphologies, their geographical and cultural distributions as well as migrations of population to identify similar vernaculars. +We also propose a new metric to automatically evaluate similarities between languages. This new metric does not require word level parallelism like traditional methods but only paragraph level parallelism. +We then show that performing Masked Language Modelling and Translation Language Modeling in addition to multi-task learning on a cluster of similar languages leads to a strong boost of performance in translating individual pairs inside this cluster. +In particular, we record an improvement of 29 BLEU on the pair Bafia-Ewondo using our approaches compared to previous work methods that did not exploit multilingualism in any way. + +[1] http://opus.nlpl.eu/JW300.php",/pdf/6792cd44baca61173e335b3c3cb853a39213b586.pdf,ICLR,2021,"In this work, we show that performing multi-task learning on a cluster of similar languages leads to a strong boost of performance in translating individual pairs inside this cluster." +rJxt0JHKvS,Byl2qwyYvB,1569440000000.0,1577170000000.0,2032,Coloring graph neural networks for node disambiguation,"[""george.dasoulas1@gmail.com"", ""kevin.scaman@gmail.com"", ""ludovic.dos.santos@huawei.com"", ""aladin.virmaux@huawei.com""]","[""George Dasoulas"", ""Ludovic Dos Santos"", ""Kevin Scaman"", ""Aladin Virmaux""]","[""Graph neural networks"", ""separability"", ""node disambiguation"", ""universal approximation"", ""representation learning""]","In this paper, we show that a simple coloring scheme can improve, both theoretically and empirically, the expressive power of Message Passing Neural Networks (MPNNs). More specifically, we introduce a graph neural network called Colored Local Iterative Procedure (CLIP) that uses colors to disambiguate identical node attributes, and show that this representation is a universal approximator of continuous functions on graphs with node attributes. Our method relies on separability, a key topological characteristic that allows to extend well-chosen neural networks into universal representations. Finally, we show experimentally that CLIP is capable of capturing structural characteristics that traditional MPNNs fail to distinguish, while being state-of-the-art on benchmark graph classification datasets.",/pdf/acdeede4e42d2617fab425b102c9779bc34ea1ee.pdf,ICLR,2020,"This paper introduces a coloring scheme for node disambiguation in graph neural networks based on separability, proven to be a universal MPNN extension." +HJxkvlBtwH,rJM5R9eKDS,1569440000000.0,1577170000000.0,2345,Certifying Neural Network Audio Classifiers,"[""wryou@student.ethz.ch"", ""bmislav@student.ethz.ch"", ""gsingh@inf.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Wonryong Ryou"", ""Mislav Balunovic"", ""Gagandeep Singh"", ""Martin Vechev""]","[""Adversarial Examples"", ""Audio Classifier"", ""Speech Recognition"", ""Certified Robustness"", ""Deep Learning""]","We present the first end-to-end verifier of audio classifiers. Compared to existing methods, our approach enables analysis of both, the entire audio processing stage as well as recurrent neural network architectures (e.g., LSTM). The audio processing is verified using novel convex relaxations tailored to feature extraction operations used in audio (e.g., Fast Fourier Transform) while recurrent architectures are certified via a novel binary relaxation for the recurrent unit update. We show the verifier scales to large networks while computing significantly tighter bounds than existing methods for common audio classification benchmarks: on the challenging Google Speech Commands dataset we certify 95% more inputs than the interval approximation (only prior scalable method), for a perturbation of -90dB.",/pdf/4c0c258cff5bf1bbbd8f41c7cfcd37934f5e01d3.pdf,ICLR,2020,We present the first approach to certify robustness of neural networks against noise-based perturbations in the audio domain. +HylTBhA5tQ,ByxzSCs5K7,1538090000000.0,1546590000000.0,1584,The Limitations of Adversarial Training and the Blind-Spot Attack,"[""huan@huan-zhang.com"", ""chenhg@mit.edu"", ""zhaos@utexas.edu"", ""boning@mtl.mit.edu"", ""inderjit@cs.utexas.edu"", ""chohsieh@cs.ucla.edu""]","[""Huan Zhang*"", ""Hongge Chen*"", ""Zhao Song"", ""Duane Boning"", ""Inderjit S. Dhillon"", ""Cho-Jui Hsieh""]","[""Adversarial Examples"", ""Adversarial Training"", ""Blind-Spot Attack""]","The adversarial training procedure proposed by Madry et al. (2018) is one of the most effective methods to defend against adversarial examples in deep neural net- works (DNNs). In our paper, we shed some lights on the practicality and the hardness of adversarial training by showing that the effectiveness (robustness on test set) of adversarial training has a strong correlation with the distance between a test point and the manifold of training data embedded by the network. Test examples that are relatively far away from this manifold are more likely to be vulnerable to adversarial attacks. Consequentially, an adversarial training based defense is susceptible to a new class of attacks, the “blind-spot attack”, where the input images reside in “blind-spots” (low density regions) of the empirical distri- bution of training data but is still on the ground-truth data manifold. For MNIST, we found that these blind-spots can be easily found by simply scaling and shifting image pixel values. Most importantly, for large datasets with high dimensional and complex data manifold (CIFAR, ImageNet, etc), the existence of blind-spots in adversarial training makes defending on any valid test examples difficult due to the curse of dimensionality and the scarcity of training data. Additionally, we find that blind-spots also exist on provable defenses including (Kolter & Wong, 2018) and (Sinha et al., 2018) because these trainable robustness certificates can only be practically optimized on a limited set of training data.",/pdf/8a378790c2b538af164f53d751d30d33811f0018.pdf,ICLR,2019,We show that even the strongest adversarial training methods cannot defend against adversarial examples crafted on slightly scaled and shifted test images. +e8W-hsu_q5,nDEatK2aPQa,1601310000000.0,1616070000000.0,1535,Group Equivariant Conditional Neural Processes,"[""~Makoto_Kawano1"", ""~Wataru_Kumagai2"", ""~Akiyoshi_Sannai1"", ""~Yusuke_Iwasawa1"", ""~Yutaka_Matsuo1""]","[""Makoto Kawano"", ""Wataru Kumagai"", ""Akiyoshi Sannai"", ""Yusuke Iwasawa"", ""Yutaka Matsuo""]","[""Neural Processes"", ""Conditional Neural Processes"", ""Stochastic Processes"", ""Regression"", ""Group Equivariance"", ""Symmetry""]","We present the group equivariant conditional neural process (EquivCNP), a meta-learning method with permutation invariance in a data set as in conventional conditional neural processes (CNPs), and it also has transformation equivariance in data space. Incorporating group equivariance, such as rotation and scaling equivariance, provides a way to consider the symmetry of real-world data. We give a decomposition theorem for permutation-invariant and group-equivariant maps, which leads us to construct EquivCNPs with an infinite-dimensional latent space to handle group symmetries. In this paper, we build architecture using Lie group convolutional layers for practical implementation. We show that EquivCNP with translation equivariance achieves comparable performance to conventional CNPs in a 1D regression task. Moreover, we demonstrate that incorporating an appropriate Lie group equivariance, EquivCNP is capable of zero-shot generalization for an image-completion task by selecting an appropriate Lie group equivariance.",/pdf/97ca04674c54635beb05af46402f012365bc8226.pdf,ICLR,2021,"A model for regression that learns conditional distributions of a stochastic process, by incorporating group equivariance into Conditional Neural Processes." +DAaaaqPv9-q,ELd3j8_pDVu,1601310000000.0,1614990000000.0,177,Self-supervised Graph-level Representation Learning with Local and Global Structure,"[""~Minghao_Xu1"", ""~Hang_Wang1"", ""~Bingbing_Ni3"", ""~Hongyu_Guo1"", ""~Jian_Tang1""]","[""Minghao Xu"", ""Hang Wang"", ""Bingbing Ni"", ""Hongyu Guo"", ""Jian Tang""]","[""Self-supervised Representation Learning"", ""Graph Representation Learning"", ""Hierarchical Semantic Learning""]","This paper focuses on unsupervised/self-supervised whole-graph representation learning, which is critical in many tasks including drug and material discovery. Current methods can effectively model the local structure between different graph instances, but they fail to discover the global semantic structure of the entire dataset. In this work, we propose a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised whole-graph representation learning. Specifically, besides preserving the local instance-level structure, GraphLoG leverages a nonparametric strategy to learn hierarchical prototypes of the data. These prototypes capture the semantic clusters in the latent space, and the number of prototypes can automatically adapt to different feature distributions. We evaluate GraphLoG by pre-training it on massive unlabeled graphs followed by fine-tuning on downstream tasks. Extensive experiments on both chemical and biological benchmark datasets demonstrate the effectiveness of our approach. ",/pdf/2fe3e8d8c91ff90b4960cc121baf1bcb727dde10.pdf,ICLR,2021,This work seeks to learn the local-instance and global-semantic structure of a set of unlabeled graphs. +S1fQSiCcYm,HkgqhPiF_m,1538090000000.0,1551480000000.0,73,Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer,"[""dberth@google.com"", ""craffel@gmail.com"", ""aurkor@google.com"", ""goodfellow@google.com""]","[""David Berthelot*"", ""Colin Raffel*"", ""Aurko Roy"", ""Ian Goodfellow""]","[""autoencoders"", ""interpolation"", ""unsupervised learning"", ""representation learning"", ""adversarial learning""]","Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in a latent code. In some cases, autoencoders can ""interpolate"": By decoding the convex combination of the latent codes for two datapoints, the autoencoder can produce an output which semantically mixes characteristics from the datapoints. In this paper, we propose a regularization procedure which encourages interpolated outputs to appear more realistic by fooling a critic network which has been trained to recover the mixing coefficient from interpolated data. We then develop a simple benchmark task where we can quantitatively measure the extent to which various autoencoders can interpolate and show that our regularizer dramatically improves interpolation in this setting. We also demonstrate empirically that our regularizer produces latent codes which are more effective on downstream tasks, suggesting a possible link between interpolation abilities and learning useful representations.",/pdf/6f8c41dcd45651f410da3b3fbe9d0fbdfe7765cf.pdf,ICLR,2019,We propose a regularizer that improves interpolation and autoencoders and show that it also improves the learned representation for downstream tasks. +XvOH0v2hsph,V6JlG_HK37A,1601310000000.0,1614990000000.0,1643,Revisiting the Train Loss: an Efficient Performance Estimator for Neural Architecture Search,"[""~Binxin_Ru1"", ""~Clare_Lyle1"", ""~Lisa_Schut2"", ""~Mark_van_der_Wilk1"", ""~Yarin_Gal1""]","[""Binxin Ru"", ""Clare Lyle"", ""Lisa Schut"", ""Mark van der Wilk"", ""Yarin Gal""]","[""performance estimation"", ""neural architecture search""]","Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopping estimates may correlate poorly with fully trained performance, and model-based estimators require large training sets. Instead, motivated by recent results linking training speed and generalisation with stochastic gradient descent, we propose to estimate the final test performance based on the sum of training losses. Our estimator is inspired by the marginal likelihood, which is used for Bayesian model selection. Our model-free estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate empirically that our estimator consistently outperforms other baselines under various settings and can achieve a rank correlation of 0.95 with final test accuracy on the NAS-Bench201 dataset within 50 epochs.",/pdf/1c3d611d342229a14856d44acecfa11843c0a879.pdf,ICLR,2021,We propose a simple yet reliable method for estimating the generalisation performance of neural architectures; our method utilises early training losses and has theoretical interpretation based on training speed and marginal likelihood. +SkxpDT4YvS,BJlAtvsvwH,1569440000000.0,1577170000000.0,613,Policy Optimization with Stochastic Mirror Descent,"[""yanglong@zju.edu.cn"", ""gang_zheng@zju.edu.cn"", ""21721269@zju.edu.cn"", ""hzzhangyu@zju.edu.cn"", ""csqianzheng@gmail.com"", ""junwen@zju.edu.cn"", ""gpan@zju.edu.cn""]","[""Long Yang"", ""Gang Zheng"", ""Zavier Zhang"", ""Yu Zhang"", ""Qian Zheng"", ""Jun Wen"", ""Gang Pana sample efficient policy gradient method with stochastic mirror descent.""]","[""reinforcement learning"", ""policy gradient"", ""stochastic variance reduce gradient"", ""sample efficiency"", ""stochastic mirror descent""]","Improving sample efficiency has been a longstanding goal in reinforcement learning. +In this paper, we propose the $\mathtt{VRMPO}$: a sample efficient policy gradient method with stochastic mirror descent. +A novel variance reduced policy gradient estimator is the key of $\mathtt{VRMPO}$ to improve sample efficiency. +Our $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, +which matches the best-known sample complexity. +We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.",/pdf/f1d37dc71a0930967cd1b61379b12654d359ae56.pdf,ICLR,2020,We propose a sample efficient policy gradient method with stochastic mirror descent via conducting a variance reduced policy gradient estimator. +Bke89JBtvB,Skx9vcRODB,1569440000000.0,1585900000000.0,1879,Batch-shaping for learning conditional channel gated networks,"[""behtesha@qti.qualcomm.com"", ""tijmen@qti.qualcomm.com"", ""mwelling@qti.qualcomm.com""]","[""Babak Ehteshami Bejnordi"", ""Tijmen Blankevoort"", ""Max Welling""]","[""Conditional computation"", ""channel gated networks"", ""gating"", ""Batch-shaping"", ""distribution matching"", ""image classification"", ""semantic segmentation""]","We present a method that trains large capacity neural networks with significantly improved accuracy and lower dynamic computational cost. This is achieved by gating the deep-learning architecture on a fine-grained-level. Individual convolutional maps are turned on/off conditionally on features in the network. To achieve this, we introduce a new residual block architecture that gates convolutional channels in a fine-grained manner. We also introduce a generally applicable tool batch-shaping that matches the marginal aggregate posteriors of features in a neural network to a pre-specified prior distribution. We use this novel technique to force gates to be more conditional on the data. We present results on CIFAR-10 and ImageNet datasets for image classification, and Cityscapes for semantic segmentation. Our results show that our method can slim down large architectures conditionally, such that the average computational cost on the data is on par with a smaller architecture, but with higher accuracy. In particular, on ImageNet, our ResNet50 and ResNet34 gated networks obtain 74.60% and 72.55% top-1 accuracy compared to the 69.76% accuracy of the baseline ResNet18 model, for similar complexity. We also show that the resulting networks automatically learn to use more features for difficult examples and fewer features for simple examples.",/pdf/a94357b0df417addd9e81f8f1bb81f45fae91aa1.pdf,ICLR,2020,A method that trains large capacity neural networks with significantly improved accuracy and lower dynamic computational cost +H1fl8S9ee,,1478280000000.0,1488480000000.0,249,Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks,"[""stefan.depeweg@siemens.com"", ""jmh233@cam.ac.uk"", ""finale@seas.harvard.edu"", ""steffen.udluft@siemens.com""]","[""Stefan Depeweg"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato"", ""Finale Doshi-Velez"", ""Steffen Udluft""]","[""Deep learning"", ""Reinforcement Learning""]","We present an algorithm for policy search in stochastic dynamical systems using +model-based reinforcement learning. The system dynamics are described with +Bayesian neural networks (BNNs) that include stochastic input variables. These +input variables allow us to capture complex statistical +patterns in the transition dynamics (e.g. multi-modality and +heteroskedasticity), which are usually missed by alternative modeling approaches. After +learning the dynamics, our BNNs are then fed into an algorithm that performs +random roll-outs and uses stochastic optimization for policy learning. We train +our BNNs by minimizing $\alpha$-divergences with $\alpha = 0.5$, which usually produces better +results than other techniques such as variational Bayes. We illustrate the performance of our method by +solving a challenging problem where model-based approaches usually fail and by +obtaining promising results in real-world scenarios including the control of a +gas turbine and an industrial benchmark.",/pdf/0f8cf63dfd4729681243a85630c19e3a14793321.pdf,ICLR,2017, +piek7LGx7j,9OTtc5phXTe,1601310000000.0,1614990000000.0,2784,Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modelling,"[""~Akash_Srivastava1"", ""~Yamini_Bansal1"", ""~Yukun_Ding1"", ""~Cole_Lincoln_Hurwitz1"", ""~Kai_Xu4"", ""~Bernhard_Egger1"", ""~Prasanna_Sattigeri1"", ""~Joshua_B._Tenenbaum1"", ""~Dan_Gutfreund1""]","[""Akash Srivastava"", ""Yamini Bansal"", ""Yukun Ding"", ""Cole Lincoln Hurwitz"", ""Kai Xu"", ""Bernhard Egger"", ""Prasanna Sattigeri"", ""Joshua B. Tenenbaum"", ""Dan Gutfreund""]","[""disentanglement"", ""disentangled representation learning"", ""vae"", ""generative model""]","Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the (aggregate) posterior to encourage statistical independence of the latent factors. This approach introduces a trade-off between disentangled representation learning and reconstruction quality since the model does not have enough capacity to learn correlated latent variables that capture detail information present in most image data. To overcome this trade-off, we present a novel multi-stage modelling approach where the disentangled factors are first learned using a preexisting disentangled representation learning method (such as $\beta$-TCVAE); then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables, adding detail information while maintaining conditioning on the previously learned disentangled factors. Taken together, our multi-stage modelling approach results in single, coherent probabilistic model that is theoretically justified by the principal of D-separation and can be realized with a variety of model classes including likelihood-based models such as variational autoencoders, implicit models such as generative adversarial networks, and tractable models like normalizing flows or mixtures of Gaussians. We demonstrate that our multi-stage model has much higher reconstruction quality than current state-of-the-art methods with equivalent disentanglement performance across multiple standard benchmarks.",/pdf/44494c8c56eb5e4ab0c9e4c0b504fd647946b1cc.pdf,ICLR,2021, +pAbm1qfheGk,ZuRl-rw9mu,1601310000000.0,1620960000000.0,1699,Learning Neural Generative Dynamics for Molecular Conformation Generation,"[""~Minkai_Xu1"", ""~Shitong_Luo1"", ""~Yoshua_Bengio1"", ""~Jian_Peng1"", ""~Jian_Tang1""]","[""Minkai Xu"", ""Shitong Luo"", ""Yoshua Bengio"", ""Jian Peng"", ""Jian Tang""]","[""Molecular conformation generation"", ""deep generative models"", ""continuous normalizing flow"", ""energy-based models""]","We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.",/pdf/90d50e0ca739c22ebb906023d446dc8c8f98a7e0.pdf,ICLR,2021,A novel probabilistic framework to generate valid and diverse molecular conformations. Reaching state-of-the-art results on conformation generation and inter-atomic distance modeling. +nEMiSX_ipXr,5OOKegimoZH,1601310000000.0,1614990000000.0,36,Proper Measure for Adversarial Robustness,"[""~Hyeongji_Kim1"", ""ketil@malde.org""]","[""Hyeongji Kim"", ""Ketil Malde""]","[""adversarial examples"", ""adversarial robustness"", ""adversarial accuracy"", ""nearest neighbor classifiers""]","This paper analyzes the problems of adversarial accuracy and adversarial training. We argue that standard adversarial accuracy fails to properly measure the robustness of classifiers. Its definition has a tradeoff with standard accuracy even when we neglect generalization. In order to handle the problems of the standard adversarial accuracy, we introduce a new measure for the robustness of classifiers called genuine adversarial accuracy. It can measure the adversarial robustness of classifiers without trading off accuracy on clean data and accuracy on the adversarially perturbed samples. In addition, it does not favor a model with invariance-based adversarial examples, samples whose predicted classes are unchanged even if the perceptual classes are changed. We prove that a single nearest neighbor (1-NN) classifier is the most robust classifier according to genuine adversarial accuracy for given data and a norm-based distance metric when the class for each data point is unique. Based on this result, we suggest that using poor distance metrics might be one factor for the tradeoff between test accuracy and $l_p$ norm-based test adversarial robustness.",/pdf/5eb4172ce3ce5ec92a0a0af189a50a4a1423da29.pdf,ICLR,2021,This paper introduces a new measure for the robustness of classifiers and suggests one possible factor for the tradeoff between test accuracy and adversarial robustness. +rJx8ylSKvr,rkl6Y9kKvr,1569440000000.0,1577170000000.0,2064,Leveraging Entanglement Entropy for Deep Understanding of Attention Matrix in Text Matching,"[""pzhang@tju.edu.cn"", ""xiaoliumao@tju.edu.cn"", ""xindianma@tju.edu.cn"", ""wang@dei.unipd.it"", ""18738996120@163.com"", ""jun.wang@cs.ucl.ac.uk"", ""dwsong@bit.edu.cn""]","[""Peng Zhang"", ""XiaoLiu Mao"", ""XinDian Ma"", ""BenYou Wang"", ""Jing Zhang"", ""Jun Wang"", ""DaWei Song""]","[""Quantum entanglement entropy"", ""Attention Matrix""]","The formal understanding of deep learning has made great progress based on quantum many-body physics. For example, the entanglement entropy in quantum many-body systems can interpret the inductive bias of neural network and then guide the design of network structure and parameters for certain tasks. However, there are two unsolved problems in the current study of entanglement entropy, which limits its application potential. First, the theoretical benefits of entanglement entropy was only investigated in the representation of a single object (e.g., an image or a sentence), but has not been well studied in the matching of two objects (e.g., question-answering pairs). Second, the entanglement entropy can not be qualitatively calculated since the exponentially increasing dimension of the matching matrix. In this paper, we are trying to address these two problem by investigating the fundamental connections between the entanglement entropy and the attention matrix. We prove that by a mapping (via the trace operator) on the high-dimensional matching matrix, a low-dimensional attention matrix can be derived. Based on such a attention matrix, we can provide a feasible solution to the entanglement entropy that describes the correlation between the two objects in matching tasks. Inspired by the theoretical property of the entanglement entropy, we can design the network architecture adaptively in a typical text matching task, i.e., question-answering task.",/pdf/fa813fbd3ae8bcf6343c7ff87d2fe8d00c6cffad.pdf,ICLR,2020, +BJa0ECFxe,,1478250000000.0,1484090000000.0,155,Information Dropout: learning optimal representations through noise,"[""achille@cs.ucla.edu"", ""soatto@cs.ucla.edu""]","[""Alessandro Achille"", ""Stefano Soatto""]","[""Theory"", ""Deep learning""]","We introduce Information Dropout, a generalization of dropout that is motivated by the Information Bottleneck principle and highlights the way in which injecting noise in the activations can help in learning optimal representations of the data. Information Dropout is rooted in information theoretic principles, it includes as special cases several existing dropout methods, like Gaussian Dropout and Variational Dropout, and, unlike classical dropout, it can learn and build representations that are invariant to nuisances of the data, like occlusions and clutter. When the task is the reconstruction of the input, we show that the information dropout method yields a variational autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Our experiments validate the theoretical intuitions behind our method, and we find that information dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.",/pdf/5e733bdd823f29c481dbd25788c7897cd4977ef6.pdf,ICLR,2017,"We introduce Information Dropout, an information theoretic generalization of dropout that highlights how injecting noise can help in learning invariant representations." +ptbb7olhGHd,2JG6CTMehzk,1601310000000.0,1614990000000.0,2665,On the Robustness of Sentiment Analysis for Stock Price Forecasting,"[""~Gabriel_Deza1"", ""c.rowat@bham.ac.uk"", ""~Nicolas_Papernot1""]","[""Gabriel Deza"", ""Colin Rowat"", ""Nicolas Papernot""]","[""adversarial machine learning"", ""adversarial examples"", ""stock price forecasting"", ""finance""]","Machine learning (ML) models are known to be vulnerable to attacks both at training and test time. Despite the extensive literature on adversarial ML, prior efforts focus primarily on applications of computer vision to object recognition or sentiment analysis to movie reviews. In these settings, the incentives for adversaries to manipulate the model's prediction are often unclear and attacks require extensive control of direct inputs to the model. This makes it difficult to evaluate how severe the impact of vulnerabilities exposed is on systems deploying ML with little provenance guarantees for the input data. In this paper, we study adversarial ML with stock price forecasting. Adversarial incentives are clear and may be quantified experimentally through a simulated portfolio. We replicate an industry standard pipeline, which performs a sentiment analysis of Twitter data to forecast trends in stock prices. We show that an adversary can exploit the lack of provenance to indirectly use tweets to manipulate the model's perceived sentiment about a target company and in turn force the model to forecast price erroneously. Our attack is mounted at test time and does not modify the training data. Given past market anomalies, we conclude with a series of recommendations for the use of machine learning as input signal to trading algorithms. ",/pdf/e09c9c680c27d4ae123a208f4e30957460bb97f3.pdf,ICLR,2021,"Replicate industry standard sentiment analysis of Twitter data for stock price forecasting, and demonstrate its vulnerability to adversarial manipulations of the model's inputs." +P__qBPffIlK,dhkxEZpDS7UE,1601310000000.0,1614990000000.0,598,Adversarial representation learning for synthetic replacement of private attributes,"[""~John_Martinsson1"", ""~Edvin_Listo_Zec1"", ""~Daniel_Gillblad1"", ""~Olof_Mogren1""]","[""John Martinsson"", ""Edvin Listo Zec"", ""Daniel Gillblad"", ""Olof Mogren""]","[""Deep learning"", ""privacy"", ""generative adversarial networks""]","Data privacy is an increasingly important aspect of many real-world big data analytics tasks. Data sources that contain sensitive information may have immense potential which could be unlocked using privacy enhancing transformations, but current methods often fail to produce convincing output. Furthermore, finding the right balance between privacy and utility is often a tricky trade-off. In this work, we propose a novel approach for data privatization, which involves two steps: in the first step, it removes the sensitive information, and in the second step, it replaces this information with an independent random sample. Our method builds on adversarial representation learning which ensures strong privacy by training the model to fool an increasingly strong adversary. While previous methods only aim at obfuscating the sensitive information, we find that adding new random information in its place strengthens the provided privacy and provides better utility at any given level of privacy. The result is an approach that can provide stronger privatization on image data, and yet be preserving both the domain and the utility of the inputs, entirely independent of the downstream task.",/pdf/333c93dcf05309dcd4a2a3e6f612dfb0854993e4.pdf,ICLR,2021,"Explores if realistic synthetic replacement of sensitive attributes leads to stronger privacy, and empirically studies the privacy vs. utility trade-off for learned privacy preserving image transformations." +RmcPm9m3tnk,qT6RYm5zCzg,1601310000000.0,1616080000000.0,2593,Generative Scene Graph Networks,"[""~Fei_Deng1"", ""~Zhuo_Zhi1"", ""donghun@etri.re.kr"", ""~Sungjin_Ahn1""]","[""Fei Deng"", ""Zhuo Zhi"", ""Donghun Lee"", ""Sungjin Ahn""]","[""object-centric representations"", ""generative modeling"", ""scene generation"", ""variational autoencoders""]","Human perception excels at building compositional hierarchies of parts and objects from unlabeled scenes that help systematic generalization. Yet most work on generative scene modeling either ignores the part-whole relationship or assumes access to predefined part labels. In this paper, we propose Generative Scene Graph Networks (GSGNs), the first deep generative model that learns to discover the primitive parts and infer the part-whole relationship jointly from multi-object scenes without supervision and in an end-to-end trainable way. We formulate GSGN as a variational autoencoder in which the latent representation is a tree-structured probabilistic scene graph. The leaf nodes in the latent tree correspond to primitive parts, and the edges represent the symbolic pose variables required for recursively composing the parts into whole objects and then the full scene. This allows novel objects and scenes to be generated both by sampling from the prior and by manual configuration of the pose variables, as we do with graphics engines. We evaluate GSGN on datasets of scenes containing multiple compositional objects, including a challenging Compositional CLEVR dataset that we have developed. We show that GSGN is able to infer the latent scene graph, generalize out of the training regime, and improve data efficiency in downstream tasks.",/pdf/4972f3189bc1990cd88f0c12abbe7111acfe3c15.pdf,ICLR,2021,We propose the first object-centric generative model capable of unsupervised scene graph discovery from multi-object scenes without access to predefined parts. +Hyxsl2AqKm,rylWVZRctQ,1538090000000.0,1545360000000.0,1105,ON THE EFFECTIVENESS OF TASK GRANULARITY FOR TRANSFER LEARNING,"[""farzaneh@cs.toronto.edu"", ""guillaume.berger@twentybn.com"", ""waseem.gharbieh@twentybn.com"", ""fleet@cs.toronto.edu"", ""roland.memisevic@twentybn.com""]","[""Farzaneh Mahdisoltani"", ""Guillaume Berger"", ""Waseem Gharbieh"", ""David Fleet"", ""Roland Memisevic""]","[""Transfer Learning"", ""Video Understanding"", ""Fine-grained Video Classification"", ""Video Captioning"", ""Common Sense"", ""Something-Something Dataset.""]","We describe a DNN for video classification and captioning, trained end-to-end, +with shared features, to solve tasks at different levels of granularity, exploring the +link between granularity in a source task and the quality of learned features for +transfer learning. For solving the new task domain in transfer learning, we freeze +the trained encoder and fine-tune an MLP on the target domain. We train on the +Something-Something dataset with over 220, 000 videos, and multiple levels of +target granularity, including 50 action groups, 174 fine-grained action categories +and captions. Classification and captioning with Something-Something are challenging +because of the subtle differences between actions, applied to thousands +of different object classes, and the diversity of captions penned by crowd actors. +Our model performs better than existing classification baselines for SomethingSomething, +with impressive fine-grained results. And it yields a strong baseline on +the new Something-Something captioning task. Experiments reveal that training +with more fine-grained tasks tends to produce better features for transfer learning.",/pdf/59be2f74b76794a3b5e5b2d439bd3d5f00e28e49.pdf,ICLR,2019,"If the model architecture is fixed, how would the complexity and granularity of task, effect the quality of learned features for transferring to a new task." +uSYfytRBh-f,5jyW1uSHs05,1601310000000.0,1614990000000.0,1480,Efficiently Troubleshooting Image Segmentation Models with Human-In-The-Loop,"[""~Haotao_Wang1"", ""~Tianlong_Chen1"", ""~Zhangyang_Wang1"", ""~Kede_Ma2""]","[""Haotao Wang"", ""Tianlong Chen"", ""Zhangyang Wang"", ""Kede Ma""]",[],"Image segmentation lays the foundation for many high-stakes vision applications such as autonomous driving and medical image analysis. It is, therefore, of great importance to not only improve the accuracy of segmentation models on well-established benchmarks, but also enhance their robustness in the real world so as to avoid sparse but fatal failures. In this paper, instead of chasing state-of-the-art performance on existing benchmarks, we turn our attention to a new challenging problem: how to efficiently expose failures of ``top-performing'' segmentation models in the real world and how to leverage such counterexamples to rectify the models. To achieve this with minimal human labelling effort, we first automatically sample a small set of images that are likely to falsify the target model from a large corpus of web images via the maximum discrepancy competition principle. We then propose a weakly labelling strategy to further reduce the number of false positives, before time-consuming pixel-level labelling by humans. Finally, we fine-tune the model to harness the identified failures, and repeat the whole process, resulting in an efficient and progressive framework for troubleshooting segmentation models. We demonstrate the feasibility of our framework using the semantic segmentation task in PASCAL VOC, and find that the fine-tuned model exhibits significantly improved generalization when applied to real-world images with greater content diversity. All experimental codes will be publicly released upon acceptance.",/pdf/ec18d4997aeddc30aa654603f7e02225104c4427.pdf,ICLR,2021, +B1lGU64tDr,HJeSd7_wvr,1569440000000.0,1587930000000.0,552,Relational State-Space Model for Stochastic Multi-Object Systems,"[""fanyang01@zju.edu.cn"", ""lingchen@cs.zju.edu.cn"", ""fanzhou@zju.edu.cn"", ""jianchuan.gys@alibaba-inc.com"", ""mingsong.cw@alibaba-inc.com""]","[""Fan Yang"", ""Ling Chen"", ""Fan Zhou"", ""Yusong Gao"", ""Wei Cao""]","[""state-space model"", ""time series"", ""deep sequential model"", ""graph neural network""]","Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets.",/pdf/de07d28ba461bc0a4cc08286cfc79cb112ffee03.pdf,ICLR,2020,A deep hierarchical state-space model in which the state transitions of correlated objects are coordinated by graph neural networks. +rJf0BjAqYX,HylJkiLDYm,1538090000000.0,1545360000000.0,134,Like What You Like: Knowledge Distill via Neuron Selectivity Transfer,"[""zehaohuang18@gmail.com"", ""winsty@gmail.com""]","[""Zehao Huang"", ""Naiyan Wang""]","[""Knowledge Distill""]","Despite deep neural networks have demonstrated extraordinary power in various applications, their superior performances are at expense of high storage and computational costs. Consequently, the acceleration and compression of neural networks have attracted much attention recently. Knowledge Transfer (KT), which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the popular solutions. In this paper, we propose a novel knowledge transfer method by treating it as a distribution matching problem. Particularly, we match the distributions of neuron selectivity patterns between teacher and student networks. To achieve this goal, we devise a new KT loss function by minimizing the Maximum Mean Discrepancy (MMD) metric between these distributions. Combined with the original loss function, our method can significantly improve the performance of student networks. We validate the effectiveness of our method across several datasets, and further combine it with other KT methods to explore the best possible results. Last but not least, we fine-tune the model to other tasks such as object detection. The results are also encouraging, which confirm the transferability of the learned features.",/pdf/3f4fd0cca1cf1530bd1eb068eddb668ec8c2826e.pdf,ICLR,2019,We treat knowledge distill as a distribution matching problem and adopt Maximum Mean Discrepancy to minimize the distances between student features and teacher features. +HkxlcnVFwB,HkgurdeRUS,1569440000000.0,1583910000000.0,102,GenDICE: Generalized Offline Estimation of Stationary Values,"[""ryzhang@cs.duke.edu"", ""bodai@google.com"", ""lihongli.cs@gmail.com"", ""schuurmans@google.com""]","[""Ruiyi Zhang*"", ""Bo Dai*"", ""Lihong Li"", ""Dale Schuurmans""]","[""Off-policy Policy Evaluation"", ""Reinforcement Learning"", ""Stationary Distribution Correction Estimation"", ""Fenchel Dual""]","An important problem that arises in reinforcement learning and Monte Carlo methods is estimating quantities defined by the stationary distribution of a Markov chain. In many real-world applications, access to the underlying transition operator is limited to a fixed set of data that has already been collected, without additional interaction with the environment being available. We show that consistent estimation remains possible in this scenario, and that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions, derived from fundamental properties of the stationary distribution, and exploiting constraint reformulations based on variational divergence minimization. The resulting algorithm, GenDICE, is straightforward and effective. We prove the consistency of the method under general conditions, provide a detailed error analysis, and demonstrate strong empirical performance on benchmark tasks, including off-line PageRank and off-policy policy evaluation.",/pdf/e359c9f4a7a14094671ffc723863544111ede3e2.pdf,ICLR,2020,"In this paper, we proposed a novel algorithm, GenDICE, for general stationary distribution correction estimation, which can handle both discounted and average off-policy evaluation on multiple behavior-agnostic samples." +SyxIterYwS,H1lv1RlKwH,1569440000000.0,1577170000000.0,2435,Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents,"[""philipzhao@berkeley.edu"", ""stas@berkeley.edu"", ""pabbeel@cs.berkeley.edu""]","[""Ruihan Zhao"", ""Stas Tiomkin"", ""Pieter Abbeel""]","[""intrinsic motivation"", ""empowerment"", ""latent representation"", ""encoder""]","Mutual Information between agent Actions and environment States (MIAS) quantifies the influence of agent on its environment. Recently, it was found that intrinsic motivation in artificial agents emerges from the maximization of MIAS. +For example, empowerment is an information-theoretic approach to intrinsic motivation, which has been shown to solve a broad range of standard RL benchmark problems. The estimation of empowerment for arbitrary dynamics is a challenging problem because it relies on the estimation of MIAS. Existing approaches rely on sampling, which have formal limitations, requiring exponentially many samples. In this work, we develop a novel approach for the estimation of empowerment in unknown arbitrary dynamics from visual stimulus only, without sampling for the estimation of MIAS. The core idea is to represent the relation between action sequences and future states by a stochastic dynamical system in latent space, which admits an efficient estimation of MIAS by the ``Water-Filling"" algorithm from information theory. We construct this embedding with deep neural networks trained on a novel objective function and demonstrate our approach by numerical simulations of non-linear continuous-time dynamical systems. We show that the designed embedding preserves information-theoretic properties of the original dynamics, and enables us to solve the standard AI benchmark problems.",/pdf/b2e293a273cf11cb81cf992cc2b7d3a7981d7901.pdf,ICLR,2020,A faster approach to calculate empowerment from images. +Hkg1csA5Y7,SygP3lwFFX,1538090000000.0,1545360000000.0,496,A fast quasi-Newton-type method for large-scale stochastic optimisation,"[""adrian.wills@newcastle.edu.au"", ""thomas.schon@it.uu.se"", ""carl.jidling@it.uu.se""]","[""Adrian Wills"", ""Thomas B. Sch\u00f6n"", ""Carl Jidling""]","[""optimisation"", ""large-scale"", ""stochastic""]","During recent years there has been an increased interest in stochastic adaptations of limited memory quasi-Newton methods, which compared to pure gradient-based routines can improve the convergence by incorporating second order information. In this work we propose a direct least-squares approach conceptually similar to the limited memory quasi-Newton methods, but that computes the search direction in a slightly different way. This is achieved in a fast and numerically robust manner by maintaining a Cholesky factor of low dimension. This is combined with a stochastic line search relying upon fulfilment of the Wolfe condition in a backtracking manner, where the step length is adaptively modified with respect to the optimisation progress. We support our new algorithm by providing several theoretical results guaranteeing its performance. The performance is demonstrated on real-world benchmark problems which shows improved results in comparison with already established methods.",/pdf/7d1445801d8532955f393e2d065de27633f40da7.pdf,ICLR,2019, +rJlMBjAcYX,B1lnJ_ActX,1538090000000.0,1545360000000.0,66,Optimizing for Generalization in Machine Learning with Cross-Validation Gradients,"[""sbarratt@stanford.edu"", ""rsh@stanford.edu""]","[""Barratt"", ""Shane"", ""Sharma"", ""Rishi""]",[],"Cross-validation is the workhorse of modern applied statistics and machine learning, as it provides a principled framework for selecting the model that maximizes generalization performance. In this paper, we show that the cross-validation risk is differentiable with respect to the hyperparameters and training data for many common machine learning algorithms, including logistic regression, elastic-net regression, and support vector machines. Leveraging this property of differentiability, we propose a cross-validation gradient method (CVGM) for hyperparameter optimization. Our method enables efficient optimization in high-dimensional hyperparameter spaces of the cross-validation risk, the best surrogate of the true generalization ability of our learning algorithm.",/pdf/1e5875aa94670102890369a0ee6874ac48b77ee2.pdf,ICLR,2019, +ryDNZZZAW,BkIEbW-A-,1509130000000.0,1518730000000.0,651,Multiple Source Domain Adaptation with Adversarial Learning,"[""han.zhao@cs.cmu.edu"", ""shanghaz@andrew.cmu.edu"", ""guanhanw@andrew.cmu.edu"", ""jpc@isr.ist.utl.pt"", ""moura@andrew.cmu.edu"", ""ggordon@cs.cmu.edu""]","[""Han Zhao"", ""Shanghang Zhang"", ""Guanhang Wu"", ""Jo\\~{a}o P. Costeira"", ""Jos\\'{e} M. F. Moura"", ""Geoffrey J. Gordon""]","[""adversarial learning"", ""domain adaptation""]","While domain adaptation has been actively researched in recent years, most theoretical results and algorithms focus on the single-source-single-target adaptation setting. Naive application of such algorithms on multiple source domain adaptation problem may lead to suboptimal solutions. We propose a new generalization bound for domain adaptation when there are multiple source domains with labeled instances and one target domain with unlabeled instances. Compared with existing bounds, the new bound does not require expert knowledge about the target distribution, nor the optimal combination rule for multisource domains. Interestingly, our theory also leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose two models, both of which we call multisource domain adversarial networks (MDANs): the first model optimizes directly our bound, while the second model is a smoothed approximation of the first one, leading to a more data-efficient and task-adaptive model. The optimization tasks of both models are minimax saddle point problems that can be optimized by adversarial training. To demonstrate the effectiveness of MDANs, we conduct extensive experiments showing superior adaptation performance on three real-world datasets: sentiment analysis, digit classification, and vehicle counting. +",/pdf/9bcfbe55ee3b895c1d3dca964002144f747f1088.pdf,ICLR,2018, +SJloA0EYDr,HJgxN95dvr,1569440000000.0,1577170000000.0,1444,A⋆MCTS: SEARCH WITH THEORETICAL GUARANTEE USING POLICY AND VALUE FUNCTIONS,"[""xwu20@stanford.edu"", ""yuandong@fb.com"", ""lexing@stanford.edu""]","[""Xian Wu"", ""Yuandong Tian"", ""Lexing Ying""]","[""tree search"", ""reinforcement learning"", ""value neural network"", ""policy neural network""]","Combined with policy and value neural networks, Monte Carlos Tree Search (MCTS) is a critical component of the recent success of AI agents in learning to play board games like Chess and Go (Silver et al., 2017). However, the theoretical foundations of MCTS with policy and value networks remains open. Inspired by MCTS, we propose A⋆MCTS, a novel search algorithm that uses both the policy and value predictors to guide search and enjoys theoretical guarantees. Specifically, assuming that value and policy networks give reasonably accurate signals of the values of each state and action, the sample complexity (number of calls to the value network) to estimate the value of the current state, as well as the optimal one-step action to take from the current state, can be bounded. We apply our theoretical framework to different models for the noise distribution of the policy and value network as well as the distribution of rewards, and show that for these general models, the sample complexity is polynomial in D, where D is the depth of the search tree. Empirically, our method outperforms MCTS in these models.",/pdf/3bb8205f689d7e79b60b75b3ac8809eba449f44d.pdf,ICLR,2020,theoretical and experimental results for novel tree search algorithm that efficiently finds optimal policy +ryg8WJSKPr,SylZicjOvr,1569440000000.0,1577170000000.0,1544,ConQUR: Mitigating Delusional Bias in Deep Q-Learning,"[""andy.2008.su@gmail.com"", ""jayden@alum.mit.edu"", ""tyler.lu@gmail.com"", ""schuurmans@google.com"", ""cboutilier@google.com""]","[""DiJia-Andy Su"", ""Jayden Ooi"", ""Tyler Lu"", ""Dale Schuurmans"", ""Craig Boutilier\u200e""]","[""reinforcement learning"", ""q-learning"", ""deep reinforcement learning"", ""Atari""]","Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are ""consistent"" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.",/pdf/a1a85197d02a4ccac04f04b4cb6ffcbe512d96ba.pdf,ICLR,2020,We developed a search framework and consistency penalty to mitigate delusional bias. +ryxtCpNtDS,BJg_ZpZ_DB,1569440000000.0,1577170000000.0,863,Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification,"[""stephanieger@u.northwestern.edu"", ""d-klabjan@northwestern.edu""]","[""Stephanie Ger"", ""Diego Klabjan""]","[""imbalanced multivariate time series classification""]","We introduce a novel synthetic oversampling method for variable length, multi- feature sequence datasets based on autoencoders and generative adversarial net- works. We show that this method improves classification accuracy for highly imbalanced sequence classification tasks. We show that this method outperforms standard oversampling techniques that use techniques such as SMOTE and autoencoders. We also use generative adversarial networks on the majority class as an outlier detection method for novelty detection, with limited classification improvement. We show that the use of generative adversarial network based synthetic data improves classification model performance on a variety of sequence data sets. +",/pdf/7d53de57a3ebd9cc3bb7d99ba0243ebaa4fc389a.pdf,ICLR,2020,"We introduce a novel oversampling method for variable length, multivariate time series data that significantly improves classification accuracy." +Syl5o2EFPB,rJgs21f-Pr,1569440000000.0,1577170000000.0,162,Learning Compact Reward for Image Captioning,"[""live@whu.edu.cn"", ""zzchen@whu.edu.cn""]","[""Nannan Li"", ""Zhenzhong Chen""]","[""image captioning"", ""adversarial learning"", ""inverse reinforcement learning"", ""vision"", ""language""]","Adversarial learning has shown its advances in generating natural and diverse descriptions in image captioning. However, the learned reward of existing adversarial methods is vague and ill-defined due to the reward ambiguity problem. In this paper, we propose a refined Adversarial Inverse Reinforcement Learning (rAIRL) method to handle the reward ambiguity problem by disentangling reward for each word in a sentence, as well as achieve stable adversarial training by refining the loss function to shift the stationary point towards Nash equilibrium. In addition, we introduce a conditional term in the loss function to mitigate mode collapse and to increase the diversity of the generated descriptions. Our experiments on MS COCO show that our method can learn compact reward for image captioning.",/pdf/6768c8ee42d22f75f4f1faba43be039f971af73a.pdf,ICLR,2020,a refiened AIRL algorithm that learns compact reward for image captioning +SP5RHi-rdlJ,a2H26irNbLG,1601310000000.0,1614990000000.0,3133,Sparse Binary Neural Networks,"[""~Riccardo_Schiavone1"", ""~Maria_A_Zuluaga1""]","[""Riccardo Schiavone"", ""Maria A Zuluaga""]","[""Binary Neural Networks"", ""Sparsity"", ""Deep Neural Network Compression""]","Quantized neural networks are gaining popularity thanks to their ability to solve complex tasks with comparable accuracy as full-precision Deep Neural Networks (DNNs), while also reducing computational power and storage requirements and increasing the processing speed. These properties make them an attractive alternative for the development and deployment of DNN-based applications in Internet-Of-Things (IoT) devices. Among quantized networks, Binary Neural Networks (BNNs) have reported the largest speed-up. However, they suffer from a fixed and limited compression factor that may result insufficient for certain devices with very limited resources. In this work, we propose Sparse Binary Neural Networks, a novel model and training scheme that allows to introduce sparsity in BNNs by using positive 0/1 binary weights, instead of the -1/+1 weights used by state-of-the-art binary networks. As a result, our method is able to achieve a high compression factor and reduces the number of operations and parameters at inference time. We study the properties of our method through experiments on linear and convolutional networks over MNIST and CIFAR-10 datasets. Experiments confirm that SBNNs can achieve high compression rates and good generalization, while further reducing the operations of BNNs, making it a viable option for deploying DNNs in very cheap and low-cost IoT devices and sensors.",/pdf/dc09ab7442310fe345de4c5fc099fb92f2026848.pdf,ICLR,2021, +Dw8vAUKYq8C,A2-lwtCdEQK,1601310000000.0,1614990000000.0,2567,Near-Optimal Glimpse Sequences for Training Hard Attention Neural Networks,"[""~William_Harvey1"", ""~Michael_Teng1"", ""~Frank_Wood2""]","[""William Harvey"", ""Michael Teng"", ""Frank Wood""]","[""attention"", ""hard attention"", ""variational inference"", ""bayesian optimal experimental design""]","Hard visual attention is a promising approach to reduce the computational burden of modern computer vision methodologies. Hard attention mechanisms are typically non-differentiable. They can be trained with reinforcement learning but the high-variance training this entails hinders more widespread application. We show how hard attention for image classification can be framed as a Bayesian optimal experimental design (BOED) problem. From this perspective, the optimal locations to attend to are those which provide the greatest expected reduction in the entropy of the classification distribution. We introduce methodology from the BOED literature to approximate this optimal behaviour, and use it to generate `near-optimal' sequences of attention locations. We then show how to use such sequences to partially supervise, and therefore speed up, the training of a hard attention mechanism. Although generating these sequences is computationally expensive, they can be reused by any other networks later trained on the same task.",/pdf/2e46fc754d2e2f2c4a13c872553c7a59daf514e5.pdf,ICLR,2021,We use Bayesian experimental design to produce sequences which are later used to provide a supervision signal for a hard attention network and greatly speed up its training. +SyeHPgHFDr,ryX4_ogtDB,1569440000000.0,1577170000000.0,2359,Finding Deep Local Optima Using Network Pruning,"[""yguo@math.fsu.edu"", ""yshe@stat.fsu.edu"", ""ywu@stat.ucla.edu"", ""abarbu@stat.fsu.edu""]","[""Yangzi Guo"", ""Yiyuan She"", ""Ying Nian Wu"", ""Adrian Barbu""]","[""network pruning"", ""non-convex optimization""]","Artificial neural networks (ANNs) are very popular nowadays and offer reliable solutions to many classification problems. However, training deep neural networks (DNN) is time-consuming due to the large number of parameters. Recent research indicates that these DNNs might be over-parameterized and different solutions have been proposed to reduce the complexity both in the number of parameters and in the training time of the neural networks. Furthermore, some researchers argue that after reducing the neural network complexity via connection pruning, the remaining weights are irrelevant and retraining the sub-network would obtain a comparable accuracy with the original one. +This may hold true in most vision problems where we always enjoy a large number of training samples and research indicates that most local optima of the convolutional neural networks may be equivalent. However, in non-vision sparse datasets, especially with many irrelevant features where a standard neural network would overfit, this might not be the case and there might be many non-equivalent local optima. This paper presents empirical evidence for these statements and an empirical study of the learnability of neural networks (NNs) on some challenging non-linear real and simulated data with irrelevant variables. +Our simulation experiments indicate that the cross-entropy loss function on XOR-like data has many local optima, and the number of local optima grows exponentially with the number of irrelevant variables. +We also introduce a connection pruning method to improve the capability of NNs to find a deep local minimum even when there are irrelevant variables. +Furthermore, the performance of the discovered sparse sub-network degrades considerably either by retraining from scratch or the corresponding original initialization, due to the existence of many bad optima around. +Finally, we will show that the performance of neural networks for real-world experiments on sparse datasets can be recovered or even improved by discovering a good sub-network architecture via connection pruning.",/pdf/b9fda564dbfa824e5fdd7fb8b8207fc683303fe1.pdf,ICLR,2020, +BJl2_nVFPB,S1gZcxztUH,1569440000000.0,1583910000000.0,57,Automatically Discovering and Learning New Visual Categories with Ranking Statistics,"[""khan@robots.ox.ac.uk"", ""srebuffi@robots.ox.ac.uk"", ""hyenal@robots.ox.ac.uk"", ""vedaldi@robots.ox.ac.uk"", ""az@robots.ox.ac.uk""]","[""Kai Han"", ""Sylvestre-Alvise Rebuffi"", ""Sebastien Ehrhardt"", ""Andrea Vedaldi"", ""Andrew Zisserman""]","[""deep learning"", ""classification"", ""novel classes"", ""transfer learning"", ""clustering"", ""incremental learning""]","We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.",/pdf/f36740e753d9dfd53293136d078ffa6b3a6b7ea1.pdf,ICLR,2020,"A method to automatically discover new categories in unlabelled data, by effectively transferring knowledge from labelled data of other different categories using feature rank statistics." +YhhEarKSli9,Zufi2ZxeuC,1601310000000.0,1614990000000.0,3687,AutoBayes: Automated Bayesian Graph Exploration for Nuisance-Robust Inference,"[""~Andac_Demir1"", ""~Toshiaki_Koike-Akino1"", ""~Ye_Wang2"", ""~Deniz_Erdogmus1""]","[""Andac Demir"", ""Toshiaki Koike-Akino"", ""Ye Wang"", ""Deniz Erdogmus""]",[],"Learning data representations that capture task-related features, but are invariant to nuisance variations remains a key challenge in machine learning. We introduce an automated Bayesian inference framework, called AutoBayes, that explores different graphical models linking classifier, encoder, decoder, estimator and adversarial network blocks to optimize nuisance-invariant machine learning pipelines. AutoBayes also enables learning disentangled representations, where the latent variable is split into multiple pieces to impose various relationships with the nuisance variation and task labels. We benchmark the framework on several public datasets, and provide analysis of its capability for subject-transfer learning with/without variational modeling and adversarial training. We demonstrate a significant performance improvement with ensemble learning across explored graphical models.",/pdf/da959653b77f7553ab65344627b8facfd3905eb5.pdf,ICLR,2021, +SylGpT4FPS,Hkgi-YxuvH,1569440000000.0,1577170000000.0,809,Last-iterate convergence rates for min-max optimization,"[""prof@gatech.edu"", ""nykal212@gmail.com"", ""andrwbsn@gmail.com""]","[""Jacob Abernethy"", ""Kevin A. Lai"", ""Andre Wibisono""]","[""min-max optimization"", ""zero-sum game"", ""saddle point"", ""last-iterate convergence"", ""non-asymptotic convergence"", ""global rates"", ""Hamiltonian"", ""sufficiently bilinear""]","While classic work in convex-concave min-max optimization relies on average-iterate convergence results, the emergence of nonconvex applications such as training Generative Adversarial Networks has led to renewed interest in last-iterate convergence guarantees. Proving last-iterate convergence is challenging because many natural algorithms, such as Simultaneous Gradient Descent/Ascent, provably diverge or cycle even in simple convex-concave min-max settings, and previous work on global last-iterate convergence rates has been limited to the bilinear and convex-strongly concave settings. In this work, we show that the Hamiltonian Gradient Descent (HGD) algorithm achieves linear convergence in a variety of more general settings, including convex-concave problems that satisfy a “sufficiently bilinear” condition. We also prove similar convergence rates for some parameter settings of the Consensus Optimization (CO) algorithm of Mescheder et al. 2017.",/pdf/6f72820a437b2f6c98d2a204229acc0e5dfa4486.pdf,ICLR,2020,We prove that global linear last-iterate convergence rates are achievable for more general classes of convex-concave min-max optimization problems than had previously been shown. +gBpYGXH9J7F,DTmMAJIu6-C,1601310000000.0,1614990000000.0,2092,Online Learning under Adversarial Corruptions,"[""~Pranjal_Awasthi3"", ""~Sreenivas_Gollapudi2"", ""kostaskollias@google.com"", ""apaar@google.com""]","[""Pranjal Awasthi"", ""Sreenivas Gollapudi"", ""Kostas Kollias"", ""Apaar Sadhwani""]","[""Online Learning"", ""Learning Theory"", ""Bandits"", ""Robustness"", ""Adversarial Corruptions""]","We study the design of efficient online learning algorithms tolerant to adversarially corrupted rewards. In particular, we study settings where an online algorithm makes a prediction at each time step, and receives a stochastic reward from the environment that can be arbitrarily corrupted with probability $\epsilon \in [0,\frac 1 2)$. Here $\epsilon$ is the noise rate the characterizes the strength of the adversary. As is standard in online learning, we study the design of algorithms with small regret over a period of time steps. However, while the algorithm observes corrupted rewards, we require its regret to be small with respect to the true uncorrupted reward distribution. We build upon recent advances in robust estimation for unsupervised learning problems to design robust online algorithms with near optimal regret in three different scenarios: stochastic multi-armed bandits, linear contextual bandits, and Markov Decision Processes~(MDPs) with stochastic rewards and transitions. Finally, we provide empirical evidence regarding the robustness of our proposed algorithms on synthetic and real datasets.",/pdf/7563d0042179c1b32a1b549e92bd3688b909cd4e.pdf,ICLR,2021,We initiate a theoretical study and design near-optimal algorithms for online learning in stochastic environments with adversarially corrupted rewards. +5NsEIflpbSv,QgJp4bY5PXJ,1601310000000.0,1614150000000.0,2910,Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing,"[""~Asish_Ghoshal2"", ""~Xilun_Chen1"", ""sonalgupta@fb.com"", ""~Luke_Zettlemoyer1"", ""~Yashar_Mehdad2""]","[""Asish Ghoshal"", ""Xilun Chen"", ""Sonal Gupta"", ""Luke Zettlemoyer"", ""Yashar Mehdad""]","[""label smoothing"", ""calibration"", ""semantic parsing"", ""structured prediction""]","Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found wide-spread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.",/pdf/4538feaf2c0ace4bc3472484186d4cda25dc7c01.pdf,ICLR,2021,We propose an extension of label smoothing which improves generalization performance by adapting to the structure present in label space of structured prediction tasks. +HyVxPsC9tm,rJgxHi7tFX,1538090000000.0,1545360000000.0,235,DynCNN: An Effective Dynamic Architecture on Convolutional Neural Network for Surveillance Videos,"[""b10113120@gmail.com"", ""pctsainb@gmail.com"", ""sjruan@mail.ntust.edu.tw""]","[""De-Qin Gao"", ""Ping-Chen Tsai"", ""Shanq-Jang Ruan""]","[""CNN optimization"", ""Reduction on convolution calculation"", ""dynamic convolution"", ""surveillance video""]","The large-scale surveillance video analysis becomes important as the development of intelligent city. The heavy computation resources neccessary for state-of-the-art deep learning model makes the real-time processing hard to be implemented. This paper exploits the characteristic of high scene similarity generally existing in surveillance videos and proposes dynamic convolution reusing the previous feature map to reduce the computation amount. We tested the proposed method on 45 surveillance videos with various scenes. The experimental results show that dynamic convolution can reduce up to 75.7% of FLOPs while preserving the precision within 0.7% mAP. Furthermore, the dynamic convolution can enhance the processing time up to 2.2 times.",/pdf/61635e98b377ec6686d0dc9478ca1e9c7a0616c1.pdf,ICLR,2019,An optimizing architecture on CNN for surveillance videos with 75.7% reduction on FLOPs and 2.2 times improvement on FPS +SyfIfnC5Ym,HygMqg65F7,1538090000000.0,1556160000000.0,1267,Improving the Generalization of Adversarial Training with Domain Adaptation,"[""cbsong@hust.edu.cn"", ""brooklet60@hust.edu.cn"", ""wanglw@pku.edu.cn"", ""jeh@cs.cornell.edu""]","[""Chuanbiao Song"", ""Kun He"", ""Liwei Wang"", ""John E. Hopcroft""]","[""adversarial training"", ""domain adaptation"", ""adversarial example"", ""deep learning""]","By injecting adversarial examples into training data, adversarial training is promising for improving the robustness of deep learning models. However, most existing adversarial training approaches are based on a specific type of adversarial attack. It may not provide sufficiently representative samples from the adversarial domain, leading to a weak generalization ability on adversarial examples from other attacks. Moreover, during the adversarial training, adversarial perturbations on inputs are usually crafted by fast single-step adversaries so as to scale to large datasets. This work is mainly focused on the adversarial training yet efficient FGSM adversary. In this scenario, it is difficult to train a model with great generalization due to the lack of representative adversarial samples, aka the samples are unable to accurately reflect the adversarial domain. To alleviate this problem, we propose a novel Adversarial Training with Domain Adaptation (ATDA) method. Our intuition is to regard the adversarial training on FGSM adversary as a domain adaption task with limited number of target domain samples. The main idea is to learn a representation that is semantically meaningful and domain invariant on the clean domain as well as the adversarial domain. Empirical evaluations on Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100 demonstrate that ATDA can greatly improve the generalization of adversarial training and the smoothness of the learned models, and outperforms state-of-the-art methods on standard benchmark datasets. To show the transfer ability of our method, we also extend ATDA to the adversarial training on iterative attacks such as PGD-Adversial Training (PAT) and the defense performance is improved considerably.",/pdf/162ba462e1a11da0c9e74559bd6ef96f918f3b59.pdf,ICLR,2019,We propose a novel adversarial training with domain adaptation method that significantly improves the generalization ability on adversarial examples from different attacks. +SyeKGgStDB,rygsszxtvS,1569440000000.0,1577170000000.0,2181,Training a Constrained Natural Media Painting Agent using Reinforcement Learning ,"[""biao@cs.umd.edu"", ""jbrandt@adobe.com"", ""rmech@adobe.com"", ""nxu@adobe.com"", ""bmkim@adobe.com"", ""dm@cs.umd.edu""]","[""Biao Jia"", ""Jonathan Brandt"", ""Radomir Mech"", ""Ning Xu"", ""Byungmoon Kim"", ""Dinesh Manocha""]",[],"We present a novel approach to train a natural media painting using reinforcement learning. Given a reference image, our formulation is based on stroke-based rendering that imitates human drawing and can be learned from scratch without supervision. Our painting agent computes a sequence of actions that represent the primitive painting strokes. In order to ensure that the generated policy is predictable and controllable, we use a constrained learning method and train the painting agent using the environment model and follows the commands encoded in an observation. We have applied our approach on many benchmarks and our results demonstrate that our constrained agent can handle different painting media and different constraints in the action space to collaborate with humans or other agents. +",/pdf/afa890445030c115fa0f638977e7a1915a4355fd.pdf,ICLR,2020,"We train a natural media painting agent using environment model. Based on our painting agent, we present a novel approach to train a constrained painting agent that follows the command encoded in the observation." +rylZKTNYPr,Syxjz2hDPS,1569440000000.0,1577170000000.0,659,Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization,"[""dominik.schmidt@zi-mannheim.de"", ""georgia.koppe@zi-mannheim.de"", ""max.beutelspacher@mailbox.org"", ""daniel.durstewitz@zi-mannheim.de""]","[""Dominik Schmidt"", ""Georgia Koppe"", ""Max Beutelspacher"", ""Daniel Durstewitz""]","[""Recurrent Neural Networks"", ""Nonlinear State Space Models"", ""Generative Models"", ""Long short-term memory"", ""vanishing/exploding gradient problem"", ""Nonlinear dynamics"", ""Interpretable machine learning"", ""Time series analysis""]","Vanilla RNN with ReLU activation have a simple structure that is amenable to systematic dynamical systems analysis and interpretation, but they suffer from the exploding vs. vanishing gradients problem. Recent attempts to retain this simplicity while alleviating the gradient problem are based on proper initialization schemes or orthogonality/unitary constraints on the RNN’s recurrency matrix, which, however, comes with limitations to its expressive power with regards to dynamical systems phenomena like chaos or multi-stability. Here, we instead suggest a regularization scheme that pushes part of the RNN’s latent subspace toward a line attractor configuration that enables long short-term memory and arbitrarily slow time scales. We show that our approach excels on a number of benchmarks like the sequential MNIST or multiplication problems, and enables reconstruction of dynamical systems which harbor widely different time scales.",/pdf/cc754318adfd2d26846dc3fad71cca5568adf5f5.pdf,ICLR,2020,We develop a new optimization approach for vanilla ReLU-based RNN that enables long short-term memory and identification of arbitrary nonlinear dynamical systems with widely differing time scales. +HklpCzC6-,r1ka0fCTW,1508940000000.0,1518730000000.0,87,Image Segmentation by Iterative Inference from Conditional Score Estimation,"[""adriana.romsor@gmail.com"", ""michal.drozdzal@gmail.com"", ""akram.er-raqabi@umontreal.ca"", ""simon.jegou@gmail.com"", ""yoshua.umontreal@gmail.com""]","[""Adriana Romero"", ""Michal Drozdzal"", ""Akram Erraqabi"", ""Simon J\u00e9gou"", ""Yoshua Bengio""]","[""semantic segmentation"", ""conditional denoising autoencoders"", ""iterative inference""]","Inspired by the combination of feedforward and iterative computations in the visual cortex, and taking advantage of the ability of denoising autoencoders to estimate the score of a joint distribution, we propose a novel approach to iterative inference for capturing and exploiting the complex joint distribution of output variables conditioned on some input variables. This approach is applied to image pixel-wise segmentation, with the estimated conditional score used to perform gradient ascent towards a mode of the estimated conditional distribution. This extends previous work on score estimation by denoising autoencoders to the case of a conditional distribution, with a novel use of a corrupted feedforward predictor replacing Gaussian corruption. An advantage of this approach over more classical ways to perform iterative inference for structured outputs, like conditional random fields (CRFs), is that it is not any more necessary to define an explicit energy function linking the output variables. To keep computations tractable, such energy function parametrizations are typically fairly constrained, involving only a few neighbors of each of the output variables in each clique. We experimentally find that the proposed iterative inference from conditional score estimation by conditional denoising autoencoders performs better than comparable models based on CRFs or those not using any explicit modeling of the conditional joint distribution of outputs.",/pdf/f844e5d7c6118e10815244aaa2dfa2f63e4e7cc9.pdf,ICLR,2018,Refining segmentation proposals by performing iterative inference with conditional denoising autoencoders. +rJl2E3AcF7,H1xd-Fa5F7,1538090000000.0,1545360000000.0,1485,Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference,"[""sliao3@cs.toronto.edu"", ""tingchen@cs.ucla.edu"", ""tianlin@google.com"", ""dennyzhou@google.com"", ""chongw@google.com""]","[""Shun Liao"", ""Ting Chen"", ""Tian Lin"", ""Chong Wang"", ""Dengyong Zhou""]","[""hierarchical softmax"", ""model compression""]","Computations for the softmax function in neural network models are expensive when the number of output classes is large. This can become a significant issue in both training and inference for such models. In this paper, we present Doubly Sparse Softmax (DS-Softmax), Sparse Mixture of Sparse of Sparse Experts, to improve the efficiency for softmax inference. During training, our method learns a two-level class hierarchy by dividing entire output class space into several partially overlapping experts. Each expert is responsible for a learned subset of the output class space and each output class only belongs to a small number of those experts. During inference, our method quickly locates the most probable expert to compute small-scale softmax. Our method is learning-based and requires no knowledge of the output class partition space a priori. We empirically evaluate our method on several real-world tasks and demonstrate that we can achieve significant computation reductions without loss of performance.",/pdf/78ddd47243202590c917cc43cd6f1121dc3f967d.pdf,ICLR,2019,"We present doubly sparse softmax, the sparse mixture of sparse of sparse experts, to improve the efficiency for softmax inference through exploiting the two-level overlapping hierarchy. " +H1l0O6EYDH,r1gZE9hPDB,1569440000000.0,1577170000000.0,653,A NEW POINTWISE CONVOLUTION IN DEEP NEURAL NETWORKS THROUGH EXTREMELY FAST AND NON PARAMETRIC TRANSFORMS,"[""doublejtoh@khu.ac.kr"", ""shbae@khu.ac.kr""]","[""Joonhyun Jeong"", ""Sung-Ho Bae""]","[""Pointwise Convolution"", ""Discrete Walsh-Hadamard Transform"", ""Discrete Cosine-Transform""]"," Some conventional transforms such as Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT) have been widely used as feature extractors in image processing but rarely applied in neural networks. However, we found that these conventional transforms have the ability to capture the cross-channel correlations without any learnable parameters in DNNs. This paper firstly proposes to apply conventional transforms on pointwise convolution, showing that such transforms significantly reduce the computational complexity of neural networks without accuracy performance degradation. Especially for DWHT, it requires no floating point multiplications but only additions and subtractions, which can considerably reduce computation overheads. In addition, its fast algorithm further reduces complexity of floating point addition from O(n^2) to O(nlog n). These non-parametric and low computational properties construct extremely efficient networks in the number parameters and operations, enjoying accuracy gain. Our proposed DWHT-based model gained 1.49% accuracy increase with 79.4% reduced parameters and 48.4% reduced FLOPs compared with its baseline model (MoblieNet-V1) on the CIFAR 100 dataset.",/pdf/4fbb3e8b7b61fb95f7a5b15d112b579833a9ab4c.pdf,ICLR,2020,We introduce new pointwise convolution layers equipped with extremely fast conventional transforms in deep neural network. +ByeqORgAW,B1JcOCgAb,1509120000000.0,1519140000000.0,458,Proximal Backpropagation,"[""thomas.frerix@tum.de"", ""thomas.moellenhoff@in.tum.de"", ""michael.moeller@uni-siegen.de"", ""cremers@tum.de""]","[""Thomas Frerix"", ""Thomas M\u00f6llenhoff"", ""Michael Moeller"", ""Daniel Cremers""]",[],"We propose proximal backpropagation (ProxProp) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of explicit gradient descent, which poses an impediment for optimization. ProxProp is developed from a general point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Specifically, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy, which comprises the network activations as variables of the optimization. We further analyze theoretical properties of ProxProp and in particular prove that the algorithm yields a descent direction in parameter space and can therefore be combined with a wide variety of convergent algorithms. Finally, we devise an efficient numerical implementation that integrates well with popular deep learning frameworks. We conclude by demonstrating promising numerical results and show that ProxProp can be effectively combined with common first order optimizers such as Adam.",/pdf/929a91844f9632ae796c5758312d1a6e18bb8119.pdf,ICLR,2018, +rJaE2alRW,r1hN3axAb,1509120000000.0,1518730000000.0,428,Autoregressive Convolutional Neural Networks for Asynchronous Time Series,"[""mikbinkowski@gmail.com"", ""gautier.marti@gmail.com"", ""pdonnat@helleborecapital.com""]","[""Mikolaj Binkowski"", ""Gautier Marti"", ""Philippe Donnat""]","[""neural networks"", ""convolutional neural networks"", ""time series"", ""asynchronous data"", ""regression""]","We propose Significance-Offset Convolutional Neural Network, a deep convolutional network architecture for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of adjusted regressors, while the weights are data-dependent functions learnt through a convolutional network. The architecture was designed for applications on asynchronous time series and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and household electricity consumption dataset. The pro-posed architecture achieves promising results as compared to convolutional and recurrent neural networks. The code for the numerical experiments and the architecture implementation will be shared online to make the research reproducible.",/pdf/bbb435e3486be2a2e9b5d652b9d0a00d2fe0d411.pdf,ICLR,2018,Convolutional architecture for learning data-dependent weights for autoregressive forecasting of time series. +7JSTDTZtn7-,uOyZ0y_dk0,1601310000000.0,1614990000000.0,450,Byzantine-Robust Learning on Heterogeneous Datasets via Resampling,"[""~Lie_He1"", ""~Sai_Praneeth_Karimireddy1"", ""~Martin_Jaggi1""]","[""Lie He"", ""Sai Praneeth Karimireddy"", ""Martin Jaggi""]","[""Byzantine robustness"", ""distributed training"", ""heterogeneous dataset""]","In Byzantine-robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages to the server. While this problem has received significant attention recently, most current defenses assume that the workers have identical data distribution. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks that circumvent these defenses leading to significant loss of performance. We then propose a universal resampling scheme that addresses data heterogeneity at a negligible computational cost. We theoretically and experimentally validate our approach, showing that combining resampling with existing robust algorithms is effective against challenging attacks. +",/pdf/aba1621ca222bc044205b7898220b7813e8ca531.pdf,ICLR,2021,"In this paper, we studied robust distributed learning problem under realistic heterogeneous data and proposed a general resampling technique which greatly improves the current robust aggregation rules on heterogeneous data." +7TBP8k7TLFA,0UbuuC0c6Lk,1601310000000.0,1614990000000.0,1251,Universal Approximation Theorem for Equivariant Maps by Group CNNs,"[""~Wataru_Kumagai2"", ""~Akiyoshi_Sannai1""]","[""Wataru Kumagai"", ""Akiyoshi Sannai""]","[""Universal Approximation Theorem"", ""CNN"", ""Deep Learning"", ""Symmetry""]","Group symmetry is inherent in a wide variety of data distributions. Data processing that preserves symmetry is described as an equivariant map and often effective in achieving high performance. Convolutional neural networks (CNNs) have been known as models with equivariance and shown to approximate equivariant maps for some specific groups. However, universal approximation theorems for CNNs have been separately derived with individual techniques according to each group and setting. This paper provides a unified method to obtain universal approximation theorems for equivariant maps by CNNs in various settings. As its significant advantage, we can handle non-linear equivariant maps between infinite-dimensional spaces for non-compact groups.",/pdf/22df638445dfc6bcee454327ac9be35a6381bef1.pdf,ICLR,2021,This paper provides a unified method to obtain universal approximation theorems for equivariant maps by CNNs in various settings. +H1xFWgrFPS,B1xMCelFwr,1569440000000.0,1583910000000.0,2144,Explanation by Progressive Exaggeration,"[""sumedha.singla@pitt.edu"", ""kayhan@pitt.edu"", ""cjx880409@gmail.com"", ""kayhan@pitt.edu""]","[""Sumedha Singla"", ""Brian Pollack"", ""Junxiang Chen"", ""Kayhan Batmanghelich""]","[""Explain"", ""deep learning"", ""black box"", ""GAN"", ""counterfactual""]","As machine learning methods see greater adoption and implementation in high stakes applications such as medical image diagnosis, the need for model interpretability and explanation has become more critical. Classical approaches that assess feature importance (eg saliency maps) do not explain how and why a particular region of an image is relevant to the prediction. We propose a method that explains the outcome of a classification black-box by gradually exaggerating the semantic effect of a given class. Given a query input to a classifier, our method produces a progressive set of plausible variations of that query, which gradually change the posterior probability from its original class to its negation. These counter-factually generated samples preserve features unrelated to the classification decision, such that a user can employ our method as a ``tuning knob'' to traverse a data manifold while crossing the decision boundary. Our method is model agnostic and only requires the output value and gradient of the predictor with respect to its input.",/pdf/0668079e8510cc0d717c920f2fd8f93c12b1dbec.pdf,ICLR,2020,"A method to explain a classifier, by generating visual perturbation of an image by exaggerating or diminishing the semantic features that the classifier associates with a target label." +B1xDq2EFDH,SylRModyvH,1569440000000.0,1577170000000.0,119,Analytical Moment Regularizer for Training Robust Networks,"[""modar.alfadly@kaust.edu.sa"", ""adel.bibi@kaust.edu.sa"", ""muhammed.kocabas@tue.mpg.de"", ""bernard.ghanem@kaust.edu.sa""]","[""Modar Alfadly"", ""Adel Bibi"", ""Muhammed Kocabas"", ""Bernard Ghanem""]","[""robustness"", ""analytic regularizer"", ""first moment""]","Despite the impressive performance of deep neural networks (DNNs) on numerous learning tasks, they still exhibit uncouth behaviours. One puzzling behaviour is the subtle sensitive reaction of DNNs to various noise attacks. Such a nuisance has strengthened the line of research around developing and training noise-robust networks. In this work, we propose a new training regularizer that aims to minimize the probabilistic expected training loss of a DNN subject to a generic Gaussian input. We provide an efficient and simple approach to approximate such a regularizer for arbitrarily deep networks. This is done by leveraging the analytic expression of the output mean of a shallow neural network, avoiding the need for memory and computation expensive data augmentation. We conduct extensive experiments on LeNet and AlexNet on various datasets including MNIST, CIFAR10, and CIFAR100 to demonstrate the effectiveness of our proposed regularizer. In particular, we show that networks that are trained with the proposed regularizer benefit from a boost in robustness against Gaussian noise to an equivalent amount of performing 3-21 folds of noisy data augmentation. Moreover, we empirically show on several architectures and datasets that improving robustness against Gaussian noise, by using the new regularizer, can improve the overall robustness against 6 other types of attacks by two orders of magnitude.",/pdf/49aa146261ded92b6fa112444b84b290b0ab0a3b.pdf,ICLR,2020,An efficient estimate to the Gaussian first moment of DNNs as a regularizer to training robust networks. +H0syOoy3Ash,bL5lJD9WOp,1601310000000.0,1615920000000.0,540,Average-case Acceleration for Bilinear Games and Normal Matrices,"[""~Carles_Domingo-Enrich1"", ""~Fabian_Pedregosa1"", ""~Damien_Scieur3""]","[""Carles Domingo-Enrich"", ""Fabian Pedregosa"", ""Damien Scieur""]","[""Smooth games"", ""First-order Methods"", ""Acceleration"", ""Bilinear games"", ""Average-case Analysis"", ""Orthogonal Polynomials""]","Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. +We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through benchmarks with a varying degree of mismatch with our assumptions.",/pdf/fe8bbfea3f4bea0de75956043f18ca370ff6f502.pdf,ICLR,2021,"We extend the framework of average-case optimal first-order methods to problems with non-symmetric matrices, which naturally arise in equilibrium finding for games." +HJfQrs0qt7,H1eyVgbgYQ,1538090000000.0,1545360000000.0,74,Convergence Properties of Deep Neural Networks on Separable Data,"[""remi.tachet@microsoft.com"", ""mohammad.pezeshki@umontreal.ca"", ""s.shabanian@gmail.com"", ""aaron.courville@gmail.com"", ""yoshua.umontreal@gmail.com""]","[""Remi Tachet des Combes"", ""Mohammad Pezeshki"", ""Samira Shabanian"", ""Aaron Courville"", ""Yoshua Bengio""]","[""learning dynamics"", ""gradient descent"", ""classification"", ""optimization"", ""cross-entropy"", ""hinge loss"", ""implicit regularization"", ""gradient starvation""]","While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.",/pdf/56635a6a8e78f121bc2708be997e89a095ad3fcf.pdf,ICLR,2019,This paper analyzes the learning dynamics of neural networks on classification tasks solved by gradient descent using the cross-entropy and hinge losses. +#NAME?,I9e-h5vBtV,1601310000000.0,1615970000000.0,3026,Disentangling 3D Prototypical Networks for Few-Shot Concept Learning,"[""~Mihir_Prabhudesai1"", ""~Shamit_Lal1"", ""~Darshan_Patil1"", ""~Hsiao-Yu_Tung1"", ""~Adam_W_Harley1"", ""~Katerina_Fragkiadaki1""]","[""Mihir Prabhudesai"", ""Shamit Lal"", ""Darshan Patil"", ""Hsiao-Yu Tung"", ""Adam W Harley"", ""Katerina Fragkiadaki""]","[""Disentanglement"", ""Few Shot Learning"", ""3D Vision"", ""VQA""]","We present neural architectures that disentangle RGB-D images into objects’ shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to-end self-supervised by predicting views in static scenes, alongside a small number of 3D object boxes. Objects and scenes are represented in terms of 3D feature grids in the bottleneck of the network. We show the proposed 3D neural representations are compositional: they can generate novel 3D scene feature maps by mixing object shapes and styles, resizing and adding the resulting object 3D feature maps over background scene feature maps. We show object detectors trained on hallucinated 3D neural scenes generalize better to novel environments. We show classifiers for object categories, color, materials, and spatial relationships trained over the disentangled 3D feature sub-spaces generalize better with dramatically fewer exemplars over the current state-of-the-art, and enable a visual question answering system that uses them as its modules to generalize one-shot to novel objects in the scene.",/pdf/b42e4e31403f7d4fdb789fb870cace1f71e6bb86.pdf,ICLR,2021,"We present neural architectures that disentangle RGB-D images into objects’ shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification." +B1gX8JrYPr,SylGmUpuvS,1569440000000.0,1577170000000.0,1723,Connecting the Dots Between MLE and RL for Sequence Prediction,"[""bwkevintan@gmail.com"", ""zhitinghu@gmail.com"", ""yangtze2301@gmail.com"", ""rsalakhu@cs.cmu.edu"", ""epxing@cs.cmu.edu""]","[""Bowen Tan"", ""Zhiting Hu"", ""Zichao Yang"", ""Ruslan Salakhutdinov"", ""Eric Xing""]","[""Sequence generation"", ""sequence prediction"", ""reinforcement learning""]","Sequence prediction models can be learned from example sequences with a variety of training algorithms. Maximum likelihood learning is simple and efficient, yet can suffer from compounding error at test time. +Reinforcement learning such as policy gradient addresses the issue but can have prohibitively poor exploration efficiency. A rich set of other algorithms, such as data noising, RAML, and softmax policy gradient, have also been developed from different perspectives. +In this paper, we present a formalism of entropy regularized policy optimization, and show that the apparently distinct algorithms, including MLE, can be reformulated as special instances of the formulation. The difference between them is characterized by the reward function and two weight hyperparameters. +The unifying interpretation enables us to systematically compare the algorithms side-by-side, and gain new insights into the trade-offs of the algorithm design. +The new perspective also leads to an improved approach that dynamically interpolates among the family of algorithms, and learns the model in a scheduled way. Experiments on machine translation, text summarization, and game imitation learning demonstrate superiority of the proposed approach.",/pdf/56c20197d5210fad94a8e875149cc7e792c9a359.pdf,ICLR,2020,An entropy regularized policy optimization formalism subsumes a set of sequence prediction learning algorithms. A new interpolation algorithm with improved results on text generation and game imitation learning. +SklgHoRqt7,S1xQBIFUYm,1538090000000.0,1545360000000.0,55,Metric-Optimized Example Weights,"[""senzhao@google.com"", ""mmilanifard@google.com"", ""mayagupta@google.com""]","[""Sen Zhao"", ""Mahdi Milani Fard"", ""Maya Gupta""]",[],"Real-world machine learning applications often have complex test metrics, and may have training and test data that follow different distributions. We propose addressing these issues by using a weighted loss function with a standard convex loss, but with weights on the training examples that are learned to optimize the test metric of interest on the validation set. These metric-optimized example weights can be learned for any test metric, including black box losses and customized metrics for specific applications. We illustrate the performance of our proposal with public benchmark datasets and real-world applications with domain shift and custom loss functions that balance multiple objectives, impose fairness policies, and are non-convex and non-decomposable.",/pdf/9700e51dc293e95f089abc0ebd3bb11336877b59.pdf,ICLR,2019, +H1ebTsActm,rkeDcR55tm,1538090000000.0,1550850000000.0,776,Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality,"[""taiji@mist.i.u-tokyo.ac.jp""]","[""Taiji Suzuki""]","[""deep learning theory"", ""approximation analysis"", ""generalization error analysis"", ""Besov space"", ""minimax optimality""]","Deep learning has shown high performances in various types of tasks from visual recognition to natural language processing, +which indicates superior flexibility and adaptivity of deep learning. +To understand this phenomenon theoretically, we develop a new approximation and estimation error analysis of +deep learning with the ReLU activation for functions in a Besov space and its variant with mixed smoothness. +The Besov space is a considerably general function space including the Holder space and Sobolev space, and especially can capture spatial inhomogeneity of smoothness. Through the analysis in the Besov space, it is shown that deep learning can achieve the minimax optimal rate and outperform any non-adaptive (linear) estimator such as kernel ridge regression, +which shows that deep learning has higher adaptivity to the spatial inhomogeneity of the target function than other estimators such as linear ones. In addition to this, it is shown that deep learning can avoid the curse of dimensionality if the target function is in a mixed smooth Besov space. We also show that the dependency of the convergence rate on the dimensionality is tight due to its minimax optimality. These results support high adaptivity of deep learning and its superior ability as a feature extractor. +",/pdf/1c0615bc2c2933b2fc81616e3f0e04dfb900c5b2.pdf,ICLR,2019, +Jacdvfjicf7,CfUvT11mfJ,1601310000000.0,1616140000000.0,627,Interpreting and Boosting Dropout from a Game-Theoretic View,"[""~Hao_Zhang22"", ""~Sen_Li2"", ""~YinChao_Ma1"", ""~Mingjie_Li3"", ""~Yichen_Xie1"", ""~Quanshi_Zhang1""]","[""Hao Zhang"", ""Sen Li"", ""YinChao Ma"", ""Mingjie Li"", ""Yichen Xie"", ""Quanshi Zhang""]","[""Dropout"", ""Interpretability"", ""Interactions""]","This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretical interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretical proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. So, the utility of dropout can be regarded as decreasing interactions to alleviating the significance of over-fitting. Based on this understanding, we propose the interaction loss to further improve the utility of dropout. Experimental results on various DNNs and datasets have shown that the interaction loss can effectively improve the utility of dropout and boost the performance of DNNs.",/pdf/21165b3f3948c92ac8a6a60e5de44f9411235f53.pdf,ICLR,2021,We prove and improve the utility of the dropout operation from a game-theoretic view. +HJxhWa4KDr,SJeIZoOLPB,1569440000000.0,1577170000000.0,391,MMD GAN with Random-Forest Kernels,"[""tao.huang2018@ruc.edu.cn"", ""handarkholme@ruc.edu.cn"", ""jiayushenyang@gmail.com"", ""hanyuan0725@gmail.com""]","[""Tao Huang"", ""Zhen Han"", ""Xu Jia"", ""Hanyuan Hang""]","[""GANs"", ""MMD"", ""kernel"", ""random forest"", ""unbiased gradients""]","In this paper, we propose a novel kind of kernel, random forest kernel, to enhance the empirical performance of MMD GAN. Different from common forests with deterministic routings, a probabilistic routing variant is used in our innovated random-forest kernel, which is possible to merge with the CNN frameworks. Our proposed random-forest kernel has the following advantages: From the perspective of random forest, the output of GAN discriminator can be viewed as feature inputs to the forest, where each tree gets access to merely a fraction of the features, and thus the entire forest benefits from ensemble learning. In the aspect of kernel method, random-forest kernel is proved to be characteristic, and therefore suitable for the MMD structure. Besides, being an asymmetric kernel, our random-forest kernel is much more flexible, in terms of capturing the differences between distributions. Sharing the advantages of CNN, kernel method, and ensemble learning, our random-forest kernel based MMD GAN obtains desirable empirical performances on CIFAR-10, CelebA and LSUN bedroom data sets. Furthermore, for the sake of completeness, we also put forward comprehensive theoretical analysis to support our experimental results.",/pdf/448f26dd1c7692fabce2a1f7ff4f56af03d97b01.pdf,ICLR,2020,Equip MMD GANs with a new random-forest kernel. +q_kZm9eHIeD,woCNvwQK2Gj,1601310000000.0,1614990000000.0,1379,Entropic Risk-Sensitive Reinforcement Learning: A Meta Regret Framework with Function Approximation,"[""~Yingjie_Fei1"", ""~Zhuoran_Yang1"", ""~Zhaoran_Wang1""]","[""Yingjie Fei"", ""Zhuoran Yang"", ""Zhaoran Wang""]",[],"We study risk-sensitive reinforcement learning with the entropic risk measure and function approximation. We consider the finite-horizon episodic MDP setting, and propose a meta algorithm based on value iteration. We then derive two algorithms for linear and general function approximation, namely RSVI.L and RSVI.G, respectively, as special instances of the meta algorithm. We illustrate that the success of RSVI.L depends crucially on carefully designed feature mapping and regularization that adapt to risk sensitivity. In addition, both RSVI.L and RSVI.G maintain risk-sensitive optimism that facilitates efficient exploration. On the analytic side, we provide regret analysis for the algorithms by developing a meta analytic framework, at the core of which is a risk-sensitive optimism condition. We show that any instance of the meta algorithm that satisfies the condition yields a meta regret bound. We further verify the condition for RSVI.L and RSVI.G under respective function approximation settings to obtain concrete regret bounds that scale sublinearly in the number of episodes. +",/pdf/82089929f026609e0ef417358fa618e145567947.pdf,ICLR,2021, +b7g3_ZMHnT0,Npay6Kbc0vV,1601310000000.0,1616850000000.0,2516,Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation,"[""mrigankraman1611@gmail.com"", ""~Aaron_Chan1"", ""~Siddhant_Agarwal1"", ""~PeiFeng_Wang1"", ""~Hansen_Wang1"", ""sukim@adobe.com"", ""~Ryan_Rossi1"", ""~Handong_Zhao3"", ""lipka@adobe.com"", ""~Xiang_Ren1""]","[""Mrigank Raman"", ""Aaron Chan"", ""Siddhant Agarwal"", ""PeiFeng Wang"", ""Hansen Wang"", ""Sungchul Kim"", ""Ryan Rossi"", ""Handong Zhao"", ""Nedim Lipka"", ""Xiang Ren""]","[""neural symbolic reasoning"", ""interpretability"", ""model explanation"", ""faithfulness"", ""knowledge graph"", ""commonsense question answering"", ""recommender system""]","Knowledge graphs (KGs) have helped neural models improve performance on various knowledge-intensive tasks, like question answering and item recommendation. By using attention over the KG, such KG-augmented models can also ""explain"" which KG information was most relevant for making a given prediction. In this paper, we question whether these models are really behaving as we expect. We show that, through a reinforcement learning policy (or even simple heuristics), one can produce deceptively perturbed KGs, which maintain the downstream performance of the original KG while significantly deviating from the original KG's semantics and structure. Our findings raise doubts about KG-augmented models' ability to reason about KG information and give sensible explanations.",/pdf/f507111c61d895cf0cf9f23f8fdd018a9ca5717d.pdf,ICLR,2021,KG-augmented models and humans use KG info differently. +SkYibHlRb,HyHibreCW,1509080000000.0,1518730000000.0,247,SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning,"[""xuxiaojun1005@gmail.com"", ""liuchang@eecs.berkeley.edu"", ""dawnsong@cs.berkeley.edu""]","[""Xiaojun Xu"", ""Chang Liu"", ""Dawn Song""]",[],"Synthesizing SQL queries from natural language is a long-standing open problem and has been attracting considerable interest recently. Toward solving the problem, the de facto approach is to employ a sequence-to-sequence-style model. Such an approach will necessarily require the SQL queries to be serialized. Since the same SQL query may have multiple equivalent serializations, training a sequence-to-sequence-style model is sensitive to the choice from one of them. This phenomenon is documented as the ""order-matters"" problem. Existing state-of-the-art approaches rely on reinforcement learning to reward the decoder when it generates any of the equivalent serializations. However, we observe that the improvement from reinforcement learning is limited. + +In this paper, we propose a novel approach, i.e., SQLNet, to fundamentally solve this problem by avoiding the sequence-to-sequence structure when the order does not matter. In particular, we employ a sketch-based approach where the sketch contains a dependency graph, so that one prediction can be done by taking into consideration only the previous predictions that it depends on. In addition, we propose a sequence-to-set model as well as the column attention mechanism to synthesize the query based on the sketch. By combining all these novel techniques, we show that SQLNet can outperform the prior art by 9% to 13% on the WikiSQL task.",/pdf/a45f20738cf0107f89dddde05702ebac05ff899d.pdf,ICLR,2018, +SJ1fQYlCZ,HyJMXKlCZ,1509100000000.0,1518730000000.0,333,Training with Growing Sets: A Simple Alternative to Curriculum Learning and Self Paced Learning,"[""melike.mermer@izu.edu.tr"", ""mfatih@ce.yildiz.edu.tr""]","[""Melike Nur Mermer"", ""Mehmet Fatih Amasyali""]","[""Neural networks"", ""Curriculum learning"", ""Self paced learning""]","Curriculum learning and Self paced learning are popular topics in the machine learning that suggest to put the training samples in order by considering their difficulty levels. Studies in these topics show that starting with a small training set and adding new samples according to difficulty levels improves the learning performance. In this paper we experimented that we can also obtain good results by adding the samples randomly without a meaningful order. We compared our method with classical training, Curriculum learning, Self paced learning and their reverse ordered versions. Results of the statistical tests show that the proposed method is better than classical method and similar with the others. These results point a new training regime that removes the process of difficulty level determination in Curriculum and Self paced learning and as successful as these methods.",/pdf/990a9fd31a08c96c002369aa3cbf89ec957393d5.pdf,ICLR,2018,We propose that training with growing sets stage-by-stage provides an optimization for neural networks. +Syl-_aVtvH,S1xvKsiwwr,1569440000000.0,1577170000000.0,621,Federated User Representation Learning,"[""ducbui@umich.edu"", ""kmalik2@fb.com"", ""jrgoetz@umich.edu"", ""shanemoon@fb.com"", ""honglei@fb.com"", ""anujk@fb.com"", ""kgshin@umich.edu""]","[""Duc Bui"", ""Kshitiz Malik"", ""Jack Goetz"", ""Seungwhan Moon"", ""Honglei Liu"", ""Anuj Kumar"", ""Kang G. Shin""]","[""Machine Learning"", ""Federated Learning"", ""Personalization"", ""User Representation""]","Collaborative personalization, such as through learned user representations (embeddings), can improve the prediction accuracy of neural-network-based models significantly. We propose Federated User Representation Learning (FURL), a simple, scalable, privacy-preserving and resource-efficient way to utilize existing neural personalization techniques in the Federated Learning (FL) setting. FURL divides model parameters into federated and private parameters. Private parameters, such as private user embeddings, are trained locally, but unlike federated parameters, they are not transferred to or averaged on the server. We show theoretically that this parameter split does not affect training for most model personalization approaches. Storing user embeddings locally not only preserves user privacy, but also improves memory locality of personalization compared to on-server training. We evaluate FURL on two datasets, demonstrating a significant improvement in model quality with 8% and 51% performance increases, and approximately the same level of performance as centralized training with only 0% and 4% reductions. Furthermore, we show that user embeddings learned in FL and the centralized setting have a very similar structure, indicating that FURL can learn collaboratively through the shared parameters while preserving user privacy.",/pdf/cb37b78a0231af2cd7f44dd536a76366968522f9.pdf,ICLR,2020,"We propose Federated User Representation Learning (FURL), a simple, scalable, privacy-preserving and bandwidth-efficient way to utilize existing neural personalization techniques in the Federated Learning (FL) setting." +BkN_r2lR-,Hy7OB2xRb,1509110000000.0,1519330000000.0,390,Identifying Analogies Across Domains,"[""yedidh@fb.com"", ""wolf@fb.com""]","[""Yedid Hoshen"", ""Lior Wolf""]","[""unsupervised mapping"", ""cross domain mapping""]","Identifying analogies across domains without supervision is a key task for artificial intelligence. Recent advances in cross domain image mapping have concentrated on translating images across domains. Although the progress made is impressive, the visual fidelity many times does not suffice for identifying the matching sample from the other domain. In this paper, we tackle this very task of finding exact analogies between datasets i.e. for every image from domain A find an analogous image in domain B. We present a matching-by-synthesis approach: AN-GAN, and show that it outperforms current techniques. We further show that the cross-domain mapping task can be broken into two parts: domain alignment and learning the mapping function. The tasks can be iteratively solved, and as the alignment is improved, the unsupervised translation function reaches quality comparable to full supervision. ",/pdf/2085fdc96566184832910099974da2c4a05c78b5.pdf,ICLR,2018,Finding correspondences between domains by performing matching/mapping iterations +HyloPnEKPr,SJg0ajDqHr,1569440000000.0,1577170000000.0,17,Context-aware Attention Model for Coreference Resolution,"[""vermouthtarot@gmail.com"", ""zxy951005@stu.xjtu.edu.cn"", ""majack@stu.xjtu.edu.cn"", ""longyu95@stu.xjtu.edu.cn"", ""wangxuan8888@stu.xjtu.edu.cn"", ""cli@xjtu.edu.cn""]","[""Yufei Li"", ""Xiangyu Zhou"", ""Jie Ma"", ""Yu Long"", ""Xuan Wang"", ""Chen Li""]","[""Coreference resolution"", ""Feature Attention""]","Coreference resolution is an important task for gaining more complete understanding about texts by artificial intelligence. The state-of-the-art end-to-end neural coreference model considers all spans in a document as potential mentions and learns to link an antecedent with each possible mention. However, for the verbatim same mentions, the model tends to get similar or even identical representations based on the features, and this leads to wrongful predictions. In this paper, we propose to improve the end-to-end system by building an attention model to reweigh features around different contexts. The proposed model substantially outperforms the state-of-the-art on the English dataset of the CoNLL 2012 Shared Task with 73.45% F1 score on development data and 72.84% F1 score on test data.",/pdf/165ccbb4e4af9a98b694db9a13d96043f227340f.pdf,ICLR,2020,We demonstrate an attention model reweighing features around different contexts to reduce the wrongful predictions between similar or identical texts units +SOVSJZ9PTO7,vB9z28K2DE,1601310000000.0,1614990000000.0,2074,JAKET: Joint Pre-training of Knowledge Graph and Language Understanding,"[""~Donghan_Yu2"", ""~Chenguang_Zhu1"", ""~Yiming_Yang1"", ""~Michael_Zeng1""]","[""Donghan Yu"", ""Chenguang Zhu"", ""Yiming Yang"", ""Michael Zeng""]","[""Pre-training"", ""Knowledge Graph"", ""Language Understanding"", ""Graph Neural Network""]","Knowledge graphs (KGs) contain rich information about world knowledge, entities, and relations. Thus, they can be great supplements to existing pre-trained language models. However, it remains a challenge to efficiently integrate information from KG into language modeling. And the understanding of a knowledge graph requires related context. We propose a novel joint pre-training framework, JAKET, to model both the knowledge graph and language. The knowledge module and language module provide essential information to mutually assist each other: the knowledge module produces embeddings for entities in text while the language module generates context-aware initial embeddings for entities and relations in the graph. Our design enables the pre-trained model to easily adapt to unseen knowledge graphs in new domains. Experimental results on several knowledge-aware NLP tasks show that our proposed framework achieves superior performance by effectively leveraging knowledge in language understanding.",/pdf/d85ed75d8cbc7aafae9903ac9785e66956d6e1e0.pdf,ICLR,2021,A joint pre-training framework which models both the knowledge graph and text and can easily adapt to unseen knowledge graphs in new domains during fine-tuning +rryJiPXifr,#NAME?,1601310000000.0,1614990000000.0,1358,Optimization Planning for 3D ConvNets,"[""~Zhaofan_Qiu2"", ""~Ting_Yao1"", ""~Chong-wah_Ngo2"", ""~Tao_Mei3""]","[""Zhaofan Qiu"", ""Ting Yao"", ""Chong-wah Ngo"", ""Tao Mei""]","[""3D ConvNets"", ""Network Training"", ""Video Recognition""]","3D Convolutional Neural Networks (3D ConvNets) have been regarded as a powerful class of models for video recognition. Nevertheless, it is not trivial to optimally learn a 3D ConvNets due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video clips and then is followed by learning long-term temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as training progresses. The fact that such process comes along with several heuristic settings motivates the study to seek an optimal ``path'' to automate the entire training. In this paper, we decompose the path into a series of training ``states'' and specify the hyper-parameters, e.g., learning rate and the length of input clips, in each state. The estimation of the knee point on the performance-epoch curve triggers the transition from one state to another. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path. Furthermore, we devise a new 3D ConvNets with a unique design of dual-head classifier to improve the spatial and temporal discrimination. Extensive experiments conducted on seven public video recognition benchmarks demonstrate the advantages of our proposal. With the optimization planning, our 3D ConvNets achieves superior results when comparing to the state-of-the-art video recognition approaches. More remarkably, we obtain the top-1 accuracy of 82.5% and 84.3% on the large-scale Kinetics-400 and Kinetics-600 datasets, respectively.",/pdf/8ceb47dd0dd48f58102f603f443fafba0cb5bb6e.pdf,ICLR,2021,We propose optimization planning mechanism to automate the design of training strategy for 3D ConvNets. +6y3-wzlGHkb,p_pE88kViwm,1601310000000.0,1614990000000.0,2150,Non-robust Features through the Lens of Universal Perturbations,"[""~Sung_Min_Park2"", ""kuoanwei@mit.edu"", ""~Kai_Yuanqing_Xiao1"", ""~Jerry_Li1"", ""~Aleksander_Madry1""]","[""Sung Min Park"", ""Kuo-An Wei"", ""Kai Yuanqing Xiao"", ""Jerry Li"", ""Aleksander Madry""]","[""adversarial examples"", ""robustness"", ""non-robust features""]","Recent work ties adversarial examples to existence of non-robust features: features which are susceptible to small perturbations and believed to be unintelligible to humans, but still useful for prediction. We study universal adversarial perturbations and demonstrate that the above picture is more nuanced. Specifically, even though universal perturbations---similarly to standard adversarial perturbations---do leverage non-robust features, these features tend to be fundamentally different from the ``standard'' ones and, in particular, non-trivially human-aligned. Namely, universal perturbations have more human-aligned locality and spatial invariance properties. However, we also show that these human-aligned non-robust features have much less predictive signal than general non-robust features. Our findings thus take a step towards improving our understanding of these previously unintelligible features.",/pdf/b01ee46313ef48630aa6d3cb38400866db43ec8b.pdf,ICLR,2021,"We analyze non-robust features through universal perturbations, and find evidence of weak yet human-aligned non-robust features." +S1x4ghC9tQ,BJef8Ln9Y7,1538090000000.0,1553790000000.0,1067,Temporal Difference Variational Auto-Encoder,"[""karol.gregor@gmail.com"", ""g.papamakarios@ed.ac.uk"", ""fbesse@google.com"", ""lbuesing@google.com"", ""theophane@google.com""]","[""Karol Gregor"", ""George Papamakarios"", ""Frederic Besse"", ""Lars Buesing"", ""Theophane Weber""]","[""generative models"", ""variational auto-encoders"", ""state space models"", ""temporal difference learning""]","To act and plan in complex environments, we posit that agents should have a mental simulator of the world with three characteristics: (a) it should build an abstract state representing the condition of the world; (b) it should form a belief which represents uncertainty on the world; (c) it should go beyond simple step-by-step simulation, and exhibit temporal abstraction. Motivated by the absence of a model satisfying all these requirements, we propose TD-VAE, a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of temporal difference learning used in reinforcement learning.",/pdf/d053a90cc9fd47dc7cabd4045f47d06afbf2cf49.pdf,ICLR,2019,"Generative model of temporal data, that builds online belief state, operates in latent space, does jumpy predictions and rollouts of states." +S1e-0kBYPB,Sygmyv1FPH,1569440000000.0,1577170000000.0,2015,Can I Trust the Explainer? Verifying Post-Hoc Explanatory Methods,"[""ocamburu@gmail.com"", ""eleonora.giunchiglia@cs.ox.ac.uk"", ""jakobfoerster@gmail.com"", ""thomas.lukasiewicz@gmail.com"", ""philblunsom@gmail.com""]","[""Oana-Maria Camburu*"", ""Eleonora Giunchiglia*"", ""Jakob Foerster"", ""Thomas Lukasiewicz"", ""Phil Blunsom""]","[""explainability"", ""neural networks""]","For AI systems to garner widespread public acceptance, we must develop methods capable of explaining the decisions of black-box models such as neural networks. In this work, we identify two issues of current explanatory methods. First, we show that two prevalent perspectives on explanations—feature-additivity and feature-selection—lead to fundamentally different instance-wise explanations. In the literature, explainers from different perspectives are currently being directly compared, despite their distinct explanation goals. The second issue is that current post-hoc explainers have only been thoroughly validated on simple models, such as linear regression, and, when applied to real-world neural networks, explainers are commonly evaluated under the assumption that the learned models behave reasonably. However, neural networks often rely on unreasonable correlations, even when producing correct decisions. We introduce a verification framework for explanatory methods under the feature-selection perspective. Our framework is based on a non-trivial neural network architecture trained on a real-world task, and for which we are able to provide guarantees on its inner workings. We validate the efficacy of our evaluation by showing the failure modes of current explainers. We aim for this framework to provide a publicly available,1 off-the-shelf evaluation when the feature-selection perspective on explanations is needed.",/pdf/c2e5c2f85adeb32269836cef923951ef7b20b0f2.pdf,ICLR,2020,An evaluation framework based on a real-world neural network for post-hoc explanatory methods +rylnK6VtDH,rJgUmraPvS,1569440000000.0,1583910000000.0,684,Multiplicative Interactions and Where to Find Them,"[""sidmj@google.com"", ""lejlot@google.com"", ""jmenick@google.com"", ""schwarzjn@google.com"", ""jwrae@google.com"", ""osindero@google.com"", ""ywteh@google.com"", ""tharley@google.com"", ""razp@google.com""]","[""Siddhant M. Jayakumar"", ""Wojciech M. Czarnecki"", ""Jacob Menick"", ""Jonathan Schwarz"", ""Jack Rae"", ""Simon Osindero"", ""Yee Whye Teh"", ""Tim Harley"", ""Razvan Pascanu""]","[""multiplicative interactions"", ""hypernetworks"", ""attention""]","We explore the role of multiplicative interaction as a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others. +Multiplicative interaction layers as primitive operations have a long-established presence in the literature, though this often not emphasized and thus under-appreciated. We begin by showing that such layers strictly enrich the representable function classes of neural networks. We conjecture that multiplicative interactions offer a particularly powerful inductive bias when fusing multiple streams of information or when conditional computation is required. We therefore argue that they should be considered in many situation where multiple compute or information paths need to be combined, in place of the simple and oft-used concatenation operation. Finally, we back up our claims and demonstrate the potential of multiplicative interactions by applying them in large-scale complex RL and sequence modelling tasks, where their use allows us to deliver state-of-the-art results, and thereby provides new evidence in support of multiplicative interactions playing a more prominent role when designing new neural network architectures.",/pdf/25d97c4a79fac39e47afae9943ee47ffbd93b248.pdf,ICLR,2020,"We explore the role of multiplicative interaction as a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others." +ByeLBj0qFQ,HJlaQEBtKQ,1538090000000.0,1545360000000.0,89,Unsupervised Image to Sequence Translation with Canvas-Drawer Networks,"[""kevinfrans2@gmail.com"", ""chin-yi.cheng@autodesk.com""]","[""Kevin Frans"", ""Chin-Yi Cheng""]","[""image"", ""translation"", ""unsupervised"", ""model-based""]","Encoding images as a series of high-level constructs, such as brush strokes or discrete shapes, can often be key to both human and machine understanding. In many cases, however, data is only available in pixel form. We present a method for generating images directly in a high-level domain (e.g. brush strokes), without the need for real pairwise data. Specifically, we train a ”canvas” network to imitate the mapping of high-level constructs to pixels, followed by a high-level ”drawing” network which is optimized through this mapping towards solving a desired image recreation or translation task. We successfully discover sequential vector representations of symbols, large sketches, and 3D objects, utilizing only pixel data. We display applications of our method in image segmentation, and present several ablation studies comparing various configurations.",/pdf/db0511d4737e4347f68696a53afb6cc3cd10a60b.pdf,ICLR,2019,Recreate images as interpretable high-level sequences without the need for paired data. +l3gNU1KStIC,qpXZNck2vk7,1601310000000.0,1614990000000.0,78,Stochastic Inverse Reinforcement Learning ,"[""~Ce_Ju1""]","[""Ce Ju""]","[""Inverse Reinforcement Learning"", ""Stochastic Methods"", ""MCEM""]","The goal of the inverse reinforcement learning (IRL) problem is to recover the reward functions from expert demonstrations. However, the IRL problem like any ill-posed inverse problem suffers the congenital defect that the policy may be optimal for many reward functions, and expert demonstrations may be optimal for many policies. In this work, we generalize the IRL problem to a well-posed expectation optimization problem stochastic inverse reinforcement learning (SIRL) to recover the probability distribution over reward functions. We adopt the Monte Carlo expectation-maximization (MCEM) method to estimate the parameter of the probability distribution as the first solution to the SIRL problem. The solution is succinct, robust, and transferable for a learning task and can generate alternative solutions to the IRL problem. Through our formulation, it is possible to observe the intrinsic property for the IRL problem from a global viewpoint, and our approach achieves a considerable performance on the objectworld. ",/pdf/18e8ae43e19fbfea72ff3fdee571de26cf0343ed.pdf,ICLR,2021,We generalize the IRL problem to a well-posed expectation optimization problem stochastic inverse reinforcement learning (SIRL) problem to recover the probability distribution for reward functions. +r1enqkBtwr,BkgvIs0ODH,1569440000000.0,1577170000000.0,1893,Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm,"[""stefano.spigler@epfl.ch"", ""mario.geiger@epfl.ch"", ""matthieu.wyart@epfl.ch""]","[""Stefano Spigler"", ""Mario Geiger"", ""Matthieu Wyart""]",[],"How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as $n^{-\beta}$ where $n$ is the number of training examples and $\beta$ an exponent that depends on both data and algorithm. In this work we measure $\beta$ when applying kernel methods to real datasets. For MNIST we find $\beta\approx 0.4$ and for CIFAR10 $\beta\approx 0.1$. Remarkably, $\beta$ is the same for regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we introduce the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption --- namely that the data are sampled from a regular lattice --- we derive analytically $\beta$ for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies, $\beta$ depends only on the training data and their dimension. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, our results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data should be defined in terms of how the distance between nearest data points depends on $n$. With this definition one obtains reasonable effective smoothness estimates for MNIST and CIFAR10.",/pdf/90e7b0840786dd9dccd469ae743c6cee9aaa3b19.pdf,ICLR,2020, +Bkgwp3NtDH,rJxNls_7wS,1569440000000.0,1577170000000.0,229,Programmable Neural Network Trojan for Pre-trained Feature Extractor,"[""jiy15@mails.tsinghua.edu.cn"", ""liuzixin18@mails.tsinghua.edu.cn"", ""xinghu@ucsb.edu"", ""wpq14@mails.tsinghua.edu.cn"", ""zyh02@tsinghua.edu.cn""]","[""Yu Ji"", ""Zinxin Liu"", ""Xing Hu"", ""Peiqi Wang"", ""Youhui Zhang""]","[""Neural Network"", ""Trojan"", ""Security""]","Neural network (NN) trojaning attack is an emerging and important attack that can broadly damage the system deployed with NN models. +Different from adversarial attack, it hides malicious functionality in the weight parameters of NN models. +Existing studies have explored NN trojaning attacks in some small datasets for specific domains, with limited numbers of fixed target classes. +In this paper, we propose a more powerful trojaning attack method for large models, which outperforms existing studies in capability, generality, and stealthiness. +First, the attack is programmable that the malicious misclassification target is not fixed and can be generated on demand even after the victim's deployment. +Second, our trojaning attack is not limited in a small domain; one trojaned model on a large-scale dataset can affect applications of different domains that reuses its general features. +Third, our trojan shows no biased behavior for different target classes, which makes it more difficult to defend.",/pdf/c47f1f34e83752822955d5d02a8a39ae36937e76.pdf,ICLR,2020,We present a more powerful NN trojaning attack that can support outer-scope targets and dynamic targets +ryxmrpNtvH,r1eslarPPH,1569440000000.0,1577170000000.0,516,Deeper Insights into Weight Sharing in Neural Architecture Search,"[""scottyugochang@gmail.com"", ""quanlu.zhang@microsoft.com"", ""jyjiang97@gmail.com"", ""gdzejlin@gmail.com"", ""yujing.wang@microsoft.com""]","[""Yuge Zhang"", ""Quanlu Zhang"", ""Junyang Jiang"", ""Zejun Lin"", ""Yujing Wang""]","[""Neural Architecture Search"", ""NAS"", ""AutoML"", ""AutoDL"", ""Deep Learning"", ""Machine Learning""]","With the success of deep neural networks, Neural Architecture Search (NAS) as a way of automatic model design has attracted wide attention. As training every child model from scratch is very time-consuming, recent works leverage weight-sharing to speed up the model evaluation procedure. These approaches greatly reduce computation by maintaining a single copy of weights on the super-net and share the weights among every child model. However, weight-sharing has no theoretical guarantee and its impact has not been well studied before. In this paper, we conduct comprehensive experiments to reveal the impact of weight-sharing: (1) The best-performing models from different runs or even from consecutive epochs within the same run have significant variance; (2) Even with high variance, we can extract valuable information from training the super-net with shared weights; (3) The interference between child models is a main factor that induces high variance; (4) Properly reducing the degree of weight sharing could effectively reduce variance and improve performance.",/pdf/f1a99f933ec121f635f80fdc0c035f010e22ac3f.pdf,ICLR,2020,A comprehensive study of the impact of weight-sharing in Neural Architecture Search +Bk67W4Yxl,,1478210000000.0,1481760000000.0,90,Improved Architectures for Computer Go,"[""Tristan.Cazenave@dauphine.fr""]","[""Tristan Cazenave""]","[""Games"", ""Supervised Learning"", ""Deep learning""]",AlphaGo trains policy networks with both supervised and reinforcement learning and makes different policy networks play millions of games so as to train a value network. The reinforcement learning part requires massive ammount of computation. We propose to train networks for computer Go so that given accuracy is reached with much less examples. We modify the architecture of the networks in order to train them faster and to have better accuracy in the end.,/pdf/d2cf69a4c366823c582e6c8a7a09404cd0b9371a.pdf,ICLR,2017,Improving training of deep networks for computer Go modifying the layers +QkRbdiiEjM,Rbi97dtyY0y,1601310000000.0,1615880000000.0,1400,AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models,"[""~Ke_Sun3"", ""~Zhanxing_Zhu1"", ""~Zhouchen_Lin1""]","[""Ke Sun"", ""Zhanxing Zhu"", ""Zhouchen Lin""]","[""Graph Neural Networks"", ""AdaBoost""]","The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(Adaboosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors of current nodes and then integrates knowledge from different hops of neighbors into the network in an Adaboost way. Different from other graph neural networks that directly stack many graph convolution layers, AdaGCN shares the same base neural network architecture among all ``layers'' and is recursively optimized, which is similar to an RNN. Besides, We also theoretically established the connection between AdaGCN and existing graph convolutional methods, presenting the benefits of our proposal. Finally, extensive experiments demonstrate the consistent state-of-the-art prediction performance on graphs across different label rates and the computational advantage of our approach AdaGCN~\footnote{Code is available at \url{https://github.com/datake/AdaGCN}.}.",/pdf/f3d1211b83f6d62eeef81117c15818464a995abb.pdf,ICLR,2021,We propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network. +Sk9yuql0Z,BJK1_9gRZ,1509100000000.0,1519860000000.0,352,Mitigating Adversarial Effects Through Randomization,"[""cihangxie306@gmail.com"", ""wjyouch@gmail.com"", ""zhshuai.zhang@gmail.com"", ""zhou.ren@snapchat.com"", ""alan.l.yuille@gmail.com""]","[""Cihang Xie"", ""Jianyu Wang"", ""Zhishuai Zhang"", ""Zhou Ren"", ""Alan Yuille""]","[""adversarial examples""]","Convolutional neural networks have demonstrated high accuracy on various tasks in recent years. However, they are extremely vulnerable to adversarial examples. For example, imperceptible perturbations added to clean images can cause convolutional neural networks to fail. In this paper, we propose to utilize randomization at inference time to mitigate adversarial effects. Specifically, we use two randomization operations: random resizing, which resizes the input images to a random size, and random padding, which pads zeros around the input images in a random manner. Extensive experiments demonstrate that the proposed randomization method is very effective at defending against both single-step and iterative attacks. Our method provides the following advantages: 1) no additional training or fine-tuning, 2) very few additional computations, 3) compatible with other adversarial defense methods. By combining the proposed randomization method with an adversarially trained model, it achieves a normalized score of 0.924 (ranked No.2 among 107 defense teams) in the NIPS 2017 adversarial examples defense challenge, which is far better than using adversarial training alone with a normalized score of 0.773 (ranked No.56). The code is public available at https://github.com/cihangxie/NIPS2017_adv_challenge_defense.",/pdf/a1d58c113bbe06a60514048d4c9d475705c5d9d7.pdf,ICLR,2018, +SySisz-CW,B1zPoG-A-,1509140000000.0,1518730000000.0,943,On the difference between building and extracting patterns: a causal analysis of deep generative models.,"[""michel.besserve@tuebingen.mpg.de"", ""dominik.janzing@tuebingen.mpg.de"", ""bs@tuebingen.mpg.de""]","[""Michel Besserve"", ""Dominik Janzing"", ""Bernhard Schoelkopf""]","[""GAN"", ""VAE"", ""causality""]","Generative models are important tools to capture and investigate the properties of complex empirical data. Recent developments such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) use two very similar, but \textit{reverse}, deep convolutional architectures, one to generate and one to extract information from data. Does learning the parameters of both architectures obey the same rules? We exploit the causality principle of independence of mechanisms to quantify how the weights of successive layers adapt to each other. Using the recently introduced Spectral Independence Criterion, we quantify the dependencies between the kernels of successive convolutional layers and show that those are more independent for the generative process than for information extraction, in line with results from the field of causal inference. In addition, our experiments on generation of human faces suggest that more independence between successive layers of generators results in improved performance of these architectures. +",/pdf/414548aca01670f57637d1d495b984e927b771cd.pdf,ICLR,2018,We use causal inference to characterise the architecture of generative models +S1GkToR5tm,B1e-3baFYQ,1538090000000.0,1550880000000.0,767,Discriminator Rejection Sampling,"[""sazadi@berkeley.edu"", ""catherio@google.com"", ""trevor@eecs.berkeley.edu"", ""goodfellow@google.com"", ""augustusodena@google.com""]","[""Samaneh Azadi"", ""Catherine Olsson"", ""Trevor Darrell"", ""Ian Goodfellow"", ""Augustus Odena""]","[""GANs"", ""rejection sampling""]","We propose a rejection sampling scheme using the discriminator of a GAN to +approximately correct errors in the GAN generator distribution. We show that +under quite strict assumptions, this will allow us to recover the data distribution +exactly. We then examine where those strict assumptions break down and design a +practical algorithm—called Discriminator Rejection Sampling (DRS)—that can be +used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of +Gaussians and on the state of the art SAGAN model. On ImageNet, we train an +improved baseline that increases the best published Inception Score from 52.52 to +62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use +DRS to further improve on this baseline, improving the Inception Score to 76.08 +and the FID to 13.75.",/pdf/01d175e8110f68db11df1f82b37c37d44f54ab3a.pdf,ICLR,2019,We use a GAN discriminator to perform an approximate rejection sampling scheme on the output of the GAN generator. +Hkx6p6EFDr,B1xhc7bdvS,1569440000000.0,1577170000000.0,835,Equivariant Entity-Relationship Networks,"[""drgraham@cs.ubc.ca"", ""siamak@cs.mcgill.ca""]","[""Devon Graham"", ""Siamak Ravanbakhsh""]","[""deep learning"", ""relational model"", ""knowledge graph"", ""exchangeability"", ""equivariance""]","Due to its extensive use in databases, the relational model is ubiquitous in representing big-data. However, recent progress in deep learning with relational data has been focused on (knowledge) graphs. In this paper we propose Equivariant Entity-Relationship Networks, the class of parameter-sharing neural networks derived from the entity-relationship model. We prove that our proposed feed-forward layer is the most expressive linear layer under the given equivariance constraints, and subsumes recently introduced equivariant models for sets, exchangeable tensors, and graphs. The proposed feed-forward layer has linear complexity in the the data and can be used for both inductive and transductive reasoning about relational databases, including database embedding, and the prediction of missing records. This, provides a principled theoretical foundation for the application of deep learning to one of the most abundant forms of data.",/pdf/21dfe09ba3972eea6ef7048626419f4c8f4c791a.pdf,ICLR,2020,We propose a feed-forward layer that is informed by the ER model of relational data and show that it is the most expressive linear layer possible under given the equivariance constraints. +8HhkbjrWLdE,EiOpqBL1dF7,1601310000000.0,1615820000000.0,1937,Separation and Concentration in Deep Networks,"[""~John_Zarka1"", ""~Florentin_Guth1"", ""~St\u00e9phane_Mallat1""]","[""John Zarka"", ""Florentin Guth"", ""St\u00e9phane Mallat""]","[""fisher ratio"", ""neural collapse"", ""mean separation"", ""concentration"", ""variance reduction"", ""deep learning"", ""image classification""]","Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce within-class variabilities while preserving class means. Variance reduction bounds are proved for Gaussian mixture models. For image classification, we show that separation of class means can be achieved with rectified wavelet tight frames that are not learned. It defines a scattering transform. Learning $1 \times 1$ convolutional tight frames along scattering channels and applying a soft-thresholding reduces within-class variabilities. The resulting scattering network reaches the classification accuracy of ResNet-18 on CIFAR-10 and ImageNet, with fewer layers and no learned biases.",/pdf/89800c3664ef3d1e88e5560caa77d60409b77113.pdf,ICLR,2021, +Bkx_Dj09tQ,HJlJtXu9KQ,1538090000000.0,1545360000000.0,276,Causal importance of orientation selectivity for generalization in image recognition,"[""i.love.ny517@gmail.com""]","[""Jumpei Ukita""]","[""deep learning"", ""generalization"", ""selectivity"", ""neuroscience""]","Although both our brain and deep neural networks (DNNs) can perform high-level sensory-perception tasks such as image or speech recognition, the inner mechanism of these hierarchical information-processing systems is poorly understood in both neuroscience and machine learning. Recently, Morcos et al. (2018) examined the effect of class-selective units in DNNs, i.e., units with high-level selectivity, on network generalization, concluding that hidden units that are selectively activated by specific input patterns may harm the network's performance. In this study, we revisit their hypothesis, considering units with selectivity for lower-level features, and argue that selective units are not always harmful to the network performance. Specifically, by using DNNs trained for image classification (7-layer CNNs and VGG16 trained on CIFAR-10 and ImageNet, respectively), we analyzed the orientation selectivity of individual units. Orientation selectivity is a low-level selectivity widely studied in visual neuroscience, in which, when images of bars with several orientations are presented to the eye, many neurons in the visual cortex respond selectively to a specific orientation. We found that orientation-selective units exist in both lower and higher layers of these DNNs, as in our brain. In particular, units in the lower layers become more orientation-selective as the generalization performance improves during the course of training of the DNNs. Consistently, networks that generalize better are more orientation-selective in the lower layers. We finally reveal that ablating these selective units in the lower layers substantially degrades the generalization performance, at least by disrupting the shift-invariance of the higher layers. These results suggest to the machine-learning community that, contrary to the triviality of units with high-level selectivity, lower-layer units with selectivity for low-level features can be indispensable for generalization, and for neuroscientists, orientation selectivity can play a causally important role in object recognition.",/pdf/743c0c874489a2621766090811f9e6e67d67b568.pdf,ICLR,2019, +By9iRkWA-,BJto0kb0W,1509130000000.0,1518730000000.0,537,Phase Conductor on Multi-layered Attentions for Machine Comprehension,"[""ult.rui.liu@gmail.com"", ""weiwei@cs.cmu.edu"", ""mwg10.thu@gmail.com"", ""mchikina@gmail.com""]","[""Rui Liu"", ""Wei Wei"", ""Weiguang Mao"", ""Maria Chikina""]","[""Attention Model"", ""Machine Comprehension"", ""Question Answering""]","Attention models have been intensively studied to improve NLP tasks such as machine comprehension via both question-aware passage attention model and self-matching attention model. Our research proposes phase conductor (PhaseCond) for attention models in two meaningful ways. First, PhaseCond, an architecture of multi-layered attention models, consists of multiple phases each implementing a stack of attention layers producing passage representations and a stack of inner or outer fusion layers regulating the information flow. Second, we extend and improve the dot-product attention function for PhaseCond by simultaneously encoding multiple question and passage embedding layers from different perspectives. We demonstrate the effectiveness of our proposed model PhaseCond on the SQuAD dataset, showing that our model significantly outperforms both state-of-the-art single-layered and multiple-layered attention models. We deepen our results with new findings via both detailed qualitative analysis and visualized examples showing the dynamic changes through multi-layered attention models.",/pdf/c0b188cfecda69999f182dd338223856d83b265d.pdf,ICLR,2018, +Bkxdqj0cFQ,S1xG1J2cKm,1538090000000.0,1545360000000.0,546,Calibration of neural network logit vectors to combat adversarial attacks,"[""og14775@my.bristol.ac.uk""]","[""Oliver Goldstein""]","[""Adversarial attacks"", ""calibration"", ""probability"", ""adversarial defence""]","Adversarial examples remain an issue for contemporary neural networks. This paper draws on Background Check (Perello-Nieto et al., 2016), a technique in model calibration, to assist two-class neural networks in detecting adversarial examples, using the one dimensional difference between logit values as the underlying measure. This method interestingly tends to achieve the highest average recall on image sets that are generated with large perturbation vectors, which is unlike the existing literature on adversarial attacks (Cubuk et al., 2017). The proposed method does not need knowledge of the attack parameters or methods at training time, unlike a great deal of the literature that uses deep learning based methods to detect adversarial examples, such as Metzen et al. (2017), imbuing the proposed method with additional flexibility.",/pdf/9076625201c7930f54c5f98d51c5d00273de4763.pdf,ICLR,2019,This paper uses principles from the field of calibration in machine learning on the logits of a neural network to defend against adversarial attacks +rJedbn0ctQ,S1es565tY7,1538090000000.0,1548790000000.0,1187,Zero-training Sentence Embedding via Orthogonal Basis,"[""ziyi.yang@stanford.edu"", ""chezhu@microsoft.com"", ""wzchen@microsoft.com""]","[""Ziyi Yang"", ""Chenguang Zhu"", ""Weizhu Chen""]","[""Natural Language Processing"", ""Sentence Embeddings""]","We propose a simple and robust training-free approach for building sentence representations. Inspired by the Gram-Schmidt Process in geometric theory, we build an orthogonal basis of the subspace spanned by a word and its surrounding context in a sentence. We model the semantic meaning of a word in a sentence based on two aspects. One is its relatedness to the word vector subspace already spanned by its contextual words. The other is its novel semantic meaning which shall be introduced as a new basis vector perpendicular to this existing subspace. Following this motivation, we develop an innovative method based on orthogonal basis to combine pre-trained word embeddings into sentence representation. This approach requires zero training and zero parameters, along with efficient inference performance. We evaluate our approach on 11 downstream NLP tasks. Experimental results show that our model outperforms all existing zero-training alternatives in all the tasks and it is competitive to other approaches relying on either large amounts of labelled data or prolonged training time.",/pdf/314fca9c6ffaeb179665758d42c410074ad84d25.pdf,ICLR,2019,A simple and training-free approach for sentence embeddings with competitive performance compared with sophisticated models requiring either large amount of training data or prolonged training time. +S1pWFzbAW,HkE-YGbAW,1509140000000.0,1535060000000.0,880,Weightless: Lossy Weight Encoding For Deep Neural Network Compression,"[""reagen@fas.harvard.edu"", ""ugupta@g.harvard.edu"", ""rdadolf@seas.harvard.edu"", ""michaelm@eecs.harvard.edu"", ""srush@seas.harvard.edu"", ""gywei@g.harvard.edu"", ""dbrooks@eecs.harvard.edu""]","[""Brandon Reagen"", ""Udit Gupta"", ""Robert Adolf"", ""Michael Mitzenmacher"", ""Alexander Rush"", ""Gu-Yeon Wei"", ""David Brooks""]","[""Deep Neural Network"", ""Compression"", ""Sparsity""]","The large memory requirements of deep neural networks strain the capabilities of many devices, limiting their deployment and adoption. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x; with the same model accuracy, this results in up to a 1.51x improvement over the state-of-the-art.",/pdf/0fde46d7ad51df43d6169c57d40d7d75f898a2ba.pdf,ICLR,2018,We propose a new way to compress neural networks using probabilistic data structures. +rkeiQlBFPB,rJeneBgYvr,1569440000000.0,1583910000000.0,2223,Meta-Learning with Warped Gradient Descent,"[""flennerhag@google.com"", ""andreirusu@google.com"", ""razp@google.com"", ""visin@google.com"", ""hujun.yin@manchester.ac.uk"", ""raia@google.com""]","[""Sebastian Flennerhag"", ""Andrei A. Rusu"", ""Razvan Pascanu"", ""Francesco Visin"", ""Hujun Yin"", ""Raia Hadsell""]","[""meta-learning"", ""transfer learning""]","Learning an efficient update rule from data that promotes rapid learning of new tasks from the same distribution remains an open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural network that directly produces updates or by attempting to learn better initialisations or scaling factors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-converging behaviour. On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta-gradients, leading to methods that can not scale beyond few-shot task adaptation. In this work, we propose Warped Gradient Descent (WarpGrad), a method that intersects these approaches to mitigate their limitations. WarpGrad meta-learns an efficiently parameterised preconditioning matrix that facilitates gradient descent across the task distribution. Preconditioning arises by interleaving non-linear layers, referred to as warp-layers, between the layers of a task-learner. Warp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. WarpGrad is computationally efficient, easy to implement, and can scale to arbitrarily large meta-learning problems. We provide a geometrical interpretation of the approach and evaluate its effectiveness in a variety of settings, including few-shot, standard supervised, continual and reinforcement learning.",/pdf/1b48cb071f86d09d0b46302f6b7643afeeff0dc1.pdf,ICLR,2020,"We propose a novel framework for meta-learning a gradient-based update rule that scales to beyond few-shot learning and is applicable to any form of learning, including continual learning." +a-xFK8Ymz5J,sBksTq2sPoa,1601310000000.0,1615930000000.0,1087,DiffWave: A Versatile Diffusion Model for Audio Synthesis,"[""z4kong@eng.ucsd.edu"", ""~Wei_Ping1"", ""~Jiaji_Huang1"", ""kexinzhao@baidu.com"", ""~Bryan_Catanzaro1""]","[""Zhifeng Kong"", ""Wei Ping"", ""Jiaji Huang"", ""Kexin Zhao"", ""Bryan Catanzaro""]","[""diffusion probabilistic models"", ""audio synthesis"", ""speech synthesis"", ""generative models""]","In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.",/pdf/d27840fc3a835c4da4a9d13c4227c7a0d8a9b3c5.pdf,ICLR,2021,"DiffWave is a versatile diffusion probabilistic model for waveform generation, which matches the state-of-the-art neural vocoder in terms of quality and can generate abundant realistic voices in time-domain without any conditional information." +Xa3iM4C1nqd,0-2s6-N_GEh,1601310000000.0,1614990000000.0,1933,Transferable Unsupervised Robust Representation Learning,"[""~De-An_Huang1"", ""~Zhiding_Yu1"", ""~Anima_Anandkumar1""]","[""De-An Huang"", ""Zhiding Yu"", ""Anima Anandkumar""]","[""unsupervised representation learning"", ""robustness"", ""transfer learning""]","Robustness is an important, and yet, under-explored aspect of unsupervised representation learning, which has seen a lot of recent developments. In this work, we address this gap by developing a novel framework: Unsupervised Robust Representation Learning (URRL), which combines unsupervised representation learning's pretext task and robust supervised learning (e.g., AugMix). Moreover, it is commonly assumed that there needs to be a trade-off between natural accuracy (on clean data) and robust accuracy (on corrupted data). We upend this view and show that URRL improves both the natural accuracy of unsupervised representation learning and its robustness to corruptions and adversarial noise. A further challenge is that the robustness of a representation might not be preserved in the transfer learning process after fine-tuning on downstream tasks. We develop transferable robustness by proposing a task-agnostic similarity regularization during the fine-tuning process. We show that this improves the robustness of the resulting model without the need for any adversarial training or further data augmentation during fine-tuning.",/pdf/fd08732ac2f8670995a82a7c1d58518fb75ccf60.pdf,ICLR,2021, +BJfOXnActQ,SJgrzW05FX,1538090000000.0,1551310000000.0,1374,Learning to Learn with Conditional Class Dependencies,"[""xiang.jiang@dal.ca"", ""mohammad@imagia.com"", ""f.varno@dal.ca"", ""gabriel@imagia.com"", ""nic@imagia.com"", ""stan@cs.dal.ca""]","[""Xiang Jiang"", ""Mohammad Havaei"", ""Farshid Varno"", ""Gabriel Chartrand"", ""Nicolas Chapados"", ""Stan Matwin""]","[""meta-learning"", ""learning to learn"", ""few-shot learning""]","Neural networks can learn to extract statistical properties from data, but they seldom make use of structured information from the label space to help representation learning. Although some label structure can implicitly be obtained when training on huge amounts of data, in a few-shot learning context where little data is available, making explicit use of the label structure can inform the model to reshape the representation space to reflect a global sense of class dependencies. We propose a meta-learning framework, Conditional class-Aware Meta-Learning (CAML), that conditionally transforms feature representations based on a metric space that is trained to capture inter-class dependencies. This enables a conditional modulation of the feature representations of the base-learner to impose regularities informed by the label space. Experiments show that the conditional transformation in CAML leads to more disentangled representations and achieves competitive results on the miniImageNet benchmark.",/pdf/9cdd30ccaf1b969ff619df8cdb271cb4d3e2371e.pdf,ICLR,2019,CAML is an instance of MAML with conditional class dependencies. +r1g4E3C9t7,rkeToK69tQ,1538090000000.0,1551760000000.0,1439,Characterizing Audio Adversarial Examples Using Temporal Dependency,"[""lucas110550@sjtu.edu.cn"", ""lxbosky@gmail.com"", ""pin-yu.chen@ibm.com"", ""dawnsong@gmail.com""]","[""Zhuolin Yang"", ""Bo Li"", ""Pin-Yu Chen"", ""Dawn Song""]","[""audio adversarial example"", ""mitigation"", ""detection"", ""machine learning""]","Recent studies have highlighted adversarial examples as a ubiquitous threat to different neural network models and many downstream applications. Nonetheless, as unique data properties have inspired distinct and powerful learning principles, this paper aims to explore their potentials towards mitigating adversarial inputs. In particular, our results reveal the importance of using the temporal dependency in audio data to gain discriminate power against adversarial examples. Tested on the automatic speech recognition (ASR) tasks and three recent audio adversarial attacks, we find that (i) input transformation developed from image adversarial defense provides limited robustness improvement and is subtle to advanced attacks; (ii) temporal dependency can be exploited to gain discriminative power against audio adversarial examples and is resistant to adaptive attacks considered in our experiments. Our results not only show promising means of improving the robustness of ASR systems, but also offer novel insights in exploiting domain-specific data properties to mitigate negative effects of adversarial examples.",/pdf/312fc5f40208f39aa6860b54ece6696cd4397424.pdf,ICLR,2019,Adversarial audio discrimination using temporal dependency +fm58XfadSTF,TMdWO8lOpd,1601310000000.0,1614990000000.0,2996,Learning a Max-Margin Classifier for Cross-Domain Sentiment Analysis,"[""~Mohammad_Rostami1"", ""~Aram_Galstyan1""]","[""Mohammad Rostami"", ""Aram Galstyan""]","[""natural language processing"", ""sentiment analysis"", ""cross-domain data representation"", ""distribution alignment""]"," Sentiment analysis is a costly yet necessary task for enterprises to study the opinions of their costumers to improve their products and services and to determine optimal marketing strategies. Due to existence of a wide range of domains across different products and services, cross-domain sentiment analysis methods have received significant attention in recent years. These methods mitigate the domain gap between different applications by training cross-domain generalizable classifiers which help to relax the need for individual data annotation per each domain. Most existing methods focus on learning domain-agnostic representations that are invariant with respect to both the source and the target domains. As a result, a classifier that is trained using annotated data in a source domain, would generalize well in a related target domain. In this work, we introduce a new domain adaptation method which induces large margins between different classes in an embedding space based on the notion of prototypical distribution. This embedding space is trained to be domain-agnostic by matching the data distributions across the domains. Large margins in the source domain help to reduce the effect of ``domain shift'' on the performance of a trained classifier in the target domain. Theoreticaland empirical analysis are provided to demonstrate that the method is effective. ",/pdf/855c64d55a3f8ccd8976ca89c74dfe29da0ccbaa.pdf,ICLR,2021,This paper intorduces a new method for mitigating domain shift problem in cross-domain sentiment classification by inducing large margins between classes in a source domain. +H1gfFaEYDS,SyeDBn3PDH,1569440000000.0,1583910000000.0,662,Adversarially Robust Representations with Smooth Encoders,"[""taylancemgil@google.com"", ""sumedhg@google.com"", ""dvij@google.com"", ""pushmeet@google.com""]","[""Taylan Cemgil"", ""Sumedh Ghaisas"", ""Krishnamurthy (Dj) Dvijotham"", ""Pushmeet Kohli""]","[""Adversarial Learning"", ""Robust Representations"", ""Variational AutoEncoder"", ""Wasserstein Distance"", ""Variational Inference""]","This paper studies the undesired phenomena of over-sensitivity of representations learned by deep networks to semantically-irrelevant changes in data. We identify a cause for this shortcoming in the classical Variational Auto-encoder (VAE) objective, the evidence lower bound (ELBO). We show that the ELBO fails to control the behaviour of the encoder out of the support of the empirical data distribution and this behaviour of the VAE can lead to extreme errors in the learned representation. This is a key hurdle in the effective use of representations for data-efficient learning and transfer. To address this problem, we propose to augment the data with specifications that enforce insensitivity of the representation with respect to families of transformations. To incorporate these specifications, we propose a regularization method that is based on a selection mechanism that creates a fictive data point by explicitly perturbing an observed true data point. For certain choices of parameters, our formulation naturally leads to the minimization of the entropy regularized Wasserstein distance between representations. We illustrate our approach on standard datasets and experimentally show that significant improvements in the downstream adversarial accuracy can be achieved by learning robust representations completely in an unsupervised manner, without a reference to a particular downstream task and without a costly supervised adversarial training procedure. +",/pdf/028b6b32416c54e7f696c014e41d55866a6752a6.pdf,ICLR,2020,We propose a method for computing adversarially robust representations in an entirely unsupervised way. +HyPpD0g0Z,Byw6PCgRW,1509120000000.0,1518730000000.0,454,Grouping-By-ID: Guarding Against Adversarial Domain Shifts,"[""heinzedeml@stat.math.ethz.ch"", ""meinshausen@stat.math.ethz.ch""]","[""Christina Heinze-Deml"", ""Nicolai Meinshausen""]","[""supervised representation learning"", ""causality"", ""interpretability"", ""transfer learning""]","When training a deep neural network for supervised image classification, one can broadly distinguish between two types of latent features of images that will drive the classification of class Y. Following the notation of Gong et al. (2016), we can divide features broadly into the classes of (i) “core” or “conditionally invariant” features X^ci whose distribution P(X^ci | Y) does not change substantially across domains and (ii) “style” or “orthogonal” features X^orth whose distribution P(X^orth | Y) can change substantially across domains. These latter orthogonal features would generally include features such as position, rotation, image quality or brightness but also more complex ones like hair color or posture for images of persons. We try to guard against future adversarial domain shifts by ideally just using the “conditionally invariant” features for classification. In contrast to previous work, we assume that the domain itself is not observed and hence a latent variable. We can hence not directly see the distributional change of features across different domains. + +We do assume, however, that we can sometimes observe a so-called identifier or ID variable. We might know, for example, that two images show the same person, with ID referring to the identity of the person. In data augmentation, we generate several images from the same original image, with ID referring to the relevant original image. The method requires only a small fraction of images to have an ID variable. + +We provide a causal framework for the problem by adding the ID variable to the model of Gong et al. (2016). However, we are interested in settings where we cannot observe the domain directly and we treat domain as a latent variable. If two or more samples share the same class and identifier, (Y, ID)=(y,i), then we treat those samples as counterfactuals under different style interventions on the orthogonal or style features. Using this grouping-by-ID approach, we regularize the network to provide near constant output across samples that share the same ID by penalizing with an appropriate graph Laplacian. This is shown to substantially improve performance in settings where domains change in terms of image quality, brightness, color changes, and more complex changes such as changes in movement and posture. We show links to questions of interpretability, fairness and transfer learning.",/pdf/d26516496c5aff05a1635f147b6d98412945952d.pdf,ICLR,2018,"We propose counterfactual regularization to guard against adversarial domain shifts arising through shifts in the distribution of latent ""style features"" of images." +xYGNO86OWDH,RNP9zKeTut8,1601310000000.0,1615940000000.0,328,Isotropy in the Contextual Embedding Space: Clusters and Manifolds,"[""~Xingyu_Cai1"", ""~Jiaji_Huang1"", ""~Yuchen_Bian1"", ""~Kenneth_Church1""]","[""Xingyu Cai"", ""Jiaji Huang"", ""Yuchen Bian"", ""Kenneth Church""]","[""Contextual embedding space"", ""Isotropy"", ""Clusters"", ""Manifolds""]","The geometric properties of contextual embedding spaces for deep language models such as BERT and ERNIE, have attracted considerable attention in recent years. Investigations on the contextual embeddings demonstrate a strong anisotropic space such that most of the vectors fall within a narrow cone, leading to high cosine similarities. It is surprising that these LMs are as successful as they are, given that most of their embedding vectors are as similar to one another as they are. In this paper, we argue that the isotropy indeed exists in the space, from a different but more constructive perspective. We identify isolated clusters and low dimensional manifolds in the contextual embedding space, and introduce tools to both qualitatively and quantitatively analyze them. We hope the study in this paper could provide insights towards a better understanding of the deep language models.",/pdf/8b00c8e698e9a810bfcee44a4ae5f6c3adeb7266.pdf,ICLR,2021,"This paper reveals isotropy in the clustered contextual embedding space, and found low-dimensional manifolds in there." +InGI-IMDL18,rBIIoPB9I0h,1601310000000.0,1614990000000.0,3325,Secure Federated Learning of User Verification Models,"[""~Hossein_Hosseini4"", ""hyunsinp@qti.qualcomm.com"", ""~Sungrack_Yun1"", ""~Christos_Louizos1"", ""jsoriaga@qti.qualcomm.com"", ""mwelling@qti.qualcomm.com""]","[""Hossein Hosseini"", ""Hyunsin Park"", ""Sungrack Yun"", ""Christos Louizos"", ""Joseph Soriaga"", ""Max Welling""]","[""Federated learning"", ""User verification models""]","We consider the problem of training User Verification (UV) models in federated setup, where the conventional loss functions are not applicable due to the constraints that each user has access to the data of only one class and user embeddings cannot be shared with the server or other users. To address this problem, we propose Federated User Verification (FedUV), a framework for private and secure training of UV models. In FedUV, users jointly learn a set of vectors and maximize the correlation of their instance embeddings with a secret user-defined linear combination of those vectors. We show that choosing the linear combinations from the codewords of an error-correcting code allows users to collaboratively train the model without revealing their embedding vectors. We present the experimental results for user verification with voice, face, and handwriting data and show that FedUV is on par with existing approaches, while not sharing the embeddings with other users or the server.",/pdf/43ee547c4d213ec43fd40833eac98d95ad0adf2f.pdf,ICLR,2021,We propose a private and secure method for training user verification models in federated setup. +HklXn1BKDH,SJxd3xyFDH,1569440000000.0,1586540000000.0,1945,Learning To Explore Using Active Neural SLAM,"[""chaplot@cs.cmu.edu"", ""dhirajgandhi@fb.com"", ""saurabhg@illinois.edu"", ""abhinavg@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu""]","[""Devendra Singh Chaplot"", ""Dhiraj Gandhi"", ""Saurabh Gupta"", ""Abhinav Gupta"", ""Ruslan Salakhutdinov""]","[""Navigation"", ""Exploration""]","This work presents a modular and hierarchical approach to learn policies for exploring 3D environments, called `Active Neural SLAM'. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned SLAM module, and global and local policies. The use of learning provides flexibility with respect to input modalities (in the SLAM module), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its benefits, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complexities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our approach over past learning and geometry-based approaches. The proposed model can also be easily transferred to the PointGoal task and was the winning entry of the CVPR 2019 Habitat PointGoal Navigation Challenge.",/pdf/071ce51856763401b2c0898d6fdbdbe3f0800d03.pdf,ICLR,2020,A modular and hierarchical approach to learn policies for exploring 3D environments. +BkgWahEFvr,SJxRbmM7vB,1569440000000.0,1583910000000.0,216,Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier,"[""conniekoukl@gmail.com"", ""leehk@bii.a-star.edu.sg"", ""changec@comp.nus.edu.sg"", ""ngtk@comp.nus.edu.sg""]","[""Connie Kou"", ""Hwee Kuan Lee"", ""Ee-Chien Chang"", ""Teck Khim Ng""]","[""adversarial attack"", ""transformation defenses"", ""distribution classifier""]","Adversarial attacks on convolutional neural networks (CNN) have gained significant attention and there have been active research efforts on defense mechanisms. Stochastic input transformation methods have been proposed, where the idea is to recover the image from adversarial attack by random transformation, and to take the majority vote as consensus among the random samples. However, the transformation improves the accuracy on adversarial images at the expense of the accuracy on clean images. While it is intuitive that the accuracy on clean images would deteriorate, the exact mechanism in which how this occurs is unclear. In this paper, we study the distribution of softmax induced by stochastic transformations. We observe that with random transformations on the clean images, although the mass of the softmax distribution could shift to the wrong class, the resulting distribution of softmax could be used to correct the prediction. Furthermore, on the adversarial counterparts, with the image transformation, the resulting shapes of the distribution of softmax are similar to the distributions from the clean images. With these observations, we propose a method to improve existing transformation-based defenses. We train a separate lightweight distribution classifier to recognize distinct features in the distributions of softmax outputs of transformed images. Our empirical studies show that our distribution classifier, by training on distributions obtained from clean images only, outperforms majority voting for both clean and adversarial images. Our method is generic and can be integrated with existing transformation-based defenses.",/pdf/14e70a1bd7f64427483ed696934ed0fc639d8466.pdf,ICLR,2020,We enhance existing transformation-based defenses by using a distribution classifier on the distribution of softmax obtained from transformed images. +rJgzzJHtDB,rJeSweh_wB,1569440000000.0,1583910000000.0,1571,"Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference","[""tkhu@tamu.edu"", ""wiwjp619@tamu.edu"", ""htwang@tamu.edu"", ""atlaswang@tamu.edu""]","[""Ting-Kuei Hu"", ""Tianlong Chen"", ""Haotao Wang"", ""Zhangyang Wang""]","[""adversarial robustness"", ""efficient inference""]","Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) (Tsipras et al., 2019). Such a dilemma is shown to be rooted in the inherently higher sample complexity (Schmidt et al., 2018) and/or model capacity (Nakkiran, 2019), for learning a high-accuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a win-win between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resource-constrained applications. Is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins? This paper studies multi-exit networks associated with input-adaptive efficient inference, showing their strong promise in achieving a “sweet point"" in co-optimizing model accuracy, robustness, and efficiency. Our proposed solution, dubbed Robust Dynamic Inference Networks (RDI-Nets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multi-loss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which we present a systematical investigation. We show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDI-Nets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models. +",/pdf/56607f5a22da68d17a77097c6f3a5785777e7fd4.pdf,ICLR,2020,"Is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins? Yes!" +ByxduJBtPB,rkg--bAODr,1569440000000.0,1577170000000.0,1809,When Covariate-shifted Data Augmentation Increases Test Error And How to Fix It,"[""xie@cs.stanford.edu"", ""aditir@stanford.edu"", ""fannyang@stanford.edu"", ""jduchi@stanford.edu"", ""pliang@cs.stanford.edu""]","[""Sang Michael Xie*"", ""Aditi Raghunathan*"", ""Fanny Yang"", ""John C. Duchi"", ""Percy Liang""]","[""data augmentation"", ""adversarial training"", ""interpolation"", ""overparameterized""]","Empirically, data augmentation sometimes improves and sometimes hurts test error, even when only adding points with labels from the true conditional distribution that the hypothesis class is expressive enough to fit. In this paper, we provide precise conditions under which data augmentation hurts test accuracy for minimum norm estimators in linear regression. To mitigate the failure modes of augmentation, we introduce X-regularization, which uses unlabeled data to regularize the parameters towards the non-augmented estimate. We prove that our new estimator never hurts test error and exhibits significant improvements over adversarial data augmentation on CIFAR-10.",/pdf/71826e1ada7ed5951cbf4f1c2db03779569cd7e8.pdf,ICLR,2020, +HJezF3VYPB,HJl23-RcIS,1569440000000.0,1583910000000.0,70,Federated Adversarial Domain Adaptation,"[""xpeng@bu.edu"", ""zijun.huang@columbia.edu"", ""yizhe.zhu@rutgers.edu"", ""saenko@bu.edu""]","[""Xingchao Peng"", ""Zijun Huang"", ""Yizhe Zhu"", ""Kate Saenko""]","[""Federated Learning"", ""Domain Adaptation"", ""Transfer Learning"", ""Feature Disentanglement""]","Federated learning improves data privacy and efficiency in machine learning performed over networks of distributed devices, such as mobile phones, IoT and wearable devices, etc. Yet models trained with federated learning can still fail to generalize to new devices due to the problem of domain shift. Domain shift occurs when the labeled data collected by source nodes statistically differs from the target node's unlabeled data. In this work, we present a principled approach to the problem of federated domain adaptation, which aims to align the representations learned among the different nodes with the data distribution of the target node. Our approach extends adversarial adaptation techniques to the constraints of the federated setting. In addition, we devise a dynamic attention mechanism and leverage feature disentanglement to enhance knowledge transfer. Empirically, we perform extensive experiments on several image and text classification tasks and show promising results under unsupervised federated domain adaptation setting.",/pdf/0a10e9439edcc36581111096b67db724743508d3.pdf,ICLR,2020,"we present a principled approach to the problem of federated domain adaptation, which aims to align the representations learned among the different nodes with the data distribution of the target node." +H1gNHs05FX,ryljFK6COX,1538090000000.0,1545360000000.0,78,Clinical Risk: wavelet reconstruction networks for marked point processes,"[""jeremy.weiss@gmail.com""]","[""Jeremy C. Weiss""]","[""point processes"", ""wavelets"", ""temporal neural networks"", ""Hawkes processes""]","Timestamped sequences of events, pervasive in domains with data logs, e.g., health records, are often modeled as point processes with rate functions over time. Leading classical methods for risk scores such as Cox and Hawkes processes use such data but make strong assumptions about the shape and form of multivariate influences, resulting in time-to-event distributions irreflective of many real world processes. Recent methods in point processes and recurrent neural networks capably model rate functions but may be complex and difficult to interrogate. Our work develops a high-performing, interrogable model. We introduce wavelet reconstruction networks, a multivariate point process with a sparse wavelet reconstruction kernel to model rate functions from marked, timestamped data. We show they achieve improved performance and interrogability over baselines in forecasting complications and scheduled care visits in patients with diabetes.",/pdf/020b6d396edd96e1e21f7958eb6f2e7147080316.pdf,ICLR,2019,"Wavelet reconstructions on relative time, used in absolute-time point process models, improve risk prediction of complications and adherence in diabetes." +V6WHleb2nV,_6hCKvpeKna,1601310000000.0,1614990000000.0,597,Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis,"[""~Katsuhiko_Ishiguro1"", ""ujihara@preferred.jp"", ""rsawada@preferred.jp"", ""~Hirotaka_Akita1"", ""kotera@preferred.jp""]","[""Katsuhiko Ishiguro"", ""Kazuya Ujihara"", ""Ryohto Sawada"", ""Hirotaka Akita"", ""Masaaki Kotera""]","[""retrosynthesis"", ""data transfer"", ""transfer learninig"", ""pre-training"", ""fine-tuning"", ""self-training""]","Retrosynthesis is a problem to infer reactant compounds to synthesize a given +product compound through chemical reactions. Recent studies on retrosynthesis +focus on proposing more sophisticated prediction models, but the dataset to feed +the models also plays an essential role in achieving the best generalizing models. +Generally, a dataset that is best suited for a specific task tends to be small. In +such a case, it is the standard solution to transfer knowledge from a large or +clean dataset in the same domain. In this paper, we conduct a systematic and +intensive examination of data transfer approaches on end-to-end generative models, +in application to retrosynthesis. Experimental results show that typical data transfer +methods can improve test prediction scores of an off-the-shelf Transformer baseline +model. Especially, the pre-training plus fine-tuning approach boosts the accuracy +scores of the baseline, achieving the new state-of-the-art. In addition, we conduct a +manual inspection for the erroneous prediction results. The inspection shows that +the pre-training plus fine-tuning models can generate chemically appropriate or +sensible proposals in almost all cases.",/pdf/67fc2c63d368a370b9f8b06c5d71a2553ced93e0.pdf,ICLR,2021,"Data Transfer improves the retrosynthesis models greatly, achieving new SotA with a simpler model. " +H1goBoR9F7,SyxU3JiKKQ,1538090000000.0,1548310000000.0,114,Dynamic Sparse Graph for Efficient Deep Learning,"[""liu_liu@ucsb.edu"", ""leideng@ucsb.edu"", ""huxing@ece.ucsb.edu"", ""maohuazhu@ucsb.edu"", ""liguoqi@mail.tsinghua.edu.cn"", ""yufeiding@cs.ucsb.edu"", ""yuanxie@ucsb.edu""]","[""Liu Liu"", ""Lei Deng"", ""Xing Hu"", ""Maohua Zhu"", ""Guoqi Li"", ""Yufei Ding"", ""Yuan Xie""]","[""Sparsity"", ""compression"", ""training"", ""acceleration""]","We propose to execute deep neural networks (DNNs) with dynamic and sparse graph (DSG) structure for compressive memory and accelerative execution during both training and inference. The great success of DNNs motivates the pursuing of lightweight models for the deployment onto embedded devices. However, most of the previous studies optimize for inference while neglect training or even complicate it. Training is far more intractable, since (i) the neurons dominate the memory cost rather than the weights in inference; (ii) the dynamic activation makes previous sparse acceleration via one-off optimization on fixed weight invalid; (iii) batch normalization (BN) is critical for maintaining accuracy while its activation reorganization damages the sparsity. To address these issues, DSG activates only a small amount of neurons with high selectivity at each iteration via a dimensionreduction search and obtains the BN compatibility via a double-mask selection. Experiments show significant memory saving (1.7-4.5x) and operation reduction (2.3-4.4x) with little accuracy loss on various benchmarks.",/pdf/08d1bc6386b7bc30bc01d298f53d36f0f7fc000e.pdf,ICLR,2019,We construct dynamic sparse graph via dimension-reduction search to reduce compute and memory cost in both DNN training and inference. +HJl8_eHYvS,S1lZ63gKDr,1569440000000.0,1583910000000.0,2399,Discriminative Particle Filter Reinforcement Learning for Complex Partial observations,"[""xiao-ma@comp.nus.edu.sg"", ""karkus@comp.nus.edu.sg"", ""dyhsu@comp.nus.edu.sg"", ""leews@comp.nus.edu.sg"", ""nan.ye@uq.edu.au""]","[""Xiao Ma"", ""Peter Karkus"", ""David Hsu"", ""Wee Sun Lee"", ""Nan Ye""]","[""Reinforcement Learning"", ""Partial Observability"", ""Differentiable Particle Filtering""]","Deep reinforcement learning is successful in decision making for sophisticated games, such as Atari, Go, etc. +However, real-world decision making often requires reasoning with partial information extracted from complex visual observations. This paper presents Discriminative Particle Filter Reinforcement Learning (DPFRL), a new reinforcement learning framework for complex partial observations. DPFRL encodes a differentiable particle filter in the neural network policy for explicit reasoning with partial observations over time. The particle filter maintains a belief using learned discriminative update, which is trained end-to-end for decision making. We show that using the discriminative update instead of standard generative models results in significantly improved performance, especially for tasks with complex visual observations, because they circumvent the difficulty of modeling complex observations that are irrelevant to decision making. +In addition, to extract features from the particle belief, we propose a new type of belief feature based on the moment generating function. DPFRL outperforms state-of-the-art POMDP RL models in Flickering Atari Games, an existing POMDP RL benchmark, and in Natural Flickering Atari Games, a new, more challenging POMDP RL benchmark introduced in this paper. Further, DPFRL performs well for visual navigation with real-world data in the Habitat environment.",/pdf/932e12dea690c10acc516cab2fd75abb50368fef.pdf,ICLR,2020,"We introduce DPFRL, a framework for reinforcement learning under partial and complex observations with an importance-weighted particle filter" +ELiYxj9JlyW,2vgHP53HaQx,1601310000000.0,1614990000000.0,1208,ME-MOMENTUM: EXTRACTING HARD CONFIDENT EXAMPLES FROM NOISILY LABELED DATA,"[""~Yingbin_Bai1"", ""~Tongliang_Liu1""]","[""Yingbin Bai"", ""Tongliang Liu""]","[""label noise"", ""hard confident examples""]","Examples that are close to the decision boundary—that we term hard examples, are essential to shaping accurate classifiers. Extracting confident examples has been widely studied in the community of learning with noisy labels. However, it remains elusive how to extract hard confident examples from the noisy training data. In this paper, we propose a deep learning paradigm to solve this problem, which is built on the memorization effect of deep neural networks that they would first learn simple patterns, i.e., which are defined by these shared by multiple training examples. To extract hard confident examples that contain non-simple patterns and are entangled with the inaccurately labeled examples, we borrow the idea of momentum from physics. Specifically, we alternately update the confident examples and refine the classifier. Note that the extracted confident examples in the previous round can be exploited to learn a better classifier and that the better classifier will help identify better (and hard) confident examples. We call the approach the “Momentum of Memorization” (Me-Momentum). Empirical results on benchmark-simulated and real-world label-noise data illustrate the effectiveness of Me-Momentum for extracting hard confident examples, leading to better classification performance.",/pdf/4f93e2f30d8a0027c7bc4d4b456838e4e9c3c6fa.pdf,ICLR,2021,"In this work, we try to address the label noise problem by extracting hard confident examples." +BJena3VtwS,Hyg4m3kEwH,1569440000000.0,1577170000000.0,241,The Visual Task Adaptation Benchmark,"[""xzhai@google.com"", ""jpuigcerver@google.com"", ""alexander.kolesnikoff@gmail.com"", ""pierrot@google.com"", ""rikel@googel.com"", ""lucic@google.com"", ""josipd@google.com"", ""andresp@google.com"", ""maximneumann@google.com"", ""adosovitskiy@gmail.com"", ""lbeyer@google.com"", ""bachem@google.com"", ""tschannen@google.com"", ""michalski@google.com"", ""obousquet@google.com"", ""sylvaingelly@google.com"", ""neilhoulsby@google.com""]","[""Xiaohua Zhai"", ""Joan Puigcerver"", ""Alexander Kolesnikov"", ""Pierre Ruyssen"", ""Carlos Riquelme"", ""Mario Lucic"", ""Josip Djolonga"", ""Andre Susano Pinto"", ""Maxim Neumann"", ""Alexey Dosovitskiy"", ""Lucas Beyer"", ""Olivier Bachem"", ""Michael Tschannen"", ""Marcin Michalski"", ""Olivier Bousquet"", ""Sylvain Gelly"", ""Neil Houlsby""]","[""representation learning"", ""self-supervised learning"", ""benchmark"", ""large-scale study""]","Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visual representations hinders progress. Many sub-fields promise representations, but each has different evaluation protocols that are either too constrained (linear classification), limited in scope (ImageNet, CIFAR, Pascal-VOC), or only loosely related to representation quality (generation). We present the Visual Task Adaptation Benchmark (VTAB): a diverse, realistic, and challenging benchmark to evaluate representations. VTAB embodies one principle: good representations adapt to unseen tasks with few examples. We run a large VTAB study of popular algorithms, answering questions like: How effective are ImageNet representation on non-standard datasets? Are generative models competitive? Is self-supervision useful if one already has labels?",/pdf/b08d94d95074cf2bf336fccf622b5aca525b6314.pdf,ICLR,2020,"VTAB is a unified, realistic, and challenging benchmark for general visual representation learning. With it, we evaluate many methods." +kcqSDWySoy,Ea0J2JXBNO_p,1601310000000.0,1614990000000.0,759,Sobolev Training for the Neural Network Solutions of PDEs,"[""~Hwijae_Son1"", ""jangjinw@iam.uni-bonn.de"", ""wjhan@postech.ac.kr"", ""~Hyung_Ju_Hwang1""]","[""Hwijae Son"", ""Jin Woo Jang"", ""Woo Jin Han"", ""Hyung Ju Hwang""]","[""Sobolev Training"", ""Partial Differential Equations"", ""Neural Networks"", ""Convergence""]","Approximating the numerical solutions of partial differential equations (PDEs) using neural networks is a promising application of deep learning. The smooth architecture of a fully connected neural network is appropriate for finding the solutions of PDEs; the corresponding loss function can also be intuitively designed and guarantees the convergence for various kinds of PDEs. However, the rate of convergence has been considered as a weakness of this approach. This paper introduces a novel loss function for the training of neural networks to find the solutions of PDEs, making the training substantially efficient. Inspired by the recent studies that incorporate derivative information for the training of neural networks, we develop a loss function that guides a neural network to reduce the error in the corresponding Sobolev space. Surprisingly, a simple modification of the loss function can make the training process similar to Sobolev Training although solving PDEs with neural networks is not a fully supervised learning task. We provide several theoretical justifications for such an approach for the viscous Burgers equation and the kinetic Fokker--Planck equation. We also present several simulation results, which show that compared with the traditional $L^2$ loss function, the proposed loss function guides the neural network to a significantly faster convergence. Moreover, we provide the empirical evidence that shows that the proposed loss function, together with the iterative sampling techniques, performs better in solving high dimensional PDEs.",/pdf/a3889f68f9a0293ad28f5d3c0341700b69077de2.pdf,ICLR,2021,We propose a class of novel loss functions for efficient training when solving PDEs using neural networks. +B1hcZZ-AW,B1qqbZWAZ,1509130000000.0,1519420000000.0,653,N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning,"[""anubhava@andrew.cmu.edu"", ""nrhineha@cs.cmu.edu"", ""fares.beainy@volvo.com"", ""kkitani@cs.cmu.edu""]","[""Anubhav Ashok"", ""Nicholas Rhinehart"", ""Fares Beainy"", ""Kris M. Kitani""]","[""Deep learning"", ""Neural networks"", ""Model compression""]","While bigger and deeper neural network architectures continue to advance the state-of-the-art for many computer vision tasks, real-world adoption of these networks is impeded by hardware and speed constraints. Conventional model compression methods attempt to address this problem by modifying the architecture manually or using pre-defined heuristics. Since the space of all reduced architectures is very large, modifying the architecture of a deep neural network in this way is a difficult task. In this paper, we tackle this issue by introducing a principled method for learning reduced network architectures in a data-driven way using reinforcement learning. Our approach takes a larger 'teacher' network as input and outputs a compressed 'student' network derived from the 'teacher' network. In the first stage of our method, a recurrent policy network aggressively removes layers from the large 'teacher' model. In the second stage, another recurrent policy network carefully reduces the size of each remaining layer. The resulting network is then evaluated to obtain a reward -- a score based on the accuracy and compression of the network. Our approach uses this reward signal with policy gradients to train the policies to find a locally optimal student network. Our experiments show that we can achieve compression rates of more than 10x for models such as ResNet-34 while maintaining similar performance to the input 'teacher' network. We also present a valuable transfer learning result which shows that policies which are pre-trained on smaller 'teacher' networks can be used to rapidly speed up training on larger 'teacher' networks.",/pdf/81be55237357784b2a356e96bf1178594f8e8706.pdf,ICLR,2018,A novel reinforcement learning based approach to compress deep neural networks with knowledge distillation +SyGjQ30qFX,r1gNFzCctm,1538090000000.0,1545360000000.0,1392,TopicGAN: Unsupervised Text Generation from Explainable Latent Topics,"[""king6101@gmail.com"", ""y.v.chen@ieee.org"", ""tlkagkb93901106@gmail.com""]","[""Yau-Shian Wang"", ""Yun-Nung Chen"", ""Hung-Yi Lee""]","[""unsupervised learning"", ""topic model"", ""text generation""]","Learning discrete representations of data and then generating data from the discovered representations have been increasingly studied because the obtained discrete representations can benefit unsupervised learning. However, the performance of learning discrete representations of textual data with deep generative models has not been widely explored. In addition, although generative adversarial networks(GAN) have shown impressing results in many areas such as image generation, for text generation, it is notorious for extremely difficult to train. In this work, we propose TopicGAN, a two-step text generative model, which is able to solve those two important problems simultaneously. In the first step, it discovers the latent topics and produced bag-of-words according to the latent topics. In the second step, it generates text from the produced bag-of-words. In our experiments, we show our model can discover meaningful discrete latent topics of texts in an unsupervised fashion and generate high quality natural language from the discovered latent topics.",/pdf/9cbfa19fa5a15f522cec935b087b28c27d57eb3b.pdf,ICLR,2019, +B1xGGTEtDH,ByeFLe9LPB,1569440000000.0,1577170000000.0,405,Universal Approximation with Deep Narrow Networks,"[""kidger@maths.ox.ac.uk"", ""tlyons@maths.ox.ac.uk""]","[""Patrick Kidger"", ""Terry Lyons""]","[""deep learning"", ""universal approximation"", ""deep narrow networks""]","The classical Universal Approximation Theorem certifies that the universal approximation property holds for the class of neural networks of arbitrary width. Here we consider the natural `dual' theorem for width-bounded networks of arbitrary depth. Precisely, let $n$ be the number of inputs neurons, $m$ be the number of output neurons, and let $\rho$ be any nonaffine continuous function, with a continuous nonzero derivative at some point. Then we show that the class of neural networks of arbitrary depth, width $n + m + 2$, and activation function $\rho$, exhibits the universal approximation property with respect to the uniform norm on compact subsets of $\mathbb{R}^n$. This covers every activation function possible to use in practice; in particular this includes polynomial activation functions, making this genuinely different to the classical case. We go on to consider extensions of this result. First we show an analogous result for a certain class of nowhere differentiable activation functions. Second we establish an analogous result for noncompact domains, by showing that deep narrow networks with the ReLU activation function exhibit the universal approximation property with respect to the $p$-norm on $\mathbb{R}^n$. Finally we show that width of only $n + m + 1$ suffices for `most' activation functions.",/pdf/aba857fa58e8744b9d3844a3e85f7faf6db48fbf.pdf,ICLR,2020, +o3iritJHLfO,VXoQtyu2jJx,1601310000000.0,1615990000000.0,3537,Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech,"[""~Yoonhyung_Lee2"", ""~Joongbo_Shin1"", ""~Kyomin_Jung1""]","[""Yoonhyung Lee"", ""Joongbo Shin"", ""Kyomin Jung""]","[""text-to-speech"", ""speech synthesis"", ""non-autoregressive"", ""VAE""]","Although early text-to-speech (TTS) models such as Tacotron 2 have succeeded in generating human-like speech, their autoregressive architectures have several limitations: (1) They require a lot of time to generate a mel-spectrogram consisting of hundreds of steps. (2) The autoregressive speech generation shows a lack of robustness due to its error propagation property. In this paper, we propose a novel non-autoregressive TTS model called BVAE-TTS, which eliminates the architectural limitations and generates a mel-spectrogram in parallel. BVAE-TTS adopts a bidirectional-inference variational autoencoder (BVAE) that learns hierarchical latent representations using both bottom-up and top-down paths to increase its expressiveness. To apply BVAE to TTS, we design our model to utilize text information via an attention mechanism. By using attention maps that BVAE-TTS generates, we train a duration predictor so that the model uses the predicted duration of each phoneme at inference. In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron 2 with similar speech quality. Furthermore, our BVAE-TTS outperforms Glow-TTS, which is one of the state-of-the-art non-autoregressive TTS models, in terms of both speech quality and inference speed while having 58% fewer parameters.",/pdf/db03d769745da96b32be80606e358d64a7641d2b.pdf,ICLR,2021,"In this paper, a novel non-autoregressive text-to-speech model based on bidirectional-inference variational autoencoder called BVAE-TTS is proposed." +IjIzIOkK2D6,7h2G1_L_KDc,1601310000000.0,1614990000000.0,2684,Efficient Graph Neural Architecture Search,"[""~Huan_Zhao2"", ""~Lanning_Wei1"", ""~quanming_yao1"", ""hezq@levono.com""]","[""Huan Zhao"", ""Lanning Wei"", ""quanming yao"", ""Zhiqiang He""]","[""graph neural network"", ""neural architecture search"", ""automated machine learning""]","Recently, graph neural networks (GNN) have been demonstrated effective in various graph-based tasks. +To obtain state-of-the-art (SOTA) data-specific GNN architectures, researchers turn to the neural architecture search (NAS) methods. +However, it remains to be a challenging problem to conduct efficient architecture search for GNN. +In this work, we present a novel framework for Efficient GrAph Neural architecture search (EGAN). +By designing a novel and expressive search space, an efficient one-shot NAS method based on stochastic relaxation and natural gradient is proposed. +Further, to enable architecture search in large graphs, a transfer learning paradigm is designed. +Extensive experiments, including node-level and graph-level tasks, are conducted. The results show that the proposed EGAN can obtain SOTA data-specific architectures, and reduce the search cost by two orders of magnitude compared to existing NAS baselines.",/pdf/d27c1fcb0d9fc6b2fd675137cd7ea5b3a0c024e3.pdf,ICLR,2021,"We propose an effective and efficient framework for graph neural architecture search, which is very important for graph-based tasks." +Qun8fv4qSby,n7FhMb8aple,1601310000000.0,1616100000000.0,3261,Transient Non-stationarity and Generalisation in Deep Reinforcement Learning,"[""~Maximilian_Igl1"", ""~Gregory_Farquhar1"", ""~Jelena_Luketina1"", ""~Wendelin_Boehmer1"", ""~Shimon_Whiteson1""]","[""Maximilian Igl"", ""Gregory Farquhar"", ""Jelena Luketina"", ""Wendelin Boehmer"", ""Shimon Whiteson""]","[""Reinforcement Learning"", ""Generalization""]","Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect, where these transient non-stationarities can permanently impact the latent representation and adversely affect generalisation performance. Consequently, to improve generalisation of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialised network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalisation benchmarks ProcGen and Multiroom.",/pdf/ea444807010b334cd2b90645f1cfa31bd38f3ef7.pdf,ICLR,2021,We find that transient non-stationarity can worsen generalization in reinforcement learning and propose a method to overcome this effeect. +Hke-JhA9Y7,rylSWld5YX,1538090000000.0,1550880000000.0,959,Learning concise representations for regression by evolving networks of trees,"[""lacava@upenn.edu"", ""tilakraj@seas.upenn.edu"", ""surisr@seas.upenn.edu"", ""jhmoore@upenn.edu""]","[""William La Cava"", ""Tilak Raj Singh"", ""James Taggart"", ""Srinivas Suri"", ""Jason H. Moore""]","[""regression"", ""stochastic optimization"", ""evolutionary compution"", ""feature engineering""]","We propose and study a method for learning interpretable representations for the task of regression. Features are represented as networks of multi-type expression trees comprised of activation functions common in neural networks in addition to other elementary functions. Differentiable features are trained via gradient descent, and the performance of features in a linear model is used to weight the rate of change among subcomponents of each representation. The search process maintains an archive of representations with accuracy-complexity trade-offs to assist in generalization and interpretation. We compare several stochastic optimization approaches within this framework. We benchmark these variants on 100 open-source regression problems in comparison to state-of-the-art machine learning approaches. Our main finding is that this approach produces the highest average test scores across problems while producing representations that are orders of magnitude smaller than the next best performing method (gradient boosting). We also report a negative result in which attempts to directly optimize the disentanglement of the representation result in more highly correlated features.",/pdf/dbc8b24939c7edf937197c9847a336725d0349c1.pdf,ICLR,2019,Representing the network architecture as a set of syntax trees and optimizing their structure leads to accurate and concise regression models. +ry80wMW0W,r1WRvM-CZ,1509140000000.0,1519400000000.0,865,Hierarchical Subtask Discovery with Non-Negative Matrix Factorization,"[""adam.earle@ymail.com"", ""asaxe@fas.harvard.edu"", ""benjros@gmail.com""]","[""Adam C. Earle"", ""Andrew M. Saxe"", ""Benjamin Rosman""]","[""Reinforcement Learning"", ""Hierarchy"", ""Subtask Discovery"", ""Linear Markov Decision Process""]","Hierarchical reinforcement learning methods offer a powerful means of planning flexible behavior in complicated domains. However, learning an appropriate hierarchical decomposition of a domain into subtasks remains a substantial challenge. We present a novel algorithm for subtask discovery, based on the recently introduced multitask linearly-solvable Markov decision process (MLMDP) framework. The MLMDP can perform never-before-seen tasks by representing them as a linear combination of a previously learned basis set of tasks. In this setting, the subtask discovery problem can naturally be posed as finding an optimal low-rank approximation of the set of tasks the agent will face in a domain. We use non-negative matrix factorization to discover this minimal basis set of tasks, and show that the technique learns intuitive decompositions in a variety of domains. Our method has several qualitatively desirable features: it is not limited to learning subtasks with single goal states, instead learning distributed patterns of preferred states; it learns qualitatively different hierarchical decompositions in the same domain depending on the ensemble of tasks the agent will face; and it may be straightforwardly iterated to obtain deeper hierarchical decompositions.",/pdf/8bdc9d8975d51c6e92a49204d636c3c4f8682994.pdf,ICLR,2018,We present a novel algorithm for hierarchical subtask discovery which leverages the multitask linear Markov decision process framework. +SkgC6TNFvr,rJgOfBZdPr,1569440000000.0,1583910000000.0,839,Reinforced active learning for image segmentation,"[""arantxa.casanova-paga@polymtl.ca"", ""pedro@opinheiro.com"", ""negar@elementai.com"", ""chris.j.pal@gmail.com""]","[""Arantxa Casanova"", ""Pedro O. Pinheiro"", ""Negar Rostamzadeh"", ""Christopher J. Pal""]","[""semantic segmentation"", ""active learning"", ""reinforcement learning""]","Learning-based approaches for semantic segmentation have two inherent challenges. First, acquiring pixel-wise labels is expensive and time-consuming. Second, realistic segmentation datasets are highly unbalanced: some categories are much more abundant than others, biasing the performance to the most represented ones. In this paper, we are interested in focusing human labelling effort on a small subset of a larger pool of data, minimizing this effort while maximizing performance of a segmentation model on a hold-out set. We present a new active learning strategy for semantic segmentation based on deep reinforcement learning (RL). An agent learns a policy to select a subset of small informative image regions -- opposed to entire images -- to be labeled, from a pool of unlabeled data. The region selection decision is made based on predictions and uncertainties of the segmentation model being trained. Our method proposes a new modification of the deep Q-network (DQN) formulation for active learning, adapting it to the large-scale nature of semantic segmentation problems. We test the proof of concept in CamVid and provide results in the large-scale dataset Cityscapes. On Cityscapes, our deep RL region-based DQN approach requires roughly 30% less additional labeled data than our most competitive baseline to reach the same performance. Moreover, we find that our method asks for more labels of under-represented categories compared to the baselines, improving their performance and helping to mitigate class imbalance.",/pdf/cddf43c162f589c4d79c0991849cca940858fcae.pdf,ICLR,2020,Learning a labeling policy with reinforcement learning to reduce labeling effort for the task of semantic segmentation +Hy7EPh10W,HJMEw21AZ,1509050000000.0,1518730000000.0,172,Novelty Detection with GAN,"[""mark.kliger@gmail.com"", ""shacharfl@gmail.com""]","[""Mark Kliger"", ""Shachar Fleishman""]","[""novelty detection"", ""GAN"", ""feature matching"", ""semi-supervised""]","The ability of a classifier to recognize unknown inputs is important for many classification-based systems. We discuss the problem of simultaneous classification and novelty detection, i.e. determining whether an input is from the known set of classes and from which specific class, or from an unknown domain and does not belong to any of the known classes. We propose a method based on the Generative Adversarial Networks (GAN) framework. We show that a multi-class discriminator trained with a generator that generates samples from a mixture of nominal and novel data distributions is the optimal novelty detector. We approximate that generator with a mixture generator trained with the Feature Matching loss and empirically show that the proposed method outperforms conventional methods for novelty detection. Our findings demonstrate a simple, yet powerful new application of the GAN framework for the task of novelty detection.",/pdf/de88bd984c67f9bfd97826eb748b81c0e7d7a296.pdf,ICLR,2018,We propose to solve a problem of simultaneous classification and novelty detection within the GAN framework. +Tq_H_EDK-wa,YKFsRmYGgn,1601310000000.0,1614990000000.0,2359,Exploiting structured data for learning contagious diseases under incomplete testing,"[""~Maggie_Makar1"", ""lrwest@mgh.harvard.edu"", ""dhooper@mgh.harvard.edu"", ""~Eric_Horvitz1"", ""eshenoy@mgh.harvard.edu"", ""~John_Guttag2""]","[""Maggie Makar"", ""Lauren West"", ""David Hooper"", ""Eric Horvitz"", ""Erica Shenoy"", ""John Guttag""]","[""infectious diseases"", ""neural networks"", ""healthcare"", ""regularization"", ""structured data""]","One of the ways that machine learning algorithms can help control the spread of an infectious disease is by building models that predict who is likely to get infected whether or not they display any symptoms, making them good candidates for preemptive isolation. In this work we ask: can we build reliable infection prediction models when the observed data is collected under limited, and biased testing that prioritizes testing symptomatic individuals? Our analysis suggests that under favorable conditions, incomplete testing might be sufficient to achieve relatively good out-of-sample prediction error. Favorable conditions occur when untested-infected individuals have sufficiently different characteristics from untested-healthy, and when the infected individuals are ""potent"", meaning they infect a large majority of their neighbors. We develop an algorithm that predicts infections, and show that it outperforms benchmarks on simulated data. We apply our model to data from a large hospital to predict Clostridioides difficile infections; a communicable disease that is characterized by asymptomatic (i.e., untested) carriers. Using a proxy instead of the unobserved untested-infected state, we show that our model outperforms benchmarks in predicting infections. ",/pdf/43204fa4500decd7423ab8a162c6b0245025fce4.pdf,ICLR,2021,We build models that leverage rich structured data to predict symptomatic and asymptomatic infections of contagious dieases +HJxR7R4FvS,SJlN5eUOPH,1569440000000.0,1583910000000.0,1058,RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering ,"[""samuel_lobel@brown.edu"", ""chunyuan.li@microsoft.com"", ""jfgao@microsoft.com"", ""lcarin@duke.edu""]","[""Sam Lobel*"", ""Chunyuan Li*"", ""Jianfeng Gao"", ""Lawrence Carin""]","[""Collaborative Filtering"", ""Recommender Systems"", ""Actor-Critic"", ""Learned Metrics""]","We investigate new methods for training collaborative filtering models based on actor-critic reinforcement learning, to more directly maximize ranking-based objective functions. Specifically, we train a critic network to approximate ranking-based metrics, and then update the actor network to directly optimize against the learned metrics. In contrast to traditional learning-to-rank methods that require re-running the optimization procedure for new lists, our critic-based method amortizes the scoring process with a neural network, and can directly provide the (approximate) ranking scores for new lists. + +We demonstrate the actor-critic's ability to significantly improve the performance of a variety of prediction models, and achieve better or comparable performance to a variety of strong baselines on three large-scale datasets. +",/pdf/4cb3e9e73f04059794a59c7df1a752c08b2d9f0b.pdf,ICLR,2020,"We apply the actor-critic methodology from reinforcement learning to collaborative filtering, resulting in improved performance across a variety of latent-variable models" +Byl_ciRcY7,rJg6lJ7tYQ,1538090000000.0,1545360000000.0,549,ON BREIMAN’S DILEMMA IN NEURAL NETWORKS: SUCCESS AND FAILURE OF NORMALIZED MARGINS,"[""yhuangcc@ust.hk"", ""yuany@ust.hk"", ""wzhuai@connect.ust.hk""]","[""Yifei HUANG"", ""Yuan YAO"", ""Weizhi ZHU""]","[""Bregman's Dilemma"", ""Generalization Error"", ""Margin"", ""Spectral normalization""]","A belief persists long in machine learning that enlargement of margins over training data accounts for the resistance of models to overfitting by increasing the robustness. Yet Breiman shows a dilemma (Breiman, 1999) that a uniform improvement on margin distribution \emph{does not} necessarily reduces generalization error. In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed normalized margins using Lipschitz constant bound by spectral norm products. With both simplified theory and extensive experiments, Breiman's dilemma is shown to rely on dynamics of normalized margin distributions, that reflects the trade-off between model expression power and data complexity. When the complexity of data is comparable to the model expression power in the sense that training and test data share similar phase transitions in normalized margin dynamics, two efficient ways are derived via classic margin-based generalization bounds to successfully predict the trend of generalization error. On the other hand, over-expressed models that exhibit uniform improvements on training normalized margins may lose such a prediction power and fail to prevent the overfitting. +",/pdf/6c3005549b4ac8119f8edeb787a903e3e738bf5e.pdf,ICLR,2019,"Bregman's dilemma is shown in deep learning that improvement of margins of over-parameterized models may result in overfitting, and dynamics of normalized margin distributions are proposed to predict generalization error and identify such a dilemma. " +HJlXC3EtwB,S1xrI_FVwH,1569440000000.0,1577170000000.0,258,Learning to Anneal and Prune Proximity Graphs for Similarity Search,"[""minjiaz@microsoft.com"", ""wenhanw@microsoft.com"", ""yuxhe@microsoft.com""]","[""Minjia Zhang"", ""Wenhan Wang"", ""Yuxiong He""]","[""Similarity search"", ""Proximity graph"", ""Learning to prune"", ""Edge heterogeneity"", ""Annealing"", ""Efficiency""]","This paper studies similarity search, which is a crucial enabler of many feature vector--based applications. The problem of similarity search has been extensively studied in the machine learning community. Recent advances of proximity graphs have achieved outstanding performance through exploiting the navigability of the underlying graph structure. In this work, we introduce the annealable proximity graph (APG) method to learn and reshape proximity graphs for efficiency and effective similarity search. APG makes proximity graph edges annealable, which can be effectively trained with a stochastic optimization algorithm. APG identifies important edges that best preserve graph navigability and prune inferior edges without drastically changing graph properties. Experimental results show that APG achieves state-of-the-art results not only by producing proximity graphs with less number of edges but also speeding up the search time by 20--40\% across different datasets with almost no loss of accuracy. +",/pdf/947274dc37debb51def2ecb39c2be7c01e25592f.pdf,ICLR,2020,Annealable proximity graphs facilitates similarity search by learning to prune inferior edges without drastically changing graph properties. +HJdXGy1RW,rJv7zyyA-,1508990000000.0,1518730000000.0,110,CrescendoNet: A Simple Deep Convolutional Neural Network with Ensemble Behavior,"[""xzhang7@clemson.edu"", ""nvishwa@clemson.edu"", ""luofeng@clemson.edu"", ""hongxih@clemson.edu""]","[""Xiang Zhang"", ""Nishant Vishwamitra"", ""Hongxin Hu"", ""Feng Luo""]","[""CNN"", ""ensemble"", ""image recognition""]","We introduce a new deep convolutional neural network, CrescendoNet, by stacking simple building blocks without residual connections. Each Crescendo block contains independent convolution paths with increased depths. The numbers of convolution layers and parameters are only increased linearly in Crescendo blocks. In experiments, CrescendoNet with only 15 layers outperforms almost all networks without residual connections on benchmark datasets, CIFAR10, CIFAR100, and SVHN. Given sufficient amount of data as in SVHN dataset, CrescendoNet with 15 layers and 4.1M parameters can match the performance of DenseNet-BC with 250 layers and 15.3M parameters. CrescendoNet provides a new way to construct high performance deep convolutional neural networks without residual connections. Moreover, through investigating the behavior and performance of subnetworks in CrescendoNet, we note that the high performance of CrescendoNet may come from its implicit ensemble behavior, which differs from the FractalNet that is also a deep convolutional neural network without residual connections. Furthermore, the independence between paths in CrescendoNet allows us to introduce a new path-wise training procedure, which can reduce the memory needed for training.",/pdf/224aa7c2d86ec3fe6b7150c74963832a746ada5c.pdf,ICLR,2018,"We introduce CrescendoNet, a deep CNN architecture by stacking simple building blocks without residual connections." +B1X0mzZCW,H1m07GWAW,1509140000000.0,1519420000000.0,812,Fidelity-Weighted Learning,"[""dehghani@uva.nl"", ""amehrjou@tuebingen.mpg.de"", ""sgouws@google.com"", ""kamps@uva.nl"", ""bs@tuebingen.mpg.de""]","[""Mostafa Dehghani"", ""Arash Mehrjou"", ""Stephan Gouws"", ""Jaap Kamps"", ""Bernhard Sch\u00f6lkopf""]","[""fidelity-weighted learning"", ""semisupervised learning"", ""weakly-labeled data"", ""teacher-student""]","Training deep neural networks requires many training samples, but in practice training labels are expensive to obtain and may be of varying quality, as some may be from trusted expert labelers while others might be from heuristics or other sources of weak supervision such as crowd-sourcing. This creates a fundamental quality- versus-quantity trade-off in the learning process. Do we learn from the small amount of high-quality data or the potentially large amount of weakly-labeled data? We argue that if the learner could somehow know and take the label-quality into account when learning the data representation, we could get the best of both worlds. To this end, we propose “fidelity-weighted learning” (FWL), a semi-supervised student- teacher approach for training deep neural networks using weakly-labeled data. FWL modulates the parameter updates to a student network (trained on the task we care about) on a per-sample basis according to the posterior confidence of its label-quality estimated by a teacher (who has access to the high-quality labels). Both student and teacher are learned from the data. We evaluate FWL on two tasks in information retrieval and natural language processing where we outperform state-of-the-art alternative semi-supervised methods, indicating that our approach makes better use of strong and weak labels, and leads to better task-dependent data representations.",/pdf/b02079b368ddc201cb4177719f6e58958cde9be6.pdf,ICLR,2018,"We propose Fidelity-weighted Learning, a semi-supervised teacher-student approach for training neural networks using weakly-labeled data." +SJxsV2R5FQ,BkghyT29YX,1538090000000.0,1550850000000.0,1481,Learning sparse relational transition models,"[""victoria.f.xia281@gmail.com"", ""ziw@mit.edu"", ""krallen@mit.edu"", ""tslvr@mit.edu"", ""lpk@csail.mit.edu""]","[""Victoria Xia"", ""Zi Wang"", ""Kelsey Allen"", ""Tom Silver"", ""Leslie Pack Kaelbling""]","[""Deictic reference"", ""relational model"", ""rule-based transition model""]","We present a representation for describing transition models in complex uncertain domains using relational rules. For any action, a rule selects a set of relevant objects and computes a distribution over properties of just those objects in the resulting state given their properties in the previous state. An iterative greedy algorithm is used to construct a set of deictic references that determine which objects are relevant in any given state. Feed-forward neural networks are used to learn the transition distribution on the relevant objects' properties. This strategy is demonstrated to be both more versatile and more sample efficient than learning a monolithic transition model in a simulated domain in which a robot pushes stacks of objects on a cluttered table.",/pdf/86de7d9edeb7dd915d0c2ec57c62bfb09f44b540.pdf,ICLR,2019,A new approach that learns a representation for describing transition models in complex uncertaindomains using relational rules. +NX1He-aFO_F,InhfCvl-sRI,1601310000000.0,1615830000000.0,3292,Learning Value Functions in Deep Policy Gradients using Residual Variance,"[""~Yannis_Flet-Berliac1"", ""~reda_ouhamma1"", ""~odalric-ambrym_maillard1"", ""~Philippe_Preux1""]","[""Yannis Flet-Berliac"", ""reda ouhamma"", ""odalric-ambrym maillard"", ""Philippe Preux""]",[],"Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.",/pdf/d19c38b4919b1481e2aa3972a928c866f4502b44.pdf,ICLR,2021,We introduce a method to improve the learning of the critic in the actor-critic framework. +Sys6GJqxl,,1478260000000.0,1486540000000.0,160,Delving into Transferable Adversarial Examples and Black-box Attacks,"[""resodo.liu@gmail.com"", ""jungyhuk@gmail.com"", ""liuchang@eecs.berkeley.edu"", ""dawnsong@cs.berkeley.edu""]","[""Yanpei Liu"", ""Xinyun Chen"", ""Chang Liu"", ""Dawn Song""]","[""Computer vision"", ""Deep learning"", ""Applications""]","An intriguing property of deep neural networks is the existence of adversarial examples, which can transfer among different architectures. These transferable adversarial examples may severely hinder deep neural network-based applications. Previous works mostly study the transferability using small scale datasets. In this work, we are the first to conduct an extensive study of the transferability over large models and a large scale dataset, and we are also the first to study the transferability of targeted adversarial examples with their target labels. We study both non-targeted and targeted adversarial examples, and show that while transferable non-targeted adversarial examples are easy to find, targeted adversarial examples generated using existing approaches almost never transfer with their target labels. Therefore, we propose novel ensemble-based approaches to generating transferable adversarial examples. Using such approaches, we observe a large proportion of targeted adversarial examples that are able to transfer with their target labels for the first time. We also present some geometric studies to help understanding the transferable adversarial examples. Finally, we show that the adversarial examples generated using ensemble-based approaches can successfully attack Clarifai.com, which is a black-box image classification system.",/pdf/978205dd6095246709f97a4ec11279e65df57035.pdf,ICLR,2017, +rknt2Be0-,H1iK2rl0-,1509080000000.0,1519800000000.0,264,Compositional Obverter Communication Learning from Raw Visual Input,"[""mp2893@gatech.edu"", ""angeliki@google.com"", ""nandodefreitas@google.com""]","[""Edward Choi"", ""Angeliki Lazaridou"", ""Nando de Freitas""]","[""compositional language"", ""obverter"", ""multi-agent communication"", ""raw pixel input""]","One of the distinguishing aspects of human language is its compositionality, which allows us to describe complex environments with limited vocabulary. Previously, it has been shown that neural network agents can learn to communicate in a highly structured, possibly compositional language based on disentangled input (e.g. hand- engineered features). Humans, however, do not learn to communicate based on well-summarized features. In this work, we train neural agents to simultaneously develop visual perception from raw image pixels, and learn to communicate with a sequence of discrete symbols. The agents play an image description game where the image contains factors such as colors and shapes. We train the agents using the obverter technique where an agent introspects to generate messages that maximize its own understanding. Through qualitative analysis, visualization and a zero-shot test, we show that the agents can develop, out of raw image pixels, a language with compositional properties, given a proper pressure from the environment.",/pdf/41438a4286902f455443a626f0aacf8fccc42388.pdf,ICLR,2018,We train neural network agents to develop a language with compositional properties from raw pixel input. +IfEkus1dpU,IDFZd2M0cFb,1601310000000.0,1614990000000.0,1405,Cut-and-Paste Neural Rendering,"[""~Anand_Bhattad1"", ""~David_Forsyth1""]","[""Anand Bhattad"", ""David Forsyth""]","[""Neural Rendering"", ""Reshading"", ""Relighting"", ""Computational Photography"", ""Image Decomposition""]","Cut-and-paste methods take an object from one image and insert it into another. Doing so often results in unrealistic looking images because the inserted object's shading is inconsistent with the target scene's shading. Existing reshading methods require a geometric and physical model of the inserted object, which is then rendered using environment parameters. Accurately constructing such a model only from a single image is beyond the current understanding of computer vision. + +We describe an alternative procedure -- cut-and-paste neural rendering, to render the inserted fragment's shading field consistent with the target scene. We use a Deep Image Prior (DIP) as a neural renderer trained to render an image with consistent image decomposition inferences. The resulting rendering from DIP should have an albedo consistent with composite albedo; it should have a shading field that, outside the inserted fragment, is the same as the target scene's shading field; +and composite surface normals are consistent with the final rendering's shading field. +The result is a simple procedure that produces convincing and realistic shading. Moreover, our procedure does not require rendered images or image-decomposition from real images in the training or labeled annotations. In fact, our only use of simulated ground truth is our use of a pre-trained normal estimator. Qualitative results are strong, supported by a user study comparing against state-of-the-art image harmonization baseline.",/pdf/70498ca77c495f676e3574c53e9faa8a8db7aaa4.pdf,ICLR,2021,Convincing cut-and-paste neural rendering by consistent image decomposition inferences. +SyeMblBtwr,H1lqdJxYwB,1569440000000.0,1577170000000.0,2128,CrossNorm: On Normalization for Off-Policy Reinforcement Learning,"[""aditya@bhatts.org"", ""argus.max@gmail.com"", ""amiranas@cs.uni-freiburg.de"", ""brox@cs.uni-freiburg.de""]","[""Aditya Bhatt"", ""Max Argus"", ""Artemij Amiranashvili"", ""Thomas Brox""]","[""RL"", ""Normalization""]","Off-policy temporal difference (TD) methods are a powerful class of reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD algorithms are not commonly used in combination with feature normalization techniques, despite positive effects of normalization in other domains. We show that naive application of existing normalization techniques is indeed not effective, but that well-designed normalization improves optimization stability and removes the necessity of target networks. In particular, we introduce a normalization based on a mixture of on- and off-policy transitions, which we call cross-normalization. It can be regarded as an extension of batch normalization that re-centers data for two different distributions, as present in off-policy learning. Applied to DDPG and TD3, cross-normalization improves over the state of the art across a range of MuJoCo benchmark tasks. +",/pdf/5fa2911762edaa91be3539f7e04791a9b4ad7e4d.pdf,ICLR,2020,Use of normalization for deep RL allows for training without target networks and better performance. +ByecAoAqK7,rJxYDSacFm,1538090000000.0,1545360000000.0,918,Zero-shot Dual Machine Translation,"[""lierni@google.com"", ""massi@google.com"", ""cbuck@google.com"", ""thomas.hofmann@inf.ethz.ch""]","[""Lierni Sestorain"", ""Massimiliano Ciaramita"", ""Christian Buck"", ""Thomas Hofmann""]","[""unsupervised"", ""machine translation"", ""dual learning"", ""zero-shot""]","Neural Machine Translation (NMT) systems rely on large amounts of parallel data.This is a major challenge for low-resource languages. Building on recent work onunsupervised and semi-supervised methods, we present an approach that combineszero-shot and dual learning. The latter relies on reinforcement learning, to exploitthe duality of the machine translation task, and requires only monolingual datafor the target language pair. Experiments on the UN corpus show that a zero-shotdual system, trained on English-French and English-Spanish, outperforms by largemargins a standard NMT system in zero-shot translation performance on Spanish-French (both directions). We also evaluate onnewstest2014. These experimentsshow that the zero-shot dual method outperforms the LSTM-based unsupervisedNMT system proposed in (Lample et al., 2018b), on the en→fr task, while onthe fr→en task it outperforms both the LSTM-based and the Transformers-basedunsupervised NMT systems.",/pdf/bccab057a76a9d26379a6f111b25349cd0618508.pdf,ICLR,2019,A multilingual NMT model with reinforcement learning (dual learning) aiming to improve zero-shot translation directions. +5USOVm2HkfG,FW2t07bXlGc,1601310000000.0,1614990000000.0,1906,Jointly-Trained State-Action Embedding for Efficient Reinforcement Learning,"[""~Paul_Julian_Pritz1"", ""~Liang_Ma4"", ""~Kin_Leung1""]","[""Paul Julian Pritz"", ""Liang Ma"", ""Kin Leung""]","[""reinforcement learning"", ""embedding"", ""representation learning"", ""state-action embedding""]","While reinforcement learning has achieved considerable successes in recent years, state-of-the-art models are often still limited by the size of state and action spaces. Model-free reinforcement learning approaches use some form of state representations and the latest work has explored embedding techniques for actions, both with the aim of achieving better generalization and applicability. However, these approaches consider only states or actions, ignoring the interaction between them when generating embedded representations. In this work, we propose a new approach for jointly learning embeddings for states and actions that combines aspects of model-free and model-based reinforcement learning, which can be applied in both discrete and continuous domains. Specifically, we use a model of the environment to obtain embeddings for states and actions and present a generic architecture that uses these to learn a policy. In this way, the embedded representations obtained via our approach enable better generalization over both states and actions by capturing similarities in the embedding spaces. Evaluations of our approach on several gaming, robotic control, and recommender systems show it significantly outperforms state-of-the-art models in both discrete/continuous domains with large state/action spaces, thus confirming its efficacy and the overall superior performance.",/pdf/2e0b08a08cbb5e955b0104c166b9481e4b586737.pdf,ICLR,2021,"We proposed a new architecture for jointly embedding states/actions and combined this with common RL algorithms, the results of which show it outperforms state-of-the-art approaches in the presence of large state/action spaces." +S1E3Ko09F7,HJeBv1guY7,1538090000000.0,1554870000000.0,484,L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data,"[""jianbochen@berkeley.edu"", ""lsong@cc.gatech.edu"", ""wainwrig@berkeley.edu"", ""jordan@cs.berkeley.edu""]","[""Jianbo Chen"", ""Le Song"", ""Martin J. Wainwright"", ""Michael I. Jordan""]","[""Model Interpretation"", ""Feature Selection""]","Instancewise feature scoring is a method for model interpretation, which yields, for each test instance, a vector of importance scores associated with features. Methods based on the Shapley score have been proposed as a fair way of computing feature attributions, but incur an exponential complexity in the number of features. This combinatorial explosion arises from the definition of Shapley value and prevents these methods from being scalable to large data sets and complex models. We focus on settings in which the data have a graph structure, and the contribution of features to the target variable is well-approximated by a graph-structured factorization. In such settings, we develop two algorithms with linear complexity for instancewise feature importance scoring on black-box models. We establish the relationship of our methods to the Shapley value and a closely related concept known as the Myerson value from cooperative game theory. We demonstrate on both language and image data that our algorithms compare favorably with other methods using both quantitative metrics and human evaluation.",/pdf/c459f1cb577e015cabb6da980b45e942be11aa91.pdf,ICLR,2019,"We develop two linear-complexity algorithms for model-agnostic model interpretation based on the Shapley value, in the settings where the contribution of features to the target is well-approximated by a graph-structured factorization." +FmMKSO4e8JK,nqyttlqGkBf,1601310000000.0,1615880000000.0,2157,Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation,"[""~Justin_Fu1"", ""~Sergey_Levine1""]","[""Justin Fu"", ""Sergey Levine""]","[""model-based optimization"", ""normalized maximum likelihood""]","In this work we consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. This problem setting emerges in many domains where function evaluation is a complex and expensive process, such as in the design of materials, vehicles, or neural network architectures. Because the available data typically only covers a small manifold of the possible space of inputs, a principal challenge is to be able to construct algorithms that can reason about uncertainty and out-of-distribution values, since a naive optimizer can easily exploit an estimated model to return adversarial inputs. We propose to tackle the MBO problem by leveraging the normalized maximum-likelihood (NML) estimator, which provides a principled approach to handling uncertainty and out-of-distribution inputs. While in the standard formulation NML is intractable, we propose a tractable approximation that allows us to scale our method to high-capacity neural network models. We demonstrate that our method can effectively optimize high-dimensional design problems in a variety of disciplines such as chemistry, biology, and materials engineering.",/pdf/4fc7d9a4093bc9bafddf46785cafda74e0be40ce.pdf,ICLR,2021,"Offline, data-driven optimization using normalized maximum likelihood to produce robust function estimates." +rkZvSe-RZ,HJePSeWCZ,1509130000000.0,1519430000000.0,582,Ensemble Adversarial Training: Attacks and Defenses,"[""tramer@cs.stanford.edu"", ""alexey@kurakin.me"", ""ngp5056@cse.psu.edu"", ""goodfellow@google.com"", ""dabo@cs.stanford.edu"", ""mcdaniel@cse.psu.edu""]","[""Florian Tram\u00e8r"", ""Alexey Kurakin"", ""Nicolas Papernot"", ""Ian Goodfellow"", ""Dan Boneh"", ""Patrick McDaniel""]","[""Adversarial Examples"", ""Adversarial Training"", ""Attacks"", ""Defenses"", ""ImageNet""]","Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. +We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. +We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with strong robustness to black-box attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks.",/pdf/9e65e3e0b6b3ecee0df85111bfd802a3c9b4e3a1.pdf,ICLR,2018,"Adversarial training with single-step methods overfits, and remains vulnerable to simple black-box and white-box attacks. We show that including adversarial examples from multiple sources helps defend against black-box attacks." +HkxAAvcxx,,1478290000000.0,1478290000000.0,446,Transformation-based Models of Video Sequences,"[""joost@joo.st"", ""akannan@fb.com"", ""ranzato@fb.com"", ""aszlam@fb.com"", ""trandu@fb.com"", ""soumith@fb.com""]","[""Joost van Amersfoort"", ""Anitha Kannan"", ""Marc'Aurelio Ranzato"", ""Arthur Szlam"", ""Du Tran"", ""Soumith Chintala""]","[""Computer vision"", ""Unsupervised Learning""]","In this work we propose a simple unsupervised approach for next frame prediction in video. Instead of directly predicting the pixels in a frame given past frames, we predict the transformations needed for generating the next frame in a sequence, given the transformations of the past frames. This leads to sharper results, while using a smaller prediction model. + +In order to enable a fair comparison between different video frame prediction models, we also propose a new evaluation protocol. We use generated frames as input to a classifier trained with ground truth sequences. This criterion guarantees that models scoring high are those producing sequences which preserve discrim- inative features, as opposed to merely penalizing any deviation, plausible or not, from the ground truth. Our proposed approach compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost.",/pdf/34dcce511da2ad624524464b80b50a9c93086043.pdf,ICLR,2017,Predict next frames of a video sequence by modelling transformations +sCZbhBvqQaU,zdk5G2MSluY,1601310000000.0,1611610000000.0,3741,Robust Reinforcement Learning on State Observations with Learned Optimal Adversary,"[""~Huan_Zhang1"", ""~Hongge_Chen1"", ""~Duane_S_Boning1"", ""~Cho-Jui_Hsieh1""]","[""Huan Zhang"", ""Hongge Chen"", ""Duane S Boning"", ""Cho-Jui Hsieh""]","[""reinforcement learning"", ""robustness"", ""adversarial attacks"", ""adversarial defense""]","We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.",/pdf/9a0def4f4b70bbb3d4c3157a3ee5e4110bb9363a.pdf,ICLR,2021,"We study the robustness of RL agents under perturbations on states and find that using an ""optimal"" adversary learned online in an alternating training manner can improve the robustness of agent policy." +H1fsUiRcKQ,SJxkQ4M9tX,1538090000000.0,1545360000000.0,206,Fast adversarial training for semi-supervised learning,"[""dongha0718@hanmail.net"", ""pminer32@gmail.com"", ""jae-joon.han@samsung.com"", ""changkyu_choi@samsung.com"", ""ydkim0903@gmail.com""]","[""Dongha Kim"", ""Yongchan Choi"", ""Jae-Joon Han"", ""Changkyu Choi"", ""Yongdai Kim""]","[""Deep learning"", ""Semi-supervised learning"", ""Adversarial training""]","In semi-supervised learning, Bad GAN approach is one of the most attractive method due to the intuitional simplicity and powerful performances. Bad GAN learns a classifier with bad samples distributed on complement of the support of the input data. But Bad GAN needs additional architectures, a generator and a density estimation model, which involves huge computation and memory consumption cost. VAT is another good semi-supervised learning algorithm, which +utilizes unlabeled data to improve the invariance of the classifier with respect to perturbation of inputs. In this study, we propose a new method by combining the ideas of Bad GAN and VAT. The proposed method generates bad samples of high-quality by use of the adversarial training used in VAT. We give theoretical explanations why the adversarial training is good at both generating bad samples and semi-supervised learning. An advantage of the proposed method is to achieve the competitive performances with much fewer computations. We demonstrate advantages our method by various experiments with well known benchmark image datasets.",/pdf/ebec2b63b06ea2db76cce42a1ddff4cb0a16b658.pdf,ICLR,2019,We propose a fast and efficient semi-supervised learning method using adversarial training. +SUyxNGzUsH,iizlxLX9sib,1601310000000.0,1614990000000.0,2621,VilNMN: A Neural Module Network approach to Video-Grounded Language Tasks,"[""~Hung_Le2"", ""~Nancy_F._Chen1"", ""~Steven_Hoi2""]","[""Hung Le"", ""Nancy F. Chen"", ""Steven Hoi""]","[""neural modular networks"", ""video-grounded dialogues"", ""dialogue understanding"", ""video understanding"", ""video QA"", ""video-grounded language tasks""]","Neural module networks (NMN) have achieved success in image-grounded tasks such as question answering (QA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Visio-Linguistic Neural Module Network (VilNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VilNMN first decomposes all language components to explicitly resolves entity references and detect corresponding action-based inputs from the question. Detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VilNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues. ",/pdf/b138388b5455b457cfb3e7cd02b6a6f3658ea742.pdf,ICLR,2021,"We propose VilNMN, a novel neural module network approach for video-grounded language tasks, which achieves strong performance in both video QA and video-grounded dialogue tasks. " +HyMxAi05Km,BJg9JIhqYm,1538090000000.0,1545360000000.0,865,Dual Learning: Theoretical Study and Algorithmic Extensions,"[""zhaoz6@rpi.edu"", ""yingce.xia@gmail.com"", ""taoqin@microsoft.com"", ""tyliu@microsoft.com""]","[""Zhibing Zhao"", ""Yingce Xia"", ""Tao Qin"", ""Tie-Yan Liu""]","[""machine translation"", ""dual learning""]","Dual learning has been successfully applied in many machine learning applications, including machine translation, image-to-image transformation, etc. The high-level idea of dual learning is very intuitive: if we map an x from one domain to another and then map it back, we should recover the original x. Although its effectiveness has been empirically verified, theoretical understanding of dual learning is still missing. In this paper, we conduct a theoretical study to understand why and when dual learning can improve a mapping function. Based on the theoretical discoveries, we extend dual learning by introducing more related mappings and propose highly symmetric frameworks, cycle dual learning and multipath dual learning, in both of which we can leverage the feedback signals from additional domains to improve the qualities of the mappings. We prove that both cycle dual learning and multipath dual learning can boost the performance of standard dual learning under mild conditions. Experiments on WMT 14 English↔German and MultiUN English↔French translations verify our theoretical findings on dual learning, and the results on the translations among English, French, and Spanish of MultiUN demonstrate the efficacy of cycle dual learning and multipath dual learning.",/pdf/f84de5d29f36270196d323a68d415f4b226d35b0.pdf,ICLR,2019, +rJevYoA9Fm,HklllZicF7,1538090000000.0,1550860000000.0,453,The Singular Values of Convolutional Layers,"[""hsedghi@google.com"", ""vineet@google.com"", ""plong@google.com""]","[""Hanie Sedghi"", ""Vineet Gupta"", ""Philip M. Long""]","[""singular values"", ""operator norm"", ""convolutional layers"", ""regularization""]","We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2% to 5.3%. ",/pdf/2269d1ab11252ade0a6cc5073aca33cd499e5874.pdf,ICLR,2019,"We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. " +Skxd6gSYDS,Syg0BZbKvr,1569440000000.0,1583910000000.0,2580,Query-efficient Meta Attack to Deep Neural Networks,"[""dujiawei@u.nus.edu"", ""hu.zhang-1@student.uts.edu.au"", ""joey.tianyi.zhou@gmail.com"", ""yi.yang@uts.edu.au"", ""elefjia@nus.edu.sg""]","[""Jiawei Du"", ""Hu Zhang"", ""Joey Tianyi Zhou"", ""Yi Yang"", ""Jiashi Feng""]","[""Adversarial attack"", ""Meta learning""]","Black-box attack methods aim to infer suitable attack patterns to targeted DNN models by only using output feedback of the models and the corresponding input queries. However, due to lack of prior and inefficiency in leveraging the query and feedback information, existing methods are mostly query-intensive for obtaining effective attack patterns. In this work, we propose a meta attack approach that is capable of attacking a targeted model with much fewer queries. Its high query-efficiency stems from effective utilization of meta learning approaches in learning generalizable prior abstraction from the previously observed attack patterns and exploiting such prior to help infer attack patterns from only a few queries and outputs. Extensive experiments on MNIST, CIFAR10 and tiny-Imagenet demonstrate that our meta-attack method can remarkably reduce the number of model queries without sacrificing the attack performance. Besides, the obtained meta attacker is not restricted to a particular model but can be used easily with a fast adaptive ability to attack a variety of models. Our code will be released to the public.",/pdf/accbc6a61ef3e6804b5f7aff94e79cf4918781ef.pdf,ICLR,2020, +SJxeI6EYwS,Syln7l_wDS,1569440000000.0,1577170000000.0,547,Simple and Effective Stochastic Neural Networks,"[""tianyuan.yu@surrey.ac.uk"", ""yongxin.yang@surrey.ac.uk"", ""dali.darren@hotmail.com"", ""t.hospedales@ed.ac.uk"", ""t.xiang@surrey.ac.uk""]","[""Tianyuan Yu"", ""Yongxin Yang"", ""Da Li"", ""Timothy Hospedales"", ""Tao Xiang""]","[""stochastic neural networks"", ""pruning"", ""adversarial defence"", ""label noise""]","Stochastic neural networks (SNNs) are currently topical, with several paradigms being actively investigated including dropout, Bayesian neural networks, variational information bottleneck (VIB) and noise regularized learning. These neural network variants impact several major considerations, including generalization, network compression, and robustness against adversarial attack and label noise. However, many existing networks are complicated and expensive to train, and/or only address one or two of these practical considerations. In this paper we propose a simple and effective stochastic neural network (SE-SNN) architecture for discriminative learning by directly modeling activation uncertainty and encouraging high activation variability. Compared to existing SNNs, our SE-SNN is simpler to implement and faster to train, and produces state of the art results on network compression by pruning, adversarial defense and learning with label noise.",/pdf/c8db792ae0818842b887f606a81491ce5ab0fcfc.pdf,ICLR,2020,In this paper we propose a simple and effective stochastic neural network (SE-SNN) architecture for discriminative learning by directly modeling activation uncertainty and encouraging high activation variability. +rkg5fh0ctQ,Bye0wyhqKX,1538090000000.0,1545360000000.0,1290,Transferring SLU Models in Novel Domains,"[""yaohuatang@webank.com"", ""kxmo@connect.ust.hk"", ""fleurxq@outlook.com"", ""carlzzhang@webank.com"", ""qyang@cse.ust.hk""]","[""Yaohua Tang"", ""Kaixiang Mo"", ""Qian Xu"", ""Chao Zhang"", ""Qiang Yang""]","[""transfer learning"", ""semantic representation"", ""spoken language understanding""]","Spoken language understanding (SLU) is a critical component in building dialogue systems. When building models for novel natural language domains, a major challenge is the lack of data in the new domains, no matter whether the data is annotated or not. Recognizing and annotating ``intent'' and ``slot'' of natural languages is a time-consuming process. Therefore, spoken language understanding in low resource domains remains a crucial problem to address. In this paper, we address this problem by proposing a transfer-learning method, whereby a SLU model is transferred to a novel but data-poor domain via a deep neural network framework. We also introduce meta-learning in our work to bridge the semantic relations between seen and unseen data, allowing new intents to be recognized and new slots to be filled with much lower new training effort. We show the performance improvement with extensive experimental results for spoken language understanding in low resource domains. We show that our method can also handle novel intent recognition and slot-filling tasks. Our methodology provides a feasible solution for alleviating data shortages in spoken language understanding.",/pdf/a316a75a8c5489a97939618060299442f39d1755.pdf,ICLR,2019,v3 +SkloDjAqYm,SkeUG2mKK7,1538090000000.0,1545650000000.0,295,LeMoNADe: Learned Motif and Neuronal Assembly Detection in calcium imaging videos,"[""elke.kirschbaum@iwr.uni-heidelberg.de"", ""manuel.haussmann@iwr.uni-heidelberg.de"", ""steffen.wolf@iwr.uni-heidelberg.de"", ""hannah.sonntag@mpimf-heidelberg.mpg.de"", ""justus.schneider@physiologie.uni-heidelberg.de"", ""shehab.elzoheiry@physiologie.uni-heidelberg.de"", ""oliver.kann@physiologie.uni-heidelberg.de"", ""daniel.durstewitz@zi-mannheim.de"", ""fred.hamprecht@iwr.uni-heidelberg.de""]","[""Elke Kirschbaum"", ""Manuel Hau\u00dfmann"", ""Steffen Wolf"", ""Hannah Sonntag"", ""Justus Schneider"", ""Shehabeldin Elzoheiry"", ""Oliver Kann"", ""Daniel Durstewitz"", ""Fred A Hamprecht""]","[""VAE"", ""unsupervised learning"", ""neuronal assemblies"", ""calcium imaging analysis""]","Neuronal assemblies, loosely defined as subsets of neurons with reoccurring spatio-temporally coordinated activation patterns, or ""motifs"", are thought to be building blocks of neural representations and information processing. We here propose LeMoNADe, a new exploratory data analysis method that facilitates hunting for motifs in calcium imaging videos, the dominant microscopic functional imaging modality in neurophysiology. Our nonparametric method extracts motifs directly from videos, bypassing the difficult intermediate step of spike extraction. Our technique augments variational autoencoders with a discrete stochastic node, and we show in detail how a differentiable reparametrization and relaxation can be used. An evaluation on simulated data, with available ground truth, reveals excellent quantitative performance. In real video data acquired from brain slices, with no ground truth available, LeMoNADe uncovers nontrivial candidate motifs that can help generate hypotheses for more focused biological investigations.",/pdf/f33573282308ee8c48f7bf6ecd16e216cb97bc64.pdf,ICLR,2019,"We present LeMoNADe, an end-to-end learned motif detection method directly operating on calcium imaging videos." +bQNosljkHj,geoxFPzUYcJ,1601310000000.0,1614990000000.0,3197,On the Geometry of Deep Bayesian Active Learning,"[""~Xiaofeng_Cao1"", ""~Ivor_Tsang1""]","[""Xiaofeng Cao"", ""Ivor Tsang""]","[""Bayesian active learning"", ""geometric interpretation"", ""core-set construction"", ""model uncertainty"", ""ellipsoid.""]","We present geometric Bayesian active learning by disagreements (GBALD), a framework that performs BALD on its geometric interpretation interacting with a deep learning model. There are two main components in GBALD: initial acquisitions based on core-set construction and model uncertainty estimation with those initial acquisitions. Our key innovation is to construct the core-set on an ellipsoid, not typical sphere, preventing its updates towards the boundary regions of the distributions. Main improvements over BALD are twofold: relieving sensitivity to uninformative prior and reducing redundant information of model uncertainty. To guarantee the improvements, our generalization analysis proves that, compared to typical Bayesian spherical interpretation, geodesic search with ellipsoid can derive a tighter lower error bound and achieve higher probability to obtain a nearly zero error. Experiments on acquisitions with several scenarios demonstrate that, yielding slight perturbations to noisy and repeated samples, GBALD further achieves significant accuracy improvements than BALD, BatchBALD and other baselines.",/pdf/21b9750a6e5f0a437e44a3f5d7426ae0dc6450b0.pdf,ICLR,2021,We present geometric Bayesian active learning by disagreements for active deep learning. +fV4vvs1J5iM,7rpIfRtt6qh,1601310000000.0,1614990000000.0,1594,A Reduction Approach to Constrained Reinforcement Learning,"[""tianchi.ctc@antgroup.com"", ""~Wenjie_Shi1"", ""lihong.glh@antgroup.com"", ""xiaodong.zxd@antgroup.com"", ""jinjie.gujj@antgroup.com""]","[""Tianchi Cai"", ""Wenjie Shi"", ""Lihong Gu"", ""Xiaodong Zeng"", ""Jinjie Gu""]",[],"Many applications of reinforcement learning (RL) optimize a long-term reward subject to risk, safety, budget, diversity or other constraints. Though constrained RL problem has been studied to incorporate various constraints, existing methods either tie to specific families of RL algorithms or require storing infinitely many individual policies found by an RL oracle to approach a feasible solution. In this paper, we present a novel reduction approach for constrained RL problem that ensures convergence when using any off-the-shelf RL algorithm to construct an RL oracle yet requires storing at most constantly many policies. The key idea is to reduce the constrained RL problem to a distance minimization problem, and a novel variant of Frank-Wolfe algorithm is proposed for this task. Throughout the learning process, our method maintains at most constantly many individual policies, where the constant is shown to be worst-case optimal to ensure convergence of any RL oracle. Our method comes with rigorous convergence and complexity analysis, and does not introduce any extra hyper-parameter. Experiments on a grid-world navigation task demonstrate the efficiency of our method. +",/pdf/c09f13328176b0e29db4dfbc3195f83b02567838.pdf,ICLR,2021, +pRGF3Jtaie,cKDrv6QgEAP,1601310000000.0,1614990000000.0,2644,ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms,"[""~Huong_Ha3"", ""~Sunil_Gupta2"", ""~Santu_Rana1"", ""~Svetha_Venkatesh1""]","[""Huong Ha"", ""Sunil Gupta"", ""Santu Rana"", ""Svetha Venkatesh""]","[""active learning"", ""bayesian learning"", ""machine learning testing"", ""information theory""]","Machine learning models are being used extensively in many important areas, but there is no guarantee that a model will always perform well or as its developers intended. Understanding the correctness of a model is crucial to prevent potential failures that may have significant detrimental impact in critical application areas. In this paper, we propose a novel framework to efficiently test a machine learning model using only a small amount of labelled test data. The core idea is to efficiently estimate the metrics of interest for a model-under-test using Bayesian neural network. We develop a methodology to efficiently train the Bayesian neural network from the limited number of labelled data. We also devise an entropy-based sampling strategy to sample the data point such that the proposed framework can give accurate estimations for the metrics of interest. Finally, we conduct an extensive set of experiments to test various machine learning models for different types of metrics. Our experiments with multiple datasets show that given a testing budget, the estimation of the metrics by our method is significantly better compared to existing state-of-the-art approaches.",/pdf/211406a91e16b38acd80801a8eb45d0cade1b0d9.pdf,ICLR,2021,We propose a novel framework for active testing of machine learning models. +B1l8BtlCb,r1kUSYe0-,1509100000000.0,1519440000000.0,335,Non-Autoregressive Neural Machine Translation,"[""jiataogu@eee.hku.hk"", ""james.bradbury@salesforce.com"", ""cxiong@salesforce.com"", ""vli@eee.hku.hk"", ""rsocher@salesforce.com""]","[""Jiatao Gu"", ""James Bradbury"", ""Caiming Xiong"", ""Victor O.K. Li"", ""Richard Socher""]","[""machine translation"", ""non-autoregressive"", ""transformer"", ""fertility"", ""nmt""]","Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English–German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English–Romanian.",/pdf/a5d2f63fd149ecdf6ced09eede7f7556b67bd662.pdf,ICLR,2018,"We introduce the first NMT model with fully parallel decoding, reducing inference latency by 10x." +S1lDV3RcKm,SJejon9cKQ,1538090000000.0,1550980000000.0,1457,MisGAN: Learning from Incomplete Data with Generative Adversarial Networks,"[""cxl@cs.umass.edu"", ""bjiang@sjtu.edu.cn"", ""marlin@cs.umass.edu""]","[""Steven Cheng-Xian Li"", ""Bo Jiang"", ""Benjamin Marlin""]","[""generative models"", ""missing data""]","Generative adversarial networks (GANs) have been shown to provide an effective way to model complex distributions and have obtained impressive results on various challenging tasks. However, typical GANs require fully-observed data during training. In this paper, we present a GAN-based framework for learning from complex, high-dimensional incomplete data. The proposed framework learns a complete data generator along with a mask generator that models the missing data distribution. We further demonstrate how to impute missing data by equipping our framework with an adversarially trained imputer. We evaluate the proposed framework using a series of experiments with several types of missing data processes under the missing completely at random assumption.",/pdf/38721add8642a3a1e4a91208b6a3ca6b05b0d371.pdf,ICLR,2019,This paper presents a GAN-based framework for learning the distribution from high-dimensional incomplete data. +GVNGAaY2Dr1,2Y5or74f53,1601310000000.0,1614990000000.0,2134,Multi-Agent Collaboration via Reward Attribution Decomposition,"[""~Tianjun_Zhang1"", ""~Huazhe_Xu1"", ""~Xiaolong_Wang3"", ""~Yi_Wu1"", ""~Kurt_Keutzer1"", ""~Joseph_E._Gonzalez1"", ""~Yuandong_Tian1""]","[""Tianjun Zhang"", ""Huazhe Xu"", ""Xiaolong Wang"", ""Yi Wu"", ""Kurt Keutzer"", ""Joseph E. Gonzalez"", ""Yuandong Tian""]","[""multi-agent reinforcement leanring"", ""ad hoc team play""]","Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and don't generalize to new agent configurations even on the same game. In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that each agent has an approximately optimal policy that decomposes into two parts: one part that only relies on the agent's own state, and the other part that is related to states of nearby agents. Following this novel finding, CollaQ decomposes the Q-function of each agent into a self term and an interactive term, with a Multi-Agent Reward Attribution (MARA) loss that regularizes the training. CollaQ is evaluated on various StarCraft maps and shows that it outperforms existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of samples. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%. ",/pdf/f42f37b5fd7244bd562fc5a576017b2ec531761d.pdf,ICLR,2021, +rJxFpp4Fvr,HklwDg-dvS,1569440000000.0,1577170000000.0,826,"Feature-Robustness, Flatness and Generalization Error for Deep Neural Networks","[""henning.petzka@gmail.com"", ""adylova.linara.r@gmail.com"", ""info@michaelkamp.org"", ""cristian.sminchisescu@math.lth.se""]","[""Henning Petzka"", ""Linara Adilova"", ""Michael Kamp"", ""Cristian Sminchisescu""]","[""robustness"", ""flatness"", ""generalization error"", ""loss surface"", ""deep neural networks"", ""feature space""]","The performance of deep neural networks is often attributed to their automated, task-related feature construction. It remains an open question, though, why this leads to solutions with good generalization, even in cases where the number of parameters is larger than the number of samples. Back in the 90s, Hochreiter and Schmidhuber observed that flatness of the loss surface around a local minimum correlates with low generalization error. For several flatness measures, this correlation has been empirically validated. However, it has recently been shown that existing measures of flatness cannot theoretically be related to generalization: if a network uses ReLU activations, the network function can be reparameterized without changing its output in such a way that flatness is changed almost arbitrarily. This paper proposes a natural modification of existing flatness measures that results in invariance to reparameterization. The proposed measures imply a robustness of the network to changes in the input and the hidden layers. Connecting this feature robustness to generalization leads to a generalized definition of the representativeness of data. With this, the generalization error of a model trained on representative data can be bounded by its feature robustness which depends on our novel flatness measure.",/pdf/2c54a394eff673040532a1cc1883a9db0d8ad4a6.pdf,ICLR,2020,We introduce a novel measure of flatness at local minima of the loss surface of deep neural networks which is invariant with respect to layer-wise reparameterizations and we connect flatness to feature robustness and generalization. +BVSM0x3EDK6,3rkQRDsvZvG,1601310000000.0,1615950000000.0,1004,Robust and Generalizable Visual Representation Learning via Random Convolutions,"[""~Zhenlin_Xu1"", ""~Deyi_Liu1"", ""~Junlin_Yang1"", ""~Colin_Raffel1"", ""~Marc_Niethammer1""]","[""Zhenlin Xu"", ""Deyi Liu"", ""Junlin Yang"", ""Colin Raffel"", ""Marc Niethammer""]","[""domain generalization"", ""robustness"", ""representation learning"", ""data augmentation""]","While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. In this work, we show that the robustness of neural networks can be greatly improved through the use of random convolutions as data augmentation. Random convolutions are approximately shape-preserving and may distort local textures. Intuitively, randomized convolutions create an infinite number of new domains with similar global shapes but random local texture. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. In particular, in the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation.",/pdf/10040feb356410f9e8b8ee9678ecb702e0534990.pdf,ICLR,2021,We use random convolutions as data augmentation to train robust visual representation that generalize to new domains. +Siwm2BaNiG,OeT3_FDvL14,1601310000000.0,1614990000000.0,89,Modal Uncertainty Estimation via Discrete Latent Representations,"[""~Di_Qiu1"", ""~Zhanghan_Ke1"", ""~Peng_Su1"", ""~Lok_Ming_Lui2""]","[""Di Qiu"", ""Zhanghan Ke"", ""Peng Su"", ""Lok Ming Lui""]","[""uncertainty estimation"", ""one -to-many mapping"", ""conditional generative model"", ""discrete latent space"", ""medical image segmentation""]","Many important problems in the real world don't have unique solutions. It is thus important for machine learning models to be capable of proposing different plausible solutions with meaningful probability measures. +In this work we propose a novel deep learning based framework, named {\it modal uncertainty estimation} (MUE), to learn the one-to-many mappings between the inputs and outputs, together with faithful uncertainty estimation. +Motivated by the multi-modal posterior collapse problem in current conditional generative models, MUE uses a set of discrete latent variables, each representing a latent mode hypothesis that explains one type of input-output relationship, to generate the one-to-many mappings. Benefit from the discrete nature of the latent representations, MUE can estimate any input the conditional probability distribution of the outputs effectively. Moreover, MUE is efficient during training since the discrete latent space and its uncertainty estimation are jointly learned. +We also develop the theoretical background of MUE and extensively validate it on both synthetic and realistic tasks. MUE demonstrates (1) significantly more accurate uncertainty estimation than the current state-of-the-art, and (2) its informativeness for practical use. + +",/pdf/33557352a7ba4632d3afc9b06de3be30f2eab419.pdf,ICLR,2021,We use a conditional generative model with discrete latent representation to solve the one-to-many mapping problem with faithful uncertainty estimates. +WGWzwdjm8mS,jUBuvw1KMB,1601310000000.0,1614990000000.0,3135,Early Stopping by Gradient Disparity,"[""~mahsa_forouzesh1"", ""~Patrick_Thiran1""]","[""mahsa forouzesh"", ""Patrick Thiran""]","[""Supervised Representation Learning"", ""Deep Neural Networks"", ""Generalization"", ""Early Stopping""]","Validation-based early-stopping methods are one of the most popular techniques used to avoid over-training deep neural networks. They require to set aside a reliable unbiased validation set, which can be expensive in applications offering limited amounts of data. In this paper, we propose to use \emph{gradient disparity}, which we define as the $\ell_2$ norm distance between the gradient vectors of two batches drawn from the training set. It comes from a probabilistic upper bound on the difference between the classification errors over a given batch, when the network is trained on this batch and when the network is trained on another batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion when data is limited, because it uses all the training samples during training. Furthermore, we show in a wide range of experimental settings that gradient disparity is not only strongly related to the usual generalization error between the training and test sets, but that it is also much more informative about the level of label noise. ",/pdf/8ea57e22ed9451cddfaca9de3651bad23c044bd0.pdf,ICLR,2021,"We propose an early stopping metric that does not require a validation set, which is particularly well suited for settings with limited and/or noisy labels." +By5SY2gA-,r1drKhxCZ,1509110000000.0,1518730000000.0,398,Towards Building Affect sensitive Word Distributions,"[""kchawla@adobe.com"", ""skhosla@adobe.com"", ""nchhaya@adobe.com"", ""jaidka@sas.upenn.edu""]","[""Kushal Chawla"", ""Sopan Khosla"", ""Niyati Chhaya"", ""Kokil Jaidka""]","[""Affect lexicon"", ""word embeddings"", ""Word2Vec"", ""GloVe"", ""WordNet"", ""joint learning"", ""sentiment analysis"", ""word similarity"", ""outlier detection"", ""affect prediction""]","Learning word representations from large available corpora relies on the distributional hypothesis that words present in similar contexts tend to have similar meanings. Recent work has shown that word representations learnt in this manner lack sentiment information which, fortunately, can be leveraged using external knowledge. Our work addresses the question: can affect lexica improve the word representations learnt from a corpus? In this work, we propose techniques to incorporate affect lexica, which capture fine-grained information about a word's psycholinguistic and emotional orientation, into the training process of Word2Vec SkipGram, Word2Vec CBOW and GloVe methods using a joint learning approach. We use affect scores from Warriner's affect lexicon to regularize the vector representations learnt from an unlabelled corpus. Our proposed method outperforms previously proposed methods on standard tasks for word similarity detection, outlier detection and sentiment detection. We also demonstrate the usefulness of our approach for a new task related to the prediction of formality, frustration and politeness in corporate communication.",/pdf/cae784cc68c7d05dedd1685520626404feab6073.pdf,ICLR,2018,Enriching word embeddings with affect information improves their performance on sentiment prediction tasks. +Hye87grYDH,SJe3XVgtvH,1569440000000.0,1577170000000.0,2212,Sparse Transformer: Concentrated Attention Through Explicit Selection,"[""1701214310@pku.edu.cn"", ""junyang.ljy@alibaba-inc.com"", ""zzy1210@pku.edu.cn"", ""renxc@pku.edu.cn"", ""xusun@pku.edu.cn""]","[""Guangxiang Zhao"", ""Junyang Lin"", ""Zhiyuan Zhang"", ""Xuancheng Ren"", ""Xu Sun""]","[""Attention"", ""Transformer"", ""Machine Translation"", ""Natural Language Processing"", ""Sparse"", ""Sequence to sequence learning""]","Self-attention-based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self attention is able to model long-term dependencies, but it may suffer from the extraction of irrelevant information in the context. To tackle the problem, we propose a novel model called Sparse Transformer. Sparse Transformer is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. Extensive experimental results on a series of natural language processing tasks, including neural machine translation, image captioning, and language modeling, all demonstrate the advantages of Sparse Transformer in model performance. + Sparse Transformer reaches the state-of-the-art performances in the IWSLT 2015 English-to-Vietnamese translation and IWSLT 2014 German-to-English translation. In addition, we conduct qualitative analysis to account for Sparse Transformer's superior performance. ",/pdf/334e12c721e66f3312c962bd610a2b8719c1613e.pdf,ICLR,2020,This work propose Sparse Transformer to improve the concentration of attention on the global context through an explicit selection of the most relevant segments for sequence to sequence learning. +iG_Cg6ONjX,pfqzDE_ozAC,1601310000000.0,1614990000000.0,734,A General Computational Framework to Measure the Expressiveness of Complex Networks using a Tight Upper Bound of Linear Regions,"[""~Yutong_Xie1"", ""gaoxiangchen@pku.edu.cn"", ""~Quanzheng_Li2""]","[""Yutong Xie"", ""Gaoxiang Chen"", ""Quanzheng Li""]",[],"The expressiveness of deep neural network (DNN) is a perspective to understand the surprising performance of DNN. The number of linear regions, i.e. pieces that a piece-wise-linear function represented by a DNN, is generally used to measure the expressiveness. And the upper bound of regions number partitioned by a rectifier network, instead of the number itself, is a more practical measurement of expressiveness of a rectifier DNN. In this work, we propose a new and tighter upper bound of regions number. Inspired by the proof of this upper bound and the framework of matrix computation in \citet{hinz2019framework}, we propose a general computational approach to compute a tight upper bound of regions number for theoretically any network structures (e.g. DNN with all kind of skip connections and residual structures). Our experiments show our upper bound is tighter than existing ones, and explain why skip connections and residual structures can improve network performance.",/pdf/a0e372ce8e40ab6276f6e1eaca1686b15f223a17.pdf,ICLR,2021, +S1eQuCVFvB,SyeVm3DuDB,1569440000000.0,1577170000000.0,1205,Machine Truth Serum,"[""tluo6@ucsc.edu"", ""yangliu@ucsc.edu""]","[""Tianyi Luo"", ""Yang Liu""]","[""Ensemble method"", ""Classification"", ""Machine Truth Serum"", ""Minority"", ""Machine Learning""]","Wisdom of the crowd revealed a striking fact that the majority answer from a crowd is often more accurate than any individual expert. We observed the same story in machine learning - ensemble methods leverage this idea to combine multiple learning algorithms to obtain better classification performance. Among many popular examples is the celebrated Random Forest, which applies the majority voting rule in aggregating different decision trees to make the final prediction. Nonetheless, these aggregation rules would fail when the majority is more likely to be wrong. In this paper, we extend the idea proposed in Bayesian Truth Serum that ""a surprisingly more popular answer is more likely the true answer"" to classification problems. The challenge for us is to define or detect when an answer should be considered as being ""surprising"". We present two machine learning aided methods which aim to reveal the truth when it is minority instead of majority who has the true answer. Our experiments over real-world datasets show that better classification performance can be obtained compared to always trusting the majority voting. Our proposed methods also outperform popular ensemble algorithms. Our approach can be generically applied as a subroutine in ensemble methods to replace majority voting rule. ",/pdf/6be871a331d8e5dead3bf5fb274fc89f43a994d6.pdf,ICLR,2020,This paper proposes two machine learning aided methods HMTS and DMTS to detect when the aggregated minority opinion should be taken as the final prediction instead of majority. +HJenn6VFvB,rJeYGUl_vB,1569440000000.0,1587650000000.0,797,Hamiltonian Generative Networks,"[""petertoth@google.com"", ""danilor@google.com"", ""drewjaegle@google.com"", ""sracaniere@google.com"", ""botev@google.com"", ""irinah@google.com""]","[""Peter Toth"", ""Danilo J. Rezende"", ""Andrew Jaegle"", ""S\u00e9bastien Racani\u00e8re"", ""Aleksandar Botev"", ""Irina Higgins""]","[""Hamiltonian dynamics"", ""normalising flows"", ""generative model"", ""physics""]","The Hamiltonian formalism plays a central role in classical and quantum physics. Hamiltonians are the main tool for modelling the continuous time evolution of systems with conserved quantities, and they come equipped with many useful properties, like time reversibility and smooth interpolation in time. These properties are important for many machine learning problems - from sequence prediction to reinforcement learning and density modelling - but are not typically provided out of the box by standard tools such as recurrent neural networks. In this paper, we introduce the Hamiltonian Generative Network (HGN), the first approach capable of consistently learning Hamiltonian dynamics from high-dimensional observations (such as images) without restrictive domain assumptions. Once trained, we can use HGN to sample new trajectories, perform rollouts both forward and backward in time, and even speed up or slow down the learned dynamics. We demonstrate how a simple modification of the network architecture turns HGN into a powerful normalising flow model, called Neural Hamiltonian Flow (NHF), that uses Hamiltonian dynamics to model expressive densities. Hence, we hope that our work serves as a first practical demonstration of the value that the Hamiltonian formalism can bring to machine learning. More results and video evaluations are available at: http://tiny.cc/hgn",/pdf/aff7b5eb43963e39c8330cd2fb8c9054c72286c7.pdf,ICLR,2020,We introduce a class of generative models that reliably learn Hamiltonian dynamics from high-dimensional observations. The learnt Hamiltonian can be applied to sequence modeling or as a normalising flow. +Zu3iPlzCe9J,n1ai354iULZ,1601310000000.0,1614990000000.0,2325,On the Power of Abstention and Data-Driven Decision Making for Adversarial Robustness,"[""~Nina_Balcan1"", ""~Avrim_Blum1"", ""~Dravyansh_Sharma1"", ""~Hongyang_Zhang1""]","[""Nina Balcan"", ""Avrim Blum"", ""Dravyansh Sharma"", ""Hongyang Zhang""]","[""Adversarial Machine Learning"", ""Learning Theory""]","We formally define a feature-space attack where the adversary can perturb datapoints by arbitrary amounts but in restricted directions. By restricting the attack to a small random subspace, our model provides a clean abstraction for non-Lipschitz networks which map small input movements to large feature movements. We prove that classifiers with the ability to abstain are provably more powerful than those that cannot in this setting. Specifically, we show that no matter how well-behaved the natural data is, any classifier that cannot abstain will be defeated by such an adversary. However, by allowing abstention, we give a parameterized algorithm with provably good performance against such an adversary when classes are reasonably well-separated in feature space and the dimension of the feature space is high. We further use a data-driven method to set our algorithm parameters to optimize over the accuracy vs. abstention trade-off with strong theoretical guarantees. Our theory has direct applications to the technique of contrastive learning, where we empirically demonstrate the ability of our algorithms to obtain high robust accuracy with only small amounts of abstention in both supervised and self-supervised settings. Our results provide a first formal abstention-based gap, and a first provable optimization for the induced trade-off in an adversarial defense setting.",/pdf/779f836743cf41a52f24369e9ceba10b3a699559.pdf,ICLR,2021,We develop algorithms with provable guarantees for defense against adversarial attacks that utilize abstention and also provably learn parameters to optimize over the accuracy vs. abstention trade-off. +r1xrb3CqtQ,BygjthpqKX,1538090000000.0,1545360000000.0,1166,Latent Domain Transfer: Crossing modalities with Bridging Autoencoders,"[""yittian@cs.stonybrook.edu"", ""jesseengel@google.com""]","[""Yingtao Tian"", ""Jesse Engel""]","[""Generative Model"", ""Latent Space"", ""Domain Transfer""]","Domain transfer is a exciting and challenging branch of machine learning because models must learn to smoothly transfer between domains, preserving local variations and capturing many aspects of variation without labels. +However, most successful applications to date require the two domains to be closely related (ex. image-to-image, video-video), +utilizing similar or shared networks to transform domain specific properties like texture, coloring, and line shapes. +Here, we demonstrate that it is possible to transfer across modalities (ex. image-to-audio) by first abstracting the data with latent generative models and then learning transformations between latent spaces. +We find that a simple variational autoencoder is able to learn a shared latent space to bridge between two generative models in an unsupervised fashion, and even between different types of models (ex. variational autoencoder and a generative adversarial network). +We can further impose desired semantic alignment of attributes with a linear classifier in the shared latent space. +The proposed variation autoencoder enables preserving both locality and semantic alignment through the transfer process, as shown in the qualitative and quantitative evaluations. +Finally, the hierarchical structure decouples the cost of training the base generative models and semantic alignments, enabling computationally efficient and data efficient retraining of personalized mapping functions. ",/pdf/501e7547b463f219d662e9973038fd04299ddd1c.pdf,ICLR,2019,Conditional VAE on top of latent spaces of pre-trained generative models that enables transfer between drastically different domains while preserving locality and semantic alignment. +3hGNqpI4WS,JudsnOR50sF,1601310000000.0,1615950000000.0,1529,Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization,"[""~Tatsuya_Matsushima1"", ""~Hiroki_Furuta1"", ""~Yutaka_Matsuo1"", ""~Ofir_Nachum1"", ""~Shixiang_Gu1""]","[""Tatsuya Matsushima"", ""Hiroki Furuta"", ""Yutaka Matsuo"", ""Ofir Nachum"", ""Shixiang Gu""]","[""Reinforcement Learning"", ""deployment-efficiency"", ""offline RL"", ""Model-based RL""]","Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that not only performs better than or comparably as the state-of-the-art dynamic-programming-based and concurrently-proposed model-based offline approaches on existing benchmarks, but can also effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN achieves impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines.",/pdf/b73e1d0a56094306542cc2020c438a4e4942e58f.pdf,ICLR,2021,"We propose a novel method that achieves both high sample-efficiency in offline RL and ""deployment-efficiency"" in online RL." +SkgscaNYPS,H1l_mm0wvB,1569440000000.0,1583910000000.0,720,The asymptotic spectrum of the Hessian of DNN throughout training,"[""arthur.jacot@epfl.ch"", ""franck.gabriel@epfl.ch"", ""clement.hongler@epfl.ch""]","[""Arthur Jacot"", ""Franck Gabriel"", ""Clement Hongler""]","[""theory of deep learning"", ""loss surface"", ""training"", ""fisher information matrix""]","The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs: we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training. ",/pdf/191586952ff738b8955827a8e8f105d8d76372ca.pdf,ICLR,2020,Description of the limiting spectrum of the Hesian of the loss surface of DNNs in the infinite-width limit. +HyxCxhRcY7,H1eYYyAcKX,1538090000000.0,1553810000000.0,1125,Deep Anomaly Detection with Outlier Exposure,"[""hendrycks@berkeley.edu"", ""mantas@ttic.edu"", ""tgd@oregonstate.edu""]","[""Dan Hendrycks"", ""Mantas Mazeika"", ""Thomas Dietterich""]","[""confidence"", ""uncertainty"", ""anomaly"", ""robustness""]","It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.",/pdf/9fd623c234374dcb0865ce91b049883cc7a7664f.pdf,ICLR,2019,"OE teaches anomaly detectors to learn heuristics for detecting unseen anomalies; experiments are in classification, density estimation, and calibration in NLP and vision settings; we do not tune on test distribution samples, unlike previous work" +Hylyui09tm,Hyeh0Iy9K7,1538090000000.0,1545360000000.0,319,EMI: Exploration with Mutual Information Maximizing State and Action Embeddings,"[""harry2636@mllab.snu.ac.kr"", ""jaekyeom@mllab.snu.ac.kr"", ""yeonwoo@mllab.snu.ac.kr"", ""svlevine@eecs.berkeley.edu"", ""hyunoh@snu.ac.kr""]","[""Hyoungseok Kim"", ""Jaekyeom Kim"", ""Yeonwoo Jeong"", ""Sergey Levine"", ""Hyun Oh Song""]","[""reinforcement learning"", ""exploration"", ""representation learning""]","Policy optimization struggles when the reward feedback signal is very sparse and essentially becomes a random search algorithm until the agent stumbles upon a rewarding or the goal state. Recent works utilize intrinsic motivation to guide the exploration via generative models, predictive forward models, or more ad-hoc measures of surprise. We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space. Our experiments show the state of the art performance on challenging locomotion task with continuous control and on image-based exploration tasks with discrete actions on Atari.",/pdf/f068858c552c5c904b199cfea760b4235e596cd8.pdf,ICLR,2019, +SyezSCNYPB,SygXnYL_vB,1569440000000.0,1577170000000.0,1102,Disentangled GANs for Controllable Generation of High-Resolution Images,"[""wn8@rice.edu"", ""tkarras@nvidia.com"", ""garg@cs.toronto.edu"", ""shoubhikdn@gmail.com"", ""anjul.patney@gmail.com"", ""abp4@rice.edu"", ""animakumar@gmail.com""]","[""Weili Nie"", ""Tero Karras"", ""Animesh Garg"", ""Shoubhik Debhath"", ""Anjul Patney"", ""Ankit B. Patel"", ""Anima Anandkumar""]","[""Disentangled GANs"", ""controllable generation"", ""high-resolution image synthesis"", ""semantic manipulation"", ""fine-grained factors""]","Generative adversarial networks (GANs) have achieved great success at generating realistic samples. However, achieving disentangled and controllable generation still remains challenging for GANs, especially in the high-resolution image domain. Motivated by this, we introduce AC-StyleGAN, a combination of AC-GAN and StyleGAN, for demonstrating that the controllable generation of high-resolution images is possible with sufficient supervision. More importantly, only using 5% of the labelled data significantly improves the disentanglement quality. Inspired by the observed separation of fine and coarse styles in StyleGAN, we then extend AC-StyleGAN to a new image-to-image model called FC-StyleGAN for semantic manipulation of fine-grained factors in a high-resolution image. In experiments, we show that FC-StyleGAN performs well in only controlling fine-grained factors, with the use of instance normalization, and also demonstrate its good generalization ability to unseen images. Finally, we create two new datasets -- Falcor3D and Isaac3D with higher resolution, more photorealism, and richer variation, as compared to existing disentanglement datasets.",/pdf/33496382d110e6f98705351587547d9698259356.pdf,ICLR,2020,We propose new GAN architectures that enable disentangled and controllable high-resolution image generation as well as new datasets that will serve as benchmarks for the research community. +SkeFl1HKwr,rkxMsBsuDB,1569440000000.0,1588070000000.0,1514,Empirical Studies on the Properties of Linear Regions in Deep Neural Networks,"[""xiao_zhang@hust.edu.cn"", ""drwu@hust.edu.cn""]","[""Xiao Zhang"", ""Dongrui Wu""]","[""deep learning"", ""linear region"", ""optimization""]","A deep neural networks (DNN) with piecewise linear activations can partition the input space into numerous small linear regions, where different linear functions are fitted. It is believed that the number of these regions represents the expressivity of a DNN. This paper provides a novel and meticulous perspective to look into DNNs: Instead of just counting the number of the linear regions, we study their local properties, such as the inspheres, the directions of the corresponding hyperplanes, the decision boundaries, and the relevance of the surrounding regions. We empirically observed that different optimization techniques lead to completely different linear regions, even though they result in similar classification accuracies. We hope our study can inspire the design of novel optimization techniques, and help discover and analyze the behaviors of DNNs.",/pdf/6d21214c226bff2af178664283a7bbb6d2356b57.pdf,ICLR,2020, +3Aoft6NWFej,sJwJWrww4AV,1601310000000.0,1616050000000.0,1810,PMI-Masking: Principled masking of correlated spans,"[""~Yoav_Levine1"", ""barakl@ai21.com"", ""opherl@ai21.com"", ""~Omri_Abend1"", ""~Kevin_Leyton-Brown1"", ""~Moshe_Tennenholtz1"", ""~Yoav_Shoham1""]","[""Yoav Levine"", ""Barak Lenz"", ""Opher Lieber"", ""Omri Abend"", ""Kevin Leyton-Brown"", ""Moshe Tennenholtz"", ""Yoav Shoham""]","[""Language modeling"", ""BERT"", ""pointwise mutual information""]","Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of pretraining.",/pdf/0ea2c13c0234ad53369c2787936620d2f84ce64d.pdf,ICLR,2021,Joint masking of correlated tokens significantly speeds up and improves BERT's pretraining +#NAME?,v0wAwxueva,1601310000000.0,1612870000000.0,3251,Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective,"[""~Muhammet_Balcilar1"", ""guillaume.renton@gmail.com"", ""pierre.heroux@univ-rouen.fr"", ""benoit.gauzere@insa-rouen.fr"", ""~S\u00e9bastien_Adam1"", ""~Paul_Honeine1""]","[""Muhammet Balcilar"", ""Guillaume Renton"", ""Pierre H\u00e9roux"", ""Benoit Ga\u00fcz\u00e8re"", ""S\u00e9bastien Adam"", ""Paul Honeine""]","[""Graph Neural Networks"", ""Spectral Graph Filter"", ""Spectral Analysis""]","In the recent literature of Graph Neural Networks (GNN), the expressive power of models has been studied through their capability to distinguish if two given graphs are isomorphic or not. Since the graph isomorphism problem is NP-intermediate, and Weisfeiler-Lehman (WL) test can give sufficient but not enough evidence in polynomial time, the theoretical power of GNNs is usually evaluated by the equivalence of WL-test order, followed by an empirical analysis of the models on some reference inductive and transductive datasets. However, such analysis does not account the signal processing pipeline, whose capability is generally evaluated in the spectral domain. In this paper, we argue that a spectral analysis of GNNs behavior can provide a complementary point of view to go one step further in the understanding of GNNs. By bridging the gap between the spectral and spatial design of graph convolutions, we theoretically demonstrate some equivalence of the graph convolution process regardless it is designed in the spatial or the spectral domain. Using this connection, we managed to re-formulate most of the state-of-the-art graph neural networks into one common framework. This general framework allows to lead a spectral analysis of the most popular GNNs, explaining their performance and showing their limits according to spectral point of view. Our theoretical spectral analysis is confirmed by experiments on various graph databases. Furthermore, we demonstrate the necessity of high and/or band-pass filters on a graph dataset, while the majority of GNN is limited to only low-pass and inevitably it fails.",/pdf/859c9ee357c81e0b9a1cb989b1e23b8b42d741f1.pdf,ICLR,2021,This paper aims to analyse of the expressive power of Graph Neural Network in spectral domain. +Bkbc-Vqeg,,1478280000000.0,1478290000000.0,208,Learning Word-Like Units from Joint Audio-Visual Analylsis,"[""dharwath@mit.edu"", ""glass@mit.edu""]","[""David Harwath"", ""James R. Glass""]","[""Speech"", ""Computer vision"", ""Deep learning"", ""Multi-modal learning"", ""Unsupervised Learning"", ""Semi-Supervised Learning""]","Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, our model is able to detect spoken instances of the words ``lighthouse'' within an utterance and associate them with image regions containing lighthouses. We do not use any form of conventional automatic speech recognition, nor do we use any text transcriptions or conventional linguistic annotations. Our model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images.",/pdf/21daf351a51f71241db0425689ae98e74f491509.pdf,ICLR,2017, +mYNfmvt8oSv,9GVtkxqEh2q,1601310000000.0,1614990000000.0,889,D2RL: Deep Dense Architectures in Reinforcement Learning,"[""~Samarth_Sinha1"", ""~Homanga_Bharadhwaj1"", ""~Aravind_Srinivas1"", ""~Animesh_Garg1""]","[""Samarth Sinha"", ""Homanga Bharadhwaj"", ""Aravind Srinivas"", ""Animesh Garg""]","[""Deep Reinforcement learning"", ""Policy architectures""]","While improvements in deep learning architectures have played a crucial role in improving the state of supervised and unsupervised learning in computer vision and natural language processing, neural network architecture choices for reinforcement learning remain relatively under-explored. We take inspiration from successful architectural choices in computer vision and generative modeling, and investigate the use of deeper networks and dense connections for reinforcement learning on a variety of simulated robotic learning benchmark environments. Our findings reveal that current methods benefit significantly from dense connections and deeper networks, across a suite of manipulation and locomotion tasks, for both proprioceptive and image-based observations. We hope that our results can serve as a strong baseline and further motivate future research into neural network architectures for reinforcement learning. The project website is at this link https://sites.google.com/view/d2rl-anonymous/home",/pdf/a5bb9c4becab991cea22e59d7f654dbc60f44170.pdf,ICLR,2021,Introducing dense architectures in the policy and value function in deep reinforcement learning can significantly improve performance in state and image-based RL. +S1WRibb0Z,rkxAibZAb,1509130000000.0,1518730000000.0,716,Expressive power of recurrent neural networks,"[""khrulkov.v@gmail.com"", ""sasha.v.novikov@gmail.com"", ""i.oseledets@skoltech.ru""]","[""Valentin Khrulkov"", ""Alexander Novikov"", ""Ivan Oseledets""]","[""Recurrent Neural Networks"", ""Tensor Train"", ""tensor decompositions"", ""expressive power""]","Deep neural networks are surprisingly efficient at solving practical tasks, +but the theory behind this phenomenon is only starting to catch up with +the practice. Numerous works show that depth is the key to this efficiency. +A certain class of deep convolutional networks – namely those that correspond +to the Hierarchical Tucker (HT) tensor decomposition – has been +proven to have exponentially higher expressive power than shallow networks. +I.e. a shallow network of exponential width is required to realize +the same score function as computed by the deep architecture. In this paper, +we prove the expressive power theorem (an exponential lower bound on +the width of the equivalent shallow network) for a class of recurrent neural +networks – ones that correspond to the Tensor Train (TT) decomposition. +This means that even processing an image patch by patch with an RNN +can be exponentially more efficient than a (shallow) convolutional network +with one hidden layer. Using theoretical results on the relation between +the tensor decompositions we compare expressive powers of the HT- and +TT-Networks. We also implement the recurrent TT-Networks and provide +numerical evidence of their expressivity.",/pdf/19046a3cba1f81a58ab3c0a626f862033e359216.pdf,ICLR,2018,We prove the exponential efficiency of recurrent-type neural networks over shallow networks. +2LiGI26kRdt,3qivR8jCk53F,1601310000000.0,1614990000000.0,1143,Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup,"[""~Cheng_Yang3"", ""~Shengnan_Wang2"", ""~Chao_Yang7"", ""~Yuechuan_Li1"", ""~Ru_He1"", ""~Jingqiao_Zhang1""]","[""Cheng Yang"", ""Shengnan Wang"", ""Chao Yang"", ""Yuechuan Li"", ""Ru He"", ""Jingqiao Zhang""]","[""BERT"", ""Training speedup"", ""Multi-stage training"", ""Natural language processing""]","Pre-trained language models, such as BERT, have achieved significant accuracy gain in many natural language processing tasks. Despite its effectiveness, the huge number of parameters makes training a BERT model computationally very challenging. In this paper, we propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT. We decompose the whole training process into several stages. The training is started from a small model with only a few encoder layers and we gradually increase the depth of the model by adding new encoder layers. At each stage, we only train the top (near the output layer) few encoder layers which are newly added. The parameters of the other layers which have been trained in the previous stages will not be updated in the current stage. In BERT training, the backward calculation is much more time-consuming than the forward calculation, especially in the distributed training setting in which the backward calculation time further includes the communication time for gradient synchronization. In the proposed training strategy, only top few layers participate backward calculation, while most layers only participate forward calculation. Hence both the computation and communication efficiencies are greatly improved. Experimental results show that the proposed method can greatly reduce the training time without significant performance degradation.",/pdf/83ee51601f8d73e503055c624d38fa95fadb1898.pdf,ICLR,2021,This paper proposes a multi-stage layerwise training method to accelerate the training of BERT model. +_TGlfdZOHY3,8I0Dv2L0O91,1601310000000.0,1614990000000.0,1898,"On Episodes, Prototypical Networks, and Few-Shot Learning","[""~Steinar_Laenen1"", ""~Luca_Bertinetto1""]","[""Steinar Laenen"", ""Luca Bertinetto""]","[""few-shot learning"", ""meta-learning"", ""metric learning"", ""deep learning""]","Episodic learning is a popular practice among researchers and practitioners interested in few-shot learning. It consists of organising training in a series of learning problems, each relying on small “support” and “query” sets to mimic the few-shot circumstances encountered during evaluation. +In this paper, we investigate the usefulness of episodic learning in Prototypical Networks, one of the most popular algorithms making use of this practice. +Surprisingly, in our experiments we found that, for Prototypical Networks, it is detrimental to use the episodic learning strategy of separating training samples between support and query set, as it is +a data-inefficient way to exploit training batches. This “non-episodic” version of Prototypical Networks, which corresponds to the classic Neighbourhood Component Analysis, reliably improves over its episodic counterpart in multiple datasets, achieving an accuracy that is competitive with the state-of-the-art, despite being extremely simple.",/pdf/c756f263a2f38fc5dcd3070bfa32faffe2444378.pdf,ICLR,2021,"We analysed the effectiveness of episodic learning in Prototypical Networks and found out that, despite adding complexity and hyper-parameters, it severely affects its performance." +ry9tUX_6-,ry5tUQ_T-,1508550000000.0,1518730000000.0,32,Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy,"[""gkd22@cam.ac.uk"", ""droy@utstat.toronto.edu""]","[""Gintare Karolina Dziugaite"", ""Daniel M. Roy""]","[""generalization error"", ""neural networks"", ""statistical learning theory"", ""PAC-Bayes theory""]","We show that Entropy-SGD (Chaudhari et al., 2017), when viewed as a learning algorithm, optimizes a PAC-Bayes bound on the risk of a Gibbs (posterior) classifier, i.e., a randomized classifier obtained by a risk-sensitive perturbation of the weights of a learned classifier. Entropy-SGD works by optimizing the bound’s prior, violating the hypothesis of the PAC-Bayes theorem that the prior is chosen independently of the data. Indeed, available implementations of Entropy-SGD rapidly obtain zero training error on random labels and the same holds of the Gibbs posterior. In order to obtain a valid generalization bound, we show that an ε-differentially private prior yields a valid PAC-Bayes bound, a straightforward consequence of results connecting generalization with differential privacy. Using stochastic gradient Langevin dynamics (SGLD) to approximate the well-known exponential release mechanism, we observe that generalization error on MNIST (measured on held out data) falls within the (empirically nonvacuous) bounds computed under the assumption that SGLD produces perfect samples. In particular, Entropy-SGLD can be configured to yield relatively tight generalization bounds and still fit real labels, although these same settings do not obtain state-of-the-art performance.",/pdf/973c38ba45d8720ead2c7d267b2027499a0a40da.pdf,ICLR,2018,"We show that Entropy-SGD optimizes the prior of a PAC-Bayes bound, violating the requirement that the prior be independent of data; we use differential privacy to resolve this and improve generalization." +B1xnPsA5KX,B1lLG0DqKQ,1538090000000.0,1545360000000.0,299,Modular Deep Probabilistic Programming,"[""zhenwend@amazon.com"", ""erimeiss@amazon.com"", ""lawrennd@amazon.com""]","[""Zhenwen Dai"", ""Eric Meissner"", ""Neil D. Lawrence""]",[],"Modularity is a key feature of deep learning libraries but has not been fully exploited for probabilistic programming. We propose to improve modularity of probabilistic programming language by offering not only plain probabilistic distributions but also sophisticated probabilistic model such as Bayesian non-parametric models as fundamental building blocks. We demonstrate this idea by presenting a modular probabilistic programming language MXFusion, which includes a new type of re-usable building blocks, called probabilistic modules. A probabilistic module consists of a set of random variables with associated probabilistic distributions and dedicated inference methods. Under the framework of variational inference, the pre-specified inference methods of individual probabilistic modules can be transparently used for inference of the whole probabilistic model. We demonstrate the power and convenience of probabilistic modules in MXFusion with various examples of Gaussian process models, which are evaluated with experiments on real data.",/pdf/2060b15e3a209688624c854d9a0d2c04d4a368f0.pdf,ICLR,2019, +ByS1VpgRZ,r1NyNae0Z,1509110000000.0,1518780000000.0,412,cGANs with Projection Discriminator,"[""miyato@preferred.jp"", ""koyama.masanori@gmail.com""]","[""Takeru Miyato"", ""Masanori Koyama""]","[""Generative Adversarial Networks"", ""GANs"", ""conditional GANs"", ""Generative models"", ""Projection""]","We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model. +This approach is in contrast with most frameworks of conditional GANs used in application today, which use the conditional information by concatenating the (embedded) conditional vector to the feature vectors. +With this modification, we were able to significantly improve the quality of the class conditional image generation on ILSVRC2012 (ImageNet) dataset from the current state-of-the-art result, and we achieved this with a single pair of a discriminator and a generator. +We were also able to extend the application to super-resolution and succeeded in producing highly discriminative super-resolution images. +This new structure also enabled high quality category transformation based on parametric functional transformation of conditional batch normalization layers in the generator.",/pdf/95ea56456c25b9ccad802c1838e43218dd07cc42.pdf,ICLR,2018,"We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model." +ABZSAe9gNeg,kkEp5IF1W6c,1601310000000.0,1614990000000.0,2382,Differentially Private Synthetic Data: Applied Evaluations and Enhancements,"[""~Lucas_Rosenblatt1"", ""~Xiaoyan_Liu1"", ""sapouyan@microsoft.com"", ""eddeleon@microsoft.com"", ""andesai@microsoft.com"", ""joshuaa@microsoft.com""]","[""Lucas Rosenblatt"", ""Xiaoyan Liu"", ""Samira Pouyanfar"", ""Eduardo de Leon"", ""Anuj Desai"", ""Joshua Allen""]","[""privacy"", ""differential privacy"", ""generative adversarial networks"", ""gan"", ""security"", ""synthetic data"", ""evaluation"", ""benchmarking"", ""ensemble""]","Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practitioners alike. In addition, we propose QUAIL, a two model hybrid approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint.",/pdf/98f400b9a620eac2695ab46c54ae30d47cdc53d9.pdf,ICLR,2021,"We present both extensive benchmarking for state-of-the-art differentially private synthesizers and QUAIL, an ensemble-based modeling approach to generating differentially private synthetic data with high utility." +px0-N3_KjA,Y_eZcp4Rxqn,1601310000000.0,1614990000000.0,2137,D4RL: Datasets for Deep Data-Driven Reinforcement Learning,"[""~Justin_Fu1"", ""~Aviral_Kumar2"", ""~Ofir_Nachum1"", ""~George_Tucker1"", ""~Sergey_Levine1""]","[""Justin Fu"", ""Aviral Kumar"", ""Ofir Nachum"", ""George Tucker"", ""Sergey Levine""]","[""reinforcement learning"", ""deep learning"", ""benchmarks""]","The offline reinforcement learning (RL) problem, also known as batch RL, refers to the setting where a policy must be learned from a static dataset, without additional online data collection. This setting is compelling as it potentially allows RL methods to take advantage of large, pre-collected datasets, much like how the rise of large datasets has fueled results in supervised learning in recent years. However, existing online RL benchmarks are not tailored towards the offline setting, making progress in offline RL difficult to measure. In this work, we introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL. Examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multi-objective datasets where an agent can perform different tasks in the same environment, and datasets consisting of a mixtures of policies. To facilitate research, we release our benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase. We hope that our benchmark will focus research effort on methods that drive improvements not just on simulated tasks, but ultimately on the kinds of real-world problems where offline RL will have the largest impact. ",/pdf/dd471beb05746ae8e58163f595af26abeaeb120b.pdf,ICLR,2021,A benchmark proposal for offline reinforcement learning. +j6rILItz4yr,t83Bu3RoTpsD,1601310000000.0,1614990000000.0,2094,ALFA: Adversarial Feature Augmentation for Enhanced Image Recognition,"[""~Tianlong_Chen1"", ""~Yu_Cheng1"", ""~Zhe_Gan1"", ""~Yu_Hu3"", ""~Zhangyang_Wang1"", ""~Jingjing_Liu2""]","[""Tianlong Chen"", ""Yu Cheng"", ""Zhe Gan"", ""Yu Hu"", ""Zhangyang Wang"", ""Jingjing Liu""]","[""Adversarial Training"", ""Image Recognition"", ""Generalization""]","Adversarial training is an effective method to combat adversarial attacks in order to create robust neural networks. By using an auxiliary batch normalization on adversarial examples, it has been shown recently to possess great potential in improving the generalization ability of neural networks for image recognition as well. However, crafting pixel-level adversarial perturbations is computationally expensive. To address this issue, we propose AdversariaL Feature Augmentation (ALFA), which advocates adversarial training on the intermediate layers of feature embeddings. ALFA utilizes both clean and adversarial augmented features jointly to enhance standard trained networks. To eliminate laborious tuning of key parameters such as locations and strength of feature augmentations, we further design a learnable adversarial feature augmentation (L-ALFA) framework to automatically adjust the perturbation magnitude of each perturbed feature. Extensive experiments demonstrate that our proposed ALFA and L-ALFA methods achieve significant and consistent generalization improvement over strong baselines on CIFAR-10, CIFAR-100, and ImageNet benchmarks across different backbone networks for image recognition. ",/pdf/416b6646d333f54ff2e84ffa8a42c26e13028cbc.pdf,ICLR,2021,Utilizing clean and adversarial augmented features to improve the generalization of image recognition +rky3QW9le,,1478260000000.0,1481370000000.0,168,Transformational Sparse Coding,"[""gklezd@cs.washington.edu"", ""rao@cs.washington.edu""]","[""Dimitrios C. Gklezakos"", ""Rajesh P. N. Rao""]","[""Unsupervised Learning"", ""Computer vision"", ""Optimization""]"," +A fundamental problem faced by object recognition systems is that +objects and their features can appear in different locations, scales +and orientations. Current deep learning methods attempt to achieve +invariance to local translations via pooling, discarding the locations +of features in the process. Other approaches explicitly learn +transformed versions of the same feature, leading to representations +that quickly explode in size. Instead of discarding the rich and +useful information about feature transformations to achieve +invariance, we argue that models should learn object features +conjointly with their transformations to achieve equivariance. We +propose a new model of unsupervised learning based on sparse coding +that can learn object features jointly with their affine +transformations directly from images. Results based on learning from +natural images indicate that our approach +matches the reconstruction quality of traditional sparse coding but +with significantly fewer degrees of freedom while simultaneously +learning transformations from data. These results open the door to +scaling up unsupervised learning to allow deep feature+transformation +learning in a manner consistent with the ventral+dorsal stream +architecture of the primate visual cortex.",/pdf/18c35a543b180505030a29fc079e8bd18b5d4c81.pdf,ICLR,2017,We extend sparse coding to include general affine transformations. We present a novel technical approach to circumvent inference intractability. +KsN9p5qJN3,riTm-ytKGQ3,1601310000000.0,1614990000000.0,1862,Energy-based Out-of-distribution Detection for Multi-label Classification,"[""~Haoran_Wang5"", ""~Weitang_Liu1"", ""~Alex_Bocchieri1"", ""~Yixuan_Li1""]","[""Haoran Wang"", ""Weitang Liu"", ""Alex Bocchieri"", ""Yixuan Li""]",[],"Out-of-distribution (OOD) detection is essential to prevent anomalous inputs from causing a model to fail during deployment. Improved methods for OOD detection in multi-class classification have emerged, while OOD detection methods for multi-label classification remain underexplored and use rudimentary techniques. We propose SumEnergy, a simple and effective method, which estimates the OOD indicator scores by aggregating energy scores from multiple labels. We show that SumEnergy can be mathematically interpreted from a joint likelihood perspective. Our results show consistent improvement over previous methods that are based on the maximum-valued scores, which fail to capture joint information from multiple labels. We demonstrate the effectiveness of our method on three common multi-label classification benchmarks, including MS-COCO, PASCAL-VOC, and NUS-WIDE. We show that SumEnergy reduces the FPR95 by up to 10.05% compared to the previous best baseline, establishing state-of-the-art performance. ",/pdf/1af10ee41e54c2f344de2c57da90b04a2cea5136.pdf,ICLR,2021,"We investigate OOD detection for multi-label classification networks, and propose an energy-based method which is both theoretically meaningful and empirically effective, establishing state-of-the-art performance on common benchmarks. " +ByxY8CNtvr,S1gh9fvOPS,1569440000000.0,1583910000000.0,1146,Improving Neural Language Generation with Spectrum Control,"[""lingxw@cs.ucla.edu"", ""jing.huang@jd.com"", ""kevin.huang3@jd.com"", ""bull@cs.ucla.edu"", ""guangtao.wang@jd.com"", ""qgu@cs.ucla.edu""]","[""Lingxiao Wang"", ""Jing Huang"", ""Kevin Huang"", ""Ziniu Hu"", ""Guangtao Wang"", ""Quanquan Gu""]",[],"Recent Transformer-based models such as Transformer-XL and BERT have achieved huge success on various natural language processing tasks. However, contextualized embeddings at the output layer of these powerful models tend to degenerate and occupy an anisotropic cone in the vector space, which is called the representation degeneration problem. In this paper, we propose a novel spectrum control approach to address this degeneration problem. The core idea of our method is to directly guide the spectra training of the output embedding matrix with a slow-decaying singular value prior distribution through a reparameterization framework. We show that our proposed method encourages isotropy of the learned word representations while maintains the modeling power of these contextual neural models. We further provide a theoretical analysis and insight on the benefit of modeling singular value distribution. We demonstrate that our spectrum control method outperforms the state-of-the-art Transformer-XL modeling for language model, and various Transformer-based models for machine translation, on common benchmark datasets for these tasks.",/pdf/d829997f1a6b658634736640508b66f39b8aa6d7.pdf,ICLR,2020, +D5Wt3FtvCF,Q7EEbDWUzbc,1601310000000.0,1614990000000.0,386,PURE: An Uncertainty-aware Recommendation Framework for Maximizing Expected Posterior Utility of Platform,"[""~Haokun_Chen1"", ""jingmu.lzy@alibaba-inc.com"", ""~Chen_Xu2"", ""~Ziqian_Chen1"", ""jinyang.gjy@alibaba-inc.com"", ""~Bolin_Ding3""]","[""Haokun Chen"", ""Zhaoyang Liu"", ""Chen Xu"", ""Ziqian Chen"", ""Jinyang Gao"", ""Bolin Ding""]","[""commercial recommendation"", ""maximizing platform benefits"", ""uncertainty-aware"", ""influence of display policy"", ""non-convex optimization""]","Commercial recommendation can be regarded as an interactive process between the recommendation platform and its target users. One crucial problem for the platform is how to make full use of its advantages so as to maximize its utility, i.e., the commercial benefits from recommendation. In this paper, we propose a novel recommendation framework which effectively utilizes the information of user uncertainty over different item dimensions and explicitly takes into consideration the impact of display policy on user in order to achieve maximal expected posterior utility for the platform. We formulate the problem of deriving optimal policy to achieve maximal expected posterior utility as a constrained non-convex optimization problem and further propose an ADMM-based solution to derive an approximately optimal policy. Extensive experiments are conducted over data collected from a real-world recommendation platform and demonstrate the effectiveness of the proposed framework. Besides, we also adopt the proposed framework to conduct experiments with an intent to reveal how the platform achieves its commercial benefits. The results suggest that the platform should cater to the user's preference for item dimensions that the user prefers, while for item dimensions where the user is with high uncertainty, the platform can achieve more commercial benefits by recommending items with high utilities.",/pdf/770c3c14eb315661cf91be42146e47be5dde7602.pdf,ICLR,2021,"In the paper, we propose a novel recommendation framework to maximize the platform's expected posterior utility, taking into consideration the user uncertainty over different item dimensions and the influence of display policy over user." +nIAxjsniDzg,JJM_Bso_O0,1601310000000.0,1616060000000.0,581,What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study,"[""~Marcin_Andrychowicz1"", ""~Anton_Raichuk1"", ""stanczyk@google.com"", ""eorsini@google.com"", ""sertan@google.com"", ""~Rapha\u00ebl_Marinier1"", ""~Leonard_Hussenot1"", ""~Matthieu_Geist1"", ""~Olivier_Pietquin1"", ""~Marcin_Michalski1"", ""~Sylvain_Gelly1"", ""~Olivier_Bachem1""]","[""Marcin Andrychowicz"", ""Anton Raichuk"", ""Piotr Sta\u0144czyk"", ""Manu Orsini"", ""Sertan Girgin"", ""Rapha\u00ebl Marinier"", ""Leonard Hussenot"", ""Matthieu Geist"", ""Olivier Pietquin"", ""Marcin Michalski"", ""Sylvain Gelly"", ""Olivier Bachem""]","[""Reinforcement learning"", ""continuous control""]","In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [Engstrom'20]. As a step towards filling that gap, we implement >50 such ``""choices"" in a unified on-policy deep actor-critic framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents.",/pdf/6e07c0b9828c97445a9667bf90fe254fa7de8e3a.pdf,ICLR,2021,We conduct a large-scale empirical study that provides insights and practical recommendations for the training of on-policy deep actor-critic RL agents. +rylIAsCqYm,BkldQ6a5Km,1538090000000.0,1560040000000.0,895,A2BCD: Asynchronous Acceleration with Optimal Complexity,"[""roberthannah89@gmail.com"", ""fei.feng@math.ucla.edu"", ""wotaoyin@math.ucla.edu""]","[""Robert Hannah"", ""Fei Feng"", ""Wotao Yin""]","[""asynchronous"", ""optimization"", ""parallel"", ""accelerated"", ""complexity""]"," In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm (A2BCD). We prove A2BCD converges linearly to a solution of the convex minimization problem at the same rate as NU_ACDM, so long as the maximum delay is not too large. This is the first asynchronous Nesterov-accelerated algorithm that attains any provable speedup. Moreover, we then prove that these algorithms both have optimal complexity. Asynchronous algorithms complete much faster iterations, and A2BCD has optimal complexity. Hence we observe in experiments that A2BCD is the top-performing coordinate descent algorithm, converging up to 4-5x faster than NU_ACDM on some data sets in terms of wall-clock time. To motivate our theory and proof techniques, we also derive and analyze a continuous-time analog of our algorithm and prove it converges at the same rate.",/pdf/8523a02ea79de46cc5a765f360beaeb9a09d3f98.pdf,ICLR,2019,We prove the first-ever convergence proof of an asynchronous accelerated algorithm that attains a speedup. +9xC2tWEwBD,VTy-Ry9fN_I,1601310000000.0,1614290000000.0,962,"A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference","[""~Sanghyun_Hong1"", ""~Yigitcan_Kaya1"", ""modoranu.ionut.vlad@hotmail.com"", ""~Tudor_Dumitras1""]","[""Sanghyun Hong"", ""Yigitcan Kaya"", ""Ionu\u021b-Vlad Modoranu"", ""Tudor Dumitras""]","[""Slowdown attacks"", ""efficient inference"", ""input-adaptive multi-exit neural networks"", ""adversarial examples""]","Recent increases in the computational demands of deep neural networks (DNNs), combined with the observation that most input samples require only simple models, have sparked interest in input-adaptive multi-exit architectures, such as MSDNets or Shallow-Deep Networks. These architectures enable faster inferences and could bring DNNs to low-power devices, e.g., in the Internet of Things (IoT). However, it is unknown if the computational savings provided by this approach are robust against adversarial pressure. In particular, an adversary may aim to slowdown adaptive DNNs by increasing their average inference time—a threat analogous to the denial-of-service attacks from the Internet. In this paper, we conduct a systematic evaluation of this threat by experimenting with three generic multi-exit DNNs (based on VGG16, MobileNet, and ResNet56) and a custom multi-exit architecture, on two popular image classification benchmarks (CIFAR-10 and Tiny ImageNet). To this end, we show that adversarial example-crafting techniques can be modified to cause slowdown, and we propose a metric for comparing their impact on different architectures. We show that a slowdown attack reduces the efficacy of multi-exit DNNs by 90–100%, and it amplifies the latency by 1.5–5× in a typical IoT deployment. We also show that it is possible to craft universal, reusable perturbations and that the attack can be effective in realistic black-box scenarios, where the attacker has limited knowledge about the victim. Finally, we show that adversarial training provides limited protection against slowdowns. These results suggest that further research is needed for defending multi-exit architectures against this emerging threat. Our code is available at https://github.com/sanghyun-hong/deepsloth. ",/pdf/c7b1e1ec7f160d09cea4ae461b498ee701297eb3.pdf,ICLR,2021,Is the computational savings provided by the input-adaptive 'multi-exit architectures' robust against adversarial perturbations? No. +Wi5KUNlqWty,LjVtpB8jIg9,1601310000000.0,1624510000000.0,1028,How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision,"[""~Dongkwan_Kim1"", ""~Alice_Oh1""]","[""Dongkwan Kim"", ""Alice Oh""]","[""Graph Neural Network"", ""Attention Mechanism"", ""Self-supervised Learning""]","Attention mechanism in graph neural networks is designed to assign larger weights to important neighbor nodes for better representation. However, what graph attention learns is not understood well, particularly when graphs are noisy. In this paper, we propose a self-supervised graph attention network (SuperGAT), an improved graph attention model for noisy graphs. Specifically, we exploit two attention forms compatible with a self-supervised task to predict edges, whose presence and absence contain the inherent information about the importance of the relationships between nodes. By encoding edges, SuperGAT learns more expressive attention in distinguishing mislinked neighbors. We find two graph characteristics influence the effectiveness of attention forms and self-supervision: homophily and average degree. Thus, our recipe provides guidance on which attention design to use when those two graph characteristics are known. Our experiment on 17 real-world datasets demonstrates that our recipe generalizes across 15 datasets of them, and our models designed by recipe show improved performance over baselines.",/pdf/9a9b994ea7dc4b59415ee1780753f7061f19eea4.pdf,ICLR,2021,We propose a method that self-supervise graph attention through edges and it should be designed according to the average degree and homophily of graphs. +JCz05AtXO3y,aFr2sxFAphS,1601310000000.0,1614990000000.0,1160,Structural Landmarking and Interaction Modelling: on Resolution Dilemmas in Graph Classification,"[""~Kai_Zhang1"", ""~Yaokang_Zhu1"", ""~Jun_Wang4"", ""~Haibin_Ling1"", ""~Jie_Zhang10"", ""~Hongyuan_Zha1""]","[""Kai Zhang"", ""Yaokang Zhu"", ""Jun Wang"", ""Haibin Ling"", ""Jie Zhang"", ""Hongyuan Zha""]","[""Graph Pooling"", ""Graph Classiciation"", ""Interaction Preserving Graph Pooling"", ""Structure Landmarking""]","Graph neural networks are promising architecture for learning and inference with graph-structured data. However, generating informative graph level features has long been a challenge. Current practice of graph-pooling typically summarizes a graph by squeezing it into a single vector. This may lead to significant loss of predictive, iterpretable structural information, because properties of a complex system are believed to arise largely from the interaction among its components. In this paper, we analyze the intrinsic difficulty in graph classification under the unified concept of ``""resolution dilemmas"" and propose `SLIM, an inductive neural network model for Structural Landmarking and Interaction Modelling, to remedy the information loss in graph pooling. We show that, by projecting graphs onto end-to-end optimizable, and well-aligned substructure landmarks (representatives), the resolution dilemmas can be resolved effectively, so that explicit interacting relation between component parts of a graph can be leveraged directly in explaining its complexity and predicting its property. Empirical evaluations, in comparison with state-of-the-art, demonstrate promising results of our approach on a number of benchmark datasets for graph classification. +",/pdf/7469d918ea04138141acfd7b4b2e7f8fd3df8a65.pdf,ICLR,2021,A new framework for graph pooling that allows explicit modelling of graph substructures and their interacting relations. +SJegkkrYPS,SygIFs9_vB,1569440000000.0,1577170000000.0,1455,Starfire: Regularization-Free Adversarially-Robust Structured Sparse Training,"[""ngamboa@stanford.edu"", ""kudrolli@stanford.edu"", ""anandd@stanford.edu"", ""perdavan@stanford.edu""]","[""Noah Gamboa"", ""Kais Kudrolli"", ""Anand Dhoot"", ""Ardavan Pedram""]","[""Structured Sparsity"", ""Sparsity"", ""Training"", ""Compression"", ""Adversarial"", ""Regularization"", ""Acceleration""]","This paper studies structured sparse training of CNNs with a gradual pruning technique that leads to fixed, sparse weight matrices after a set number of epochs. We simplify the structure of the enforced sparsity so that it reduces overhead caused by regularization. The proposed training methodology explores several options for structured sparsity. + +We study various tradeoffs with respect to pruning duration, learning-rate configuration, and the total length of training. +We show that our method creates a sparse version of ResNet50 and ResNet50v1.5 on full ImageNet while remaining within a negligible <1% margin of accuracy loss. To make sure that this type of sparse training does not harm the robustness of the network, we also demonstrate how the network behaves in the presence of adversarial attacks. Our results show that with 70% target sparsity, over 75% top-1 accuracy is achievable. ",/pdf/1141600f493b40ede4899a3e551c2e71dc4dbff9.pdf,ICLR,2020,"This paper studies structured sparse training of CNNs that leads to fixed, sparse weight matrices after a set number of epochs." +snaT4xewUfX,WDTBa4UPyDwL,1601310000000.0,1614990000000.0,3436,Variational inference for diffusion modulated Cox processes,"[""~Prateek_Jaiswal1"", ""~Harsha_Honnappa1"", ""~Vinayak_Rao1""]","[""Prateek Jaiswal"", ""Harsha Honnappa"", ""Vinayak Rao""]","[""Cox process"", ""variational inference"", ""stochastic differential equation"", ""smoothing posterior density""]","This paper proposes a stochastic variational inference (SVI) method for computing an approximate posterior path measure of a Cox process. These processes are widely used in natural and physical sciences, engineering and operations research, and represent a non-trivial model of a wide array of phenomena. In our work, we model the stochastic intensity as the solution of a diffusion stochastic differential equation (SDE), and our objective is to infer the posterior, or smoothing, measure over the paths given Poisson process realizations. We first derive a system of stochastic partial differential equations (SPDE) for the pathwise smoothing posterior density function, a non-trivial result, since the standard solution of SPDEs typically involves an It\^o stochastic integral, which is not defined pathwise. Next, we propose an SVI approach to approximating the solution of the system. We parametrize the class of approximate smoothing posteriors using a neural network, derive a lower bound on the evidence of the observed point process sample-path, and optimize the lower bound using stochastic gradient descent (SGD). We demonstrate the efficacy of our method on both synthetic and real-world problems, and demonstrate the advantage of the neural network solution over standard numerical solvers.",/pdf/4834cda177ee299eaaa31a33d0391ce75afe87e3.pdf,ICLR,2021,This paper proposes a variational inference method for computing an approximate smoothing posterior path measure of a Cox process with intensity as a solution to a stochastic differential equation. +ryF-cQ6T-,ryu-qmpa-,1508880000000.0,1518730000000.0,68,Machine Learning by Two-Dimensional Hierarchical Tensor Networks: A Quantum Information Theoretic Perspective on Deep Architectures,"[""dingliu_thu@126.com"", ""shi-ju.ran@icfo.eu"", ""peter.wittek@icfo.eu"", ""pengcheng12@mails.ucas.ac.cn"", ""raulbzga@gmail.com"", ""gsu@ucas.ac.cn"", ""maciej.lewenstein@icfo.eu""]","[""Ding Liu"", ""Shi-Ju Ran"", ""Peter Wittek"", ""Cheng Peng"", ""Raul Bl\u00e1zquez Garc\u00eda"", ""Gang Su"", ""Maciej Lewenstein""]","[""quantum machine learning"", ""tensor network"", ""quantum information""]","The resemblance between the methods used in studying quantum-many body physics and in machine learning has drawn considerable attention. In particular, tensor networks (TNs) and deep learning architectures bear striking similarities to the extent that TNs can be used for machine learning. Previous results used one-dimensional TNs in image recognition, showing limited scalability and a request of high bond dimension. In this work, we train two-dimensional hierarchical TNs to solve image recognition problems, using a training algorithm derived from the multipartite entanglement renormalization ansatz (MERA). This approach overcomes scalability issues and implies novel mathematical connections among quantum many-body physics, quantum information theory, and machine learning. While keeping the TN unitary in the training phase, TN states can be defined, which optimally encodes each class of the images into a quantum many-body state. We study the quantum features of the TN states, including quantum entanglement and fidelity. We suggest these quantities could be novel properties that characterize the image classes, as well as the machine learning tasks. Our work could be further applied to identifying possible quantum properties of certain artificial intelligence methods.",/pdf/2fa1a2e4547bad4d64143bf30857db799e329c7f.pdf,ICLR,2018,"This approach overcomes scalability issues and implies novel mathematical connections among quantum many-body physics, quantum information theory, and machine learning." +rygnfn0qF7,SygEdbAqY7,1538090000000.0,1545360000000.0,1303,Language Model Pre-training for Hierarchical Document Representations,"[""mingweichang@google.com"", ""kristout@google.com"", ""kentonl@google.com"", ""jacobdevlin@google.com""]","[""Ming-Wei Chang"", ""Kristina Toutanova"", ""Kenton Lee"", ""Jacob Devlin""]",[],"Hierarchical neural architectures can efficiently capture long-distance dependencies and have been used for many document-level tasks such as summarization, document segmentation, and fine-grained sentiment analysis. However, effective usage of such a large context can difficult to learn, especially in the case where there is limited labeled data available. +Building on the recent success of language model pretraining methods for learning flat representations of text, we propose algorithms for pre-training hierarchical document representations from unlabeled data. Unlike prior work, which has focused on pre-training contextual token representations or context-independent sentence/paragraph representations, our hierarchical document representations include fixed-length sentence/paragraph representations which integrate contextual information from the entire documents. Experiments on document segmentation, document-level question answering, and extractive document summarization demonstrate the effectiveness of the proposed pre-training algorithms.",/pdf/ec77d07b0e7e1520b1b282fa10d4a0c54198d5ad.pdf,ICLR,2019, +rk3pnae0b,By9pnpl0Z,1509120000000.0,1518730000000.0,430,Topic-Based Question Generation,"[""wenpeng.hu@pku.edu.cn"", ""liub@cs.uic.edu"", ""ruiyan@pku.edu.cn"", ""zhaody@pku.edu.cn"", ""jwma@math.pku.edu.cn""]","[""Wenpeng Hu"", ""Bing Liu"", ""Rui Yan"", ""Dongyan Zhao"", ""Jinwen Ma""]",[],"Asking questions is an important ability for a chatbot. This paper focuses on question generation. Although there are existing works on question generation based on a piece of descriptive text, it remains to be a very challenging problem. In the paper, we propose a new question generation problem, which also requires the input of a target topic in addition to a piece of descriptive text. The key reason for proposing the new problem is that in practical applications, we found that useful questions need to be targeted toward some relevant topics. One almost never asks a random question in a conversation. Due to the fact that given a descriptive text, it is often possible to ask many types of questions, generating a question without knowing what it is about is of limited use. To solve the problem, we propose a novel neural network that is able to generate topic-specific questions. One major advantage of this model is that it can be trained directly using a question-answering corpus without requiring any additional annotations like annotating topics in the questions or answers. Experimental results show that our model outperforms the state-of-the-art baseline.",/pdf/c77e09b95d892bdd9df1235bdbb55c56275098dd.pdf,ICLR,2018,We propose a neural network that is able to generate topic-specific questions. +6BWY3yDdDi,sBQRFRbxkVF,1601310000000.0,1614990000000.0,2463,A Truly Constant-time Distribution-aware Negative Sampling,"[""~Shabnam_Daghaghi1"", ""~Tharun_Medini1"", ""~Beidi_Chen1"", ""mengnan.zhao@rice.edu"", ""~Anshumali_Shrivastava1""]","[""Shabnam Daghaghi"", ""Tharun Medini"", ""Beidi Chen"", ""Mengnan Zhao"", ""Anshumali Shrivastava""]",[],"Softmax classifiers with a very large number of classes naturally occur in many applications such as natural language processing and information retrieval. The calculation of full-softmax is very expensive from the computational and energy perspective. There have been a variety of sampling approaches to overcome this challenge, popularly known as negative sampling (NS). Ideally, NS should sample negative classes from a distribution that is dependent on the input data, the current parameters, and the correct positive class. Unfortunately, due to the dynamically updated parameters and data samples, there does not exist any sampling scheme that is truly adaptive and also samples the negative classes in constant time every iteration. Therefore, alternative heuristics like random sampling, static frequency-based sampling, or learning-based biased sampling; which primarily trade either the sampling cost or the adaptivity of samples per iteration, are adopted. In this paper, we show a class of distribution where the sampling scheme is truly adaptive and provably generates negative samples in constant time. We demonstrate a negative sampling implementation that is significantly faster, in terms of wall clock time, compared to the most optimized TensorFlow implementations of standard softmax or other sampling approaches on the best available GPUs (V100s).",/pdf/8ffda2ff56edd3ed2f80c0b18c71097a29636dc3.pdf,ICLR,2021,We provide two LSH based hard-negative sampling strategies and an efficient C++implementation that outperforms Tensorflow-GPU on time while retaining precision. +Kr7CrZPPPo,yfhAxexMFdk,1601310000000.0,1614990000000.0,2107,Learning a Non-Redundant Collection of Classifiers,"[""~Daniel_Pace1"", ""~Alessandra_Russo1"", ""~Murray_Shanahan1""]","[""Daniel Pace"", ""Alessandra Russo"", ""Murray Shanahan""]",[],"Supervised learning models constructed under the i.i.d. assumption have often been shown to exploit spurious or brittle predictive signals instead of more robust ones present in the training data. Inspired by Quality-Diversity algorithms, in this work we train a collection of classifiers to learn distinct solutions to a classification problem, with the goal of learning to exploit a variety of predictive signals present in the training data. We propose an information-theoretic measure of model diversity based on minimizing an estimate of conditional total correlation of final layer representations across models given the label. We consider datasets with synthetically injected spurious correlations and evaluate our framework's ability to rapidly adapt to a change in distribution that destroys the spurious correlation. We compare our method to a variety of baselines under this evaluation protocol, showing that it is competitive with other approaches while being more successful at isolating distinct signals. We also show that our model is competitive with Invariant Risk Minimization under this evaluation protocol without requiring access to the environment information required by IRM to discriminate between spurious and robust signals.",/pdf/8025e3147433121201f2f72ea9ef85e417a945e0.pdf,ICLR,2021,Learning to isolate distinct predictive signals using an information-theoretic minimal redundancy criterion. +rJeB36NKvB,SygwRklODr,1569440000000.0,1583910000000.0,781,How much Position Information Do Convolutional Neural Networks Encode?,"[""amirul@scs.ryerson.ca"", ""sen.jia@ryerson.ca"", ""bruce@ryerson.ca""]","[""Md Amirul Islam*"", ""Sen Jia*"", ""Neil D. B. Bruce""]","[""network understanding"", ""absolute position information""]","In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. Information concerning absolute position is inherently useful, and it is reasonable to assume that deep CNNs may implicitly learn to encode this information if there is a means to do so. In this paper, we test this hypothesis revealing the surprising degree of absolute position information that is encoded in commonly used neural networks. A comprehensive set of experiments show the validity of this hypothesis and shed light on how and where this information is represented while offering clues to where positional information is derived from in deep CNNs.",/pdf/2267055f8221e283014aba7ef46092ba93ff450f.pdf,ICLR,2020,"Our work shows positional information has been implicitly encoded in a network. This information is important for detecting position-dependent features, e.g. semantic and saliency." +tyd9yxioXgO,#NAME?,1601310000000.0,1614990000000.0,295,Compositional Video Synthesis with Action Graphs,"[""~Amir_Bar1"", ""~Roei_Herzig2"", ""~Xiaolong_Wang3"", ""~Gal_Chechik1"", ""~Trevor_Darrell2"", ""~Amir_Globerson1""]","[""Amir Bar"", ""Roei Herzig"", ""Xiaolong Wang"", ""Gal Chechik"", ""Trevor Darrell"", ""Amir Globerson""]","[""Video Synthesis"", ""Vision and Language"", ""Representation Learning""]","Videos of actions are complex signals, containing rich compositional structure. Current video generation models are limited in their ability to generate such videos. To address this challenge, we introduce a generative model (AG2Vid) that can be conditioned on an Action Graph, a structure that naturally represents the dynamics of actions and interactions between objects. Our AG2Vid model disentangles appearance and position features, allowing for more accurate generation. AG2Vid is evaluated on the CATER and Something-Something datasets and outperforms other baselines. Finally, we show how Action Graphs can be used for generating novel compositions of actions. ",/pdf/22e8c3a6742e1e1e5b9492f41d4c51e73c93abe8.pdf,ICLR,2021,"We introduce Action Graphs, a natural and convenient structure representing the dynamics of actions between objects over time. We show we can synthesize goal-oriented videos and generate novel compositions of unseen actions from it on two datasets." +SyehMhC9Y7,S1ewNGRcYQ,1538090000000.0,1545360000000.0,1300,"Deep Imitative Models for Flexible Inference, Planning, and Control","[""nrhineha@cs.cmu.edu"", ""rmcallister@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Nicholas Rhinehart"", ""Rowan McAllister"", ""Sergey Levine""]","[""imitation learning"", ""forecasting"", ""computer vision""]","Imitation learning provides an appealing framework for autonomous control: in many tasks, demonstrations of preferred behavior can be readily obtained from human experts, removing the need for costly and potentially dangerous online data collection in the real world. However, policies learned with imitation learning have limited flexibility to accommodate varied goals at test time. Model-based reinforcement learning (MBRL) offers considerably more flexibility, since a predictive model learned from data can be used to achieve various goals at test time. However, MBRL suffers from two shortcomings. First, the model does not help to choose desired or safe outcomes -- its dynamics estimate only what is possible, not what is preferred. Second, MBRL typically requires additional online data collection to ensure that the model is accurate in those situations that are actually encountered when attempting to achieve test time goals. Collecting this data with a partially trained model can be dangerous and time-consuming. In this paper, we aim to combine the benefits of imitation learning and MBRL, and propose imitative models: probabilistic predictive models able to plan expert-like trajectories to achieve arbitrary goals. We find this method substantially outperforms both direct imitation and MBRL in a simulated autonomous driving task, and can be learned efficiently from a fixed set of expert demonstrations without additional online data collection. We also show our model can flexibly incorporate user-supplied costs at test-time, can plan to sequences of goals, and can even perform well with imprecise goals, including goals on the wrong side of the road.",/pdf/0cdfcf6b1e8643b6983a40439f874c79a5c3d3f2.pdf,ICLR,2019,"Hybrid Vision-Driven Imitation Learning and Model-Based Reinforcement Learning for Planning, Forecasting, and Control" +JHx9ZDCQEA,fVAzvEqf_NO,1601310000000.0,1614990000000.0,2253,PolyRetro: Few-shot Polymer Retrosynthesis via Domain Adaptation,"[""~Binghong_Chen1"", ""~Chengtao_Li1"", ""~Hanjun_Dai1"", ""rrampi790@gmail.com"", ""~Le_Song1""]","[""Binghong Chen"", ""Chengtao Li"", ""Hanjun Dai"", ""Rampi Ramprasad"", ""Le Song""]","[""ML for Chemistry"", ""Polymer Retrosynthesis"", ""Few-show Learning"", ""Domain Adaptation""]","Polymers appear everywhere in our daily lives -- fabrics, plastics, rubbers, etc. -- and we could hardly live without them. To make polymers, chemists develop processes that combine smaller building blocks~(monomers) to form long chains or complex networks~(polymers). These processes are called polymerizations and will usually take lots of human efforts to develop. Although machine learning models for small molecules have generated lots of promising results, the prediction problem for polymerization is new and suffers from the scarcity of polymerization datasets available in the field. Furthermore, the problem is made even more challenging by the large size of the polymers and the additional recursive constraints, which are not present in the small molecule problem. In this paper, we make an initial step towards this challenge and propose a learning-based search framework that can automatically identify a sequence of reactions that lead to the polymerization of a target polymer with minimal polymerization data involved. Our method transfers models trained on small molecule datasets for retrosynthesis to check the validity of polymerization reaction. Furthermore, our method also incorporates a template prior learned on a limited amount of polymer data into the framework to adapt the model from small molecule to the polymer domain. We demonstrate that our method is able to propose high-quality polymerization plans for a dataset of 52 real-world polymers, of which a significant portion successfully recovers the currently-in-used polymerization processes in the real world.",/pdf/a1ef376e6c81308e1409577376a810853b012f5c.pdf,ICLR,2021,We propose a novel learning-based search framework for structural constrained optimization problems with application to polymer retrosynthesis. +SkBHr1WRW,H19NSJ-RZ,1509120000000.0,1518730000000.0,479,Ego-CNN: An Ego Network-based Representation of Graphs Detecting Critical Structures,"[""rctzeng@datalab.cs.nthu.edu.tw"", ""shwu@cs.nthu.edu.tw""]","[""Ruo-Chun Tzeng"", ""Shan-Hung Wu""]","[""graph embedding"", ""CNN""]","While existing graph embedding models can generate useful embedding vectors that perform well on graph-related tasks, what valuable information can be jointly learned by a graph embedding model is less discussed. In this paper, we consider the possibility of detecting critical structures by a graph embedding model. We propose Ego-CNN to embed graph, which works in a local-to-global manner to take advantages of CNNs that gradually expanding the detectable local regions on the graph as the network depth increases. Critical structures can be detected if Ego-CNN is combined with a supervised task model. We show that Ego-CNN is (1) competitive to state-of-the-art graph embeddings models, (2) can nicely work with CNNs visualization techniques to show the detected structures, and (3) is efficient and can incorporate with scale-free priors, which commonly occurs in social network datasets, to further improve the training efficiency.",/pdf/3ed1125fb6b161177afa9693bd98b5cc305d92f2.pdf,ICLR,2018, +2NU7a9AHo-6,aHENhdHCmmM,1601310000000.0,1614990000000.0,3353,AUL is a better optimization metric in PU learning,"[""~Shangchuan_Huang1"", ""~Songtao_Wang1"", ""tolidan@tsinghua.edu.cn"", ""lavender_jlw@126.com""]","[""Shangchuan Huang"", ""Songtao Wang"", ""Dan Li"", ""Liwei Jiang""]",[],"Traditional binary classification models are trained and evaluated with fully labeled data which is not common in real life. In non-ideal dataset, only a small fraction of positive data are labeled. Training a model from such partially labeled data is named as positive-unlabeled (PU) learning. A naive solution of PU learning is treating unlabeled samples as negative. However, using biased data, the trained model may converge to non-optimal point and its real performance cannot be well estimated. Recent works try to recover the unbiased result by estimating the proportion of positive samples with mixture proportion estimation (MPE) algorithms, but the model performance is still limited and heavy computational cost is introduced (particularly for big datasets). In this work, we theoretically prove that Area Under Lift curve (AUL) is an unbiased metric in PU learning scenario, and the experimental evaluation on 9 datasets shows that the average absolute error of AUL estimation is only 1/6 of AUC estimation. By experiments we also find that, compared with state-of-the-art AUC-optimization algorithm, AULoptimization algorithm can not only significantly save the computational cost, but also improve the model performance by up to 10%.",/pdf/e9e994234e519183ca1af53aaf4668297f3b1270.pdf,ICLR,2021, +HygnDhEtvr,HylQ05wjBS,1569440000000.0,1583910000000.0,18,Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation,"[""cheny39@rpi.edu"", ""lwu@email.wm.edu"", ""zaki@cs.rpi.edu""]","[""Yu Chen"", ""Lingfei Wu"", ""Mohammed J. Zaki""]","[""deep learning"", ""reinforcement learning"", ""graph neural networks"", ""natural language processing"", ""question generation""]","Natural question generation (QG) aims to generate questions from a passage and an answer. Previous works on QG either (i) ignore the rich structure information hidden in text, (ii) solely rely on cross-entropy loss that leads to issues like exposure bias and inconsistency between train/test measurement, or (iii) fail to fully exploit the answer information. To address these limitations, in this paper, we propose a reinforcement learning (RL) based graph-to-sequence (Graph2Seq) model for QG. Our model consists of a Graph2Seq generator with a novel Bidirectional Gated Graph Neural Network based encoder to embed the passage, and a hybrid evaluator with a mixed objective combining both cross-entropy and RL losses to ensure the generation of syntactically and semantically valid text. We also introduce an effective Deep Alignment Network for incorporating the answer information into the passage at both the word and contextual levels. Our model is end-to-end trainable and achieves new state-of-the-art scores, outperforming existing methods by a significant margin on the standard SQuAD benchmark.",/pdf/b848cdc9d93e44c1717b7c9861ca4c9aae15a1f6.pdf,ICLR,2020, +BJlxdCVKDB,B1gTrsDdwH,1569440000000.0,1577170000000.0,1199,MoET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees,"[""vasic@utexas.edu"", ""aapetrovic@mas.bg.ac.rs"", ""kaiyuanw@google.com"", ""nikolic@matf.bg.ac.rs"", ""rising@google.com"", ""khurshid@ece.utexas.edu""]","[""Marko Vasic"", ""Andrija Petrovic"", ""Kaiyuan Wang"", ""Mladen Nikolic"", ""Rishabh Singh"", ""Sarfraz Khurshid""]","[""explainable machine learning"", ""reinforcement learning""]","Deep Reinforcement Learning (DRL) has led to many recent breakthroughs on complex control tasks, such as defeating the best human player in the game of Go. However, decisions made by the DRL agent are not explainable, hindering its applicability in safety-critical settings. Viper, a recently proposed technique, constructs a decision tree policy by mimicking the DRL agent. Decision trees are interpretable as each action made can be traced back to the decision rule path that lead to it. However, one global decision tree approximating the DRL policy has significant limitations with respect to the geometry of decision boundaries. We propose MoET, a more expressive, yet still interpretable model based on Mixture of Experts, consisting of a gating function that partitions the state space, and multiple decision tree experts that specialize on different partitions. We propose a training procedure to support non-differentiable decision tree experts and integrate it into imitation learning procedure of Viper. We evaluate our algorithm on four OpenAI gym environments, and show that the policy constructed in such a way is more performant and better mimics the DRL agent by lowering mispredictions and increasing the reward. We also show that MoET policies are amenable for verification using off-the-shelf automated theorem provers such as Z3.",/pdf/1e44e1d76a1d5d7f42d73a2565551548785009b3.pdf,ICLR,2020,Explainable reinforcement learning model using novel combination of mixture of experts with non-differentiable decision tree experts. +BkfiXiUlg,,1478040000000.0,1478040000000.0,34,Learning Efficient Algorithms with Hierarchical Attentive Memory,"[""marcin@openai.com"", ""kkurach@google.com""]","[""Marcin Andrychowicz"", ""Karol Kurach""]",[],"In this paper, we propose and investigate a novel memory architecture for neural networks called Hierarchical Attentive Memory (HAM). It is based on a binary tree with leaves corresponding to memory cells. This allows HAM to perform memory access in O(log n) complexity, which is a significant improvement over the standard attention mechanism that requires O(n) operations, where n is the size of the memory. + +We show that an LSTM network augmented with HAM can learn algorithms for problems like merging, sorting or binary searching from pure input-output examples. In particular, it learns to sort n numbers in time O(n log n) and generalizes well to input sequences much longer than the ones seen during the training. We also show that HAM can be trained to act like classic data structures: a stack, a FIFO queue and a priority queue.",/pdf/1f865d9b000a8b89c7aefd275c22b413e6f84d38.pdf,ICLR,2017,fast attention in O(log n); learned sorting algorithm that generalizes +rJlnxkSYPS,Byg1-8idDr,1569440000000.0,1583910000000.0,1520,Unsupervised Clustering using Pseudo-semi-supervised Learning,"[""divam@cmu.edu"", ""ramjee@microsoft.com"", ""nipun.kwatra@microsoft.com"", ""muthian@microsoft.com""]","[""Divam Gupta"", ""Ramachandran Ramjee"", ""Nipun Kwatra"", ""Muthian Sivathanu""]","[""Unsupervised Learning"", ""Unsupervised Clustering"", ""Deep Learning""]","In this paper, we propose a framework that leverages semi-supervised models to improve unsupervised clustering performance. To leverage semi-supervised models, we first need to automatically generate labels, called pseudo-labels. We find that prior approaches for generating pseudo-labels hurt clustering performance because of their low accuracy. Instead, we use an ensemble of deep networks to construct a similarity graph, from which we extract high accuracy pseudo-labels. The approach of finding high quality pseudo-labels using ensembles and training the semi-supervised model is iterated, yielding continued improvement. We show that our approach outperforms state of the art clustering results for multiple image and text datasets. For example, we achieve 54.6% accuracy for CIFAR-10 and 43.9% for 20news, outperforming state of the art by 8-12% in absolute terms.",/pdf/f8fb057df8c04a26f2a06fcd22571d6f4b49c6d0.pdf,ICLR,2020,Using ensembles and pseudo labels for unsupervised clustering +r1lZ7AEKvB,HylegISOvB,1569440000000.0,1583910000000.0,1028,The Logical Expressiveness of Graph Neural Networks,"[""pbarcelo@gmail.com"", ""egor.kostylev@cs.ox.ac.uk"", ""mikael.monet@imfd.cl"", ""jorge.perez.rojas@gmail.com"", ""juan.reutter@gmail.com"", ""jpsilvapena@gmail.com""]","[""Pablo Barcel\u00f3"", ""Egor V. Kostylev"", ""Mikael Monet"", ""Jorge P\u00e9rez"", ""Juan Reutter"", ""Juan Pablo Silva""]","[""Graph Neural Networks"", ""First Order Logic"", ""Expressiveness""]","The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. We tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first order logic. FOC2 is tightly related to the WL test, and hence to GNNs. We start by studying a popular class of GNNs, which we call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. We show that this class of GNNs is too weak to capture all FOC2 classifiers, and provide a syntactic characterization of the largest subclass of FOC2 classifiers that can be captured by AC-GNNs. This subclass coincides with a logic heavily used by the knowledge representation community. We then look at what needs to be added to AC-GNNs for capturing all FOC2 classifiers. We show that it suffices to add readout functions, which allow to update the features of a node not only in terms of its neighbors, but also in terms of a global attribute vector. We call GNNs of this kind ACR-GNNs. We experimentally validate our findings showing that, on synthetic data conforming to FOC2 formulas, AC-GNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.",/pdf/60fc684014522b4eec2e087e6006320b2fe35ec3.pdf,ICLR,2020,"We characterize the expressive power of GNNs in terms of classical logical languages, separating different GNNs and showing connections with standard notions in Knowledge Representation." +DILxQP08O3B,G8C4t645SmU,1601310000000.0,1615860000000.0,624,VTNet: Visual Transformer Network for Object Goal Navigation,"[""~Heming_Du2"", ""~Xin_Yu1"", ""~Liang_Zheng4""]","[""Heming Du"", ""Xin Yu"", ""Liang Zheng""]",[],"Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize ``turning right'' over ``turning left'' when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.",/pdf/e1c5a2f2e9fd64005c3b944fd743140b5c02bc74.pdf,ICLR,2021, +m5Qsh0kBQG,bASil42oKI,1601310000000.0,1617980000000.0,2611,Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients,"[""~Brenden_K_Petersen1"", ""landajuelala1@llnl.gov"", ""~Terrell_N._Mundhenk1"", ""santiago10@llnl.gov"", ""~Soo_Kyung_Kim1"", ""kim102@llnl.gov""]","[""Brenden K Petersen"", ""Mikel Landajuela Larma"", ""Terrell N. Mundhenk"", ""Claudio Prata Santiago"", ""Soo Kyung Kim"", ""Joanne Taery Kim""]","[""symbolic regression"", ""reinforcement learning"", ""automated machine learning""]","Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of $\textit{symbolic regression}$. Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that leverages deep learning for symbolic regression via a simple idea: use a large model to search the space of small models. Specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions and employ a novel risk-seeking policy gradient to train the network to generate better-fitting expressions. Our algorithm outperforms several baseline methods (including Eureqa, the gold standard for symbolic regression) in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate constraints in situ, and a risk-seeking policy gradient formulation that optimizes for best-case performance instead of expected performance.",/pdf/317665469793748a5dbbedaa91f4f31e395d23bf.pdf,ICLR,2021,"A deep learning approach to symbolic regression, in which an autoregressive RNN emits a distribution over expressions that is optimized using a risk-seeking policy gradient." +qYZD-AO1Vn,#NAME?,1601310000000.0,1615220000000.0,3480,Differentiable Trust Region Layers for Deep Reinforcement Learning,"[""~Fabian_Otto1"", ""~Philipp_Becker1"", ""~Vien_Anh_Ngo1"", ""~Hanna_Carolin_Maria_Ziesche1"", ""~Gerhard_Neumann1""]","[""Fabian Otto"", ""Philipp Becker"", ""Vien Anh Ngo"", ""Hanna Carolin Maria Ziesche"", ""Gerhard Neumann""]","[""reinforcement learning"", ""trust region"", ""policy gradient"", ""projection"", ""Wasserstein distance"", ""Kullback-Leibler divergence"", ""Frobenius norm""]","Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, often lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices. The code is available at https://git.io/Jthb0. +",/pdf/16d7f0bd8f02047e8ca8dbcf7a7f000f5d685020.pdf,ICLR,2021, +YMsbeG6FqBU,mnOiqWXrDxr,1601310000000.0,1614990000000.0,3679,The Advantage Regret-Matching Actor-Critic,"[""~Audrunas_Gruslys1"", ""~Marc_Lanctot1"", ""~Remi_Munos1"", ""~Finbarr_Timbers1"", ""~Martin_Schmid2"", ""~Julien_Perolet1"", ""~Dustin_Morrill1"", ""~Vinicius_Zambaldi1"", ""~Jean-Baptiste_Lespiau1"", ""~John_Schultz1"", ""~Mohammad_Gheshlaghi_Azar1"", ""~Michael_Bowling1"", ""~Karl_Tuyls1""]","[""Audrunas Gruslys"", ""Marc Lanctot"", ""Remi Munos"", ""Finbarr Timbers"", ""Martin Schmid"", ""Julien Perolet"", ""Dustin Morrill"", ""Vinicius Zambaldi"", ""Jean-Baptiste Lespiau"", ""John Schultz"", ""Mohammad Gheshlaghi Azar"", ""Michael Bowling"", ""Karl Tuyls""]","[""Nash Equilibrium"", ""Games"", ""CFR""]","Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior: Advantage Regret-Matching Actor-Critic (ARMAC). Rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.",/pdf/adf52bc535b5558e98adb6b582bc1baa916ddf9a.pdf,ICLR,2021,We introduce ARMAC: generalized Counterfactual Regret Minimization using functional approximations and relying only on outcome sampling. +B1lFa3EFwB,HJxqN8oXDH,1569440000000.0,1577170000000.0,234,Stablizing Adversarial Invariance Induction by Discriminator Matching,"[""iwasawa@weblab.t.u-tokyo.ac.jp"", ""akuzawa-kei@weblab.t.u-tokyo.ac.jp"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Yusuke Iwasawa"", ""Kei Akuzawa"", ""Yutaka Matsuo""]","[""invariance induction"", ""adversarial training"", ""domain generalization""]","Incorporating the desired invariance into representation learning is a key challenge in many situations, e.g., for domain generalization and privacy/fairness constraints. An adversarial invariance induction (AII) shows its power on this purpose, which maximizes the proxy of the conditional entropy between representations and attributes by adversarial training between an attribute discriminator and feature extractor. However, the practical behavior of AII is still unclear as the previous analysis assumes the optimality of the attribute classifier, which is rarely held in practice. This paper first analyzes the practical behavior of AII both theoretically and empirically, indicating that AII has theoretical difficulty as it maximizes variational {\em upper} bound of the actual conditional entropy, and AII catastrophically fails to induce invariance even in simple cases as suggested by the above theoretical findings. We then argue that a simple modification to AII can significantly stabilize the adversarial induction framework and achieve better invariant representations. Our modification is based on the property of conditional entropy; it is maximized if and only if the divergence between all pairs of marginal distributions over $z$ between different attributes is minimized. The proposed method, {\em invariance induction by discriminator matching}, modify AII objective to explicitly consider the divergence minimization requirements by defining a proxy of the divergence by using the attribute discriminator. Empirical validations on both the toy dataset and four real-world datasets (related to applications of user anonymization and domain generalization) reveal that the proposed method provides superior performance when inducing invariance for nuisance factors. ",/pdf/1d18f8fa9539eb5c076a21b4e781164ea090e944.pdf,ICLR,2020, +HJgZrsC5t7,Hyl7aX8zFX,1538090000000.0,1545360000000.0,63,Improving On-policy Learning with Statistical Reward Accumulation,"[""dy015@ie.cuhk.edu.hk"", ""yk017@ie.cuhk.edu.hk"", ""dhlin@ie.cuhk.edu.hk"", ""xtang@ie.cuhk.edu.hk"", ""ccloy@ieee.org""]","[""Yubin Deng"", ""Ke Yu"", ""Dahua Lin"", ""Xiaoou Tang"", ""Chen Change Loy""]",[],"Deep reinforcement learning has obtained significant breakthroughs in recent years. Most methods in deep-RL achieve good results via the maximization of the reward signal provided by the environment, typically in the form of discounted cumulative returns. Such reward signals represent the immediate feedback of a particular action performed by an agent. However, tasks with sparse reward signals are still challenging to on-policy methods. In this paper, we introduce an effective characterization of past reward statistics (which can be seen as long-term feedback signals) to supplement this immediate reward feedback. In particular, value functions are learned with multi-critics supervision, enabling complex value functions to be more easily approximated in on-policy learning, even when the reward signals are sparse. We also introduce a novel exploration mechanism called ``hot-wiring'' that can give a boost to seemingly trapped agents. We demonstrate the effectiveness of our advantage actor multi-critic (A2MC) method across the discrete domains in Atari games as well as continuous domains in the MuJoCo environments. A video demo is provided at https://youtu.be/zBmpf3Yz8tc and source codes will be made available upon paper acceptance.",/pdf/d85ca12dce21fb1e81d7f63e19b10860228a6153.pdf,ICLR,2019,Improving On-policy Learning with Statistical Reward Accumulation +BJ46w6Ule,,1478050000000.0,1484070000000.0,36,Dynamic Partition Models,"[""goessling@uchicago.edu""]","[""Marc Goessling"", ""Yali Amit""]",[],"We present a new approach for learning compact and intuitive distributed representations with binary encoding. Rather than summing up expert votes as in products of experts, we employ for each variable the opinion of the most reliable expert. Data points are hence explained through a partitioning of the variables into expert supports. The partitions are dynamically adapted based on which experts are active. During the learning phase we adopt a smoothed version of this model that uses separate mixtures for each data dimension. In our experiments we achieve accurate reconstructions of high-dimensional data points with at most a dozen experts.",/pdf/b9e974f1ed60a55c828f66e47def806b28b20140.pdf,ICLR,2017,Learning of compact binary representations through partitioning of the variables +S1XolQbRW,r1vtxmZR-,1509140000000.0,1519400000000.0,1118,Model compression via distillation and quantization,"[""antonio.polino1@gmail.com"", ""razp@google.com"", ""d.alistarh@gmail.com""]","[""Antonio Polino"", ""Razvan Pascanu"", ""Dan Alistarh""]","[""quantization"", ""distillation"", ""model compression""]","Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices. +",/pdf/16d83892a02967fcd588378586a2f113b4d58122.pdf,ICLR,2018,"Obtains state-of-the-art accuracy for quantized, shallow nets by leveraging distillation. " +S14h9sCqYm,Syl7tKYqYX,1538090000000.0,1545360000000.0,570,Weakly-supervised Knowledge Graph Alignment with Adversarial Learning,"[""qumn123@gmail.com"", ""tangjianpku@gmail.com"", ""yoshua.bengio@mila.quebec""]","[""Meng Qu"", ""Jian Tang"", ""Yoshua Bengio""]","[""Knowledge Graph Alignment"", ""Generative Adversarial Network"", ""Weakly Supervised""]","Aligning knowledge graphs from different sources or languages, which aims to align both the entity and relation, is critical to a variety of applications such as knowledge graph construction and question answering. Existing methods of knowledge graph alignment usually rely on a large number of aligned knowledge triplets to train effective models. However, these aligned triplets may not be available or are expensive to obtain for many domains. Therefore, in this paper we study how to design fully-unsupervised methods or weakly-supervised methods, i.e., to align knowledge graphs without or with only a few aligned triplets. We propose an unsupervised framework based on adversarial training, which is able to map the entities and relations in a source knowledge graph to those in a target knowledge graph. This framework can be further seamlessly integrated with existing supervised methods, where only a limited number of aligned triplets are utilized as guidance. Experiments on real-world datasets prove the effectiveness of our proposed approach in both the weakly-supervised and unsupervised settings.",/pdf/ae6b06b831804268b11ea0acb359857b63aef179.pdf,ICLR,2019,This paper studies weakly-supervised knowledge graph alignment with adversarial training frameworks. +MBpHUFrcG2x,tFwrE3P8WD,1601310000000.0,1614100000000.0,2275,Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows,"[""~Chris_Cannella1"", ""~Mohammadreza_Soltani1"", ""~Vahid_Tarokh1""]","[""Chris Cannella"", ""Mohammadreza Soltani"", ""Vahid Tarokh""]","[""Conditional Sampling"", ""Normalizing Flows"", ""Markov Chain Monte Carlo"", ""Missing Data Inference""]","We introduce Projected Latent Markov Chain Monte Carlo (PL-MCMC), a technique for sampling from the exact conditional distributions learned by normalizing flows. As a conditional sampling method, PL-MCMC enables Monte Carlo Expectation Maximization (MC-EM) training of normalizing flows from incomplete data. Through experimental tests applying normalizing flows to missing data tasks for a variety of data sets, we demonstrate the efficacy of PL-MCMC for conditional sampling from normalizing flows.",/pdf/32946e80b74b4bb7d6f25d74cb773ac68b9b4a36.pdf,ICLR,2021,We introduce and demonstrate a novel MCMC technique for sampling from the exact conditional distributions known by normalizing flows. +wQRlSUZ5V7B,VNWRHC2Eo6R,1601310000000.0,1616700000000.0,3058,Capturing Label Characteristics in VAEs,"[""~Tom_Joy1"", ""sebastian.schmon@gmail.com"", ""~Philip_Torr1"", ""~Siddharth_N1"", ""~Tom_Rainforth1""]","[""Tom Joy"", ""Sebastian Schmon"", ""Philip Torr"", ""Siddharth N"", ""Tom Rainforth""]","[""variational autoencoder"", ""representation learning"", ""deep generative models""]","We present a principled approach to incorporating labels in variational autoencoders (VAEs) that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs—capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop a novel VAE model, the characteristic capturing VAE (CCVAE), which “reparameterizes” supervision through auxiliary variables and a concomitant variational objective. Through judicious structuring of mappings between latent and auxiliary variables, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.",/pdf/f58d5a4d19e174d578190ec9687a1904e52596b6.pdf,ICLR,2021,We present a principled approach to incorporating labels in VAEs that captures the rich characteristic information associated with those labels. +Hk4dFjR5K7,r1lXLv4FKm,1538090000000.0,1547200000000.0,461,ADef: an Iterative Algorithm to Construct Adversarial Deformations,"[""rima.alaifari@sam.math.ethz.ch"", ""alberti@dima.unige.it"", ""tandrig@sam.math.ethz.ch""]","[""Rima Alaifari"", ""Giovanni S. Alberti"", ""Tandri Gauksson""]","[""Adversarial examples"", ""deformations"", ""deep neural networks"", ""computer vision""]","While deep neural networks have proven to be a powerful tool for many recognition and classification tasks, their stability properties are still not well understood. In the past, image classifiers have been shown to be vulnerable to so-called adversarial attacks, which are created by additively perturbing the correctly classified image. In this paper, we propose the ADef algorithm to construct a different kind of adversarial attack created by iteratively applying small deformations to the image, found through a gradient descent step. We demonstrate our results on MNIST with convolutional neural networks and on ImageNet with Inception-v3 and ResNet-101.",/pdf/901b877b7dda447ba534f8edb8fc5cf7d28f20ce.pdf,ICLR,2019,"We propose a new, efficient algorithm to construct adversarial examples by means of deformations, rather than additive perturbations." +e60-SyRXtRt,FdOtdGocYj,1601310000000.0,1614990000000.0,1935,GANMEX: Class-Targeted One-vs-One Attributions using GAN-based Model Explainability,"[""~Sheng-Min_Shih1"", ""pinju.tien@gmail.com"", ""~Zohar_Karnin1""]","[""Sheng-Min Shih"", ""Pin-Ju Tien"", ""Zohar Karnin""]","[""Deep Neural Networks"", ""Attribution Methods"", ""Generative Adversarial Networks"", ""Interpretability"", ""Explainability""]","Attribution methods have been shown as promising approaches for identifying key features that led to learned model predictions. While most existing attribution methods rely on a baseline input for performing feature perturbations, limited research has been conducted to address the baseline selection issues. Poor choices of baselines can lead to unfair attributions as well as limited ability of one-vs-one explanations for multi-class classifiers, which means explaining why the input belongs to its original class but not the other specified target class. Achieving one-vs-one explanation is crucial when certain classes are more similar than others, e.g. two bird types among multiple animals. One-vs-one explanations focus on key differentiating features rather than features shared across the original and the target classes. In this paper, we present GANMEX, a novel algorithm applying Generative Adversarial Networks (GAN) by incorporating the to-be-explained classifier as part of the adversarial networks. Our approach effectively selects the baseline as the closest realistic sample belong to the target class, which allows attribution methods to provide true one-vs-one explanations. We showed that GANMEX baselines improved the saliency maps visually and led to stronger performance on perturbation-based evaluation metrics over the existing baselines. Attribution results with the existing baselines are known to be insensitive to model randomization, and we demonstrated that GANMEX baselines led to better outcome under the randomization sanity checks.",/pdf/a227d5a31f80d5ad96a466091817f2cd3f518ebf.pdf,ICLR,2021,"We developed GANMEX, a novel approach using GAN for generating class-targeted attribution method baselines and achieving one-vs-one explanations for DNNs." +rkgc06VtwH,r1egg1fdDB,1569440000000.0,1577170000000.0,867,Improving Semantic Parsing with Neural Generator-Reranker Architecture,"[""hinan1@stanford.edu"", ""gtomar@google.com"", ""huapupan@google.com""]","[""Huseyin A. Inan"", ""Gaurav Singh Tomar"", ""Huapu Pan""]","[""Natural Language Processing"", ""Semantic Parsing"", ""Neural Reranking""]","Semantic parsing is the problem of deriving machine interpretable meaning representations from natural language utterances. Neural models with encoder-decoder architectures have recently achieved substantial improvements over traditional methods. Although neural semantic parsers appear to have relatively high recall using large beam sizes, there is room for improvement with respect to one-best precision. In this work, we propose a generator-reranker architecture for semantic parsing. The generator produces a list of potential candidates and the reranker, which consists of a pre-processing step for the candidates followed by a novel critic network, reranks these candidates based on the similarity between each candidate and the input sentence. We show the advantages of this approach along with how it improves the parsing performance through extensive analysis. We experiment our model on three semantic parsing datasets (GEO, ATIS, and OVERNIGHT). The overall architecture achieves the state-of-the-art results in all three datasets. ",/pdf/d67baf0a62b29a4de4e3ff5bc34af5087755bf15.pdf,ICLR,2020, +HkL7n1-0b,S1nfhyW0W,1509130000000.0,1519130000000.0,517,Wasserstein Auto-Encoders,"[""iliya.tolstikhin@gmail.com"", ""obousquet@gmail.com"", ""sylvain.gelly@gmail.com"", ""bs@tuebingen.mpg.de""]","[""Ilya Tolstikhin"", ""Olivier Bousquet"", ""Sylvain Gelly"", ""Bernhard Schoelkopf""]","[""auto-encoder"", ""generative models"", ""GAN"", ""VAE"", ""unsupervised learning""]","We propose the Wasserstein Auto-Encoder (WAE)---a new algorithm for building a generative model of the data distribution. WAE minimizes a penalized form of the Wasserstein distance between the model distribution and the target distribution, which leads to a different regularizer than the one used by the Variational Auto-Encoder (VAE). +This regularizer encourages the encoded training distribution to match the prior. We compare our algorithm with several other techniques and show that it is a generalization of adversarial auto-encoders (AAE). Our experiments show that WAE shares many of the properties of VAEs (stable training, encoder-decoder architecture, nice latent manifold structure) while generating samples of better quality.",/pdf/8a046910519ed75fbd072e63725d9cf2814c1a6b.pdf,ICLR,2018,"We propose a new auto-encoder based on the Wasserstein distance, which improves on the sampling properties of VAE." +H1l8sz-AW,Hyb-ofZAb,1509140000000.0,1518730000000.0,921,Improving generalization by regularizing in $L^2$ function space,"[""aarrii@seas.upenn.edu""]","[""Ari S Benjamin"", ""Konrad Kording""]","[""natural gradient"", ""generalization"", ""optimization"", ""function space"", ""Hilbert""]","Learning rules for neural networks necessarily include some form of regularization. Most regularization techniques are conceptualized and implemented in the space of parameters. However, it is also possible to regularize in the space of functions. Here, we propose to measure networks in an $L^2$ Hilbert space, and test a learning rule that regularizes the distance a network can travel through $L^2$-space each update. This approach is inspired by the slow movement of gradient descent through parameter space as well as by the natural gradient, which can be derived from a regularization term upon functional change. The resulting learning rule, which we call Hilbert-constrained gradient descent (HCGD), is thus closely related to the natural gradient but regularizes a different and more calculable metric over the space of functions. Experiments show that the HCGD is efficient and leads to considerably better generalization. ",/pdf/7a2ce87adfe02b55f52b57024cffd5fc96f827c1.pdf,ICLR,2018,"It's important to consider optimization in function space, not just parameter space. We introduce a learning rule that reduces distance traveled in function space, just like SGD limits distance traveled in parameter space." +pQq3oLH9UmL,8bkf5Rdm0LVj,1601310000000.0,1614990000000.0,2497,Achieving Explainability in a Visual Hard Attention Model through Content Prediction,"[""~Samrudhdhi_Bharatkumar_Rangrej1"", ""~James_J._Clark1""]","[""Samrudhdhi Bharatkumar Rangrej"", ""James J. Clark""]","[""visual hard attention"", ""glimpses"", ""explainability"", ""bayesian optimal experiment design"", ""variational autoencoder""]","A visual hard attention model actively selects and observes a sequence of subregions in an image to make a prediction. Unlike in the deep convolution network, in hard attention it is explainable which regions of the image contributed to the prediction. However, the attention policy used by the model to select these regions is not explainable. The majority of hard attention models determine the attention-worthy regions by first analyzing a complete image. However, it may be the case that the entire image is not available in the beginning but instead sensed gradually through a series of partial observations. In this paper, we design an efficient hard attention model for classifying partially observable scenes. The attention policy used by our model is explainable and non-parametric. The model estimates expected information gain (EIG) obtained from attending various regions by predicting their content ahead of time. It compares EIG using Bayesian Optimal Experiment Design and attends to the region with maximum EIG. We train our model with a differentiable objective, optimized using gradient descent, and test it on several datasets. The performance of our model is comparable to or better than the baseline models.",/pdf/31556b3cce3ed53f8cabd92790fcedb951ab8d8a.pdf,ICLR,2021,The hard attention model learns explainable attention policy by predicting the content of the glimpses and using it to find an optimal location to attend. +rkxDoJBYPB,SkeGjCAdDr,1569440000000.0,1583910000000.0,1918,Reinforced Genetic Algorithm Learning for Optimizing Computation Graphs,"[""adipal@google.com"", ""fgimeno@google.com"", ""vinair@google.com"", ""yujiali@google.com"", ""mlubin@google.com"", ""pushmeet@google.com"", ""vinyals@google.com""]","[""Aditya Paliwal"", ""Felix Gimeno"", ""Vinod Nair"", ""Yujia Li"", ""Miles Lubin"", ""Pushmeet Kohli"", ""Oriol Vinyals""]","[""reinforcement learning"", ""learning to optimize"", ""combinatorial optimization"", ""computation graphs"", ""model parallelism"", ""learning for systems""]","We present a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler. Unlike earlier learning-based works that require training the optimizer on the same graph to be optimized, we propose a learning approach that trains an optimizer offline and then generalizes to previously unseen graphs without further training. This allows our approach to produce high-quality execution decisions on real-world TensorFlow graphs in seconds instead of hours. We consider two optimization tasks for computation graphs: minimizing running time and peak memory usage. In comparison to an extensive set of baselines, our approach achieves significant improvements over classical and other learning-based methods on these two tasks. ",/pdf/c0aa0effb0efd296d031e0ece93c6cd7e0f5fcc1.pdf,ICLR,2020,"We use deep RL to learn a policy that directs the search of a genetic algorithm to better optimize the execution cost of computation graphs, and show improved results on real-world TensorFlow graphs." +nCY83KxoehA,3JGPQ4rJxDun,1601310000000.0,1614990000000.0,1096,Automated Concatenation of Embeddings for Structured Prediction,"[""~Xinyu_Wang3"", ""~Yong_Jiang1"", ""~Nguyen_Bach1"", ""~Tao_Wang4"", ""~Zhongqiang_Huang1"", ""~Fei_Huang2"", ""~Kewei_Tu1""]","[""Xinyu Wang"", ""Yong Jiang"", ""Nguyen Bach"", ""Tao Wang"", ""Zhongqiang Huang"", ""Fei Huang"", ""Kewei Tu""]",[],"Pretrained contextualized embeddings are powerful word representations for structured prediction tasks. Recent work found that better word representations can be obtained by concatenating different types of embeddings. However, the selection of embeddings to form the best concatenated representation usually varies depending on the task and the collection of candidate embeddings, and the ever-increasing number of embedding types makes it a more difficult problem. In this paper, we propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks, based on a formulation inspired by recent progress on neural architecture search. Specifically, a controller alternately samples a concatenation of embeddings, according to its current belief of the effectiveness of individual embedding types in consideration for a task, and updates the belief based on a reward. We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model, which is fed with the sampled concatenation as input and trained on a task dataset. Empirical results on 6 tasks and 21 datasets show that our approach outperforms strong baselines and achieves state-of-the-art performance with fine-tuned embeddings in the vast majority of evaluations.",/pdf/da962eb025209acdff429e018647f5bccdbf6d73.pdf,ICLR,2021,"We propose ACE, which automatically searches the best word embedding concatenation as word representation. ACE achieved state-of-the-art results on 6 structured prediction tasks over 19 out of 21 datasets." +2AL06y9cDE-,_QNiF5vuP1T,1601310000000.0,1616110000000.0,2276,Towards Robust Neural Networks via Close-loop Control,"[""~Zhuotong_Chen1"", ""~Qianxiao_Li1"", ""~Zheng_Zhang2""]","[""Zhuotong Chen"", ""Qianxiao Li"", ""Zheng Zhang""]","[""neural network robustness"", ""optimal control"", ""dynamical system""]","Despite their success in massive engineering applications, deep neural networks are vulnerable to various perturbations due to their black-box nature. Recent study has shown that a deep neural network can misclassify the data even if the input data is perturbed by an imperceptible amount. In this paper, we address the robustness issue of neural networks by a novel close-loop control method from the perspective of dynamic systems. Instead of modifying the parameters in a fixed neural network architecture, a close-loop control process is added to generate control signals adaptively for the perturbed or corrupted data. We connect the robustness of neural networks with optimal control using the geometrical information of underlying data to design the control objective. The detailed analysis shows how the embedding manifolds of state trajectory affect error estimation of the proposed method. Our approach can simultaneously maintain the performance on clean data and improve the robustness against many types of data perturbations. It can also further improve the performance of robustly trained neural networks against different perturbations. To the best of our knowledge, this is the first work that improves the robustness of neural networks with close-loop control.",/pdf/596019eba6149f7c83bd7dc648809e2100b337d8.pdf,ICLR,2021,We propose a close-loop control framework to improve the robustness of neural networks under various data perturbations. +rygwLgrYPB,HJg-4ceKPB,1569440000000.0,1587950000000.0,2326,Regularizing activations in neural networks via distribution matching with the Wasserstein metric,"[""tjoo@estsoft.com"", ""emppunity@gmail.com"", ""byungkim@hanyang.ac.kr""]","[""Taejong Joo"", ""Donggu Kang"", ""Byunghoon Kim""]","[""regularization"", ""Wasserstein metric"", ""deep learning""]","Regularization and normalization have become indispensable components in training deep neural networks, resulting in faster training and improved generalization performance. We propose the projected error function regularization loss (PER) that encourages activations to follow the standard normal distribution. PER randomly projects activations onto one-dimensional space and computes the regularization loss in the projected space. PER is similar to the Pseudo-Huber loss in the projected space, thus taking advantage of both $L^1$ and $L^2$ regularization losses. Besides, PER can capture the interaction between hidden units by projection vector drawn from a unit sphere. By doing so, PER minimizes the upper bound of the Wasserstein distance of order one between an empirical distribution of activations and the standard normal distribution. To the best of the authors' knowledge, this is the first work to regularize activations via distribution matching in the probability distribution space. We evaluate the proposed method on the image classification task and the word-level language modeling task. +",/pdf/4783bd6584a791c3e1de7a2cbb2714228df2a22d.pdf,ICLR,2020, +Z4YatHL7aq,NZ7CnL5Gr70,1601310000000.0,1614990000000.0,221,Semantically-Adaptive Upsampling for Layout-to-Image Translation,"[""~Hao_Tang6"", ""~Nicu_Sebe1""]","[""Hao Tang"", ""Nicu Sebe""]","[""Feature upsampling"", ""semantically-adaptive"", ""layout-to-image translation""]","We propose the Semantically-Adaptive UpSampling (SA-UpSample), a general and highly effective upsampling method for the layout-to-image translation task. SA-UpSample has three advantages: 1) Global view. Unlike traditional upsampling methods (e.g., Nearest-neighbor) that only exploit local neighborhoods, SA-UpSample can aggregate semantic information in a global view. +2) Semantically adaptive. Instead of using a fixed kernel for all locations (e.g., Deconvolution), SA-UpSample enables semantic class-specific upsampling via generating adaptive kernels for different locations. 3) Efficient. Unlike Spatial Attention which uses a fully-connected strategy to connect all the pixels, SA-UpSample only considers the most relevant pixels, introducing little computational overhead. We observe that SA-UpSample achieves consistent and substantial gains on six popular datasets. The source code will be made publicly available.",/pdf/2f477861ed02c3ae6a9efbbbf6072f587693e92a.pdf,ICLR,2021,A novel feature upsampling method for layout-to-image translation method. +mWnfMrd9JLr,Xbn-49gH5Md,1601310000000.0,1614990000000.0,1053,On the Latent Space of Flow-based Models,"[""~Mingtian_Zhang1"", ""~Yitong_Sun1"", ""~Steven_McDonagh1"", ""~Chen_Zhang6""]","[""Mingtian Zhang"", ""Yitong Sun"", ""Steven McDonagh"", ""Chen Zhang""]","[""flow-based mode"", ""generative model"", ""intrinsic dimension"", ""manifold learning""]"," Flow-based generative models typically define a latent space with dimensionality identical to the observational space. In many problems, however, the data does not populate the full ambient data-space that they natively reside in, but rather inhabit a lower-dimensional manifold. In such scenarios, flow-based models are unable to represent data structures exactly as their density will always have support off the data manifold, potentially resulting in degradation of model performance. In addition, the requirement for equal latent and data space dimensionality can unnecessarily increase model complexity for contemporary flow models. Towards addressing these problems, we propose to learn a manifold prior that affords benefits to both the tasks of sample generation and representation quality. An auxiliary product of our approach is that we are able to identify the intrinsic dimension of the data distribution.",/pdf/b462aae3c4bdc81047264202e07d6186a4aba5d1.pdf,ICLR,2021,A flow based model for data supported on a manifold. +xJFxgRLx79J,zgPFzn657d,1601310000000.0,1614990000000.0,541,Learning Two-Time-Scale Representations For Large Scale Recommendations,"[""~Xinshi_Chen1"", ""yzhu@fb.com"", ""~Haowen_Xu1"", ""muhanzhang@fb.com"", ""~Liang_Xiong1"", ""~Le_Song1""]","[""Xinshi Chen"", ""Yan Zhu"", ""Haowen Xu"", ""Muhan Zhang"", ""Liang Xiong"", ""Le Song""]","[""Recommendation System"", ""Large-scale Recommendation"", ""User Behavior Modeling"", ""Long-range sequences""]","We propose a surprisingly simple but effective two-time-scale (2TS) model for learning user representations for recommendation. In our approach, we will partition users into two sets, active users with many observed interactions and inactive or new users with few observed interactions, and we will use two RNNs to model them separately. Furthermore, we design a two-stage training method for our model, where, in the first stage, we learn transductive embeddings for users and items, and then, in the second stage, we learn the two RNNs leveraging the transductive embeddings trained in the first stage. Through the lens of online learning and stochastic optimization, we provide theoretical analysis that motivates the design of our 2TS model. The 2TS model achieves a nice bias-variance trade-off while being computationally efficient. In large scale datasets, our 2TS model is able to achieve significantly better recommendations than previous state-of-the-art, yet being much more computationally efficient. ",/pdf/53fde94147d5a65197836d2592559c2009cfbd7c.pdf,ICLR,2021, +ryBnUWb0b,r1EhUWW0Z,1509130000000.0,1519410000000.0,682,Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data,"[""waf2107@columbia.edu"", ""hgs@cs.columbia.edu""]","[""William Falcon"", ""Henning Schulzrinne""]","[""Recurrent Neural Networks"", ""RNN"", ""LSTM"", ""Mobile Device"", ""Sensors""]","In cities with tall buildings, emergency responders need an accurate floor level location to find 911 callers quickly. We introduce a system to estimate a victim's floor level via their mobile device's sensor data in a two-step process. First, we train a neural network to determine when a smartphone enters or exits a building via GPS signal changes. Second, we use a barometer equipped smartphone to measure the change in barometric pressure from the entrance of the building to the victim's indoor location. Unlike impractical previous approaches, our system is the first that does not require the use of beacons, prior knowledge of the building infrastructure, or knowledge of user behavior. We demonstrate real-world feasibility through 63 experiments across five different tall buildings throughout New York City where our system predicted the correct floor level with 100% accuracy. +",/pdf/2d8544784d03cedc9e738a8fdfa6717a43104422.pdf,ICLR,2018,We used an LSTM to detect when a smartphone walks into a building. Then we predict the device's floor level using data from sensors aboard the smartphone. +S1xSzyrYDB,r1e0bMhdDB,1569440000000.0,1577170000000.0,1579,Cyclic Graph Dynamic Multilayer Perceptron for Periodic Signals,"[""mikiof@mit.edu"", ""erikgest@mit.edu"", ""takayuki_hirano@jsw.co.jp"", ""youcef@mit.edu""]","[""Mikio Furokawa"", ""Erik Gest"", ""Takayuki Hirano"", ""Kamal Youcef-Toumi""]",[],"We propose a feature extraction for periodic signals. Virtually every mechanized transportation vehicle, power generation, industrial machine, and robotic system contains rotating shafts. It is possible to collect data about periodicity by mea- suring a shaft’s rotation. However, it is difficult to perfectly control the collection timing of the measurements. Imprecise timing creates phase shifts in the resulting data. Although a phase shift does not materially affect the measurement of any given data point collected, it does alter the order in which all of the points are col- lected. It is difficult for classical methods, like multi-layer perceptron, to identify or quantify these alterations because they depend on the order of the input vectors’ components. This paper proposes a robust method for extracting features from phase shift data by adding a graph structure to each data point and constructing a suitable machine learning architecture for graph data with cyclic permutation. Simulation and experimental results illustrate its effectiveness.",/pdf/cd998a0d545829d56fbf039481e875c86860c51f.pdf,ICLR,2020, +F3s69XzWOia,4irQi08o8Z1,1601310000000.0,1615750000000.0,1126,Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies,"[""~T._Konstantin_Rusch1"", ""~Siddhartha_Mishra1""]","[""T. Konstantin Rusch"", ""Siddhartha Mishra""]","[""RNNs"", ""Oscillators"", ""Gradient stability"", ""Long-term dependencies""]","Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.",/pdf/e69f77961c03aeefb3d2dfcc2c77185ac11781db.pdf,ICLR,2021,"A biologically motivated and discretized ODE based RNN for learning long-term dependencies, with rigorous bounds mitigating the exploding and vanishing gradient problem." +o21sjfFaU1,ap33gDEAl8w,1601310000000.0,1614990000000.0,1463,Learning Robust Models by Countering Spurious Correlations,"[""~Haohan_Wang1"", ""~Zeyi_Huang3"", ""~Eric_Xing1""]","[""Haohan Wang"", ""Zeyi Huang"", ""Eric Xing""]","[""robustness"", ""domain adaptation"", ""spurious correlation"", ""dataset bias""]","Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. One reason behind this accuracy drop is the reliance of models on the features that are only associated with the label in the training distribution, but not the test distribution. This problem is usually known as spurious correlation, confounding factors, or dataset bias. In this paper, we formally study the generalization error bound for this setup with the knowledge of how the spurious features are associated with the label. We also compare our analysis to the widely-accepted domain adaptation error bound and show that our bound can be tighter, with more assumptions that we consider realistic. Further, our analysis naturally offers a set of solutions for this problem, linked to established solutions in various topics about robustness in general, and these solutions all require some understandings of how the spurious features are associated with the label. Finally, we also briefly discuss a method that does not require such an understanding.",/pdf/9ec7e576017ebb0d7f7a82dd121f1298233f86ae.pdf,ICLR,2021,"We offer a formal generalization error bound of the problem of learning when there are spurious correlated features, with the knowledge of these features. Our bound also leads to discussion of the methods for this problem. " +PrzjugOsDeE,8TYsS3q5c7G,1601310000000.0,1614420000000.0,1395,CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation,"[""~Xin_Ding2"", ""~Yongwei_Wang1"", ""~Zuheng_Xu1"", ""~William_J_Welch1"", ""~Z._Jane_Wang1""]","[""Xin Ding"", ""Yongwei Wang"", ""Zuheng Xu"", ""William J Welch"", ""Z. Jane Wang""]","[""Conditional generative adversarial networks"", ""image generation"", ""continuous and scalar conditions""]","This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on a continuous label is mathematically distinct and raises two fundamental problems: (P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN losses (a.k.a. empirical cGAN losses) often fails in practice; (P2) Since regression labels are scalar and infinitely many, conventional label input methods (e.g., combining a hidden map of the generator/discriminator with a one-hot encoded label) are not applicable. The proposed CcGAN solves the above problems, respectively, by (S1) reformulating existing empirical cGAN losses to be appropriate for the continuous scenario; and (S2) proposing a novel method to incorporate regression labels into the generator and the discriminator. The reformulation in (S1) leads to two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) respectively, and a novel empirical generator loss. The error bounds of a discriminator trained with HVDL and SVDL are derived under mild assumptions in this work. A new benchmark dataset, RC-49, is also proposed for generative image modeling conditional on regression labels. Our experiments on the Circular 2-D Gaussians, RC-49, and UTKFace datasets show that CcGAN is able to generate diverse, high-quality samples from the image distribution conditional on a given regression label. Moreover, in these experiments, CcGAN substantially outperforms cGAN both visually and quantitatively.",/pdf/e0bbc6b8950c086438fe77cc0c9098bff1128577.pdf,ICLR,2021,"This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed as regression labels). " +H1oyRlYgg,,1478200000000.0,1486670000000.0,76,On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,"[""keskar.nitish@u.northwestern.edu"", ""dheevatsa.mudigere@intel.com"", ""j-nocedal@northwestern.edu"", ""mikhail.smelyanskiy@intel.com"", ""peter.tang@intel.com""]","[""Nitish Shirish Keskar"", ""Dheevatsa Mudigere"", ""Jorge Nocedal"", ""Mikhail Smelyanskiy"", ""Ping Tak Peter Tang""]","[""Deep learning"", ""Optimization""]","The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$--$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions---and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.",/pdf/bd71807dc65ba0c6a814e403b33cc9666cb17d5b.pdf,ICLR,2017,"We present numerical evidence for the argument that if deep networks are trained using large (mini-)batches, they converge to sharp minimizers, and these minimizers have poor generalization properties. " +ct8_a9h1M,B1Jw_rTXPOF,1601310000000.0,1615060000000.0,1270,Contextual Dropout: An Efficient Sample-Dependent Dropout Module,"[""~XINJIE_FAN2"", ""~Shujian_Zhang1"", ""korawat.tanwisuth@utexas.edu"", ""~Xiaoning_Qian2"", ""~Mingyuan_Zhou1""]","[""XINJIE FAN"", ""Shujian Zhang"", ""Korawat Tanwisuth"", ""Xiaoning Qian"", ""Mingyuan Zhou""]","[""Efficient Inference Methods"", ""Probabilistic Methods"", ""Supervised Deep Networks""]","Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributions across all data samples due to its simplicity. Despite the potential gains in the flexibility of modeling uncertainty, sample-dependent dropout, on the other hand, is less explored as it often encounters scalability issues or involves non-trivial model changes. In this paper, we propose contextual dropout with an efficient structural design as a simple and scalable sample-dependent dropout module, which can be applied to a wide range of models at the expense of only slightly increased memory and computational cost. We learn the dropout probabilities with a variational objective, compatible with both Bernoulli dropout and Gaussian dropout. We apply the contextual dropout module to various models with applications to image classification and visual question answering and demonstrate the scalability of the method with large-scale datasets, such as ImageNet and VQA 2.0. Our experimental results show that the proposed method outperforms baseline methods in terms of both accuracy and quality of uncertainty estimation.",/pdf/e60af0056960e7c5e82ee3d15ad9c6dd3577788d.pdf,ICLR,2021,"We propose contextual dropout as a scalable sample-dependent dropout method, which makes the dropout probabilities depend on the input covariates of each data sample." +a2gqxKDvYys,KYoFmJ7eq91,1601310000000.0,1615980000000.0,3587,Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models,"[""~Justin_Bayer1"", ""~Maximilian_Soelch1"", ""~Atanas_Mirchev1"", ""~Baris_Kayalibay1"", ""~Patrick_van_der_Smagt1""]","[""Justin Bayer"", ""Maximilian Soelch"", ""Atanas Mirchev"", ""Baris Kayalibay"", ""Patrick van der Smagt""]","[""variational inference"", ""state-space models"", ""amortized inference"", ""recurrent networks""]","Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e.g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter---a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction.",/pdf/3b8adcab7340a271fe4624eabaaa9168eb0c8899.pdf,ICLR,2021,We show how a common model assumption in amortised variational inference with sequential LVMS leads to a suboptimality and how to prevent it. +rJY0-Kcll,,1478300000000.0,1488380000000.0,472,Optimization as a Model for Few-Shot Learning,"[""sachinr@twitter.com"", ""hugo@twitter.com"", ""sachinr@princeton.edu""]","[""Sachin Ravi"", ""Hugo Larochelle""]",[],"Though deep neural networks have shown great success in the large data domain, they generally perform poorly on few-shot learning tasks, where a model has to quickly generalize after seeing very few examples from each class. The general belief is that gradient-based optimization in high capacity models requires many iterative steps over many examples to perform well. Here, we propose an LSTM-based meta-learner model to learn the exact optimization algorithm used to train another learner neural network in the few-shot regime. The parametrization of our model allows it to learn appropriate parameter updates specifically for the scenario where a set amount of updates will be made, while also learning a general initialization of the learner network that allows for quick convergence of training. We demonstrate that this meta-learning model is competitive with deep metric-learning techniques for few-shot learning. ",/pdf/191257b538cdf09db8808fe926c1ffb2f51db178.pdf,ICLR,2017,We propose an LSTM-based meta-learner model to learn the exact optimization algorithm used to train another learner neural network in the few-shot regime +8YFhXYe1Ps,nckQcAUHEeg,1601310000000.0,1614990000000.0,3736,Interpretability Through Invertibility: A Deep Convolutional Network With Ideal Counterfactuals And Isosurfaces,"[""~Leon_Sixt1"", ""schuessler@tu-berlin.de"", ""philipp@itp.tu-berlin.de"", ""~Tim_Landgraf1""]","[""Leon Sixt"", ""Martin Schuessler"", ""Philipp Wei\u00df"", ""Tim Landgraf""]","[""Interpretable Machine Learning"", ""Counterfactuals"", ""Computer Vision"", ""Human Evaluation"", ""User Study""]","Current state of the art computer vision applications rely on highly complex models. Their interpretability is mostly limited to post-hoc methods which are not guaranteed to be faithful to the model. To elucidate a model’s decision, we present a novel interpretable model based on an invertible deep convolutional network. Our model generates meaningful, faithful, and ideal counterfactuals. Using PCA on the classifier’s input, we can also create “isofactuals”– image interpolations with the same outcome but visually meaningful different features. Counter- and isofactuals can be used to identify positive and negative evidence in an image. This can also be visualized with heatmaps. We evaluate our approach against gradient-based attribution methods, which we find to produce meaningless adversarial perturbations. Using our method, we reveal biases in three different datasets. In a human subject experiment, we test whether non-experts find our method useful to spot spurious correlations learned by a model. Our work is a step towards more trustworthy explanations for computer vision.",/pdf/83628964eb4ac5f2e2665ccd6ea038a08a70ff43.pdf,ICLR,2021,We use invertible neural networks to generate ideal counterfactuals and isofactuals. +w2Z2OwVNeK,RD_a-3vWDQ_,1601310000000.0,1615930000000.0,3201,Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks,"[""~Ingmar_Schubert1"", ""~Ozgur_S_Oguz1"", ""~Marc_Toussaint1""]","[""Ingmar Schubert"", ""Ozgur S Oguz"", ""Marc Toussaint""]","[""reinforcement learning"", ""reward shaping"", ""plan-based reward shaping"", ""robotics"", ""robotic manipulation""]","In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.",/pdf/6ab6b9e3a9fe5a364f986aaff177de866990899b.pdf,ICLR,2021,"We introduce Final-Volume-Preserving Reward Shaping, and show in a plan-based setting that it significantly increases the sample efficiency of reinforcement learning." +H1e6ij0cKQ,H1g3pscctm,1538090000000.0,1545360000000.0,662,EFFICIENT SEQUENCE LABELING WITH ACTOR-CRITIC TRAINING,"[""snajafi@ualberta.ca"", ""colin.a.cherry@gmail.com"", ""gkondrak@ualberta.ca""]","[""Saeed Najafi"", ""Colin Cherry"", ""Greg Kondrak""]","[""Structured Prediction"", ""Reinforcement Learning"", ""NLP""]","Neural approaches to sequence labeling often use a Conditional Random Field (CRF) to model their output dependencies, while Recurrent Neural Networks (RNN) are used for the same purpose in other tasks. We set out to establish RNNs as an attractive alternative to CRFs for sequence labeling. To do so, we address one of the RNN’s most prominent shortcomings, the fact that it is not exposed to its own errors with the maximum-likelihood training. We frame the prediction of the output sequence as a sequential decision-making process, where we train the network with an adjusted actor-critic algorithm (AC-RNN). We comprehensively compare this strategy with maximum-likelihood training for both RNNs and CRFs on three structured-output tasks. The proposed AC-RNN efficiently matches the performance of the CRF on NER and CCG tagging, and outperforms it on Machine Transliteration. We also show that our training strategy is significantly better than other techniques for addressing RNN’s exposure bias, such as Scheduled Sampling, and Self-Critical policy training. +",/pdf/fdb32ddb4a126c40b5a21a39393cac0d57cd1a88.pdf,ICLR,2019, +L8BElg6Qldb,bPnKU3a6xM,1601310000000.0,1614990000000.0,3319,Nonvacuous Loss Bounds with Fast Rates for Neural Networks via Conditional Information Measures,"[""~Fredrik_Hellstr\u00f6m1"", ""~Giuseppe_Durisi1""]","[""Fredrik Hellstr\u00f6m"", ""Giuseppe Durisi""]",[],"We present a framework to derive bounds on the test loss of randomized learning algorithms for the case of bounded loss functions. This framework leads to bounds that depend on the conditional information density between the the output hypothesis and the choice of the training set, given a larger set of data samples from which the training set is formed. Furthermore, the bounds pertain to the average test loss as well as to its tail probability, both for the PAC-Bayesian and the single-draw settings. If the conditional information density is bounded uniformly in the size $n$ of the training set, our bounds decay as $1/n$, which is referred to as a fast rate. This is in contrast with the tail bounds involving conditional information measures available in the literature, which have a less benign $1/\sqrt{n}$ dependence. We demonstrate the usefulness of our tail bounds by showing that they lead to estimates of the test loss achievable with several neural network architectures trained on MNIST and Fashion-MNIST that match the state-of-the-art bounds available in the literature.",/pdf/1023fc091bf43fd013ce271114afe091d6a81a2c.pdf,ICLR,2021, +BJlqYlrtPB,S1l-VAgKPH,1569440000000.0,1577170000000.0,2445,Negative Sampling in Variational Autoencoders,"[""csadrian@renyi.hu"", ""bbeatrix1010@gmail.com"", ""daniel@renyi.hu""]","[""Adri\u00e1n Csisz\u00e1rik"", ""Beatrix Benk\u0151"", ""D\u00e1niel Varga""]","[""Variational Autoencoder"", ""generative modelling"", ""out-of-distribution detection""]","We propose negative sampling as an approach to improve the notoriously bad out-of-distribution likelihood estimates of Variational Autoencoder models. Our model pushes latent images of negative samples away from the prior. When the source of negative samples is an auxiliary dataset, such a model can vastly improve on baselines when evaluated on OOD detection tasks. Perhaps more surprisingly, we present a fully unsupervised variant that can also significantly improve detection performance: using the output of the generator as a source of negative samples results in a fully unsupervised model that can be interpreted as adversarially trained. +",/pdf/1923aaa1393f2fec05882c3cf221cf1950bf36fe.pdf,ICLR,2020,Pulling near-manifold examples (utilizing an auxiliary dataset or generated samples) to a secondary prior improves the discriminative power of VAE models regarding out-of-distribution samples. +ryxUkTVYvH,rklyPo8HvB,1569440000000.0,1577170000000.0,302,Towards Controllable and Interpretable Face Completion via Structure-Aware and Frequency-Oriented Attentive GANs,"[""zchen23@ncsu.edu"", ""snie@ncsu.edu"", ""tianfu_wu@ncsu.edu"", ""healey@ncsu.edu""]","[""Zeyuan Chen"", ""Shaoliang Nie"", ""Tianfu Wu"", ""Christopher G. Healey""]","[""Face Completion"", ""GANs"", ""Conditional Image Synthesis"", ""Interpretability"", ""Frequency-Oriented Attention""]","Face completion is a challenging conditional image synthesis task. This paper proposes controllable and interpretable high-resolution and fast face completion by learning generative adversarial networks (GANs) progressively from low resolution to high resolution. We present structure-aware and frequency-oriented attentive GANs. The proposed structure-aware component leverages off-the-shelf facial landmark detectors and proposes a simple yet effective method of integrating the detected landmarks in generative learning. It facilitates facial expression transfer together with facial attributes control, and helps regularize the structural consistency in progressive training. The proposed frequency-oriented attentive module (FOAM) encourages GANs to attend to only finer details in the coarse-to-fine progressive training, thus enabling progressive attention to face structures. The learned FOAMs show a strong pattern of switching its attention from low-frequency to high-frequency signals. In experiments, the proposed method is tested on the CelebA-HQ benchmark. Experiment results show that our approach outperforms state-of-the-art face completion methods. The proposed method is also fast with mean inference time of 0.54 seconds for images at 1024x1024 resolution (using a Titan Xp GPU).",/pdf/61ad40edfa986c1b72ff38193c15d2ae0ae23e66.pdf,ICLR,2020,Structure-aware and frequency-oriented attentive GANs for high-resolution and fast face completion +Ud3DSz72nYR,qbYC8M0Nzkf,1601310000000.0,1615960000000.0,3005,Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions,"[""~Zhengxian_Lin1"", ""~Kin-Ho_Lam1"", ""~Alan_Fern1""]","[""Zhengxian Lin"", ""Kin-Ho Lam"", ""Alan Fern""]","[""Explainable AI"", ""Deep Reinforcement Learning""]","We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP) model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large number of features, we develop a novel method for computing minimal sufficient explanations from an ESP. Our case studies in three domains, including a complex strategy game, show that ESP models can be effectively learned and support insightful explanations. ",/pdf/0b44de227203c9a6da82618d99fd47af97f88da6.pdf,ICLR,2021,We introduced the embedded self-prediction (ESP) model for producing meaningful and sound contrastive explanations for RL agents. +r1lnigSFDr,S1lYaJbKDB,1569440000000.0,1577170000000.0,2516,Improving the Gating Mechanism of Recurrent Neural Networks,"[""gua@google.com"", ""caglarg@google.com"", ""mwhoffman@google.com"", ""razp@google.com""]","[""Albert Gua"", ""Caglar Gulcehre"", ""Tom le Paine"", ""Razvan Pascanu"", ""Matt Hoffman""]","[""recurrent neural networks"", ""LSTM"", ""GRUs"", ""gating mechanisms"", ""deep learning"", ""reinforcement learning""]","In this work, we revisit the gating mechanisms widely used in various recurrent and feedforward networks such as LSTMs, GRUs, or highway networks. These gates are meant to control information flow, allowing gradients to better propagate back in time for recurrent models. However, to propagate gradients over very long temporal windows, they need to operate close to their saturation regime. We propose two independent and synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyper-parameters, and are aimed at improving learnability of the gates when they are close to saturation. Our proposals are theoretically justified, and we show a generic framework that encompasses other recently proposed gating mechanisms such as chrono-initialization and master gates . We perform systematic analyses and ablation studies on the proposed improvements and evaluate our method on a wide range of applications including synthetic memorization tasks, sequential image classification, language modeling, and reinforcement learning. Empirically, our proposed gating mechanisms robustly increase the performance of recurrent models such as LSTMs, especially on tasks requiring long temporal dependencies.",/pdf/ee6aa8f1d0d1d8ec5841207e08caae0971cc7fa6.pdf,ICLR,2020,Improving the gating mechanisms of recurrent neural networks by addressing the initialization of the biases and the saturation problem of sigmoid. +Skx73lBFDS,B1gd4gZYwr,1569440000000.0,1577170000000.0,2530,Combining graph and sequence information to learn protein representations,"[""hassanmohamed@alum.mit.edu"", ""mohamed-konoufo.coulibali.1@ulaval.ca"", ""pelkins@alum.mit.edu"", ""aabdalla@alum.mit.edu""]","[""Hassan Kan\u00e9"", ""Mohamed Coulibali"", ""Pelkins Ajanoh"", ""Ali Abdalla""]","[""NLP"", ""Protein"", ""Representation Learning""]","Computational methods that infer the function of proteins are key to understanding life at the molecular level. In recent years, representation learning has emerged as a powerful paradigm to discover new patterns among entities as varied as images, words, speech, molecules. In typical representation learning, there is only one source of data or one level of abstraction at which the learned representation occurs. However, proteins can be described by their primary, secondary, tertiary, and quaternary structure or even as nodes in protein-protein interaction networks. Given that protein function is an emergent property of all these levels of interactions in this work, we learn joint representations from both amino acid sequence and multilayer networks representing tissue-specific protein-protein interactions. Using these representations, we train machine learning models that outperform existing methods on the task of tissue-specific protein function prediction on 10 out of 13 tissues. Furthermore, we outperform existing methods by 19% on average.",/pdf/bce08842ef00a7182222b1dfa6b03b21ed03187a.pdf,ICLR,2020,We learn protein representations by integrating data from physical interaction and amino acid sequence +IVwXaHpiO0,b46RqGUQqnG,1601310000000.0,1614990000000.0,671,SyncTwin: Transparent Treatment Effect Estimation under Temporal Confounding,"[""~Zhaozhi_Qian1"", ""~Yao_Zhang3"", ""~Ioana_Bica1"", ""amw79@medschl.cam.ac.uk"", ""~Mihaela_van_der_Schaar2""]","[""Zhaozhi Qian"", ""Yao Zhang"", ""Ioana Bica"", ""Angela Wood"", ""Mihaela van der Schaar""]","[""treatment effect"", ""interpretability"", ""healthcare"", ""causal inference""]","Estimating causal treatment effects using observational data is a problem with few solutions when the confounder has a temporal structure, e.g. the history of disease progression might impact both treatment decisions and clinical outcomes. For such a challenging problem, it is desirable for the method to be transparent --- the ability to pinpoint a small subset of data points that contributes most to the estimate and to clearly indicate whether the estimate is reliable or not. This paper develops a new method, SyncTwin, to overcome temporal confounding in a transparent way. SyncTwin estimates the treatment effect of a target individual by comparing the outcome with its synthetic twin, which is constructed to closely match the target in the representation of the temporal confounders. SyncTwin achieves transparency by enforcing the synthetic twin to only depend on the weighted combination of few other individuals in the dataset. Moreover, the quality of the synthetic twin can be assessed by a performance metric, which also indicates the reliability of the estimated treatment effect. Experiments demonstrate that SyncTwin outperforms the benchmarks in clinical observational studies while still being transparent.",/pdf/7b13cfc6b2018d3c839fa3c4650874cc02e9470c.pdf,ICLR,2021,"We develop SyncTwin, a transparent treatment effect estimation method that deals with confounders with temporal structures and has a broad range of applications in clinical observational studies and beyond." +Skz3Q2CcFX,r1elGLhqKm,1538090000000.0,1545360000000.0,1395,Visualizing and Understanding the Semantics of Embedding Spaces via Algebraic Formulae,"[""piero@uber.com"", ""gnavvy@uber.com"", ""rivulet.zhang@gmail.com""]","[""Piero Molino"", ""Yang Wang"", ""Jiawei Zhang""]","[""visualization"", ""embeddings"", ""representations"", ""t-sne"", ""natural"", ""language"", ""processing"", ""machine"", ""learning"", ""algebra""]","Embeddings are a fundamental component of many modern machine learning and natural language processing models. +Understanding them and visualizing them is essential for gathering insights about the information they capture and the behavior of the models. +State of the art in analyzing embeddings consists in projecting them in two-dimensional planes without any interpretable semantics associated to the axes of the projection, which makes detailed analyses and comparison among multiple sets of embeddings challenging. +In this work, we propose to use explicit axes defined as algebraic formulae over embeddings to project them into a lower dimensional, but semantically meaningful subspace, as a simple yet effective analysis and visualization methodology. +This methodology assigns an interpretable semantics to the measures of variability and the axes of visualizations, allowing for both comparisons among different sets of embeddings and fine-grained inspection of the embedding spaces. +We demonstrate the power of the proposed methodology through a series of case studies that make use of visualizations constructed around the underlying methodology and through a user study. The results show how the methodology is effective at providing more profound insights than classical projection methods and how it is widely applicable to many other use cases.",/pdf/5a43f9ad12b9aa0724eee8b48117c5e5e5fd761b.pdf,ICLR,2019,We propose to use explicit vector algebraic formulae projection as an alternative way to visualize embedding spaces specifically tailored for goal-oriented analysis tasks and it outperforms t-SNE in our user study. +HJxcP2EFDS,SyetD-bKSS,1569440000000.0,1577170000000.0,14,Amharic Negation Handling,"[""girma1978@gmail.com""]","[""Girma Neshir""]","[""Negation Handling Algorithm"", ""Amharic Sentiment Analysis"", ""Amharic Sentiment lexicon"", ""char level"", ""word level ngram"", ""machine learning"", ""hybrid""]","User generated content contains opinionated texts not only in dominant languages (like English) but also less dominant languages( like Amharic). However, negation handling techniques that supports for sentiment detection is not developed in such less dominant language(i.e. Amharic). Negation handling is one of the challenging tasks for sentiment classification. Thus, this work builds negation handling schemes which enhances Amharic Sentiment classification. The proposed Negation Handling framework combines the lexicon based approach and character ngram based machine learning model. The performance of framework is evaluated using the annotated Amharic News Comments. The system is outperforming the best of all models and the baselines by an accuracy of 98.0. The result is compared with the baselines (without negation handling and word level ngram model).",/pdf/c92b9ae3cb64cc8abef7470c80502fc449f9fc90.pdf,ICLR,2020,This work presents Amharic Negation Handling for efficient Sentiment Classification. +I4c4K9vBNny,HP9UGYkeDtH,1601310000000.0,1615810000000.0,1859,Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling,"[""~\u0110or\u0111e_Miladinovi\u01071"", ""~Aleksandar_Stani\u01071"", ""~Stefan_Bauer1"", ""~J\u00fcrgen_Schmidhuber1"", ""~Joachim_M._Buhmann1""]","[""\u0110or\u0111e Miladinovi\u0107"", ""Aleksandar Stani\u0107"", ""Stefan Bauer"", ""J\u00fcrgen Schmidhuber"", ""Joachim M. Buhmann""]","[""Neural networks"", ""Deep generative models"", ""Image Modeling"", ""Variational Autoencoders""]","How to improve generative modeling by better exploiting spatial regularities and coherence in images? We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs). In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way, using a sequential gating-based mechanism that distributes contextual information across 2-D space. We show that augmenting the decoder of a hierarchical VAE by spatial dependency layers considerably improves density estimation over baseline convolutional architectures and the state-of-the-art among the models within the same class. Furthermore, we demonstrate that SDN can be applied to large images by synthesizing samples of high quality and coherence. In a vanilla VAE setting, we find that a powerful SDN decoder also improves learning disentangled representations, indicating that neural architectures play an important role in this task. Our results suggest favoring spatial dependency over convolutional layers in various VAE settings. The accompanying source code is given at https://github.com/djordjemila/sdn.",/pdf/07c9cfa0f97d0d062ba56a4bd3f00cbb1488c8a9.pdf,ICLR,2021,"A novel neural network layer for improved generative modeling of images, applied to variational autoencoders." +HJx_d34YDB,B1x5JkwUIS,1569440000000.0,1577170000000.0,46,VIDEO AFFECTIVE IMPACT PREDICTION WITH MULTIMODAL FUSION AND LONG-SHORT TEMPORAL CONTEXT,"[""yinzhao.zy@alibaba-inc.com"", ""longjun.clj@alibaba-inc.com"", ""chaoping.tcp@alibaba-inc.com"", ""auzj_alex@mail.scut.edu.cn"", ""weiwu@scut.edu.cn""]","[""Yin Zhao"", ""Longjun Cai"", ""Chaoping Tu"", ""Jie Zhang"", ""Wu Wei""]","[""multi-modal fusion"", ""affective computing"", ""temporal context"", ""residual-based training strategy""]","Predicting the emotional impact of videos using machine learning is a challenging task. Feature extraction, multi-modal fusion and temporal context fusion are crucial stages for predicting valence and arousal values in the emotional impact, but +have not been successfully exploited. In this paper, we proposed a comprehensive framework with innovative designs of model structure and multi-modal fusion strategy. We select the most suitable modalities for valence and arousal tasks respectively and each modal feature is extracted using the modality-specific pre-trained deep model on large generic dataset. Two-time-scale structures, one for the intra-clip and the other for the inter-clip, are proposed to capture the temporal dependency of video content and emotional states. To combine the complementary information from multiple modalities, an effective and efficient residual-based progressive training strategy is proposed. Each modality is step-wisely combined into the +multi-modal model, responsible for completing the missing parts of features. With all those above, our proposed prediction framework achieves better performance with a large margin compared to the state-of-the-art.",/pdf/33a143281af746485db625d1ffee095bfa413043.pdf,ICLR,2020, +B1GSBsRcFX,SJlZP5E6OX,1538090000000.0,1545360000000.0,84,Stop memorizing: A data-dependent regularization framework for intrinsic pattern learning,"[""zhu@math.duke.edu"", ""qiang.qiu@duke.edu"", ""wangbao@math.ucla.edu"", ""jianfeng@math.duke.edu"", ""guillermo.sapiro@duke.edu"", ""ingrid@math.duke.edu""]","[""Wei Zhu"", ""Qiang Qiu"", ""Bao Wang"", ""Jianfeng Lu"", ""Guillermo Sapiro"", ""Ingrid Daubechies""]","[""deep neural networks"", ""memorizing"", ""data-dependent regularization""]","Deep neural networks (DNNs) typically have enough capacity to fit random data by brute force even when conventional data-dependent regularizations focusing on the geometry of the features are imposed. We find out that the reason for this is the inconsistency between the enforced geometry and the standard softmax cross entropy loss. To resolve this, we propose a new framework for data-dependent DNN regularization, the Geometrically-Regularized-Self-Validating neural Networks (GRSVNet). During training, the geometry enforced on one batch of features is simultaneously validated on a separate batch using a validation loss consistent with the geometry. We study a particular case of GRSVNet, the Orthogonal-Low-rank Embedding (OLE)-GRSVNet, which is capable of producing highly discriminative features residing in orthogonal low-rank subspaces. Numerical experiments show that OLE-GRSVNet outperforms DNNs with conventional regularization when trained on real data. More importantly, unlike conventional DNNs, OLE-GRSVNet refuses to memorize random data or random labels, suggesting it only learns intrinsic patterns by reducing the memorizing capacity of the baseline DNN.",/pdf/fbf3bad55a29c48cd74c3d470ebe5ccaeb147e61.pdf,ICLR,2019,we propose a new framework for data-dependent DNN regularization that can prevent DNNs from overfitting random data or random labels. +S1lKSjRcY7,rklCznKKtX,1538090000000.0,1545360000000.0,105,Improved Gradient Estimators for Stochastic Discrete Variables,"[""eandriyash@dwavesys.com"", ""avahdat@dwavesys.com"", ""wgm@dwavesys.com""]","[""Evgeny Andriyash"", ""Arash Vahdat"", ""Bill Macready""]","[""continuous relaxation"", ""discrete stochastic variables"", ""reparameterization trick"", ""variational inference"", ""discrete optimization"", ""stochastic gradient estimation""]",In many applications we seek to optimize an expectation with respect to a distribution over discrete variables. Estimating gradients of such objectives with respect to the distribution parameters is a challenging problem. We analyze existing solutions including finite-difference (FD) estimators and continuous relaxation (CR) estimators in terms of bias and variance. We show that the commonly used Gumbel-Softmax estimator is biased and propose a simple method to reduce it. We also derive a simpler piece-wise linear continuous relaxation that also possesses reduced bias. We demonstrate empirically that reduced bias leads to a better performance in variational inference and on binary optimization tasks.,/pdf/ae83aaf46d270292953663aef66b0b6342cf97b4.pdf,ICLR,2019,We propose simple ways to reduce bias and complexity of stochastic gradient estimators used for learning distributions over discrete variables. +H1eRBoC9FX,B1edS0hKYm,1538090000000.0,1545360000000.0,132,Unsupervised Meta-Learning for Reinforcement Learning,"[""abhigupta@berkeley.edu"", ""eysenbachbe@gmail.com"", ""cbfinn@eecs.berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Abhishek Gupta"", ""Benjamin Eysenbach"", ""Chelsea Finn"", ""Sergey Levine""]","[""Meta-Learning"", ""Reinforcement Learning"", ""Exploration"", ""Unsupervised""]","Meta-learning is a powerful tool that learns how to quickly adapt a model to new tasks. In the context of reinforcement learning, meta-learning algorithms can acquire reinforcement learning procedures to solve new problems more efficiently by meta-learning prior tasks. The performance of meta-learning algorithms critically depends on the tasks available for meta-training: in the same way that supervised learning algorithms generalize best to test points drawn from the same distribution as the training points, meta-learning methods generalize best to tasks from the same distribution as the meta-training tasks. In effect, meta-reinforcement learning offloads the design burden from algorithm design to task design. If we can automate the process of task design as well, we can devise a meta-learning algorithm that is truly automated. In this work, we take a step in this direction, proposing a family of unsupervised meta-learning algorithms for reinforcement learning. We describe a general recipe for unsupervised meta-reinforcement learning, and describe an effective instantiation of this approach based on a recently proposed unsupervised exploration technique and model-agnostic meta-learning. We also discuss practical and conceptual considerations for developing unsupervised meta-learning methods. Our experimental results demonstrate that unsupervised meta-reinforcement learning effectively acquires accelerated reinforcement learning procedures without the need for manual task design, significantly exceeds the performance of learning from scratch, and even matches performance of meta-learning methods that use hand-specified task distributions.",/pdf/199cd9c230fbf41131c4c066f8b5a3e41d8ecf1e.pdf,ICLR,2019,Remove the burden of task distribution specification in meta-reinforcement learning by using unsupervised exploration +6htjOqus6C3,wVPeLqzIfl7,1601310000000.0,1614990000000.0,1198,DynamicVAE: Decoupling Reconstruction Error and Disentangled Representation Learning,"[""~Huajie_Shao1"", ""lhh2017@zju.edu.cn"", ""qmyang@zju.edu.cn"", ""~Shuochao_Yao1"", ""~Han_Zhao1"", ""~Tarek_Abdelzaher1""]","[""Huajie Shao"", ""Haohong Lin"", ""Qinmin Yang"", ""Shuochao Yao"", ""Han Zhao"", ""Tarek Abdelzaher""]","[""disentangled representation learning"", ""dynamic learning"", ""Variational Autoencoder"", ""PID contoller""]","This paper challenges the common assumption that the weight $\beta$, in $\beta$-VAE, should be larger than $1$ in order to effectively disentangle latent factors. We demonstrate that $\beta$-VAE, with $\beta < 1$, can not only attain good disentanglement but also significantly improve reconstruction accuracy via dynamic control. The paper \textit{removes the inherent trade-off} between reconstruction accuracy and disentanglement for $\beta$-VAE. Existing methods, such as $\beta$-VAE and FactorVAE, assign a large weight to the KL-divergence term in the objective function, leading to high reconstruction errors for the sake of better disentanglement. To mitigate this problem, a ControlVAE has recently been developed that dynamically tunes the KL-divergence weight in an attempt to \textit{control the trade-off} to more a favorable point. However, ControlVAE fails to eliminate the conflict between the need for a large $\beta$ (for disentanglement) and the need for a small $\beta$ (for smaller reconstruction error). Instead, we propose DynamicVAE that maintains a different $\beta$ at different stages of training, thereby \textit{decoupling disentanglement and reconstruction accuracy}. In order to evolve the weight, $\beta$, along a trajectory that enables such decoupling, DynamicVAE leverages a modified incremental PI (proportional-integral) controller, a variant of proportional-integral-derivative controller (PID) algorithm, and employs a moving average as well as a hybrid annealing method to evolve the value of KL-divergence smoothly in a tightly controlled fashion. We theoretically prove the stability of the proposed approach. Evaluation results on three benchmark datasets demonstrate that DynamicVAE significantly improves the reconstruction accuracy while achieving disentanglement comparable to the best of existing methods. The results verify that our method can separate disentangled representation learning and reconstruction, removing the inherent tension between the two. ",/pdf/76824fd024e07e4fd1e2ba536e5b9e7f5fb29dc4.pdf,ICLR,2021,The goal of this paper is to decouple disentangling and reconstruction for disentangled representation learning via dynamic control. +Ig-VyQc-MLK,xQEZ0gjDabR,1601310000000.0,1616360000000.0,5,Pruning Neural Networks at Initialization: Why Are We Missing the Mark?,"[""~Jonathan_Frankle1"", ""~Gintare_Karolina_Dziugaite1"", ""~Daniel_Roy1"", ""~Michael_Carbin1""]","[""Jonathan Frankle"", ""Gintare Karolina Dziugaite"", ""Daniel Roy"", ""Michael Carbin""]","[""Pruning"", ""Sparsity"", ""Lottery Ticket"", ""Science""]","Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.",/pdf/cca2ba851f4bc02f930894a6b4954aa465868ba6.pdf,ICLR,2021,"Methods for pruning neural nets at initialization perform the same or better when shuffling or reinitializing the weights they prune in each layer, a way in which they differ from SOTA weight-pruning methods after training." +6M4c3WegNtX,xQybqxdTiZj,1601310000000.0,1614990000000.0,3619,Neural Ensemble Search for Uncertainty Estimation and Dataset Shift,"[""~Sheheryar_Zaidi1"", ""~Arber_Zela1"", ""~Thomas_Elsken1"", ""chris.holmes@stats.ox.ac.uk"", ""~Frank_Hutter1"", ""~Yee_Whye_Teh1""]","[""Sheheryar Zaidi"", ""Arber Zela"", ""Thomas Elsken"", ""Chris Holmes"", ""Frank Hutter"", ""Yee Whye Teh""]","[""uncertainty estimation"", ""deep ensemble"", ""dataset shift"", ""robustness"", ""uncertainty calibration""]","Ensembles of neural networks achieve superior performance compared to stand-alone networks not only in terms of predictive performance, but also uncertainty calibration and robustness to dataset shift. Diversity among networks is believed to be key for building strong ensembles, but typical approaches, such as \emph{deep ensembles}, only ensemble different weight vectors of a fixed architecture. Instead, we propose two methods for constructing ensembles to exploit diversity among networks with \emph{varying} architectures. We find that the resulting ensembles are indeed more diverse and also exhibit better uncertainty calibration, predictive performance and robustness to dataset shift in comparison with deep ensembles on a variety of classification tasks.",/pdf/364b3f551cf22c6ece1bda178e6c7604ef0dcd05.pdf,ICLR,2021,"We propose methods for constructing ensembles of neural networks with varying architectures, demonstrating that they outperform deep ensembles, in terms of uncertainty calibration, predictive performance and robustness to dataset shift." +#NAME?,hDFFL0Ywqv1,1601310000000.0,1614990000000.0,168,CROSS-SUPERVISED OBJECT DETECTION,"[""~Zitian_Chen1"", ""~Zhiqiang_Shen1"", ""~Jiahui_Yu1"", ""~Erik_Learned-Miller2""]","[""Zitian Chen"", ""Zhiqiang Shen"", ""Jiahui Yu"", ""Erik Learned-Miller""]","[""Object detection"", ""weakly supervised"", ""transfer leaning""]","After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects. However, building good object localizers (i.e., detectors) currently requires expensive instance-level annotations. While some work has been done on learning detectors from weakly labeled samples (with only class labels), these detectors do poorly at localization. In this work, we show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. We call this learning paradigm cross-supervised object detection. While earlier works investigated this paradigm, they did not apply it to realistic complex images (e.g., COCO), and their performance was poor. We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations, together with a spatial correlation module that bridges the gap between detection and recognition. These contributions enable us to better detect novel objects with image-level annotations in complex multi-object scenes such as the COCO dataset.",/pdf/a6fdc33f91f5b937c25cdda47b871c2ddaf46100.pdf,ICLR,2021, We show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. +3UTezOEABr,jN7ezNj8Vnk,1601310000000.0,1614990000000.0,2493,TimeAutoML: Autonomous Representation Learning for Multivariate Irregularly Sampled Time Series,"[""yangjiao@tongji.edu.cn"", ""~Kai_Yang3"", ""shaoyu@tongji.edu.cn"", ""lp@tongji.edu.cn"", ""~Sijia_Liu1"", ""~Dongjin_Song2""]","[""Yang Jiao"", ""Kai Yang"", ""shaoyu dou"", ""pan luo"", ""Sijia Liu"", ""Dongjin Song""]","[""representation learning"", ""AutoML"", ""irregularly sampled time series"", ""anomaly detection"", ""clustering""]","Multivariate time series (MTS) data are becoming increasingly ubiquitous in diverse domains, e.g., IoT systems, health informatics, and 5G networks. To obtain an effective representation of MTS data, it is not only essential to consider unpredictable dynamics and highly variable lengths of these data but also important to address the irregularities in the sampling rates of MTS. Existing parametric approaches rely on manual hyperparameter tuning and may cost a huge amount of labor effort. Therefore, it is desirable to learn the representation automatically and efficiently. To this end, we propose an autonomous representation learning approach for multivariate time series (TimeAutoML) with irregular sampling rates and variable lengths. As opposed to previous works, we first present a representation learning pipeline in which the configuration and hyperparameter optimization +are fully automatic and can be tailored for various tasks, e.g., anomaly detection, clustering, etc. Next, a negative sample generation approach and an auxiliary classification task are developed and integrated within TimeAutoML to enhance +its representation capability. Extensive empirical studies on real-world datasets demonstrate that the proposed TimeAutoML outperforms competing approaches on various tasks by a large margin. In fact, it achieves the best anomaly detection +performance among all comparison algorithms on 78 out of all 85 UCR datasets, acquiring up to 20% performance improvement in terms of AUC score.",/pdf/07e9b62da12f86427970c15dfd458283541c23c1.pdf,ICLR,2021,This paper presents an autonomous representation learning approach for multivariate irregularly sampled time series. +rJgDb1SFwB,SylZdsi_wr,1569440000000.0,1577170000000.0,1546,MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis,"[""mrosnati@ethz.ch"", ""fortuin@inf.ethz.ch""]","[""Margherita Rosnati"", ""Vincent Fortuin""]","[""time series analysis"", ""interpretability"", ""Gaussian Processes"", ""attention neural networks""]","With a mortality rate of 5.4 million lives worldwide every year and a healthcare cost of more than 16 billion dollars in the USA alone, sepsis is one of the leading causes of hospital mortality and an increasing concern in the ageing western world. Recently, medical and technological advances have helped re-define the illness criteria of this disease, which is otherwise poorly understood by the medical society. Together with the rise of widely accessible Electronic Health Records, the advances in data mining and complex nonlinear algorithms are a promising avenue for the early detection of sepsis. This work contributes to the research effort in the field of automated sepsis detection with an open-access labelling of the medical MIMIC-III data set. Moreover, we propose MGP-AttTCN: a joint multitask Gaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretable manner. We show that our model outperforms the current state-of-the-art and present evidence that different labelling heuristics lead to discrepancies in task difficulty.",/pdf/94249607d064e4cba7ad1321318616f1cc01228e.pdf,ICLR,2020,We propose MGP-AttTCN: a joint multitask Gaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretable and robust manner. +SJmAXkgCb,SJM0QyeA-,1509060000000.0,1518730000000.0,200,DNN Feature Map Compression using Learned Representation over GF(2),"[""denis.gudovskiy@us.panasonic.com"", ""alec.hodgkinson@us.panasonic.com"", ""luca.rigazio@us.panasonic.com""]","[""Denis A. Gudovskiy"", ""Alec Hodgkinson"", ""Luca Rigazio""]","[""feature map"", ""representation"", ""compression"", ""quantization"", ""finite-field""]","In this paper, we introduce a method to compress intermediate feature maps of deep neural networks (DNNs) to decrease memory storage and bandwidth requirements during inference. Unlike previous works, the proposed method is based on converting fixed-point activations into vectors over the smallest GF(2) finite field followed by nonlinear dimensionality reduction (NDR) layers embedded into a DNN. Such an end-to-end learned representation finds more compact feature maps by exploiting quantization redundancies within the fixed-point activations along the channel or spatial dimensions. We apply the proposed network architecture to the tasks of ImageNet classification and PASCAL VOC object detection. Compared to prior approaches, the conducted experiments show a factor of 2 decrease in memory requirements with minor degradation in accuracy while adding only bitwise computations.",/pdf/fbb63536bc12c7f8bad6966ded70788620640206.pdf,ICLR,2018,Feature map compression method that converts quantized activations into binary vectors followed by nonlinear dimensionality reduction layers embedded into a DNN +HJWzXsKxx,,1478240000000.0,1478380000000.0,123,Training Long Short-Term Memory With Sparsified Stochastic Gradient Descent,"[""maohuazhu@ece.ucsb.edu"", ""mrhu@nvidia.com"", ""jclemons@nvidia.com"", ""skeckler@nvidia.com"", ""yuanxie@ece.ucsb.edu""]","[""Maohua Zhu"", ""Minsoo Rhu"", ""Jason Clemons"", ""Stephen W. Keckler"", ""Yuan Xie""]","[""Optimization"", ""Deep learning""]","Prior work has demonstrated that exploiting the sparsity can dramatically improve the energy efficiency and shrink the memory footprint of Convolutional Neural Networks (CNNs). +However, these sparsity-centric optimization techniques might be less effective for Long Short-Term Memory (LSTM) based Recurrent Neural Networks (RNNs), especially for the training phase, because of the significant structural difference between the neurons. To investigate if there is possible sparsity-centric optimization for training LSTM-based RNNs, we studied several applications and observed that there is potential sparsity in the gradients generated in the backward propagation. In this paper, we illustrate why the sparsity exists and propose a simple yet effective thresholding technique to induce further more sparsity during training an LSTM-based RNN training. Experiment results show that the proposed technique can increase the sparsity of linear gate gradients to higher than 80\% without loss of performance, which makes more than 50\% multiply-accumulate (MAC) operations redundant in an entire LSTM training process. These redudant MAC operations can be eliminated by hardware techniques to improve the energy efficiency and training speed of LSTM-based RNNs.",/pdf/a26baac5dcca7219fca1435ab9342312c2db22be.pdf,ICLR,2017,A simple yet effective technique to induce considerable amount of sparsity in LSTM training +3tFAs5E-Pe,uLEllMkixZyO,1601310000000.0,1615670000000.0,2976,Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization,"[""~Alexander_Korotin2"", ""lingxiao@mit.edu"", ""~Justin_Solomon1"", ""~Evgeny_Burnaev1""]","[""Alexander Korotin"", ""Lingxiao Li"", ""Justin Solomon"", ""Evgeny Burnaev""]","[""wasserstein-2 barycenters"", ""non-minimax optimization"", ""cycle-consistency regularizer"", ""input convex neural networks"", ""continuous case""]","Wasserstein barycenters provide a geometric notion of the weighted average of probability measures based on optimal transport. In this paper, we present a scalable algorithm to compute Wasserstein-2 barycenters given sample access to the input measures, which are not restricted to being discrete. While past approaches rely on entropic or quadratic regularization, we employ input convex neural networks and cycle-consistency regularization to avoid introducing bias. As a result, our approach does not resort to minimax optimization. We provide theoretical analysis on error bounds as well as empirical evidence of the effectiveness of the proposed approach in low-dimensional qualitative scenarios and high-dimensional quantitative experiments.",/pdf/e0ff5cb89ad8da4cac3b85587213f35c465757fc.pdf,ICLR,2021,We present a new algorithm to compute Wasserstein-2 barycenters of continuous distributions powered by a straightforward optimization procedure without introducing bias or a generative model. +Hkf2_sC5FX,r1lYDHPctm,1538090000000.0,1547030000000.0,392,Efficient Lifelong Learning with A-GEM,"[""arslan.chaudhry@eng.ox.ac.uk"", ""ranzato@fb.com"", ""mrf@fb.com"", ""elhoseiny@fb.com""]","[""Arslan Chaudhry"", ""Marc\u2019Aurelio Ranzato"", ""Marcus Rohrbach"", ""Mohamed Elhoseiny""]","[""Lifelong Learning"", ""Continual Learning"", ""Catastrophic Forgetting"", ""Few-shot Transfer""]","In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity, computational and memory cost. Towards this end, we first introduce a new and a more realistic evaluation protocol, whereby learners observe each example only once and hyper-parameter selection is done on a small and disjoint set of tasks, which is not used for the actual learning experience and evaluation. Second, we introduce a new metric measuring how quickly a learner acquires a new skill. Third, we propose an improved version of GEM (Lopez-Paz & Ranzato, 2017), dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance as GEM, while being almost as computationally and memory efficient as EWC (Kirkpatrick et al., 2016) and other regularization-based methods. Finally, we show that all algorithms including A-GEM can learn even more quickly if they are provided with task descriptors specifying the classification tasks under consideration. Our experiments on several standard lifelong learning benchmarks demonstrate that A-GEM has the best trade-off between accuracy and efficiency",/pdf/68b34388f383ea8919ffc158ee58a351da811add.pdf,ICLR,2019,An efficient lifelong learning algorithm that provides a better trade-off between accuracy and time/ memory complexity compared to other algorithms. +r1HhRfWRZ,SJqPRGWR-,1509140000000.0,1519850000000.0,1065,Learning Awareness Models,"[""bamos@cs.cmu.edu"", ""dinh.laurent@gmail.com"", ""cabi@google.com"", ""tcr@google.com"", ""sergomez@google.com"", ""alimuldal@google.com"", ""etom@google.com"", ""tassa@google.com"", ""nandodefreitas@google.com"", ""mdenil@google.com""]","[""Brandon Amos"", ""Laurent Dinh"", ""Serkan Cabi"", ""Thomas Roth\u00f6rl"", ""Sergio G\u00f3mez Colmenarejo"", ""Alistair Muldal"", ""Tom Erez"", ""Yuval Tassa"", ""Nando de Freitas"", ""Misha Denil""]","[""Awareness"", ""Prediction"", ""Seq2seq"", ""Robots""]","We consider the setting of an agent with a fixed body interacting with an unknown and uncertain external world. We show that models trained to predict proprioceptive information about the agent's body come to represent objects in the external world. In spite of being trained with only internally available signals, these dynamic body models come to represent external objects through the necessity of predicting their effects on the agent's own body. That is, the model learns holistic persistent representations of objects in the world, even though the only training signals are body signals. Our dynamics model is able to successfully predict distributions over 132 sensor readings over 100 steps into the future and we demonstrate that even when the body is no longer in contact with an object, the latent variables of the dynamics model continue to represent its shape. We show that active data collection by maximizing the entropy of predictions about the body---touch sensors, proprioception and vestibular information---leads to learning of dynamic models that show superior performance when used for control. We also collect data from a real robotic hand and show that the same models can be used to answer questions about properties of objects in the real world. Videos with qualitative results of our models are available at https://goo.gl/mZuqAV.",/pdf/16f282c414d23413e543c6c0f48bec64ee4eaf56.pdf,ICLR,2018,We train predictive models on proprioceptive information and show they represent properties of external objects. +ot9bYHvuULl,bv2R5KOmIon,1601310000000.0,1614990000000.0,1176,Augmented Sliced Wasserstein Distances,"[""~Xiongjie_Chen1"", ""~Yongxin_Yang1"", ""~Yunpeng_Li1""]","[""Xiongjie Chen"", ""Yongxin Yang"", ""Yunpeng Li""]",[],"While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through random projection, yet they suffer from low projection efficiency because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.",/pdf/e51097fdc2902d47652acdb308d4a203e5c3dc93.pdf,ICLR,2021, +SkYXvCR6W,SJFXDAA6Z,1508990000000.0,1518730000000.0,107,Compact Encoding of Words for Efficient Character-level Convolutional Neural Networks Text Classification,"[""wemerson_marinho@id.uff.br"", ""lmarti@ic.uff.br"", ""nayat@ime.uerj.br""]","[""Wemerson Marinho"", ""Luis Marti"", ""Nayat Sanchez-pi""]","[""Character Level Convolutional Networks"", ""Text Classification"", ""Word Compressing""]","This paper puts forward a new text to tensor representation that relies on information compression techniques to assign shorter codes to the most frequently used characters. This representation is language-independent with no need of pretraining and produces an encoding with no information loss. It provides an adequate description of the morphology of text, as it is able to represent prefixes, declensions, and inflections with similar vectors and are able to represent even unseen words on the training dataset. Similarly, as it is compact yet sparse, is ideal for speed up training times using tensor processing libraries. As part of this paper, we show that this technique is especially effective when coupled with convolutional neural networks (CNNs) for text classification at character-level. We apply two variants of CNN coupled with it. Experimental results show that it drastically reduces the number of parameters to be optimized, resulting in competitive classification accuracy values in only a fraction of the time spent by one-hot encoding representations, thus enabling training in commodity hardware.",/pdf/a397a74b15c74247710fcb2b9e66babaa98dbabb.pdf,ICLR,2018,Using Compressing tecniques to Encoding of Words is a possibility for faster training of CNN and dimensionality reduction of representation +lJuOUWlAC8i,XZLTZ5iwpX_,1601310000000.0,1614990000000.0,2935,Learning Contextualized Knowledge Graph Structures for Commonsense Reasoning,"[""~Jun_Yan5"", ""mt1170736@iitd.ac.in"", ""zhang-ty17@mails.tsinghua.edu.cn"", ""~Ryan_Rossi1"", ""~Handong_Zhao3"", ""sukim@adobe.com"", ""lipka@adobe.com"", ""~Xiang_Ren1""]","[""Jun Yan"", ""Mrigank Raman"", ""Tianyu Zhang"", ""Ryan Rossi"", ""Handong Zhao"", ""Sungchul Kim"", ""Nedim Lipka"", ""Xiang Ren""]",[],"Recently, neural-symbolic architectures have achieved success on commonsense reasoning through effectively encoding relational structures retrieved from external knowledge graphs (KGs) and obtained state-of-the-art results in tasks such as (commonsense) question answering and natural language inference. However, current neural-symbolic reasoning methods rely on quality and contextualized knowledge structures (i.e., fact triples) that can be retrieved at the pre-processing stage and overlook challenges such as dealing with incompleteness of a KG (low coverage), limited expressiveness of its relations, and irrelevant retrieved facts in the reasoning context. +In this paper, we present a novel neural-symbolic approach, named Hybrid Graph Network (HGN), which jointly generates feature representations for new triples (as complement to the existing edges in the KG), determines relevance of the triples to the reasoning context, and learns graph model parameters for encoding the relational information. Our method learns a compact graph structure (comprising both retrieved and generated edges) through filtering edges that are unhelpful to the reasoning process. We show marked improvements on three commonsense reasoning benchmarks and demonstrate the superiority of the learned graph structures with user studies.",/pdf/3f662940cb9772aec838615078bbb78e81459b18.pdf,ICLR,2021, +UV9kN3S4uTZ,pO4Utf8AWxK,1601310000000.0,1614990000000.0,985,Dynamic Relational Inference in Multi-Agent Trajectories,"[""~Ruichao_Xiao1"", ""~Manish_Kumar_Singh1"", ""~Rose_Yu1""]","[""Ruichao Xiao"", ""Manish Kumar Singh"", ""Rose Yu""]","[""deep generative model"", ""relational inference"", ""trajectory modeling"", ""multi-agent learning""]","Unsupervised learning of interactions from multi-agent trajectories has broad applications in physics, vision, and robotics. However, existing neural relational inference works are limited to static relations. We consider a more general setting of dynamic relational inference where interactions change over time. We propose DYnamic multi-Agent Relational Inference (DYARI) model, a deep generative model that can reason about dynamic relations. Using a simulated physics system, we study various dynamic relation scenarios, including periodic and additive dynamics. We perform a comprehensive study on the trade-off between dynamic and inference period, the impact of the training scheme, and model architecture on dynamic relational inference accuracy. We also showcase an application of our model to infer coordination and competition patterns from real-world multi-agent basketball trajectories.",/pdf/ca4784b744bf737b89b7f60c9c21bcc84ccf67e1.pdf,ICLR,2021,A deep generative model for dynamically inferring hidden relations in multi-agent trajectories +BJlaG0VFDH,HJx5-7B_Pr,1569440000000.0,1577170000000.0,1018,Decoupling Weight Regularization from Batch Size for Model Compression,"[""dslee3@gmail.com"", ""mogndrewk@gmail.com"", ""quddnr145@gmail.com"", ""dragwon.jeon@gmail.com"", ""qkrqotjd91@gmail.com"", ""yji6373@naver.com"", ""gywei@g.harvard.edu""]","[""Dongsoo Lee"", ""Se Jung Kwon"", ""Byeongwook Kim"", ""Yongkweon Jeon"", ""Baeseong Park"", ""Jeongin Yun"", ""Gu-Yeon Wei""]","[""Model compression"", ""Weight Regularization"", ""Batch Size"", ""Gradient Descent""]","Conventionally, compression-aware training performs weight compression for every mini-batch to compute the impact of compression on the loss function. In this paper, in order to study when would be the right time to compress weights during optimization steps, we propose a new hyper-parameter called Non-Regularization period or NR period during which weights are not updated for regularization. We first investigate the influence of NR period on regularization using weight decay and weight random noise insertion. Throughout various experiments, we show that stronger weight regularization demands longer NR period (regardless of batch size) to best utilize regularization effects. From our empirical evidence, we argue that weight regularization for every mini-batch allows small weight updates only and limited regularization effects such that there is a need to search for right NR period and weight regularization strength to enhance model accuracy. Consequently, NR period becomes especially crucial for model compression where large weight updates are necessary to increase compression ratio. Using various models, we show that simple weight updates to comply with compression formats along with long NR period is enough to achieve high compression ratio and model accuracy.",/pdf/3c6282f7e9737d0996fe3241a3c7031ea0850dfe.pdf,ICLR,2020,We show that stronger regularization and high model compression ratio can be achieved when weight updates are conducted less frequently. +SkgGCkrKvH,SklDJDJKwS,1569440000000.0,1583910000000.0,2016,Decentralized Deep Learning with Arbitrary Communication Compression,"[""anastasia.koloskova@epfl.ch"", ""tao.lin@epfl.ch"", ""sebastian.stich@epfl.ch"", ""martin.jaggi@epfl.ch""]","[""Anastasia Koloskova*"", ""Tao Lin*"", ""Sebastian U Stich"", ""Martin Jaggi""]",[],"Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches are limited by network bandwidth, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD achieves linear speedup in the number of workers for arbitrary high compression ratios on general non-convex functions, and non-IID training data. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over decentralized user devices, connected by a peer-to-peer network and (ii) in a datacenter. ",/pdf/3934e547bc85fc7fb307d0440ebfc85da4efec1d.pdf,ICLR,2020,"We propose Choco-SGD---decentralized SGD with compressed communication---for non-convex objectives and show its strong performance in various deep learning applications (on-device learning, datacenter case)." +r1iuQjxCZ,rk9umseRW,1509110000000.0,1526900000000.0,369,On the importance of single directions for generalization,"[""arimorcos@google.com"", ""barrettdavid@google.com"", ""ncr@google.com"", ""botvinick@google.com""]","[""Ari S. Morcos"", ""David G.T. Barrett"", ""Neil C. Rabinowitz"", ""Matthew Botvinick""]","[""generalization"", ""analysis"", ""deep learning"", ""selectivity""]","Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network’s reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyper- parameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.",/pdf/4c7b6879ab4b04d94e5160447cc5092552013f13.pdf,ICLR,2018,"We find that deep networks which generalize poorly are more reliant on single directions than those that generalize well, and evaluate the impact of dropout and batch normalization, as well as class selectivity on single direction reliance." +HJeYalBKvr,HJgNIZbYPH,1569440000000.0,1577170000000.0,2581,Attention over Phrases,"[""cui.wanyun@sufe.edu.cn""]","[""Wanyun Cui""]","[""representation learning"", ""natural language processing"", ""attention""]","How to represent the sentence ``That's the last straw for her''? The answer of the self-attention is a weighted sum of each individual words, i.e. $$semantics=\alpha_1Emb(\text{That})+\alpha_2Emb(\text{'s})+\cdots+\alpha_nEmb(\text{her})$$. But the weighted sum of ``That's'', ``the'', ``last'', ``straw'' can hardly represent the semantics of the phrase. We argue that the phrases play an important role in attention. +If we combine some words into phrases, a more reasonable representation with compositions is +$$semantics=\alpha_1Emb(\text{That's})+Emb_2(\text{the last straw})+\alpha_3Emb(\text{for})+\alpha_4Emb(\text{her})$$. +While recent studies prefer to use the attention mechanism to represent the natural language, few noticed the word compositions. In this paper, we study the problem of representing such compositional attentions in phrases. In this paper, we proposed a new attention architecture called HyperTransformer. Besides representing the words of the sentence, we introduce hypernodes to represent the candidate phrases in attention. +HyperTransformer has two phases. The first phase is used to attend over all word/phrase pairs, which is similar to the standard Transformer. The second phase is used to represent the inductive bias within each phrase. Specially, we incorporate the non-linear attention in the second phase. The non-linearity represents the the semantic mutations in phrases. The experimental performance has been greatly improved. In WMT16 English-German translation task, the BLEU increases from 20.90 (by Transformer) to 34.61 (by HyperTransformer).",/pdf/e17d21e6827ee934813c066a345f6f2f059ae0ea.pdf,ICLR,2020,We study the problem of representing phrases as atoms in attention. +HkzuKpLgg,,1478050000000.0,1483090000000.0,37,Efficient Communications in Training Large Scale Neural Networks,"[""linnan.wang@gatech.edu"", ""wwu12@vols.utk.edu"", ""bosilca@icl.utk.edu"", ""richie@cc.gatech.edu"", ""zlxu@uestc.edu.cn""]","[""Linnan Wang"", ""Wei Wu"", ""George Bosilca"", ""Richard Vuduc"", ""Zenglin Xu""]","[""Applications"", ""Deep learning""]","We consider the problem of how to reduce the cost of communication that is re- quired for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires a many collective communication operations, like broadcasts of parameters or reduc- tions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P , where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(log P ). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.",/pdf/3d5a4f6021bd32220713f2b41fd36d18d83110a2.pdf,ICLR,2017,Tackle the communications in the parallel training of neural networks +BJgEjiRqYX,SkeW1cicKm,1538090000000.0,1545360000000.0,614,A Case for Object Compositionality in Deep Generative Models of Images,"[""sjoerd@idsia.ch"", ""kkurach@gmail.com"", ""sylvain.gelly@gmail.com""]","[""Sjoerd van Steenkiste"", ""Karol Kurach"", ""Sylvain Gelly""]","[""Objects"", ""Compositionality"", ""Generative Models"", ""GAN"", ""Unsupervised Learning""]","Deep generative models seek to recover the process with which the observed data was generated. They may be used to synthesize new samples or to subsequently extract representations. Successful approaches in the domain of images are driven by several core inductive biases. However, a bias to account for the compositional way in which humans structure a visual scene in terms of objects has frequently been overlooked. In this work we propose to structure the generator of a GAN to consider objects and their relations explicitly, and generate images by means of composition. This provides a way to efficiently learn a more accurate generative model of real-world images, and serves as an initial step towards learning corresponding object representations. We evaluate our approach on several multi-object image datasets, and find that the generator learns to identify and disentangle information corresponding to different objects at a representational level. A human study reveals that the resulting generative model is better at generating images that are more faithful to the reference distribution.",/pdf/f81142574e181abd9de705e70891a6f48d107e8f.pdf,ICLR,2019,"We propose to structure the generator of a GAN to consider objects and their relations explicitly, and generate images by means of composition" +SkVRTj0cYQ,rJlLsXTqt7,1538090000000.0,1545360000000.0,853,Differentially Private Federated Learning: A Client Level Perspective,"[""geyerr@ethz.ch"", ""tassilo.klein@sap.com"", ""moin.nabi@sap.com""]","[""Robin C. Geyer"", ""Tassilo J. Klein"", ""Moin Nabi""]","[""Machine Learning"", ""Federated Learning"", ""Privacy"", ""Security"", ""Differential Privacy""]","Federated learning is a recent advance in privacy protection. +In this context, a trusted curator aggregates parameters optimized in decentralized fashion by multiple clients. The resulting model is then distributed back to all clients, ultimately converging to a joint representative model without explicitly having to share the data. +However, the protocol is vulnerable to differential attacks, which could originate from any party contributing during federated optimization. In such an attack, a client's contribution during training and information about their data set is revealed through analyzing the distributed model. +We tackle this problem and propose an algorithm for client sided differential privacy preserving federated optimization. The aim is to hide clients' contributions during training, balancing the trade-off between privacy loss and model performance. +Empirical studies suggest that given a sufficiently large number of participating clients, our proposed procedure can maintain client-level differential privacy at only a minor cost in model performance. ",/pdf/b0820366c53400ca4be650c144b551219c0b5fe2.pdf,ICLR,2019,Ensuring that models learned in federated fashion do not reveal a client's participation. +XZzriKGEj0_,USLdMRVPDN,1601310000000.0,1614990000000.0,1064,Learning What Not to Model: Gaussian Process Regression with Negative Constraints,"[""~Gaurav_Shrivastava1"", ""~Harsh_Shrivastava1"", ""~Abhinav_Shrivastava2""]","[""Gaurav Shrivastava"", ""Harsh Shrivastava"", ""Abhinav Shrivastava""]","[""Gaussian Process"", ""Gaussian Process Regression""]","Gaussian Process (GP) regression fits a curve on a set of datapairs, with each pair consisting of an input point '$\mathbf{x}$' and its corresponding target regression value '$y(\mathbf{x})$' (a positive datapair). But, what if for an input point '$\bar{\mathbf{x}}$', we want to constrain the GP to avoid a target regression value '$\bar{y}(\bar{\mathbf{x}})$' (a negative datapair)? This requirement can often appear in real-world navigation tasks, where an agent would want to avoid obstacles, like furniture items in a room when planning a trajectory to navigate. In this work, we propose to incorporate such negative constraints in a GP regression framework. Our approach, 'GP-NC' or Gaussian Process with Negative Constraints, fits over the positive datapairs while avoiding the negative datapairs. Specifically, our key idea is to model the negative datapairs using small blobs of Gaussian distribution and maximize its KL divergence from the GP. We jointly optimize the GP-NC for both the positive and negative datapairs. We empirically demonstrate that our GP-NC framework performs better than the traditional GP learning and that our framework does not affect the scalability of Gaussian Process regression and helps the model converge faster as the size of the data increases.",/pdf/231094d70f0ad58be0b246f7933d53b364f5ac2d.pdf,ICLR,2021, +trj4iYJpIvy,4d60yXZoWDf,1601310000000.0,1614990000000.0,2624,Approximation Algorithms for Sparse Principal Component Analysis,"[""~Agniva_Chowdhury1"", ""~Petros_Drineas1"", ""~David_Woodruff2"", ""~Samson_Zhou1""]","[""Agniva Chowdhury"", ""Petros Drineas"", ""David Woodruff"", ""Samson Zhou""]","[""Sparse PCA"", ""Principal component analysis"", ""Randomized linear algebra"", ""Singular value decomposition""]","Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present three provably accurate, polynomial time, approximation algorithms for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. The first algorithm is based on randomized matrix multiplication; the second algorithm is based on a novel deterministic thresholding scheme; and the third algorithm is based on a semidefinite programming relaxation of SPCA. All algorithms come with provable guarantees and run in low-degree polynomial time. Our empirical evaluations confirm our theoretical findings.",/pdf/d435a7b3177042c2b73d699657b87ceac60b39b9.pdf,ICLR,2021,"We present three provably accurate approximation algorithms for the Sparse Principal Component Analysis (SPCA) problem, without imposing any restrictive assumptions on the input covariance matrix." +HylloR4YDr,rkxg7gKOPH,1569440000000.0,1577170000000.0,1306,Learning Latent Representations for Inverse Dynamics using Generalized Experiences,"[""amavalan@eng.ucsd.edu"", ""sicung@ucsd.edu""]","[""Aditi Mavalankar"", ""Sicun Gao""]","[""deep reinforcement learning"", ""continuous control"", ""inverse dynamics model""]","Many practical robot locomotion tasks require agents to use control policies that can be parameterized by goals. Popular deep reinforcement learning approaches in this direction involve learning goal-conditioned policies or value functions, or Inverse Dynamics Models (IDMs). IDMs map an agent’s current state and desired goal to the required actions. We show that the key to achieving good performance with IDMs lies in learning the information shared between equivalent experiences, so that they can be generalized to unseen scenarios. We design a training process that guides the learning of latent representations to encode this shared information. Using a limited number of environment interactions, our agent is able to efficiently navigate to arbitrary points in the goal space. We demonstrate the effectiveness of our approach in high-dimensional locomotion environments such as the Mujoco Ant, PyBullet Humanoid, and PyBullet Minitaur. We provide quantitative and qualitative results to show that our method clearly outperforms competing baseline approaches.",/pdf/53ea301f34105f09f9c3f551f771a2943cd571d7.pdf,ICLR,2020,"We show that the key to achieving good performance with IDMs lies in learning latent representations to encode the information shared between equivalent experiences, so that they can be generalized to unseen scenarios." +B1xv9pEKDS,Hyg6ACTwvr,1569440000000.0,1577170000000.0,710,LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning,"[""kt.song@njust.edu.cn"", ""sigmeta@pku.edu.cn"", ""xuta@microsoft.com"", ""taoqin@microsoft.com"", ""lujf@njust.edu.cn"", ""liuhz@pku.edu.cn"", ""tyliu@microsoft.com""]","[""Kaitao Song"", ""Hao Sun"", ""Xu Tan"", ""Tao Qin"", ""Jianfeng Lu"", ""Hongzhi Liu"", ""Tie-Yan Liu""]","[""Knowledge Distillation"", ""Pre-training"", ""Fine-tuning"", ""BERT"", ""GPT-2"", ""MASS""]","While pre-training and fine-tuning, e.g., BERT~\citep{devlin2018bert}, GPT-2~\citep{radford2019language}, have achieved great success in language understanding and generation tasks, the pre-trained models are usually too big for online deployment in terms of both memory cost and inference speed, which hinders them from practical online usage. In this paper, we propose LightPAFF, a Lightweight Pre-training And Fine-tuning Framework that leverages two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model in both pre-training and fine-tuning stages. In this way the lightweight model can achieve similar accuracy as the big teacher model, but with much fewer parameters and thus faster online inference speed. LightPAFF can support different pre-training methods (such as BERT, GPT-2 and MASS~\citep{song2019mass}) and be applied to many downstream tasks. Experiments on three language understanding tasks, three language modeling tasks and three sequence to sequence generation tasks demonstrate that while achieving similar accuracy with the big BERT, GPT-2 and MASS models, LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.",/pdf/5d7238be8a10723ca003760ee05bcb02d9c86e93.pdf,ICLR,2020, +k9EHBqXDEOX,GnlZU8Pfu7s,1601310000000.0,1614990000000.0,1017,Asynchronous Advantage Actor Critic: Non-asymptotic Analysis and Linear Speedup,"[""shenh5@rpi.edu"", ""~Kaiqing_Zhang3"", ""~Mingyi_Hong1"", ""~Tianyi_Chen1""]","[""Han Shen"", ""Kaiqing Zhang"", ""Mingyi Hong"", ""Tianyi Chen""]",[],"Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. +Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. +Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including the non-asymptotic analysis and the performance gain of parallelism (a.k.a. speedup). +This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0), with provable convergence guarantees. +With linear value function approximation for the TD update, the convergence of A3C-TD(0) is established under both i.i.d. and Markovian sampling. Under i.i.d. sampling, A3C-TD(0) obtains sample complexity of $\mathcal{O}(\epsilon^{-2.5}/N)$ per worker to achieve $\epsilon$ accuracy, where $N$ is the number of workers. Compared to the best-known sample complexity of $\mathcal{O}(\epsilon^{-2.5})$ for two-timescale AC, A3C-TD(0) achieves \emph{linear speedup}, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. +Numerical tests on synthetically generated instances and OpenAI Gym environments have been provided to verify our theoretical analysis. ",/pdf/515aef776f89b4ce75cf432f75f79adce42390d9.pdf,ICLR,2021, +Du7s5ukNKz,CD2l4b0G2N5,1601310000000.0,1614990000000.0,1366,Policy Learning Using Weak Supervision,"[""~Jingkang_Wang1"", ""~Hongyi_Guo1"", ""~Zhaowei_Zhu1"", ""~Yang_Liu3""]","[""Jingkang Wang"", ""Hongyi Guo"", ""Zhaowei Zhu"", ""Yang Liu""]","[""Weak Supervision"", ""Policy Learning"", ""Correlated Agreement""]","Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavior cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the ``weak supervisions'' as imperfect information coming from a \emph{peer agent}, and evaluate the learning agent's policy based on a ``correlated agreement'' with the peer agent's policy (instead of simple agreements). Our way of leveraging peer agent's information offers us a family of solutions that learn effectively from weak supervisions with theoretical guarantees. Extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations, and standard policy co-training (RL + BC) show that the proposed approach leads to substantial improvements, especially when the complexity or the noise of the learning environments grows. ",/pdf/ee5c56f4a598020e7a9cfb8282565e68508488ca.pdf,ICLR,2021,"We introduce a novel framework to handle the common ""weakly supervision"" problem in policy learning based on a correlated agreement." +HygYqs0qKX,BJgYVAI5KX,1538090000000.0,1545360000000.0,554,Conscious Inference for Object Detection,"[""zhoujh09@gmail.com"", ""nikolaos.karianakis@microsoft.com"", ""yingwu@eecs.northwestern.edu"", ""ganghua@gmail.com""]","[""Jiahuan Zhou"", ""Nikolaos Karianakis"", ""Ying Wu"", ""Gang Hua""]","[""consciousness"", ""conscious inference"", ""object detection"", ""object pose estimation""]","Current Convolutional Neural Network (CNN)-based object detection models adopt strictly feedforward inference to predict the final detection results. However, the widely used one-way inference is agnostic to the global image context and the interplay between input image and task semantics. In this work, we present a general technique to improve off-the-shelf CNN-based object detection models in the inference stage without re-training, architecture modification or ground-truth requirements. We propose an iterative, bottom-up and top-down inference mechanism, which is named conscious inference, as it is inspired by prevalent models for human consciousness with top-down guidance and temporal persistence. While the downstream pass accumulates category-specific evidence over time, it subsequently affects the proposal calculation and the final detection. Feature activations are updated in line with no additional memory cost. Our approach advances the state of the art using popular detection models (Faster-RCNN, YOLOv2, YOLOv3) on 2D object detection and 6D object pose estimation.",/pdf/0f74b0628cab25f337fa905145d5566ad614179f.pdf,ICLR,2019, +Q1aiM7sCi1,c-eLVpphreP,1601310000000.0,1614990000000.0,890,Fuzzy c-Means Clustering for Persistence Diagrams,"[""~Thomas_Davies1"", ""jack.aspinall@materials.ox.ac.uk"", ""~Bryan_Wilder1"", ""long.tran-thanh@warwick.ac.uk""]","[""Thomas Davies"", ""Jack Aspinall"", ""Bryan Wilder"", ""Long Tran-Thanh""]","[""Topological data analysis"", ""fuzzy clustering""]","Persistence diagrams concisely represent the topology of a point cloud whilst having strong theoretical guarantees. Most current approaches to integrating topological information into machine learning implicitly map persistence diagrams to a Hilbert space, resulting in deformation of the underlying metric structure whilst also generally requiring prior knowledge about the true topology of the space. In this paper we give an algorithm for Fuzzy c-Means (FCM) clustering directly on the space of persistence diagrams, enabling unsupervised learning that automatically captures the topological structure of data, with no prior knowledge or additional processing of persistence diagrams. We prove the same convergence guarantees as traditional FCM clustering: every convergent subsequence of iterates tends to a local minimum or saddle point. We end by presenting experiments where our fuzzy topological clustering algorithm allows for unsupervised top-$k$ candidate selection in settings where (i) the properties of persistence diagrams make them the natural choice over geometric equivalents, and (ii) the probabilistic membership values let us rank candidates in settings where verifying candidate suitability is expensive: lattice structure classification in materials science and pre-trained model selection in machine learning.",/pdf/f39b11858b9f1b4e4e984179ce05c1625eeb1384.pdf,ICLR,2021,"We develop fuzzy clustering for the space of persistence diagrams, with experiments on lattice structures and decision boundaries." +rkgOLb-0W,r1JuIbW0-,1509130000000.0,1519020000000.0,679,Neural Language Modeling by Jointly Learning Syntax and Lexicon,"[""yikang.shn@gmail.com"", ""lin.zhouhan@gmail.com"", ""chin-wei.huang@umontreal.ca"", ""aaron.courville@gmail.com""]","[""Yikang Shen"", ""Zhouhan Lin"", ""Chin-wei Huang"", ""Aaron Courville""]","[""Language model"", ""unsupervised parsing""]","We propose a neural language model capable of unsupervised syntactic structure induction. The model leverages the structure information to form better semantic representations and better language modeling. Standard recurrent neural networks are limited by their structure and fail to efficiently use syntactic information. On the other hand, tree-structured recursive networks usually require additional structural supervision at the cost of human expert annotation. In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model. In our model, the gradient can be directly back-propagated from the language model loss into the neural parsing network. Experiments show that the proposed model can discover the underlying syntactic structure and achieve state-of-the-art performance on word/character-level language model tasks.",/pdf/bbdc41bc5d2c333055d978fe42d3fc95a3628d1e.pdf,ICLR,2018,"In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model." +wiSgdeJ29ee,iihq59GcJzd,1601310000000.0,1614990000000.0,2555,Fine-Tuning Offline Reinforcement Learning with Model-Based Policy Optimization,"[""~Adam_Villaflor1"", ""~John_Dolan1"", ""~Jeff_Schneider1""]","[""Adam Villaflor"", ""John Dolan"", ""Jeff Schneider""]","[""Offline Reinforcement Learning"", ""Model-Based Reinforcement Learning"", ""Off-policy Reinforcement Learning"", ""uncertainty estimation""]","In offline reinforcement learning (RL), we attempt to learn a control policy from a fixed dataset of environment interactions. This setting has the potential benefit of allowing us to learn effective policies without needing to collect additional interactive data, which can be expensive or dangerous in real-world systems. However, traditional off-policy RL methods tend to perform poorly in this setting due to the distributional shift between the fixed data set and the learned policy. In particular, they tend to extrapolate optimistically and overestimate the action-values outside of the dataset distribution. Recently, two major avenues have been explored to address this issue. First, behavior-regularized methods that penalize actions that deviate from the demonstrated action distribution. Second, uncertainty-aware model-based (MB) methods that discourage state-actions where the dynamics are uncertain. In this work, we propose an algorithmic framework that consists of two stages. In the first stage, we train a policy using behavior-regularized model-free RL on the offline dataset. Then, a second stage where we fine-tune the policy using our novel Model-Based Behavior-Regularized Policy Optimization (MB2PO) algorithm. We demonstrate that for certain tasks and dataset distributions our conservative model-based fine-tuning can greatly increase performance and allow the agent to generalize and outperform the demonstrated behavior. We evaluate our method on a variety of the Gym-MuJoCo tasks in the D4RL benchmark and demonstrate that our method is competitive and in some cases superior to the state of the art for most of the evaluated tasks.",/pdf/f55d7696304a1c3b97065934b32f21f128a7a8f7.pdf,ICLR,2021,We present an offline RL approach that leverages both uncertainty-aware models and behavior-regularized model-free RL to achieve state of the art results on the MuJoCo tasks in the D4RL benchmark. +Ysuv-WOFeKR,cN6uH2XQa8v,1601310000000.0,1616010000000.0,2407,Parrot: Data-Driven Behavioral Priors for Reinforcement Learning,"[""~Avi_Singh1"", ""~Huihan_Liu1"", ""~Gaoyue_Zhou1"", ""~Albert_Yu1"", ""~Nicholas_Rhinehart1"", ""~Sergey_Levine1""]","[""Avi Singh"", ""Huihan Liu"", ""Gaoyue Zhou"", ""Albert Yu"", ""Nicholas Rhinehart"", ""Sergey Levine""]","[""reinforcement learning"", ""imitation learning""]","Reinforcement learning provides a general framework for flexible decision making and control, but requires extensive data collection for each new task that an agent needs to learn. In other machine learning fields, such as natural language processing or computer vision, pre-training on large, previously collected datasets to bootstrap learning for new tasks has emerged as a powerful paradigm to reduce data requirements when learning a new task. In this paper, we ask the following question: how can we enable similarly useful pre-training for RL agents? We propose a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials from a wide range of previously seen tasks, and we show how this learned prior can be used for rapidly learning new tasks without impeding the RL agent's ability to try out novel behaviors. We demonstrate the effectiveness of our approach in challenging robotic manipulation domains involving image observations and sparse reward functions, where our method outperforms prior works by a substantial margin. Additional materials can be found on our project website: https://sites.google.com/view/parrot-rl",/pdf/7ace651a0c4f40194198a9ab3ea9aefadb191c77.pdf,ICLR,2021,"We propose a method for pre-training a prior for reinforcement learning using data from a diverse range of tasks, and use this prior to speed up learning of new tasks. " +fGiKxvF-eub,fF4YHS7C6a7,1601310000000.0,1614990000000.0,1438,Oblivious Sketching-based Central Path Method for Solving Linear Programming Problems,"[""~Zhao_Song3"", ""~Zheng_Yu1""]","[""Zhao Song"", ""Zheng Yu""]","[""optimization"", ""sketching"", ""linear programming"", ""central path method"", ""running time complexity""]","In this work, we propose a sketching-based central path method for solving linear programmings, whose running time matches the state of art results [Cohen, Lee, Song STOC 19; Lee, Song, Zhang COLT 19]. Our method opens up the iterations of the central path method and deploys an ""iterate and sketch"" approach towards the problem by introducing a new coordinate-wise embedding technique, which may be of independent interest. Compare to previous methods, the work [Cohen, Lee, Song STOC 19] enjoys feasibility while being non-oblivious, and [Lee, Song, Zhang COLT 19] is oblivious but infeasible, and relies on $\mathit{dense}$ sketching matrices such as subsampled randomized Hadamard/Fourier transform matrices. Our method enjoys the benefits of being both oblivious and feasible, and can use $\mathit{sparse}$ sketching matrix [Nelson, Nguyen FOCS 13] to speed up the online matrix-vector multiplication. Our framework for solving LP naturally generalizes to a broader class of convex optimization problems including empirical risk minimization.",/pdf/7e5939114af2f70c782b3fab5a66616f5599443d.pdf,ICLR,2021,"We propose a sketching-based central path method for solving linear programs, which has the same running time as the state of art algorithms and enjoys the advantages of being ""oblivious"" and ""feasible""." +SXoheAR0Gz,Dge7ETpOxb,1601310000000.0,1614990000000.0,650,Fast Partial Fourier Transform,"[""~Yong-chan_Park1"", ""~Jun-Gi_Jang1"", ""~U_Kang1""]","[""Yong-chan Park"", ""Jun-Gi Jang"", ""U Kang""]","[""Fourier transform"", ""time series"", ""signal processing"", ""anomaly detection"", ""machine learning""]","Given a time-series vector, how can we efficiently compute a specified part of Fourier coefficients? Fast Fourier transform (FFT) is a widely used algorithm that computes the discrete Fourier transform in many machine learning applications. Despite the pervasive use, FFT algorithms do not provide a fine-tuning option for the user to specify one’s demand, that is, the output size (the number of Fourier coefficients to be computed) is algorithmically determined by the input size. Such a lack of flexibility is often followed by just discarding the unused coefficients because many applications do not require the whole spectrum of the frequency domain, resulting in an inefficiency due to the extra computation. +In this paper, we propose a fast Partial Fourier Transform (PFT), an efficient algorithm for computing only a part of Fourier coefficients. PFT approximates a part of twiddle factors (trigonometric constants) using polynomials, thereby reducing the computational complexity due to the mixture of many twiddle factors. We derive the asymptotic time complexity of PFT with respect to input and output sizes, as well as its numerical accuracy. Experimental results show that PFT outperforms the current state-of-the-art algorithms, with an order of magnitude of speedup for sufficiently small output sizes without sacrificing accuracy.",/pdf/87f2c4e64fdf1568711d30f70d7f28b1fd8420ff.pdf,ICLR,2021,"We propose a fast Partial Fourier Transform (PFT), an efficient algorithm for computing only a part of Fourier coefficients. " +AcH9xD24Hd,IfyGeHnRfNW,1601310000000.0,1614990000000.0,3055,Learning the Step-size Policy for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm,"[""~Lucas_N._Egidio1"", ""anders.g.hansson@liu.se"", ""~Bo_Wahlberg1""]","[""Lucas N. Egidio"", ""Anders Hansson"", ""Bo Wahlberg""]","[""Unconstrained optimization"", ""Step-size policy"", ""L-BFGS"", ""Learned optimizers""]","We consider the problem of how to learn a step-size policy for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm. This is a limited computational memory quasi-Newton method widely used for deterministic unconstrained optimization but currently avoided in large-scale problems for requiring step sizes to be provided at each iteration. Existing methodologies for the step size selection for L-BFGS use heuristic tuning of design parameters and massive re-evaluations of the objective function and gradient to find appropriate step-lengths. We propose a neural network architecture with local information of the current iterate as the input. The step-length policy is learned from data of similar optimization problems, avoids additional evaluations of the objective function, and guarantees that the output step remains inside a pre-defined interval. The corresponding training procedure is formulated as a stochastic optimization problem using the backpropagation through time algorithm. The performance of the proposed method is evaluated on the training of classifiers for the MNIST database for handwritten digits and for CIFAR-10. The results show that the proposed algorithm outperforms heuristically tuned optimizers such as ADAM, RMSprop, L-BFGS with a backtracking line search and L-BFGS with a constant step size. The numerical results also show that a learned policy can be used as a warm-start to train new policies for different problems after a few additional training steps, highlighting its potential use in multiple large-scale optimization problems.",/pdf/5c468ab23138819aaafe51287d7fcabd150d5c4e.pdf,ICLR,2021,A framework to automatically learn a policy from data that generates step sizes for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm and performs better than heuristically tuned ADAM and RMSprop in tests on MNIST dataset. +naSAkn2Xo46,F1Kui_zrl0H,1601310000000.0,1614990000000.0,3311,Factored Action Spaces in Deep Reinforcement Learning,"[""~Thomas_PIERROT1"", ""~Valentin_Mac\u00e91"", ""jb.sevestre@instadeep.com"", ""l.monier@instadeep.com"", ""~Alexandre_Laterre1"", ""~Nicolas_Perrin1"", ""kb@instadeep.com"", ""~Olivier_Sigaud1""]","[""Thomas PIERROT"", ""Valentin Mac\u00e9"", ""Jean-Baptiste Sevestre"", ""Louis Monier"", ""Alexandre Laterre"", ""Nicolas Perrin"", ""Karim Beguir"", ""Olivier Sigaud""]","[""Deep Reinforcement Learning"", ""Large action spaces"", ""Parameterized action spaces"", ""Multi-Agent"", ""Continuous Control""]","Very large action spaces constitute a critical challenge for deep Reinforcement Learning (RL) algorithms. An existing approach consists in splitting the action space into smaller components and choosing either independently or sequentially actions in each dimension. This approach led to astonishing results for the StarCraft and Dota 2 games, however it remains underexploited and understudied. In this paper, we name this approach Factored Actions Reinforcement Learning (FARL) and study both its theoretical impact and practical use. Notably, we provide a theoretical analysis of FARL on the Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC) algorithms and evaluate these agents in different classes of problems. We show that FARL is a very versatile and efficient approach to combinatorial and continuous control problems.",/pdf/bbaa64c47169d91cd8ec29389c59a5bdb14cd4a2.pdf,ICLR,2021,We propose a theoretical study as well as practical tips and applications of action spaces factorization in deep Reinforcement Learning. +rk8wKk-R-,HyVDYybAZ,1509120000000.0,1518730000000.0,501,Convolutional Sequence Modeling Revisited,"[""shaojieb@cs.cmu.edu"", ""zkolter@cs.cmu.edu"", ""vkoltun@gmail.com""]","[""Shaojie Bai"", ""J. Zico Kolter"", ""Vladlen Koltun""]","[""Temporal Convolutional Network"", ""Sequence Modeling"", ""Deep Learning""]","This paper revisits the problem of sequence modeling using convolutional +architectures. Although both convolutional and recurrent architectures have a +long history in sequence prediction, the current ""default"" mindset in much of +the deep learning community is that generic sequence modeling is best handled +using recurrent networks. The goal of this paper is to question this assumption. +Specifically, we consider a simple generic temporal convolution network (TCN), +which adopts features from modern ConvNet architectures such as a dilations and +residual connections. We show that on a variety of sequence modeling tasks, +including many frequently used as benchmarks for evaluating recurrent networks, +the TCN outperforms baseline RNN methods (LSTMs, GRUs, and vanilla RNNs) and +sometimes even highly specialized approaches. We further show that the +potential ""infinite memory"" advantage that RNNs have over TCNs is largely +absent in practice: TCNs indeed exhibit longer effective history sizes than their +recurrent counterparts. As a whole, we argue that it may be time to (re)consider +ConvNets as the default ""go to"" architecture for sequence modeling.",/pdf/4f72202970a15efa844bae2729d7d0a4f2ed503a.pdf,ICLR,2018,We argue that convolutional networks should be considered the default starting point for sequence modeling tasks. +iEcqwosBEgx,dFPkH7lIqlD,1601310000000.0,1614990000000.0,1214,Novel Policy Seeking with Constrained Optimization,"[""~Hao_Sun3"", ""~Zhenghao_Peng1"", ""~Bo_Dai2"", ""guoj@pcl.ac.cn"", ""~Dahua_Lin1"", ""~Bolei_Zhou5""]","[""Hao Sun"", ""Zhenghao Peng"", ""Bo Dai"", ""Jian Guo"", ""Dahua Lin"", ""Bolei Zhou""]","[""Novel Policy Seeking"", ""Reinforcement Learning"", ""Constrained Optimization""]","We address the problem of seeking novel policies in reinforcement learning tasks. Instead of following the multi-objective framework commonly used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We at first introduce a new metric to evaluate the difference between policies, and then design two practical novel policy seeking methods following the new perspective, namely the Constrained Task Novel Bisector (CTNB), and the Interior Policy Differentiation (IPD), corresponding to the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJuCo control suite show our methods can achieve substantial improvements over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task.",/pdf/4005ceb2ae6b55f6d6b67ea5027929352e3b024f.pdf,ICLR,2021,We address the problem of seeking novel policies in reinforcement learning tasks with constrained optimization to generate well-performed diverse policies. +B1GOWV5eg,,1478280000000.0,1488520000000.0,206,Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning,"[""ssahil08@gmail.com"", ""aravindsrinivas@gmail.com"", ""ravi@cse.iitm.ac.in""]","[""Sahil Sharma"", ""Aravind S. Lakshminarayanan"", ""Balaraman Ravindran""]","[""Deep learning"", ""Reinforcement Learning""]","Reinforcement Learning algorithms can learn complex behavioral patterns for sequential decision making tasks wherein an agent interacts with an environment and acquires feedback in the form of rewards sampled from it. Traditionally, such algorithms make decisions, i.e., select actions to execute, at every single time step of the agent-environment interactions. In this paper, we propose a novel framework, Fine Grained Action Repetition (FiGAR), which enables the agent to decide the action as well as the time scale of repeating it. +FiGAR can be used for improving any Deep Reinforcement Learning algorithm which maintains an explicit policy estimate by enabling temporal abstractions in the action space and implicitly enabling planning through sequences of repetitive macro-actions. +We empirically demonstrate the efficacy of our framework by showing performance improvements on top of three policy search algorithms in different domains: Asynchronous Advantage Actor Critic in the Atari 2600 domain, Trust Region Policy Optimization in Mujoco domain and Deep Deterministic Policy Gradients in the TORCS car racing domain. +",/pdf/0c28b4c7497552fcfcfd483485ab478635ad765d.pdf,ICLR,2017,Framework for temporal abstractions in policy space by learning to repeat actions +UuchYL8wSZo,p2vMZPkSghd,1601310000000.0,1615850000000.0,949,Learning Generalizable Visual Representations via Interactive Gameplay,"[""~Luca_Weihs1"", ""~Aniruddha_Kembhavi1"", ""~Kiana_Ehsani1"", ""~Sarah_M_Pratt1"", ""winsonh@allenai.org"", ""alvaroh@allenai.org"", ""~Eric_Kolve1"", ""dustins@allenai.org"", ""~Roozbeh_Mottaghi1"", ""~Ali_Farhadi3""]","[""Luca Weihs"", ""Aniruddha Kembhavi"", ""Kiana Ehsani"", ""Sarah M Pratt"", ""Winson Han"", ""Alvaro Herrasti"", ""Eric Kolve"", ""Dustin Schwenk"", ""Roozbeh Mottaghi"", ""Ali Farhadi""]","[""representation learning"", ""deep reinforcement learning"", ""computer vision""]","A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization. Comparatively little is known regarding the impact of embodied gameplay upon artificial agents. While recent work has produced agents proficient in abstract games, these environments are far removed the real world and thus these agents can provide little insight into the advantages of embodied play. Hiding games, such as hide-and-seek, played universally, provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing Cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn generalizable representations of their observations encoding information such as object permanence, free space, and containment. Moving closer to biologically motivated learning strategies, our agents' representations, enhanced by intentionality and memory, are developed through interaction and play. These results serve as a model for studying how facets of vision develop through interaction, provide an experimental framework for assessing what is learned by artificial agents, and demonstrates the value of moving from large, static, datasets towards experiential, interactive, representation learning.",/pdf/9136e8ff7bc8b8e82f73830b816f60fe258ec628.pdf,ICLR,2021,We show the representation learned through interaction and gameplay generalizes better compared to passive and static representation learning methods. +B1al7jg0b,H1pgmsxCW,1509110000000.0,1519080000000.0,368,Overcoming Catastrophic Interference using Conceptor-Aided Backpropagation,"[""x.he@jacobs-university.de"", ""h.jaeger@jacobs-university.de""]","[""Xu He"", ""Herbert Jaeger""]","[""Catastrophic Interference"", ""Conceptor"", ""Backpropagation"", ""Continual Learning"", ""Lifelong Learning""]","Catastrophic interference has been a major roadblock in the research of continual learning. Here we propose a variant of the back-propagation algorithm, ""Conceptor-Aided Backprop"" (CAB), in which gradients are shielded by conceptors against degradation of previously learned tasks. Conceptors have their origin in reservoir computing, where they have been previously shown to overcome catastrophic forgetting. CAB extends these results to deep feedforward networks. On the disjoint and permuted MNIST tasks, CAB outperforms two other methods for coping with catastrophic interference that have recently been proposed.",/pdf/8c8c2a92ef45fdd233c1050d816b2281a088b5fa.pdf,ICLR,2018,"We propose a variant of the backpropagation algorithm, in which gradients are shielded by conceptors against degradation of previously learned tasks." +r1l-e3Cqtm,SJg7cxA9F7,1538090000000.0,1545360000000.0,1051,Deep Probabilistic Video Compression,"[""jun.han.gr@dartmouth.edu"", ""sal.lombardo@disneyresearch.com"", ""christopher.schroers@disneyresearch.com"", ""stephan.mandt@gmail.com""]","[""Jun Han"", ""Salvator Lombardo"", ""Christopher Schroers"", ""Stephan Mandt""]","[""variational inference"", ""video compression"", ""deep generative models""]","We propose a variational inference approach to deep probabilistic video compression. Our model uses advances in variational autoencoders (VAEs) for sequential data and combines it with recent work on neural image compression. The approach jointly learns to transform the original video into a lower-dimensional representation as well as to entropy code this representation according to a temporally-conditioned probabilistic model. We split the latent space into local (per frame) and global (per segment) variables, and show that training the VAE to utilize both representations leads to an improved rate-distortion performance. Evaluation on small videos from public data sets with varying complexity and diversity show that our model yields competitive results when trained on generic video content. Extreme compression performance is achieved for videos with specialized content if the model is trained on similar videos.",/pdf/5aec3594d42c20abaaa50a88b19baf85e9c68467.pdf,ICLR,2019,Deep Probabilistic Video Compression Via Sequential Variational Autoencoders +BJxh2j0qYm,HJgCOaw9t7,1538090000000.0,1551160000000.0,748,Dynamic Channel Pruning: Feature Boosting and Suppression,"[""xt.gao@siat.ac.cn"", ""yaz21@cam.ac.uk"", ""lukaszd.mail@gmail.com"", ""robert.mullins@cl.cam.ac.uk"", ""czxu@um.edu.mo""]","[""Xitong Gao"", ""Yiren Zhao"", ""\u0141ukasz Dudziak"", ""Robert Mullins"", ""Cheng-zhong Xu""]","[""dynamic network"", ""faster CNNs"", ""channel pruning""]","Making deep convolutional neural networks more accurate typically comes at the cost of increased computational and memory resources. In this paper, we reduce this cost by exploiting the fact that the importance of features computed by convolutional layers is highly input-dependent, and propose feature boosting and suppression (FBS), a new method to predictively amplify salient convolutional channels and skip unimportant ones at run-time. FBS introduces small auxiliary connections to existing convolutional layers. In contrast to channel pruning methods which permanently remove channels, it preserves the full network structures and accelerates convolution by dynamically skipping unimportant input and output channels. FBS-augmented networks are trained with conventional stochastic gradient descent, making it readily available for many state-of-the-art CNNs. We compare FBS to a range of existing channel pruning and dynamic execution schemes and demonstrate large improvements on ImageNet classification. Experiments show that FBS can respectively provide 5× and 2× savings in compute on VGG-16 and ResNet-18, both with less than 0.6% top-5 accuracy loss.",/pdf/49336570b19e8f9fe58503dd974c4f8373b56ae6.pdf,ICLR,2019,We make convolutional layers run faster by dynamically boosting and suppressing channels in feature computation. +HJG7m2AcF7,BJxZ1T69Fm,1538090000000.0,1545360000000.0,1343,Context Mover's Distance & Barycenters: Optimal transport of contexts for building representations,"[""sidak.singh@epfl.ch"", ""andreas.hug@epfl.ch"", ""aymeric.dieuleveut@epfl.ch"", ""martin.jaggi@epfl.ch""]","[""Sidak Pal Singh"", ""Andreas Hug"", ""Aymeric Dieuleveut"", ""Martin Jaggi""]","[""representation learning"", ""wasserstein distance"", ""wasserstein barycenter"", ""entailment""]","We propose a unified framework for building unsupervised representations of entities and their compositions, by viewing each entity as a histogram (or distribution) over its contexts. This enables us to take advantage of optimal transport and construct representations that effectively harness the geometry of the underlying space containing the contexts. Our method captures uncertainty via modelling the entities as distributions and simultaneously provides interpretability with the optimal transport map, hence giving a novel perspective for building rich and powerful feature representations. As a guiding example, we formulate unsupervised representations for text, and demonstrate it on tasks such as sentence similarity and word entailment detection. Empirical results show strong advantages gained through the proposed framework. This approach can potentially be used for any unsupervised or supervised problem (on text or other modalities) with a co-occurrence structure, such as any sequence data. The key tools at the core of this framework are Wasserstein distances and Wasserstein barycenters.",/pdf/faf133fa1be48ff76596e0af8e0207985dad5ddf.pdf,ICLR,2019, +y4-e1K23GLC,A5Qa7C2Ssn1,1601310000000.0,1614990000000.0,245,A law of robustness for two-layers neural networks,"[""~Sebastien_Bubeck1"", ""~Yuanzhi_Li1"", ""~Dheeraj_Mysore_Nagaraj1""]","[""Sebastien Bubeck"", ""Yuanzhi Li"", ""Dheeraj Mysore Nagaraj""]","[""neural networks"", ""approximation theory"", ""robust machine learning""]","We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with $k$ neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than $\sqrt{n/k}$ where $n$ is the number of datapoints. In particular, this conjecture implies that overparametrization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a $O(1)$-Lipschitz network, while mere data fitting of $d$-dimensional data requires only one neuron per $d$ datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the high-dimensional regime $n \approx d$ (which we also refer to as the undercomplete case, since only $k \leq d$ is relevant here). Finally we prove the conjecture for polynomial activation functions of degree $p$ when $n \approx d^p$. We complement these findings with experimental evidence supporting the conjecture.",/pdf/54a01a5235df9e0c5f22a4fcefb16799177231ea.pdf,ICLR,2021,"We conjecture a precise tradeoff between the size of a neural network and its robustness, and prove variants of the conjecture." +Kao09W-oe8,BEEUvy7tluZ,1601310000000.0,1614990000000.0,1318,Channel-Directed Gradients for Optimization of Convolutional Neural Networks,"[""~Dong_Lao1"", ""~Peihao_Zhu1"", ""~Peter_Wonka2"", ""~Ganesh_Sundaramoorthi1""]","[""Dong Lao"", ""Peihao Zhu"", ""Peter Wonka"", ""Ganesh Sundaramoorthi""]","[""stochastic optimization"", ""Riemannian geometry"", ""Riemannian gradient flows"", ""convolutional neural nets""]","We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted L2 or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks, and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.",/pdf/37bd54b6cec84bd03e4daeff55af13d9bb492387.pdf,ICLR,2021, +butEPeLARP_,_iMlFkoQXoW,1601310000000.0,1614990000000.0,2862,Predicting the impact of dataset composition on model performance,"[""~Tatsunori_Hashimoto1""]","[""Tatsunori Hashimoto""]","[""Experimental design"", ""generalization"", ""data collection""]"," Real-world machine learning systems are often are trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple, accurate way to predict the loss incurred by a model based on data size and composition. Our work expands recent observations of log-linear generalization error and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach achieves nearly exact ($r^2>.93$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 > .83$) on more challenging machine translation and question answering tasks where baselines achieve worse-than-random performance.",/pdf/83ecd8dcd1ae89ffa1a92679fee7ac2a9990aada.pdf,ICLR,2021, +H-AAaJ9v_lE,eRv2kjF8PIx,1601310000000.0,1614990000000.0,3512,Legendre Deep Neural Network (LDNN) and its application for approximation of nonlinear Volterra–Fredholm–Hammerstein integral equations,"[""kparand@sbu.ac.ir"", ""~Zeinab_Hajimohammadi1"", ""~Ali_Ghodsi1""]","[""Kourosh Parand"", ""Zeinab Hajimohammadi"", ""Ali Ghodsi""]","[""Deep neural network"", ""Volterra\u2013Fredholm\u2013Hammerstein integral equations"", ""Legendre orthogonal polynomials"", ""Gaussian quadrature method"", ""Collocation method""]","Various phenomena in biology, physics, and engineering are modeled by differential equations. These differential equations including partial differential equations and ordinary differential equations can be converted and represented as integral equations. In particular, Volterra–Fredholm–Hammerstein integral equations are the main type of these integral equations and researchers are interested in investigating and solving these equations. In this paper, we propose Legendre Deep Neural Network (LDNN) for solving nonlinear Volterra–Fredholm–Hammerstein integral equations (V-F-H-IEs). LDNN utilizes Legendre orthogonal polynomials as activation functions of the Deep structure. We present how LDNN can be used to solve nonlinear V-F-H-IEs. We show using the Gaussian quadrature collocation method in combination with LDNN results in a novel numerical solution for nonlinear V-F-H-IEs. Several examples are given to verify the performance and accuracy of LDNN.",/pdf/5f0fb090d796ab6d496251b5ba46a3170d551f31.pdf,ICLR,2021,we propose the Legendre Deep Neural Network (LDNN) for solving nonlinear Volterra–Fredholm–Hammerstein integral equations (V-F-H-IEs) +BkVVOi0cFX,B1lGa6X9K7,1538090000000.0,1545360000000.0,348,Denoise while Aggregating: Collaborative Learning in Open-Domain Question Answering,"[""jihaozhe@gmail.com"", ""mrlyk423@gmail.com"", ""liuzy@tsinghua.edu.cn"", ""sms@tsinghua.edu.cn""]","[""Haozhe Ji"", ""Yankai Lin"", ""Zhiyuan Liu"", ""Maosong Sun""]","[""natural language processing"", ""open-domain question answering"", ""semi-supervised learning""]","The open-domain question answering (OpenQA) task aims to extract answers that match specific questions from a distantly supervised corpus. Unlike supervised reading comprehension (RC) datasets where questions are designed for particular paragraphs, background sentences in OpenQA datasets are more prone to noise. We observe that most existing OpenQA approaches are vulnerable to noise since they simply regard those sentences that contain the answer span as ground truths and ignore the plausible correlation between the sentences and the question. To address this deficiency, we introduce a unified and collaborative model that leverages alignment information from query-sentence pairs in a small-scale supervised RC dataset and aggregates relevant evidence from distantly supervised corpus to answer open-domain questions. We evaluate our model on several real-world OpenQA datasets, and experimental results show that our collaborative learning methods outperform the existing baselines significantly.",/pdf/331a04b9b54096dd4c9904025a9a3ba2a01c8a18.pdf,ICLR,2019,We propose denoising strategies to leverage information from supervised RC datasets to handle the noise issue in the open-domain QA task. +LNtTXJ9XXr,u5Wn2MkkRrS,1601310000000.0,1614990000000.0,2196,Adversarial Masking: Towards Understanding Robustness Trade-off for Generalization,"[""~Minhao_Cheng1"", ""~Zhe_Gan1"", ""~Yu_Cheng1"", ""~Shuohang_Wang1"", ""~Cho-Jui_Hsieh1"", ""~Jingjing_Liu2""]","[""Minhao Cheng"", ""Zhe Gan"", ""Yu Cheng"", ""Shuohang Wang"", ""Cho-Jui Hsieh"", ""Jingjing Liu""]","[""Adversarial Machine Learning"", ""Adversarial Robustness"", ""Adversarial Training"", ""Generalization""]","Adversarial training is a commonly used technique to improve model robustness against adversarial examples. Despite its success as a defense mechanism, adversarial training often fails to generalize well to unperturbed test data. While previous work assumes it is caused by the discrepancy between robust and non-robust features, in this paper, we introduce \emph{Adversarial Masking}, a new hypothesis that this trade-off is caused by different feature maskings applied. Specifically, the rescaling operation in the batch normalization layer, when combined together with ReLU activation, serves as a feature masking layer to select different features for model training. By carefully manipulating different maskings, a well-balanced trade-off can be achieved between model performance on unperturbed and perturbed data. Built upon this hypothesis, we further propose Robust Masking (RobMask), which constructs unique masking for every specific attack perturbation by learning a set of primary adversarial feature maskings. By incorporating different feature maps after the masking, we can distill better features to help model generalization. Sufficiently, adversarial training can be treated as an effective regularizer to achieve better generalization. Experiments on multiple benchmarks demonstrate that RobMask achieves significant improvement on clean test accuracy compared to strong state-of-the-art baselines.",/pdf/5eafd989547bf5e640fc45d558c400011626ef20.pdf,ICLR,2021,"We introduce a new hypothesis to understand the trade-off between robustness and natural accuracy, and further propose a new method to achieve better generalization using adversarial examples.." +rJbPBt9lg,,1478300000000.0,1485140000000.0,486,Neural Code Completion,"[""xinw@eecs.berkeley.edu"", ""liuchang@eecs.berkeley.edu"", ""ricshin@berkeley.edu"", ""jegonzal@berkeley.edu"", ""dawnsong@cs.berkeley.edu""]","[""Chang Liu"", ""Xin Wang"", ""Richard Shin"", ""Joseph E. Gonzalez"", ""Dawn Song""]","[""Deep learning"", ""Applications""]","Code completion, an essential part of modern software development, yet can bechallenging for dynamically typed programming languages. In this paper we ex-plore the use of neural network techniques to automatically learn code completionfrom a large corpus of dynamically typed JavaScript code. We show differentneural networks that leverage not only token level information but also structuralinformation, and evaluate their performance on different prediction tasks. Wedemonstrate that our models can outperform the state-of-the-art approach, whichis based on decision tree techniques, on both next non-terminal and next terminalprediction tasks by 3.8 points and 0.5 points respectively. We believe that neuralnetwork techniques can play a transformative role in helping software developersmanage the growing complexity of software systems, and we see this work as afirst step in that direction.",/pdf/89c0d2ea8865acc14e38cd87c2557eb4fa39e68f.pdf,ICLR,2017, +B1i7ezW0-,B1cmgf-0-,1509130000000.0,1518730000000.0,763,Semi-Supervised Learning via New Deep Network Inversion,"[""randallbalestriero@gmail.com"", ""roger.dyni@gmail.com"", ""herve.glotin@univ-tln.fr"", ""richb@rice.edu""]","[""Balestriero R."", ""Roger V."", ""Glotin H."", ""Baraniuk R.""]","[""inversion scheme"", ""deep neural networks"", ""semi-supervised learning"", ""MNIST"", ""SVHN"", ""CIFAR10""]","We exploit a recently derived inversion scheme for arbitrary deep neural networks to develop a new semi-supervised learning framework that applies to a wide range of systems and problems. +The approach reaches current state-of-the-art methods on MNIST and provides reasonable performances on SVHN and CIFAR10. Through the introduced method, residual networks are for the first time applied to semi-supervised tasks. Experiments with one-dimensional signals highlight the generality of the method. Importantly, our approach is simple, efficient, and requires no change in the deep network architecture.",/pdf/ae09092f86f953ee8182bce26e5654ca8df216d6.pdf,ICLR,2018,We exploit an inversion scheme for arbitrary deep neural networks to develop a new semi-supervised learning framework applicable to many topologies. +rygkk305YQ,B1g1HvhqKQ,1538090000000.0,1545890000000.0,948,Hierarchical Generative Modeling for Controllable Speech Synthesis,"[""wnhsu@mit.edu"", ""ngyuzh@google.com"", ""ronw@google.com"", ""heigazen@google.com"", ""yonghui@google.com"", ""logpie@gmail.com"", ""yuancao@google.com"", ""jiaye@google.com"", ""zhifengc@google.com"", ""jonathanasdf@google.com"", ""drpng@google.com"", ""rpang@google.com""]","[""Wei-Ning Hsu"", ""Yu Zhang"", ""Ron J. Weiss"", ""Heiga Zen"", ""Yonghui Wu"", ""Yuxuan Wang"", ""Yuan Cao"", ""Ye Jia"", ""Zhifeng Chen"", ""Jonathan Shen"", ""Patrick Nguyen"", ""Ruoming Pang""]","[""speech synthesis"", ""representation learning"", ""deep generative model"", ""sequence-to-sequence model""]","This paper proposes a neural end-to-end text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.",/pdf/dc17849f3867fe07f78c874d2397170d7d3a6bd2.pdf,ICLR,2019,"Building a TTS model with Gaussian Mixture VAEs enables fine-grained control of speaking style, noise condition, and more." +FsLTUzZlsgT,ddK4IXJVRt,1601310000000.0,1614990000000.0,457,Learning Curves for Analysis of Deep Networks,"[""~Derek_Hoiem1"", ""~Tanmay_Gupta1"", ""~Zhizhong_Li1"", ""~Michal_M_Shlapentokh-Rothman1""]","[""Derek Hoiem"", ""Tanmay Gupta"", ""Zhizhong Li"", ""Michal M Shlapentokh-Rothman""]","[""learning curve"", ""deep network"", ""analysis"", ""asymptotic error"", ""learning efficiency"", ""power law""]","A learning curve models a classifier’s test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to analyze the impact of design choices, such as pre-training, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. We also provide several interesting observations based on learning curves for a variety of image classification models.",/pdf/633b75d19885b4b8bd139fd6319fea4b4c22fab3.pdf,ICLR,2021,We revisit learning curves as a tool for analyzing the impact of deep network design on performance. +BJlrZyrKDB,BJl9z9j_PH,1569440000000.0,1577170000000.0,1541,Statistically Consistent Saliency Estimation,"[""barut@gwu.edu"", ""shine_lsy@gwu.edu""]","[""Emre Barut"", ""Shunyan Luo""]","[""Deep Learning Interpretation"", ""Saliency Estimation"", ""High Dimensional Statistics""]","The use of deep learning for a wide range of data problems has increased the need for understanding and diagnosing these models, and deep learning interpretation techniques have become an essential tool for data analysts. Although numerous model interpretation methods have been proposed in recent years, most of these procedures are based on heuristics with little or no theoretical guarantees. In this work, we propose a statistical framework for saliency estimation for black box computer vision models. We build a model-agnostic estimation procedure that is statistically consistent and passes the saliency checks of Adebayo et al. (2018). Our method requires solving a linear program, whose solution can be efficiently computed in polynomial time. Through our theoretical analysis, we establish an upper bound on the number of model evaluations needed to recover the region of importance with high probability, and build a new perturbation scheme for estimation of local gradients that is shown to be more efficient than the commonly used random perturbation schemes. Validity of the new method is demonstrated through sensitivity analysis. +",/pdf/4b4bc0ed97974320249076b8864fea082e9f0570.pdf,ICLR,2020,We propose a statistical framework and a theoretically consistent procedure for saliency estimation. +cFpWC6ZMtmj,96Z4wp83pu,1601310000000.0,1614990000000.0,2929,Explainability for fair machine learning,"[""~Tom_Begley1"", ""~Tobias_Schwedes1"", ""~Christopher_Frye1"", ""~Ilya_Feige1""]","[""Tom Begley"", ""Tobias Schwedes"", ""Christopher Frye"", ""Ilya Feige""]","[""explainability"", ""fairness"", ""Shapley""]","As the decisions made or influenced by machine learning models increasingly impact our lives, it is crucial to detect, understand, and mitigate unfairness. But even simply determining what ``unfairness'' should mean in a given context is non-trivial: there are many competing definitions, and choosing between them often requires a deep understanding of the underlying task. It is thus tempting to use model explainability to gain insights into model fairness, however existing explainability tools do not reliably indicate whether a model is indeed fair. In this work we present a new approach to explaining fairness in machine learning, based on the Shapley value paradigm. Our fairness explanations attribute a model's overall unfairness to individual input features, even in cases where the model does not operate on sensitive attributes directly. Moreover, motivated by the linearity of Shapley explainability, we propose a meta algorithm for applying existing training-time fairness interventions, wherein one trains a perturbation to the original model, rather than a new model entirely. By explaining the original model, the perturbation, and the fair-corrected model, we gain insight into the accuracy-fairness trade-off that is being made by the intervention. We further show that this meta algorithm enjoys both flexibility and stability benefits with no loss in performance.",/pdf/1f78e15a15871a353ec28b27d984ec20c5124c6b.pdf,ICLR,2021,Explainability methods for understanding fairness in machine learning models. +Sklqvo0qt7,rygIyeScKQ,1538090000000.0,1545360000000.0,286,A Priori Estimates of the Generalization Error for Two-layer Neural Networks,"[""leiwu@pku.edu.cn"", ""chaom@princeton.edu"", ""weinan@math.princeton.edu""]","[""Lei Wu"", ""Chao Ma"", ""Weinan E""]","[""Over-parameterization"", ""A priori estimates"", ""Path norm"", ""Neural networks"", ""Generalization error"", ""Approximation error""]","New estimates for the generalization error are established for a nonlinear regression problem using a two-layer neural network model. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model. In contrast, most existing results for neural networks are a posteriori in nature in the sense that the bounds depend on some norms of the model parameters. The error rates are comparable to that of the Monte Carlo method in terms of the size of the dataset. Moreover, these bounds are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset. ",/pdf/650a6a2592059c697270c0570ecd48a2bae2a75b.pdf,ICLR,2019, +SJefPkSFPr,BJx69qa_wB,1569440000000.0,1577170000000.0,1759,Regulatory Focus: Promotion and Prevention Inclinations in Policy Search,"[""leilansen@gmail.com"", ""lizz@sensetime.com"", ""lixiaoyang@nbu.edu.cn"", ""qiucong@sensetime.com"", ""dhlin@ie.cuhk.edu.hk""]","[""Lanxin Lei"", ""Zhizhong Li"", ""Xiaoyang Li"", ""Cong Qiu"", ""Dahua Lin""]","[""Reinforcement Learning"", ""Regulatory Focus"", ""Promotion and Prevention"", ""Exploration""]","The estimation of advantage is crucial for a number of reinforcement learning algorithms, as it directly influences the choices of future paths. In this work, we propose a family of estimates based on the order statistics over the path ensemble, which allows one to flexibly drive the learning process in a promotion focus or prevention focus. On top of this formulation, we systematically study the impacts of different regulatory focuses. Our findings reveal that regulatory focus, when chosen appropriately, can result in significant benefits. In particular, for the environments with sparse rewards, promotion focus would lead to more efficient exploration of the policy space; while for those where individual actions can have critical impacts, prevention focus is preferable. On various benchmarks, including MuJoCo continuous control, Terrain locomotion, Atari games, and sparse-reward environments, the proposed schemes consistently demonstrate improvement over mainstream methods, not only accelerating the learning process but also obtaining substantial performance gains.",/pdf/16c27e3567d6b0906779799f61c66e9ecf9a9050.pdf,ICLR,2020,We implemented and tested the regulatory fit theory from psychology in RL using order statistics over path ensembles. +3LujMJM9EMp,A0VNzuxW5we,1601310000000.0,1614990000000.0,2730,DEMI: Discriminative Estimator of Mutual Information ,"[""~Ruizhi_Liao3"", ""~Daniel_Moyer3"", ""~Polina_Golland1"", ""~William_M_Wells1""]","[""Ruizhi Liao"", ""Daniel Moyer"", ""Polina Golland"", ""William M Wells""]","[""Mutual information estimation"", ""discriminative classification""]","Estimating mutual information between continuous random variables is often intractable and extremely challenging for high-dimensional data. Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information. Although showing promise for this difficult problem, the variational methods have been theoretically and empirically proven to have serious statistical limitations: 1) many methods struggle to produce accurate estimates when the underlying mutual information is either low or high; 2) the resulting estimators may suffer from high variance. Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution rather than from the product of its marginal distributions. Moreover, we establish a direct connection between mutual information and the average log odds estimate produced by the classifier on a test set, leading to a simple and accurate estimator of mutual information. We show theoretically that our method and other variational approaches are equivalent when they achieve their optimum, while our method sidesteps the variational bound. Empirical results demonstrate high accuracy of our approach and the advantages of our estimator in the context of representation learning. +",/pdf/0d854745f9fa56b3894a3411216323ea986c104a.pdf,ICLR,2021, +SJl2ps0qKQ,HkeCV5pqF7,1538090000000.0,1545360000000.0,842,Learning to Decompose Compound Questions with Reinforcement Learning,"[""capriceyhh@zju.edu.cn"", ""wanghanwh@zju.edu.cn"", ""guoshuang@zju.edu.cn"", ""lantau.zw@alibaba-inc.com"", ""huajunsir@zju.edu.cn""]","[""Haihong Yang"", ""Han Wang"", ""Shuang Guo"", ""Wei Zhang"", ""Huajun Chen""]","[""Compound Question Decomposition"", ""Reinforcement Learning"", ""Knowledge-Based Question Answering"", ""Learning-to-decompose""]","As for knowledge-based question answering, a fundamental problem is to relax the assumption of answerable questions from simple questions to compound questions. Traditional approaches firstly detect topic entity mentioned in questions, then traverse the knowledge graph to find relations as a multi-hop path to answers, while we propose a novel approach to leverage simple-question answerers to answer compound questions. Our model consists of two parts: (i) a novel learning-to-decompose agent that learns a policy to decompose a compound question into simple questions and (ii) three independent simple-question answerers that classify the corresponding relations for each simple question. Experiments demonstrate that our model learns complex rules of compositionality as stochastic policy, which benefits simple neural networks to achieve state-of-the-art results on WebQuestions and MetaQA. We analyze the interpretable decomposition process as well as generated partitions.",/pdf/1c9f6ef4b3d02a397e1b8ee17c6f62b7917fe696.pdf,ICLR,2019,We propose a learning-to-decompose agent that helps simple-question answerers to answer compound question over knowledge graph. +ryxY73AcK7,BJxFw76qF7,1538090000000.0,1545360000000.0,1380,Sorting out Lipschitz function approximation,"[""cem.anil@mail.utoronto.ca"", ""jlucas@cs.toronto.edu"", ""rgrosse@cs.toronto.edu""]","[""Cem Anil"", ""James Lucas"", ""Roger B. Grosse""]","[""deep learning"", ""lipschitz neural networks"", ""generalization"", ""universal approximation"", ""adversarial examples"", ""generative models"", ""optimal transport"", ""adversarial robustness""]","Training neural networks subject to a Lipschitz constraint is useful for generalization bounds, provable adversarial robustness, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation function is 1-Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. Based on this, we propose to combine a gradient norm preserving activation function, GroupSort, with norm-constrained weight matrices. We show that norm-constrained GroupSort architectures are universal Lipschitz function approximators. Empirically, we show that norm-constrained GroupSort networks achieve tighter estimates of Wasserstein distance than their ReLU counterparts and can achieve provable adversarial robustness guarantees with little cost to accuracy.",/pdf/89c2904bb18c3154dafd463e693c4859fb50d340.pdf,ICLR,2019,We identify pathologies in existing activation functions when learning neural networks with Lipschitz constraints and use these insights to design neural networks which are universal Lipschitz function approximators. +SkYbF1slg,,1478320000000.0,1486660000000.0,549,An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax,"[""whuang21@jhmi.edu"", ""kzhang4@jhmi.edu""]","[""Wentao Huang"", ""Kechen Zhang""]","[""Unsupervised Learning"", ""Theory"", ""Deep learning""]","A framework is presented for unsupervised learning of representations based on infomax principle for large-scale neural populations. We use an asymptotic approximation to the Shannon's mutual information for a large neural population to demonstrate that a good initial approximation to the global information-theoretic optimum can be obtained by a hierarchical infomax method. Starting from the initial solution, an efficient algorithm based on gradient descent of the final objective function is proposed to learn representations from the input datasets, and the method works for complete, overcomplete, and undercomplete bases. As confirmed by numerical experiments, our method is robust and highly efficient for extracting salient features from input datasets. Compared with the main existing methods, our algorithm has a distinct advantage in both the training speed and the robustness of unsupervised representation learning. Furthermore, the proposed method is easily extended to the supervised or unsupervised model for training deep structure networks.",/pdf/94058ba880bcc05cf75a967aec77aa1434b7a76d.pdf,ICLR,2017,We present a novel information-theoretic framework for fast and robust unsupervised Learning via information maximization for neural population coding. +DGIXvEAJVd,3adqNlBaZX46,1601310000000.0,1614990000000.0,2826,Learning Chess Blindfolded,"[""~Shubham_Toshniwal1"", ""~Sam_Wiseman1"", ""~Karen_Livescu1"", ""~Kevin_Gimpel1""]","[""Shubham Toshniwal"", ""Sam Wiseman"", ""Karen Livescu"", ""Kevin Gimpel""]","[""Chess"", ""Transformers"", ""Language Modeling"", ""World State""]","Transformer language models have made tremendous strides in natural language understanding. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simple, constrained, and deterministic domain. Moreover, we observe that chess notation itself allows for directly probing the world state, without requiring any additional probing-related machinery. Additionally, we have access to a vast number of chess games coupled with the exact state at every move, allowing us to measure the impact of various ways of including grounding during language model training. Overall, we find that with enough training data, transformer language models can learn to track pieces and predict legal moves when trained solely from move sequences. However, in adverse circumstances (small training sets or prediction following long move histories), providing access to board state information during training can yield consistent improvements.",/pdf/7c4d5c84cef6aa8052699ea3db2a4d34689899e9.pdf,ICLR,2021,Language modeling for Chess with Transformers +SkxYOiCqKX,S1gX6RY5YQ,1538090000000.0,1545360000000.0,372,Pixel Chem: A Representation for Predicting Material Properties with Neural Network,"[""115010269@link.cuhk.edu.cn"", ""115010252@link.cuhk.edu.cn"", ""116010125@link.cuhk.edu.cn"", ""115010250@link.cuhk.edu.cn"", ""115010111@link.cuhk.edu.cn"", ""115010194@link.cuhk.edu.cn"", ""zhuxi@cuhk.edu.cn""]","[""Shuqian Ye"", ""Yanheng Xu"", ""Jiechun Liang"", ""Hao Xu"", ""Shuhong Cai"", ""Shixin Liu"", ""Xi Zhu""]","[""material property prediction"", ""neural network"", ""material structure representation"", ""chemistry""]","In this work we developed a new representation of the chemical information for the machine learning models, with benefits from both the real space (R-space) and energy space (K-space). Different from the previous symmetric matrix presentations, the charge transfer channel based on Pauling’s electronegativity is derived from the dependence on real space distance and orbitals for the hetero atomic structures. This representation can work for the bulk materials as well as the low dimensional nano materials, and can map the R-space and K-space into the pixel space (P-space) by training and testing 130k structures. P-space can well reproduce the R-space quantities within error 0.53. This new asymmetric matrix representation double the information storage than the previous symmetric representations.This work provides a new dimension for the computational chemistry towards the machine learning architecture. ",/pdf/333c0e6ad1657a0746e828443a1aa0ead9134251.pdf,ICLR,2019,"Proposed a unified, physics based representation of material structures to predict various properties with neural netwoek." +sAX7Z7uIJ_Y,85p-ys7xfih,1601310000000.0,1614990000000.0,3116,Calibrated Adversarial Refinement for Stochastic Semantic Segmentation,"[""~Elias_Kassapis1"", ""~Georgi_Dikov1"", ""~Deepak_Gupta2"", ""~Cedric_Nugteren1""]","[""Elias Kassapis"", ""Georgi Dikov"", ""Deepak Gupta"", ""Cedric Nugteren""]","[""stochastic semantic segmentation"", ""conditional generative models"", ""adversarial training"", ""calibration"", ""uncertainty""]","Ambiguities in images or unsystematic annotation can lead to multiple valid solutions in semantic segmentation. To learn a distribution over predictions, recent work has explored the use of probabilistic networks. However, these do not necessarily capture the empirical distribution accurately. In this work, we aim to learn a calibrated multimodal predictive distribution, where the empirical frequency of the sampled predictions closely reflects that of the corresponding labels in the training set. To this end, we propose a novel two-stage, cascaded strategy for calibrated adversarial refinement. In the first stage, we explicitly model the data with a categorical likelihood. In the second, we train an adversarial network to sample from it an arbitrary number of coherent predictions. The model can be used independently or integrated into any black-box segmentation framework to facilitate learning of calibrated stochastic mappings. We demonstrate the utility and versatility of the approach by attaining state-of-the-art results on the multigrader LIDC dataset and a modified Cityscapes dataset. In addition, we use a toy regression dataset to show that our framework is not confined to semantic segmentation, and the core design can be adapted to other tasks requiring learning a calibrated predictive distribution.",/pdf/b1bdb0206f69f90685fd408c18777686459b0ad0.pdf,ICLR,2021,"We propose a framework for learning a calibrated, multimodal predictive distribution by combining a segmentation network predicting pixelwise probabilities with an adversarial network using these probabilities to sample coherent segmentation maps." +S1c2cvqee,,1478290000000.0,1487880000000.0,419,Designing Neural Network Architectures using Reinforcement Learning,"[""bowen@mit.edu"", ""otkrist@mit.edu"", ""naik@mit.edu"", ""raskar@mit.edu""]","[""Bowen Baker"", ""Otkrist Gupta"", ""Nikhil Naik"", ""Ramesh Raskar""]","[""Deep learning"", ""Reinforcement Learning""]","At present, designing convolutional neural network (CNN) architectures requires both human expertise and labor. New architectures are handcrafted by careful experimentation or modified from a handful of existing networks. We introduce MetaQNN, a meta-modeling algorithm based on reinforcement learning to automatically generate high-performing CNN architectures for a given learning task. The learning agent is trained to sequentially choose CNN layers using $Q$-learning with an $\epsilon$-greedy exploration strategy and experience replay. The agent explores a large but finite space of possible architectures and iteratively discovers designs with improved performance on the learning task. On image classification benchmarks, the agent-designed networks (consisting of only standard convolution, pooling, and fully-connected layers) beat existing networks designed with the same layer types and are competitive against the state-of-the-art methods that use more complex layer types. We also outperform existing meta-modeling approaches for network design on image classification tasks.",/pdf/509c06f595581160f6875ff7e705c6fc623b21ea.pdf,ICLR,2017,A Q-learning algorithm for automatically generating neural nets +B1suU-bAW,Bk9d8-WAb,1509130000000.0,1518730000000.0,680,Learning Covariate-Specific Embeddings with Tensor Decompositions,"[""kjtian@stanford.edu"", ""tengz@stanford.edu"", ""jamesz@stanford.edu""]","[""Kevin Tian"", ""Teng Zhang"", ""James Zou""]","[""Word embedding"", ""tensor decomposition""]","Word embedding is a useful approach to capture co-occurrence structures in a large corpus of text. In addition to the text data itself, we often have additional covariates associated with individual documents in the corpus---e.g. the demographic of the author, time and venue of publication, etc.---and we would like the embedding to naturally capture the information of the covariates. In this paper, we propose a new tensor decomposition model for word embeddings with covariates. Our model jointly learns a \emph{base} embedding for all the words as well as a weighted diagonal transformation to model how each covariate modifies the base embedding. To obtain the specific embedding for a particular author or venue, for example, we can then simply multiply the base embedding by the transformation matrix associated with that time or venue. The main advantages of our approach is data efficiency and interpretability of the covariate transformation matrix. Our experiments demonstrate that our joint model learns substantially better embeddings conditioned on each covariate compared to the standard approach of learning a separate embedding for each covariate using only the relevant subset of data. Furthermore, our model encourages the embeddings to be ``topic-aligned'' in the sense that the dimensions have specific independent meanings. This allows our covariate-specific embeddings to be compared by topic, enabling downstream differential analysis. We empirically evaluate the benefits of our algorithm on several datasets, and demonstrate how it can be used to address many natural questions about the effects of covariates.",/pdf/e3c572fcde3b81904429fd30686d913ef86bd5a2.pdf,ICLR,2018,"Using the same embedding across covariates doesn't make sense, we show that a tensor decomposition algorithm learns sparse covariate-specific embeddings and naturally separable topics jointly and data-efficiently." +N6JECD-PI5w,x9fecZMOLpV,1601310000000.0,1616000000000.0,2701,FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders,"[""~Pengyu_Cheng1"", ""~Weituo_Hao1"", ""~Siyang_Yuan1"", ""~Shijing_Si1"", ""~Lawrence_Carin2""]","[""Pengyu Cheng"", ""Weituo Hao"", ""Siyang Yuan"", ""Shijing Si"", ""Lawrence Carin""]","[""Fairness"", ""Contrastive Learning"", ""Mutual Information"", ""Pretrained Text Encoders""]","Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.",/pdf/c3b19ced57b7827c059693736c4217a27b682d92.pdf,ICLR,2021,A debiasing method for large-scale pretrained text encoders via contrastive learning. +heFdS9_tkzc,UQ4Oie8QWPn,1601310000000.0,1614990000000.0,114,Distantly Supervised Relation Extraction in Federated Settings,"[""~Dianbo_Sui1"", ""~Yubo_Chen1"", ""~Kang_Liu1"", ""~Jun_Zhao4""]","[""Dianbo Sui"", ""Yubo Chen"", ""Kang Liu"", ""Jun Zhao""]","[""Distant Supervision"", ""Relation Extraction"", ""Federated Learning""]","Distant supervision is widely used in relation extraction in order to create a large-scale training dataset by aligning a knowledge base with unstructured text. Most existing studies in this field have assumed there is a great deal of centralized unstructured text. However, in practice, text may be distributed on different platforms and cannot be centralized due to privacy restrictions. Therefore, it is worthwhile to investigate distant supervision in the federated learning paradigm, which decouples the training of the model from the need for direct access to the raw text. However, overcoming label noise of distant supervision becomes more difficult in federated settings, because the sentences containing the same entity pair scatter around different platforms. In this paper, we propose a federated denoising framework to suppress label noise in federated settings. The core of this framework is a multiple instance learning based denoising method that is able to select reliable sentences via cross-platform collaboration. Various experimental results on New York Times dataset and miRNA gene regulation relation dataset demonstrate the effectiveness of the proposed method.",/pdf/78902f7b30d4c43ddee5736c0af0434bd9a4b142.pdf,ICLR,2021,We propose a federated denoising framework to suppress distantly supervised label noise in federated settings. +r1espiA9YQ,HJgwv_s5Y7,1538090000000.0,1545360000000.0,832,Towards More Theoretically-Grounded Particle Optimization Sampling for Deep Learning,"[""15300180019@fudan.edu.cn"", ""rz68@duke.edu"", ""cchangyou@gmail.com""]","[""Jianyi Zhang"", ""Ruiyi Zhang"", ""Changyou Chen""]",[],"Many deep-learning based methods such as Bayesian deep learning (DL) and deep reinforcement learning (RL) have heavily relied on the ability of a model being able to efficiently explore via Bayesian sampling. Particle-optimization sampling (POS) is a recently developed technique to generate high-quality samples from a target distribution by iteratively updating a set of interactive particles, with a representative algorithm the Stein variational gradient descent (SVGD). Though obtaining significant empirical success, the {\em non-asymptotic} convergence behavior of SVGD remains unknown. In this paper, we generalize POS to a stochasticity setting by injecting random noise in particle updates, called stochastic particle-optimization sampling (SPOS). Notably, for the first time, we develop {\em non-asymptotic convergence theory} for the SPOS framework, characterizing convergence of a sample approximation w.r.t.\! the number of particles and iterations under both convex- and noncovex-energy-function settings. Interestingly, we provide theoretical understanding of a pitfall of SVGD that can be avoided in the proposed SPOS framework, {\it i.e.}, particles tend to collapse to a local mode in SVGD under some particular conditions. Our theory is based on the analysis of nonlinear stochastic differential equations, which serves as an extension and a complementary development to the asymptotic convergence theory for SVGD such as (Liu, 2017). With such theoretical guarantees, SPOS can be safely and effectively applied on both Bayesian DL and deep RL tasks. Extensive results demonstrate the effectiveness of our proposed framework.",/pdf/eb066b0226d0086042ae84aae27762215c75b663.pdf,ICLR,2019, +Syx4wnEtvH,rJxWUpe-QS,1569440000000.0,1583910000000.0,1,Large Batch Optimization for Deep Learning: Training BERT in 76 minutes,"[""youyang@cs.berkeley.edu"", ""jingli@google.com"", ""sashank@google.com"", ""jhseu@google.com"", ""sanjivk@google.com"", ""bsrinadh@google.com"", ""xiaodansong@google.com"", ""demmel@berkeley.edu"", ""keutzer@berkeley.edu"", ""chohsieh@cs.ucla.edu""]","[""Yang You"", ""Jing Li"", ""Sashank Reddi"", ""Jonathan Hseu"", ""Sanjiv Kumar"", ""Srinadh Bhojanapalli"", ""Xiaodan Song"", ""James Demmel"", ""Kurt Keutzer"", ""Cho-Jui Hsieh""]","[""large-batch optimization"", ""distributed training"", ""fast optimizer""]","Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes.",/pdf/857c7f5fd13b287d2aa9ecea6712812c11f75f37.pdf,ICLR,2020,A fast optimizer for general applications and large-batch training. +rkgfdeBYvH,Byxfo2gKvr,1569440000000.0,1583910000000.0,2390,Effect of Activation Functions on the Training of Overparametrized Neural Nets,"[""abhishekpanigrahi034@gmail.com"", ""ashetty1995@gmail.com"", ""navingo@microsoft.com""]","[""Abhishek Panigrahi"", ""Abhishek Shetty"", ""Navin Goyal""]","[""activation functions"", ""deep learning theory"", ""neural networks""]","It is well-known that overparametrized neural networks trained using gradient based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they depend on the minimum eigenvalue of a certain Gram matrix. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds which require that this eigenvalue be large. Empirically, a number of alternative activation functions have been proposed which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth: +• For non-smooth activations such as ReLU, SELU, ELU, which are not smooth because there is a point where either the first order or second order derivative is discontinuous, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. +• For smooth activations such as tanh, swish, polynomial, which have derivatives of all orders at all points, the situation is more complex: if the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. +We discuss a number of extensions and applications of these results.",/pdf/fbbea6ca741f5700e790389e48dc9245d16dbca9.pdf,ICLR,2020,We provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks +H1eSS3CcKX,BygdWi6qKX,1538090000000.0,1556990000000.0,1541,Stochastic Optimization of Sorting Networks via Continuous Relaxations,"[""adityag@cs.stanford.edu"", ""ejwang@cs.stanford.edu"", ""azweig@cs.stanford.edu"", ""ermon@cs.stanford.edu""]","[""Aditya Grover"", ""Eric Wang"", ""Aaron Zweig"", ""Stefano Ermon""]","[""continuous relaxations"", ""sorting"", ""permutation"", ""stochastic computation graphs"", ""Plackett-Luce""]","Sorting input objects is an important step in many machine learning pipelines. However, the sorting operator is non-differentiable with respect to its inputs, which prohibits end-to-end gradient-based optimization. In this work, we propose NeuralSort, a general-purpose continuous relaxation of the output of the sorting operator from permutation matrices to the set of unimodal row-stochastic matrices, where every row sums to one and has a distinct argmax. This relaxation permits straight-through optimization of any computational graph involve a sorting operation. Further, we use this relaxation to enable gradient-based stochastic optimization over the combinatorially large space of permutations by deriving a reparameterized gradient estimator for the Plackett-Luce family of distributions over permutations. We demonstrate the usefulness of our framework on three tasks that require learning semantic orderings of high-dimensional objects, including a fully differentiable, parameterized extension of the k-nearest neighbors algorithm",/pdf/7aa82203d422eb95334b27e8c000baf7a2c5698e.pdf,ICLR,2019,"We provide a continuous relaxation to the sorting operator, enabling end-to-end, gradient-based stochastic optimization." +SyZipzbCb,Hk9FaMZCW,1509140000000.0,1519430000000.0,1023,Distributed Distributional Deterministic Policy Gradients,"[""gabrielbm@google.com"", ""mwhoffman@google.com"", ""budden@google.com"", ""wdabney@google.com"", ""horgan@google.com"", ""dhruvat@google.com"", ""alimuldal@google.com"", ""heess@google.com"", ""countzero@google.com""]","[""Gabriel Barth-Maron"", ""Matthew W. Hoffman"", ""David Budden"", ""Will Dabney"", ""Dan Horgan"", ""Dhruva TB"", ""Alistair Muldal"", ""Nicolas Heess"", ""Timothy Lillicrap""]","[""policy gradient"", ""continuous control"", ""actor critic"", ""reinforcement learning""]","This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple improvements such as the use of N-step returns and prioritized experience replay. Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.",/pdf/805ee795ebd35cb137af65559f2f43ec63153831.pdf,ICLR,2018,"We develop an agent that we call the Distributional Deterministic Deep Policy Gradient algorithm, which achieves state of the art performance on a number of challenging continuous control problems." +5qK0RActG1x,nqxF-cWqtOr,1601310000000.0,1614990000000.0,1036,Democratizing Evaluation of Deep Model Interpretability through Consensus,"[""~Xuhong_Li3"", ""~Haoyi_Xiong1"", ""~Siyu_Huang2"", ""jishilei@baidu.com"", ""~Yanjie_Fu2"", ""~Dejing_Dou1""]","[""Xuhong Li"", ""Haoyi Xiong"", ""Siyu Huang"", ""Shilei Ji"", ""Yanjie Fu"", ""Dejing Dou""]","[""interpretability evaluation"", ""deep model interpretability""]","Deep learning interpretability tools, such as (Bau et al., 2017; Ribeiro et al., 2016; Smilkov et al., 2017), have been proposed to explain and visualize the ways that deep neural networks make predictions. The success of these methods highly relies on human subjective interpretations, i.e., the ground truth of interpretations, such as feature importance ranking or locations of visual objects, when evaluating the interpretability of the deep models on a specific task. For tasks that the ground truth of interpretations is not available, we propose a novel framework Consensus incorporating an ensemble of deep models as the committee for interpretability evaluation. Given any task/dataset, Consensus first obtains the interpretation results using existing tools, e.g., LIME (Ribeiro et al., 2016), for every model in the committee, then aggregates the results from the entire committee and approximates the “ground truth” of interpretations through voting. With such approximated ground truth, Consensus evaluates the interpretability of a model through matching its interpretation result and the approximated one, and ranks the matching scores together with committee members, so as to pursue the absolute and relative interpretability evaluation results. We carry out extensive experiments to validate Consensus on various datasets. The results show that Consensus can precisely identify the interpretability for a wide range of models on ubiquitous datasets that the ground truth is not available. Robustness analyses further demonstrate the advantage of the proposed framework to reach the consensus of interpretations through simple voting and evaluate the interpretability of deep models. Through the proposed Consensus framework, the interpretability evaluation has been democratized without the need of ground truth as criterion.",/pdf/fe23bd9bb28af4f4f1cb4eda6c72e805d8016ca5.pdf,ICLR,2021, +rJx0Q6EFPB,SJeo87-DvB,1569440000000.0,1577170000000.0,469,TinyBERT: Distilling BERT for Natural Language Understanding,"[""jiaoxiaoqi@hust.edu.cn"", ""yinyichun@huawei.com"", ""shang.lifeng@huawei.com"", ""jiang.xin@huawei.com"", ""chen.xiao2@huawei.com"", ""lynn.lilinlin@huawei.com"", ""wangfang@hust.edu.cn"", ""qun.liu@huawei.com""]","[""Xiaoqi Jiao"", ""Yichun Yin"", ""Lifeng Shang"", ""Xin Jiang"", ""Xiao Chen"", ""Linlin Li"", ""Fang Wang"", ""Qun Liu""]","[""BERT Compression"", ""Transformer Distillation"", ""TinyBERT""]","Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, the pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be well transferred to a small “student” TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general domain as well as the task-specific knowledge in BERT. TinyBERT is empirically effective and achieves comparable results with BERT on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. +",/pdf/78cd235085142ae08db1f2413dd4ff31e2157f79.pdf,ICLR,2020, +SPhswbiXpJQ,dtc49x6xwde,1601310000000.0,1614990000000.0,625,Deep Data Flow Analysis,"[""cummins@fb.com"", ""zfisches@student.ethz.ch"", ""~Tal_Ben-Nun1"", ""torsten.hoefler@inf.ethz.ch"", ""hleather@fb.com"", ""~Michael_O'Boyle1""]","[""Chris Cummins"", ""Zacharias Fisches"", ""Tal Ben-Nun"", ""Torsten Hoefler"", ""Hugh Leather"", ""Michael O'Boyle""]","[""program representations"", ""program analysis"", ""compilers"", ""graph neural networks""]","Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow analysis that are critical to making good optimization decisions. This must change for machine learning to become the dominant technology in compiler heuristics. + +To this end, we propose ProGraML - Program Graphs for Machine Learning - a language-independent, portable representation of whole-program semantics for deep learning. To benchmark current and future learning techniques for compiler analyses we introduce an open dataset of 461k Intermediate Representation (IR) files for LLVM, covering five source programming languages, and 15.4M corresponding data flow results. We formulate data flow analysis as an MPNN and show that, using ProGraML, standard analyses can be learned, yielding improved performance on downstream compiler optimization tasks.",/pdf/e2082c40dac624fffdbbaa64d310ca805e54726d.pdf,ICLR,2021,A graph representation for programs that enables more powerful reasoning about program semantics and improved performance on downstream compiler tasks. +ByxpMd9lx,,1478290000000.0,1488590000000.0,454,Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks,"[""zhiliny@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu"", ""wcohen@cs.cmu.edu""]","[""Zhilin Yang"", ""Ruslan Salakhutdinov"", ""William W. Cohen""]","[""Natural language processing"", ""Deep learning"", ""Transfer Learning""]","Recent papers have shown that neural networks obtain state-of-the-art performance on several different sequence tagging tasks. One appealing property of such systems is their generality, as excellent performance can be achieved with a unified architecture and without task-specific feature engineering. However, it is unclear if such systems can be used for tasks without large amounts of training data. In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations (e.g., POS tagging for microblogs). We examine the effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages, and show that significant improvement can often be obtained. These improvements lead to improvements over the current state-of-the-art on several well-studied tasks.",/pdf/1c0e900eb5f205621124bbe3819b1791e4a5da85.pdf,ICLR,2017, +BkxFi2VYvS,SygtMd-Zvr,1569440000000.0,1577170000000.0,159,Semi-supervised Semantic Segmentation using Auxiliary Network,"[""qoososola520.ee06g@nctu.edu.tw"", ""hmhang@nctu.edu.tw""]","[""Wei-Hsu Chen"", ""Hsueh-Ming Hang""]","[""deep learning"", ""semi-supervised segmentation"", ""semantic segmentation"", ""CNN""]","Recently, the convolutional neural networks (CNNs) have shown great success on semantic segmentation task. However, for practical applications such as autonomous driving, the popular supervised learning method faces two challenges: the demand of low computational complexity and the need of huge training dataset accompanied by ground truth. Our focus in this paper is semi-supervised learning. We wish to use both labeled and unlabeled data in the training process. A highly efficient semantic segmentation network is our platform, which achieves high segmentation accuracy at low model size and high inference speed. We propose a semi-supervised learning approach to improve segmentation accuracy by including extra images without labels. While most existing semi-supervised learning methods are designed based on the adversarial learning techniques, we present a new and different approach, which trains an auxiliary CNN network that validates labels (ground-truth) on the unlabeled images. Therefore, in the supervised training phase, both the segmentation network and the auxiliary network are trained using labeled images. Then, in the unsupervised training phase, the unlabeled images are segmented and a subset of image pixels are picked up by the auxiliary network; and then they are used as ground truth to train the segmentation network. Thus, at the end, all dataset images can be used for retraining the segmentation network to improve the segmentation results. We use Cityscapes and CamVid datasets to verify the effectiveness of our semi-supervised scheme, and our experimental results show that it can improve the mean IoU for about 1.2% to 2.9% on the challenging Cityscapes dataset.",/pdf/6328ed3363a7714598aae2b277d6f994a793239a.pdf,ICLR,2020,We design a two-branch semi-supervised segmentation system consisting of a segmentation network and an auxiliary CNN network that validates labels (ground-truth) on the unlabeled images +2kImxCmYBic,YUYlmOLXJBr,1601310000000.0,1614990000000.0,73,Numeric Encoding Options with Automunge,"[""~Nicholas_Teague1""]","[""Nicholas Teague""]","[""tabular"", ""feature engineering"", ""preprocessing""]","Mainstream practice in machine learning with tabular data may take for granted that any feature engineering beyond scaling for numeric sets is superfluous in context of deep neural networks. This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning by way of a survey of options for numeric transformations as available in the Automunge open source python library platform for tabular data pipelines, where transformations may be applied to distinct columns in “family tree” sets with generations and branches of derivations. Automunge transformation options include normalization, binning, noise injection, derivatives, and more. The aggregation of these methods into family tree sets of transformations are demonstrated for use to present numeric features to machine learning in multiple configurations of varying information content, as may be applied to encode numeric sets of unknown interpretation. Experiments demonstrate the realization of a novel generalized solution to data augmentation by noise injection for tabular learning, as may materially benefit model performance in applications with underserved training data.",/pdf/eb7936556eca10490d3a1a95954a5cee32ffb93d.pdf,ICLR,2021,A survey of tabular data numeric feature set encodings as available in the Automunge library. +Hke1gySFvB,B1eM4xo_vH,1569440000000.0,1577170000000.0,1490,Enhancing Language Emergence through Empathy,"[""mos@vs.uni-kassel.de""]","[""Marie Ossenkopf""]","[""multi-agent deep reinforcement learning"", ""emergent communication"", ""auxiliary tasks""]","The emergence of language in multi-agent settings is a promising research direction to ground natural language in simulated agents. If AI would be able to understand the meaning of language through its using it, it could also transfer it to other situations flexibly. That is seen as an important step towards achieving general AI. The scope of emergent communication is so far, however, still limited. It is necessary to enhance the learning possibilities for skills associated with communication to increase the emergable complexity. We took an example from human language acquisition and the importance of the empathic connection in this process. We propose an approach to introduce the notion of empathy to multi-agent deep reinforcement learning. We extend existing approaches on referential games with an auxiliary task for the speaker to predict the listener's mind change improving the learning time. Our experiments show the high potential of this architectural element by doubling the learning speed of the test setup. ",/pdf/3a54ce0ab2b727b28df5a619acc1cf9471941527.pdf,ICLR,2020,An auxiliary prediction task can speed up learning in language emergence setups. +H1gax6VtDB,SyesbrSIDr,1569440000000.0,1583910000000.0,356,Contrastive Learning of Structured World Models,"[""t.n.kipf@uva.nl"", ""e.e.vanderpol@uva.nl"", ""m.welling@uva.nl""]","[""Thomas Kipf"", ""Elise van der Pol"", ""Max Welling""]","[""state representation learning"", ""graph neural networks"", ""model-based reinforcement learning"", ""relational learning"", ""object discovery""]","A structured understanding of our world in terms of objects, relations, and hierarchies is an important component of human cognition. Learning such a structured world model from raw sensory data remains a challenge. As a step towards this goal, we introduce Contrastively-trained Structured World Models (C-SWMs). C-SWMs utilize a contrastive approach for representation learning in environments with compositional structure. We structure each state embedding as a set of object representations and their relations, modeled by a graph neural network. This allows objects to be discovered from raw pixel observations without direct supervision as part of the learning process. We evaluate C-SWMs on compositional environments involving multiple interacting objects that can be manipulated independently by an agent, simple Atari games, and a multi-object physics simulation. Our experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.",/pdf/9358897ed5bc9ee514546d3e76b158e0bdb4da56.pdf,ICLR,2020,Contrastively-trained Structured World Models (C-SWMs) learn object-oriented state representations and a relational model of an environment from raw pixel input. +rJlDnoA5Y7,ryxs9IFcF7,1538090000000.0,1550850000000.0,716,Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs,"[""sachink@cs.cmu.edu"", ""ytsvetko@cs.cmu.edu""]","[""Sachin Kumar"", ""Yulia Tsvetkov""]","[""Language Generation"", ""Regression"", ""Word Embeddings"", ""Machine Translation""]","The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent types; and it has a large memory footprint. We propose a general technique for replacing the softmax layer with a continuous embedding layer. Our primary innovations are a novel probabilistic loss, and a training and inference procedure in which we generate a probability distribution over pre-trained word embeddings, instead of a multinomial distribution over the vocabulary obtained via softmax. We evaluate this new class of sequence-to-sequence models with continuous outputs on the task of neural machine translation. We show that our models obtain upto 2.5x speed-up in training time while performing on par with the state-of-the-art models in terms of translation quality. These models are capable of handling very large vocabularies without compromising on translation quality. They also produce more meaningful errors than in the softmax-based models, as these errors typically lie in a subspace of the vector space of the reference translations.",/pdf/34788a5d9ba135093f6f227df5acd7ec40d0692e.pdf,ICLR,2019,Language generation using seq2seq models which produce word embeddings instead of a softmax based distribution over the vocabulary at each step enabling much faster training while maintaining generation quality +SyuWNMZ0W,r1yWEf-CZ,1509140000000.0,1518730000000.0,815,Directing Generative Networks with Weighted Maximum Mean Discrepancy,"[""momod@utexas.edu"", ""guywcole@utexas.edu"", ""sinead.williamson@mccombs.utexas.edu""]","[""Maurice Diesendruck"", ""Guy W. Cole"", ""Sinead Williamson""]","[""generative networks"", ""two sample tests"", ""bias correction"", ""maximum mean discrepancy""]","The maximum mean discrepancy (MMD) between two probability measures P +and Q is a metric that is zero if and only if all moments of the two measures +are equal, making it an appealing statistic for two-sample tests. Given i.i.d. samples +from P and Q, Gretton et al. (2012) show that we can construct an unbiased +estimator for the square of the MMD between the two distributions. If P is a +distribution of interest and Q is the distribution implied by a generative neural +network with stochastic inputs, we can use this estimator to train our neural network. +However, in practice we do not always have i.i.d. samples from our target +of interest. Data sets often exhibit biases—for example, under-representation of +certain demographics—and if we ignore this fact our machine learning algorithms +will propagate these biases. Alternatively, it may be useful to assume our data has +been gathered via a biased sample selection mechanism in order to manipulate +properties of the estimating distribution Q. +In this paper, we construct an estimator for the MMD between P and Q when we +only have access to P via some biased sample selection mechanism, and suggest +methods for estimating this sample selection mechanism when it is not already +known. We show that this estimator can be used to train generative neural networks +on a biased data sample, to give a simulator that reverses the effect of that +bias.",/pdf/a91ee9986cc0b931e097218a355a2722c503e7d3.pdf,ICLR,2018,"We propose an estimator for the maximum mean discrepancy, appropriate when a target distribution is only accessible via a biased sample selection procedure, and show that it can be used in a generative network to correct for this bias." +S1l_ZlrFvS,Byez3egKwr,1569440000000.0,1577170000000.0,2142,Why do These Match? Explaining the Behavior of Image Similarity Models,"[""bplumme2@illinois.edu"", ""mvasile2@illinois.edu"", ""vpetsiuk@bu.edu"", ""saenko@bu.edu"", ""daf@illinois.edu""]","[""Bryan A. Plummer"", ""Mariya I. Vasileva"", ""Vitali Petsiuk"", ""Kate Saenko"", ""David Forsyth""]","[""explainable artificial intelligence"", ""image similarity"", ""artificial intelligence for fashion""]","Explaining a deep learning model can help users understand its behavior and allow researchers to discern its shortcomings. Recent work has primarily focused on explaining models for tasks like image classification or visual question answering. In this paper, we introduce an explanation approach for image similarity models, where a model's output is a score measuring the similarity of two inputs rather than a classification. In this task, an explanation depends on both of the input images, so standard methods do not apply. We propose an explanation method that pairs a saliency map identifying important image regions with an attribute that best explains the match. We find that our explanations provide additional information not typically captured by saliency maps alone, and can also improve performance on the classic task of attribute recognition. Our approach's ability to generalize is demonstrated on two datasets from diverse domains, Polyvore Outfits and Animals with Attributes 2.",/pdf/f8d4af1b2924ec0b70c86ec72934d2f2141666a4.pdf,ICLR,2020,A black box approach for explaining the predictions of an image similarity model. +BJluxREKDB,B1l6wdXdDS,1569440000000.0,1583910000000.0,933,Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning,"[""gilled@berkeley.edu"", ""mrabe@google.com"", ""sshesia@eecs.berkeley.edu"", ""eal@eecs.berkeley.edu""]","[""Gil Lederman"", ""Markus Rabe"", ""Sanjit Seshia"", ""Edward A. Lee""]","[""Logic"", ""QBF"", ""Logical Reasoning"", ""SAT"", ""Graph"", ""Reinforcement Learning"", ""GNN""]","We demonstrate how to learn efficient heuristics for automated reasoning algorithms for quantified Boolean formulas through deep reinforcement learning. We focus on a backtracking search algorithm, which can already solve formulas of impressive size - up to hundreds of thousands of variables. The main challenge is to find a representation of these formulas that lends itself to making predictions in a scalable way. For a family of challenging problems, we learned a heuristic that solves significantly more formulas compared to the existing handwritten heuristics.",/pdf/e8010fd156b78b981729127692c4ba3543c87ab7.pdf,ICLR,2020,"We use RL to automatically learn branching heuristic within a state of the art QBF solver, on industrial problems." +Skxuk1rFwB,H1ebcT5Owr,1569440000000.0,1583910000000.0,1473,Towards Stable and Efficient Training of Verifiably Robust Neural Networks,"[""huan@huan-zhang.com"", ""chenhg@mit.edu"", ""xiaocw@umich.edu"", ""sgowal@google.com"", ""stanforth@google.com"", ""lbo@illinois.edu"", ""boning@mtl.mit.edu"", ""chohsieh@cs.ucla.edu""]","[""Huan Zhang"", ""Hongge Chen"", ""Chaowei Xiao"", ""Sven Gowal"", ""Robert Stanforth"", ""Bo Li"", ""Duane Boning"", ""Cho-Jui Hsieh""]","[""Robust Neural Networks"", ""Verifiable Training"", ""Certified Adversarial Defense""]","Training neural networks with verifiable robustness guarantees is challenging. Several existing approaches utilize linear relaxation based neural network output bounds under perturbation, but they can slow down training by a factor of hundreds depending on the underlying network architectures. Meanwhile, interval bound propagation (IBP) based training is efficient and significantly outperforms linear relaxation based methods on many tasks, yet it may suffer from stability issues since the bounds are much looser especially at the beginning of training. In this paper, we propose a new certified adversarial training method, CROWN-IBP, by combining the fast IBP bounds in a forward bounding pass and a tight linear relaxation based bound, CROWN, in a backward bounding pass. CROWN-IBP is computationally efficient and consistently outperforms IBP baselines on training verifiably robust neural networks. We conduct large scale experiments on MNIST and CIFAR datasets, and outperform all previous linear relaxation and bound propagation based certified defenses in L_inf robustness. +Notably, we achieve 7.02% verified test error on MNIST at epsilon=0.3, and 66.94% on CIFAR-10 with epsilon=8/255.",/pdf/06192b543d3903c59809e74c3c4d084d66eb4f77.pdf,ICLR,2020,"We propose a new certified adversarial training method, CROWN-IBP, that achieves state-of-the-art robustness for L_inf norm adversarial perturbations." +Hk0wHx-RW,HyCvrgWCZ,1509130000000.0,1521760000000.0,583,Learning Sparse Latent Representations with the Deep Copula Information Bottleneck,"[""aleksander.wieczorek@unibas.ch"", ""mario.wieser@unibas.ch"", ""d.murezzan@unibas.ch"", ""volker.roth@unibas.ch""]","[""Aleksander Wieczorek*"", ""Mario Wieser*"", ""Damian Murezzan"", ""Volker Roth""]","[""Information Bottleneck"", ""Deep Information Bottleneck"", ""Deep Variational Information Bottleneck"", ""Variational Autoencoder"", ""Sparsity"", ""Disentanglement"", ""Interpretability"", ""Copula"", ""Mutual Information""]","Deep latent variable models are powerful tools for representation learning. In this paper, we adopt the deep information bottleneck model, identify its shortcomings and propose a model that circumvents them. To this end, we apply a copula transformation which, by restoring the invariance properties of the information bottleneck method, leads to disentanglement of the features in the latent space. Building on that, we show how this transformation translates to sparsity of the latent space in the new model. We evaluate our method on artificial and real data.",/pdf/e228022883bc9e1cd8449e06aacd6d08afab8696.pdf,ICLR,2018,We apply the copula transformation to the Deep Information Bottleneck which leads to restored invariance properties and a disentangled latent space with superior predictive capabilities. +Bylp62EKDH,r1ez4_z4PB,1569440000000.0,1577170000000.0,245,Extreme Triplet Learning: Effectively Optimizing Easy Positives and Hard Negatives,"[""xuanhong@gwu.edu"", ""pless@gwu.edu""]","[""Hong Xuan"", ""Robert Pless""]","[""Triplet Learning"", ""Easy Positive"", ""Hard Negatives""]","The Triplet Loss approach to Distance Metric Learning is defined by the strategy to select triplets and the loss function through which those triplets are optimized. During optimization, two especially important cases are easy positive and hard negative mining which consider, the closest example of the same and different classes. We characterize how triplets behave based during optimization as a function of these similarities, and highlight that these important cases have technical problems where standard gradient descent behaves poorly, pulling the negative example closer and/or pushing the positive example farther away. We derive an updated loss function that fixes these problems and shows improvements to the state of the art for CUB, CAR, SOP, In-Shop Clothes datasets.",/pdf/0a70f58ac341b0433995a28cdd211b7b61b3d686.pdf,ICLR,2020, +SJgCEpVtvr,SkeiXdVwwB,1569440000000.0,1577170000000.0,505,DYNAMIC SELF-TRAINING FRAMEWORK FOR GRAPH CONVOLUTIONAL NETWORKS,"[""15300180085@fudan.edu.cn"", ""17210980007@fudan.edu.cn"", ""huangzf@fudan.edu.cn""]","[""Ziang Zhou"", ""Shenzhong Zhang"", ""Zengfeng Huang""]","[""self-training"", ""semi-supervised learning"", ""graph convolutional networks""]","Graph neural networks (GNN) such as GCN, GAT, MoNet have achieved state-of-the-art results on semi-supervised learning on graphs. However, when the number of labeled nodes is very small, the performances of GNNs downgrade dramatically. Self-training has proved to be effective for resolving this issue, however, the performance of self-trained GCN is still inferior to that of G2G and DGI for many settings. Moreover, additional model complexity make it more difficult to tune the hyper-parameters and do model selection. We argue that the power of self-training is still not fully explored for the node classification task. In this paper, we propose a unified end-to-end self-training framework called \emph{Dynamic Self-traning}, which generalizes and simplifies prior work. A simple instantiation of the framework based on GCN is provided and empirical results show that our framework outperforms all previous methods including GNNs, embedding based method and self-trained GCNs by a noticeable margin. Moreover, compared with standard self-training, hyper-parameter tuning for our framework is easier.",/pdf/15d64d7eae0a0667de0c3bfa5c4821f40f2cf63b.pdf,ICLR,2020,Propose a novel self-training framework which performs well in few-label cases combined with GCN. +zElset1Klrp,bgt1GjeST7Q,1601310000000.0,1615860000000.0,2083,Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online,"[""~Yangchen_Pan2"", ""kdbanman@ualberta.ca"", ""~Martha_White1""]","[""Yangchen Pan"", ""Kirby Banman"", ""Martha White""]","[""Reinforcement learning"", ""natural sparsity"", ""sparse representation"", ""fuzzy tiling activation function""]","Recent work has shown that sparse representations---where only a small percentage of units are active---can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches, that have only been used offline in a pre-training phase. In this work, we pursue a direction that achieves sparsity by design, rather than by learning. Specifically, we design an activation function that produces sparse representations deterministically by construction, and so is more amenable to online training. The idea relies on the simple approach of binning, but overcomes the two key limitations of binning: zero gradients for the flat regions almost everywhere, and lost precision---reduced discrimination---due to coarse aggregation. We introduce a Fuzzy Tiling Activation (FTA) that provides non-negligible gradients and produces overlap between bins that improves discrimination. We first show that FTA is robust under covariate shift in a synthetic online supervised learning problem, where we can vary the level of correlation and drift. Then we move to the deep reinforcement learning setting and investigate both value-based and policy gradient algorithms that use neural networks with FTAs, in classic discrete control and Mujoco continuous control environments. We show that algorithms equipped with FTAs are able to learn a stable policy faster without needing target networks on most domains. ",/pdf/37f746394b25e91e61274da8ebafc304ae3d32a0.pdf,ICLR,2021,A simple and efficient way to learn sparse feature online in deep learning setting. +BydjJte0-,SJDs1Yx0-,1509100000000.0,1518730000000.0,328,Towards Reverse-Engineering Black-Box Neural Networks,"[""joon@mpi-inf.mpg.de"", ""maxaug@mpi-inf.mpg.de"", ""mfritz@mpi-inf.mpg.de"", ""schiele@mpi-inf.mpg.de""]","[""Seong Joon Oh"", ""Max Augustin"", ""Mario Fritz"", ""Bernt Schiele""]","[""black box"", ""security"", ""privacy"", ""attack"", ""metamodel"", ""adversarial example"", ""reverse-engineering"", ""machine learning""]","Many deployed learned models are black boxes: given input, returns output. Internal information about the model, such as the architecture, optimisation procedure, or training data, is not disclosed explicitly as it might contain proprietary information or make the system more vulnerable. This work shows that such attributes of neural networks can be exposed from a sequence of queries. This has multiple implications. On the one hand, our work exposes the vulnerability of black-box neural networks to different types of attacks -- we show that the revealed internal information helps generate more effective adversarial examples against the black box model. On the other hand, this technique can be used for better protection of private content from automatic recognition models using adversarial examples. Our paper suggests that it is actually hard to draw a line between white box and black box models.",/pdf/2e15b6e2ccbc3ede52b310b7f87bd2e0c6f3c93e.pdf,ICLR,2018,"Querying a black-box neural network reveals a lot of information about it; we propose novel ""metamodels"" for effectively extracting information from a black box." +HJMC_iA5tm,BJeic0ucFm,1538090000000.0,1546730000000.0,402,Learning a SAT Solver from Single-Bit Supervision,"[""dselsam@cs.stanford.edu"", ""mlamm@cs.stanford.edu"", ""buenz@cs.stanford.edu"", ""pliang@cs.stanford.edu"", ""leonardo@microsoft.com"", ""dill@cs.stanford.edu""]","[""Daniel Selsam"", ""Matthew Lamm"", ""Benedikt B\\\""{u}nz"", ""Percy Liang"", ""Leonardo de Moura"", ""David L. Dill""]","[""sat"", ""search"", ""graph neural network"", ""theorem proving"", ""proof""]","We present NeuroSAT, a message passing neural network that learns to solve SAT problems after only being trained as a classifier to predict satisfiability. Although it is not competitive with state-of-the-art SAT solvers, NeuroSAT can solve problems that are substantially larger and more difficult than it ever saw during training by simply running for more iterations. Moreover, NeuroSAT generalizes to novel distributions; after training only on random SAT problems, at test time it can solve SAT problems encoding graph coloring, clique detection, dominating set, and vertex cover problems, all on a range of distributions over small random graphs.",/pdf/868d9f7b066d73e8b692a14f5148c48d16376781.pdf,ICLR,2019,"We train a graph network to predict boolean satisfiability and show that it learns to search for solutions, and that the solutions it finds can be decoded from its activations." +HyH9lbZAW,H1N5eWb0b,1509130000000.0,1519420000000.0,644,Variational Message Passing with Structured Inference Networks,"[""wlin2018@cs.ubc.ca"", ""nicolas.hubacher@outlook.com"", ""emtiyaz@gmail.com""]","[""Wu Lin"", ""Nicolas Hubacher"", ""Mohammad Emtiyaz Khan""]","[""Variational Inference"", ""Variational Message Passing"", ""Variational Auto-Encoder"", ""Graphical Models"", ""Structured Models"", ""Natural Gradients""]","Recent efforts on combining deep models with probabilistic graphical models are promising in providing flexible models that are also easy to interpret. We propose a variational message-passing algorithm for variational inference in such models. We make three contributions. First, we propose structured inference networks that incorporate the structure of the graphical model in the inference network of variational auto-encoders (VAE). Second, we establish conditions under which such inference networks enable fast amortized inference similar to VAE. Finally, we derive a variational message passing algorithm to perform efficient natural-gradient inference while retaining the efficiency of the amortized inference. By simultaneously enabling structured, amortized, and natural-gradient inference for deep structured models, our method simplifies and generalizes existing methods.",/pdf/0fe1975274211e9c40a394f067a2f2f89414cbc9.pdf,ICLR,2018,We propose a variational message-passing algorithm for models that contain both the deep model and probabilistic graphical model. +JkfYjnOEo6M,Dpz9BtZbUew,1601310000000.0,1616080000000.0,2982,Group Equivariant Stand-Alone Self-Attention For Vision,"[""~David_W._Romero1"", ""~Jean-Baptiste_Cordonnier2""]","[""David W. Romero"", ""Jean-Baptiste Cordonnier""]","[""group equivariant transformers"", ""group equivariant self-attention"", ""group equivariance"", ""self-attention"", ""transformers""]","We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.",/pdf/d8bac9d42bd7732afa503ae4fe5f83e1ace88bb2.pdf,ICLR,2021,We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. +SyeUMRNYDr,Bylzqp4uwH,1569440000000.0,1577170000000.0,1001,Generating Dialogue Responses From A Semantic Latent Space,"[""wjko@outlook.com"", ""avik.r@samsung.com"", ""yilin.shen@samsung.com"", ""hongxia.jin@samsung.com""]","[""Wei-Jen Ko"", ""Avik Ray"", ""Yilin Shen"", ""Hongxia Jin""]","[""dialog"", ""chatbot"", ""open domain conversation"", ""CCA""]","Generic responses are a known issue for open-domain dialog generation. Most current approaches model this one-to-many task as a one-to-one task, hence being unable to integrate information from multiple semantically similar valid responses of a prompt. We propose a novel dialog generation model that learns a semantic latent space, on which representations of semantically related sentences are close to each other. This latent space is learned by maximizing correlation between the features extracted from prompt and responses. Learning the pair relationship between the prompts and responses as a regression task on the latent space, instead of classification on the vocabulary using MLE loss, enables our model to view semantically related responses collectively. An additional autoencoder is trained, for recovering the full sentence from the latent space. Experimental results show that our proposed model eliminates the generic response problem, while achieving comparable or better coherence compared to baselines.",/pdf/144ff15048a98cb46c85b7337073a631353a4188.pdf,ICLR,2020,"A novel dialog generation model that learns on a utterance level semantic latent space. The model could learn from semantically similar sentences collectively, thus eliminates the generic response problem." +SklcFsAcKX,B1lMvfjqY7,1538090000000.0,1545360000000.0,472,Deep Denoising: Rate-Optimal Recovery of Structured Signals with a Deep Prior,"[""rh43@rice.edu"", ""wen.huang@xmu.edu.cn"", ""p.hand@northeastern.edu"", ""vlad@helm.ai""]","[""Reinhard Heckel"", ""Wen Huang"", ""Paul Hand"", ""Vladislav Voroninski""]","[""non-convex optimization"", ""denoising"", ""generative neural network""]","Deep neural networks provide state-of-the-art performance for image denoising, where the goal is to recover a near noise-free image from a noisy image. +The underlying principle is that neural networks trained on large datasets have empirically been shown to be able to generate natural images well from a low-dimensional latent representation of the image. +Given such a generator network, or prior, a noisy image can be denoised by finding the closest image in the range of the prior. +However, there is little theory to justify this success, let alone to predict the denoising performance as a function of the networks parameters. +In this paper we consider the problem of denoising an image from additive Gaussian noise, assuming the image is well described by a deep neural network with ReLu activations functions, mapping a k-dimensional latent space to an n-dimensional image. +We state and analyze a simple gradient-descent-like iterative algorithm that minimizes a non-convex loss function, and provably removes a fraction of (1 - O(k/n)) of the noise energy. +We also demonstrate in numerical experiments that this denoising performance is, indeed, achieved by generative priors learned from data.",/pdf/d2fd29b79dc9adb972640fd9342f951b11ceea23.pdf,ICLR,2019,"By analyzing an algorithms minimizing a non-convex loss, we show that all but a small fraction of noise can be removed from an image using a deep neural network based generative prior." +Bye8hREtvB,BJxd2uYuDS,1569440000000.0,1577170000000.0,1357,Natural Image Manipulation for Autoregressive Models Using Fisher Scores,"[""wilson1.yan@berkeley.edu"", ""jonathanho@berkeley.edu"", ""pabbeel@cs.berkeley.edu""]","[""Wilson Yan"", ""Jonathan Ho"", ""Pieter Abbeel""]","[""fisher score"", ""generative models"", ""image interpolation""]","Deep autoregressive models are one of the most powerful models that exist today which achieve state-of-the-art bits per dim. However, they lie at a strict disadvantage when it comes to controlled sample generation compared to latent variable models. Latent variable models such as VAEs and normalizing flows allow meaningful semantic manipulations in latent space, which autoregressive models do not have. In this paper, we propose using Fisher scores as a method to extract embeddings from an autoregressive model to use for interpolation and show that our method provides more meaningful sample manipulation compared to alternate embeddings such as network activations.",/pdf/b628a7e74672a349f7beebed75ca8a33165bb386.pdf,ICLR,2020,We develop a novel method to perform image interpolation and semantic manipulation using autoregressive models through fisher scores +SJg1lxrYwS,BkggJ3yKvr,1569440000000.0,1577170000000.0,2084,PatchFormer: A neural architecture for self-supervised representation learning on images,"[""aravind@cs.berkeley.edu"", ""pabbeel@cs.berkeley.edu""]","[""Aravind Srinivas"", ""Pieter Abbeel""]","[""Unsupervised Learning"", ""Representation Learning"", ""Transformers""]","Learning rich representations from predictive learning without labels has been a longstanding challenge in the field of machine learning. Generative pre-training has so far not been as successful as contrastive methods in modeling representations of raw images. In this paper, we propose a neural architecture for self-supervised representation learning on raw images called the PatchFormer which learns to model spatial dependencies across patches in a raw image. Our method learns to model the conditional probability distribution of missing patches given the context of surrounding patches. We evaluate the utility of the learned representations by fine-tuning the pre-trained model on low data-regime classification tasks. Specifically, we benchmark our model on semi-supervised ImageNet classification which has become a popular benchmark recently for semi-supervised and self-supervised learning methods. Our model is able to achieve 30.3% and 65.5% top-1 accuracies when trained only using 1% and 10% of the labels on ImageNet showing the promise for generative pre-training methods.",/pdf/2ee479c1129ef1e86a8208c3e6278502f80c24da.pdf,ICLR,2020,Decoding pixels can still work for representation learning on images +r1g6ogrtDr,SJesa1bKDB,1569440000000.0,1583910000000.0,2517,Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Occurring in Data,"[""d.w.romeroguzman@vu.nl"", ""m.hoogendoorn@vu.nl""]","[""David W. Romero"", ""Mark Hoogendoorn""]","[""Equivariant Neural Networks"", ""Attention Mechanisms"", ""Deep Learning""]","Equivariance is a nice property to have as it produces much more parameter efficient neural architectures and preserves the structure of the input through the feature mapping. Even though some combinations of transformations might never appear (e.g. an upright face with a horizontal nose), current equivariant architectures consider the set of all possible transformations in a transformation group when learning feature representations. Contrarily, the human visual system is able to attend to the set of relevant transformations occurring in the environment and utilizes this information to assist and improve object recognition. Based on this observation, we modify conventional equivariant feature mappings such that they are able to attend to the set of co-occurring transformations in data and generalize this notion to act on groups consisting of multiple symmetries. We show that our proposed co-attentive equivariant neural networks consistently outperform conventional rotation equivariant and rotation & reflection equivariant neural networks on rotated MNIST and CIFAR-10.",/pdf/f00c9f3e1e78418c18bf1a3eabc5cca7896ef918.pdf,ICLR,2020,We utilize attention to restrict equivariant neural networks to the set or co-occurring transformations in data. +Sk36NgFeg,,1478190000000.0,1478190000000.0,74,Filling in the details: Perceiving from low fidelity visual input,"[""fwick@cs.umb.edu"", ""mwick@cs.umass.edu"", ""mpomplun@gmail.com""]","[""Farahnaz A. Wick"", ""Michael L. Wick"", ""Marc Pomplun""]","[""Deep learning"", ""Computer vision"", ""Semi-Supervised Learning""]","Humans perceive their surroundings in great detail even though most of our visual field is reduced to low-fidelity color-deprived (e.g., dichromatic) input by the retina. In contrast, most deep learning architectures deploy computational resources homogeneously to every part of the visual input. Is such a prodigal deployment of resources necessary? In this paper, we present a framework for investigating the extent to which connectionist architectures can perceive an image in full detail even when presented with low acuity, distorted input. Our goal is to initiate investigations that will be fruitful both for engineering better networks and also for eventually testing hypotheses on the neural mechanisms responsible for our own visual system's ability to perceive missing information. We find that networks can compensate for low acuity input by learning global feature functions that allow the network to fill in some of the missing details. For example, the networks accurately perceive shape and color in the periphery, even when 75\% of the input is achromatic and low resolution. On the other hand, the network is prone to similar mistakes as humans; for example, when presented with a fully grayscale landscape image, it perceives the sky as blue when the sky is actually a red sunset. ",/pdf/ce19097845e8f83133c971d7189be79450c96ff7.pdf,ICLR,2017,Using generative models to create images from impoverished input similar to those received by our visual cortex +HylRk2A5FQ,HkehXldcYQ,1538090000000.0,1545360000000.0,1032,Graph Learning Network: A Structure Learning Algorithm,"[""darwin.pilco@ic.unicamp.br"", ""adin@ic.unicamp.br""]","[""Darwin Danilo Saire Pilco"", ""Ad\u00edn Ram\u00edrez Rivera""]","[""graph prediction"", ""graph structure learning"", ""graph neural network""]","Graph prediction methods that work closely with the structure of the data, e.g., graph generation, commonly ignore the content of its nodes. On the other hand, the solutions that consider the node’s information, e.g., classification, ignore the structure of the whole. And some methods exist in between, e.g., link prediction, but predict the structure piece-wise instead of considering the graph as a whole. We hypothesize that by jointly predicting the structure of the graph and its nodes’ features, we can improve both tasks. We propose the Graph Learning Network (GLN), a simple yet effective process to learn node embeddings and structure prediction functions. Our model uses graph convolutions to propose expected node features, and predict the best structure based on them. We repeat these steps sequentially to enhance the prediction and the embeddings. In contrast to existing generation methods that rely only on the structure of the data, we use the feature on the nodes to predict better relations, similar to what link prediction methods do. However, we propose an holistic approach to process the whole graph for our predictions. Our experiments show that our method predicts consistent structures across a set of problems, while creating meaningful node embeddings.",/pdf/13426a6ef41774e9acb65816704749f9d5b8f62e.pdf,ICLR,2019,"Methods for simultaneous prediction of nodes' feature embeddings and adjacency matrix, and how to learn this process." +SkA-IE06W,ryTWU4ApW,1508950000000.0,1519250000000.0,88,When is a Convolutional Filter Easy to Learn?,"[""ssdu@cs.cmu.edu"", ""jasonlee@marshall.usc.edu"", ""yuandong@fb.com""]","[""Simon S. Du"", ""Jason D. Lee"", ""Yuandong Tian""]","[""deep learning"", ""convolutional neural network"", ""non-convex optimization"", ""convergence analysis""]","We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions. Our theory also justifies the two-stage learning rate strategy in deep neural networks. While our focus is theoretical, we also present experiments that justify our theoretical findings.",/pdf/c63c78376fdd954b0da713e86662d008dbab85df.pdf,ICLR,2018,We prove randomly initialized (stochastic) gradient descent learns a convolutional filter in polynomial time. +xzqLpqRzxLq,pS9P9FGRajH,1601310000000.0,1614560000000.0,587,IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning,"[""~Manli_Zhang1"", ""~Jianhong_Zhang1"", ""~Zhiwu_Lu1"", ""~Tao_Xiang1"", ""~Mingyu_Ding1"", ""~Songfang_Huang1""]","[""Manli Zhang"", ""Jianhong Zhang"", ""Zhiwu Lu"", ""Tao Xiang"", ""Mingyu Ding"", ""Songfang Huang""]","[""few-shot learning"", ""self-supervised learning"", ""episode-level pretext task""]","The need of collecting large quantities of labeled training data for each new task has limited the usefulness of deep neural networks. Given data from a set of source tasks, this limitation can be overcome using two transfer learning approaches: few-shot learning (FSL) and self-supervised learning (SSL). The former aims to learn `how to learn' by designing learning episodes using source tasks to simulate the challenge of solving the target new task with few labeled samples. In contrast, the latter exploits an annotation-free pretext task across all source tasks in order to learn generalizable feature representations. In this work, we propose a novel Instance-level and Episode-level Pretext Task (IEPT) framework that seamlessly integrates SSL into FSL. Specifically, given an FSL episode, we first apply geometric transformations to each instance to generate extended episodes. At the instance-level, transformation recognition is performed as per standard SSL. Importantly, at the episode-level, two SSL-FSL hybrid learning objectives are devised: (1) The consistency across the predictions of an FSL classifier from different extended episodes is maximized as an episode-level pretext task. (2) The features extracted from each instance across different episodes are integrated to construct a single FSL classifier for meta-learning. Extensive experiments show that our proposed model (i.e., FSL with IEPT) achieves the new state-of-the-art. ",/pdf/a68102247933495b5b77811b3b5299cf97a108f4.pdf,ICLR,2021,This paper proposes a novel Instance-level and Episode-level Pretext Task (IEPT) framework that seamlessly integrates SSL into FSL. +rylDfnCqF7,HyxuP0h9F7,1538090000000.0,1548620000000.0,1274,Lagging Inference Networks and Posterior Collapse in Variational Autoencoders,"[""junxianh@cs.cmu.edu"", ""dspokoyn@cs.cmu.edu"", ""gneubig@cs.cmu.edu"", ""tberg@cs.cmu.edu""]","[""Junxian He"", ""Daniel Spokoyny"", ""Graham Neubig"", ""Taylor Berg-Kirkpatrick""]","[""variational autoencoders"", ""posterior collapse"", ""generative models""]","The variational autoencoder (VAE) is a popular combination of deep latent variable model and accompanying variational learning technique. By using a neural inference network to approximate the model's posterior on latent variables, VAEs efficiently parameterize a lower bound on marginal data likelihood that can be optimized directly via gradient methods. In practice, however, VAE training often results in a degenerate local optimum known as ""posterior collapse"" where the model learns to ignore the latent variable and the approximate posterior mimics the prior. In this paper, we investigate posterior collapse from the perspective of training dynamics. We find that during the initial stages of training the inference network fails to approximate the model's true posterior, which is a moving target. As a result, the model is encouraged to ignore the latent encoding and posterior collapse occurs. Based on this observation, we propose an extremely simple modification to VAE training to reduce inference lag: depending on the model's current mutual information between latent variable and observation, we aggressively optimize the inference network before performing each model update. Despite introducing neither new model components nor significant complexity over basic VAE, our approach is able to avoid the problem of collapse that has plagued a large amount of previous work. Empirically, our approach outperforms strong autoregressive baselines on text and image benchmarks in terms of held-out likelihood, and is competitive with more complex techniques for avoiding collapse while being substantially faster.",/pdf/47f79f4015dbabc7f2eab6e432cddf975cf1c486.pdf,ICLR,2019,"To address posterior collapse in VAEs, we propose a novel yet simple training procedure that aggressively optimizes inference network with more updates. This new training procedure mitigates posterior collapse and leads to a better VAE model. " +Skj8Kag0Z,SJsIKTgCZ,1509120000000.0,1519440000000.0,422,Stabilizing Adversarial Nets with Prediction Methods,"[""jaiabhay@cs.umd.edu"", ""sohilas@umd.edu"", ""xuzh@cs.umd.edu"", ""djacobs@umiacs.umd.edu"", ""tomg@cs.umd.edu""]","[""Abhay Yadav"", ""Sohil Shah"", ""Zheng Xu"", ""David Jacobs"", ""Tom Goldstein""]","[""adversarial networks"", ""optimization""]","Adversarial neural networks solve many important problems in data science, but are notoriously difficult to train. These difficulties come from the fact that optimal weights for adversarial nets correspond to saddle points, and not minimizers, of the loss function. The alternating stochastic gradient methods typically used for such problems do not reliably converge to saddle points, and when convergence does happen it is often highly sensitive to learning rates. We propose a simple modification of stochastic gradient descent that stabilizes adversarial networks. We show, both in theory and practice, that the proposed method reliably converges to saddle points. This makes adversarial networks less likely to ""collapse,"" and enables faster training with larger learning rates.",/pdf/c9b6d6d85b5ff2738c615cf5fa704565691742b2.pdf,ICLR,2018,"We present a simple modification to the alternating SGD method, called a prediction step, that improves the stability of adversarial networks." +ryT4pvqll,,1478290000000.0,1488560000000.0,442,Improving Policy Gradient by Exploring Under-appreciated Rewards,"[""ofirnachum@google.com"", ""mnorouzi@google.com"", ""schuurmans@google.com""]","[""Ofir Nachum"", ""Mohammad Norouzi"", ""Dale Schuurmans""]","[""Reinforcement Learning""]","This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring only small modifications to the standard REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. We find that our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Notably, the approach is able to solve a benchmark multi-digit addition task. To our knowledge, this is the first time that a pure RL method has solved addition using only reward feedback.",/pdf/5d0a48343bdd01c02831d88857c60f4b9940ad61.pdf,ICLR,2017,We present a novel form of policy gradient for model-free reinforcement learning with improved exploration properties. +qOCdZn3lQIJ,lBLcifsBK8GV,1601310000000.0,1614990000000.0,2108,Compressing gradients in distributed SGD by exploiting their temporal correlation,"[""~Tharindu_Adikari1"", ""~Stark_Draper1""]","[""Tharindu Adikari"", ""Stark Draper""]","[""distributed optimization"", ""gradient compression"", ""error-feedback""]","We propose SignXOR, a novel compression scheme that exploits temporal correlation of gradients for the purpose of gradient compression. Sign-based schemes such as Scaled-sign and SignSGD (Bernstein et al., 2018; Karimireddy et al., 2019) compress gradients by storing only the sign of gradient entries. These methods, however, ignore temporal correlations between gradients. The equality or non-equality of signs of gradients in two consecutive iterations can be represented by a binary vector, which can be further compressed depending on its entropy. By implementing a rate-distortion encoder we increase the temporal correlation of gradients, lowering entropy and improving compression. We achieve theoretical convergence of SignXOR by employing the two-way error-feedback approach introduced by Zheng et al. (2019). Zheng et al. (2019) show that two-way compression with error-feedback achieves the same asymptotic convergence rate as SGD, although convergence is slower by a constant factor. We strengthen their analysis to show that the rate of convergence of two-way compression with errorfeedback asymptotically is the same as that of SGD. As a corollary we prove that two-way SignXOR compression with error-feedback achieves the same asymptotic rate of convergence as SGD. We numerically evaluate our proposed method on the CIFAR-100 and ImageNet datasets and show that SignXOR requires less than 50% of communication traffic compared to sending sign of gradients. To the best of our knowledge we are the first to present a gradient compression scheme that exploits temporal correlation of gradients.",/pdf/b7214eb609f3fafe32d2605a7c9b727c56b65e0f.pdf,ICLR,2021,A novel compression scheme that exploits temporal correlation of gradients for the purpose of gradient compression. +A2gNouoXE7,MF-R-tWxfdk,1601310000000.0,1616020000000.0,2212,Filtered Inner Product Projection for Crosslingual Embedding Alignment,"[""~Vin_Sachidananda1"", ""~Ziyi_Yang1"", ""~Chenguang_Zhu1""]","[""Vin Sachidananda"", ""Ziyi Yang"", ""Chenguang Zhu""]","[""multilingual representations"", ""word embeddings"", ""natural language processing""]","Due to widespread interest in machine translation and transfer learning, there are numerous algorithms for mapping multiple embeddings to a shared representation space. Recently, these algorithms have been studied in the setting of bilingual lexicon induction where one seeks to align the embeddings of a source and a target language such that translated word pairs lie close to one another in a common representation space. In this paper, we propose a method, Filtered Inner Product Projection (FIPP), for mapping embeddings to a common representation space. As semantic shifts are pervasive across languages and domains, FIPP first identifies the common geometric structure in both embeddings and then, only on the common structure, aligns the Gram matrices of these embeddings. FIPP is applicable even when the source and target embeddings are of differing dimensionalities. Additionally, FIPP provides computational benefits in ease of implementation and is faster to compute than current approaches. Following the baselines in Glavas et al. 2019, we evaluate FIPP both in the context of bilingual lexicon induction and downstream language tasks. We show that FIPP outperforms existing methods on the XLING BLI dataset for most language pairs while also providing robust performance across downstream tasks. ",/pdf/2638533f1265729a64b0eaf1d56f4ca348fd18dc.pdf,ICLR,2021, +H1lC8o0cKX,HkgoK2A_Km,1538090000000.0,1545360000000.0,223,Unsupervised Emergence of Spatial Structure from Sensorimotor Prediction,"[""alban.laflaquiere@gmail.com"", ""mgarciaortiz@softbankrobotics.com""]","[""Alban Laflaqui\u00e8re"", ""Michael Garcia Ortiz""]","[""spatial perception"", ""grounding"", ""sensorimotor prediction"", ""unsupervised learning"", ""representation learning""]","Despite its omnipresence in robotics application, the nature of spatial knowledge and the mechanisms that underlie its emergence in autonomous agents are still poorly understood. Recent theoretical work suggests that the concept of space can be grounded by capturing invariants induced by the structure of space in an agent's raw sensorimotor experience. Moreover, it is hypothesized that capturing these invariants is beneficial for a naive agent trying to predict its sensorimotor experience. Under certain exploratory conditions, spatial representations should thus emerge as a byproduct of learning to predict. +We propose a simple sensorimotor predictive scheme, apply it to different agents and types of exploration, and evaluate the pertinence of this hypothesis. We show that a naive agent can capture the topology and metric regularity of its spatial configuration without any a priori knowledge, nor extraneous supervision.",/pdf/7ae2c17f8b7e8fcf02f6d4a7131c607011263b9d.pdf,ICLR,2019,A practical evaluation of hypotheses previously laid out about the unsupervised emergence of spatial representations from sensorimotor prediction. +Bk0MRI5lg,,1478290000000.0,1481830000000.0,343,Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units,"[""dan@ttic.edu"", ""kgimpel@ttic.edu""]","[""Dan Hendrycks"", ""Kevin Gimpel""]",[],"We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU nonlinearity is the expected transformation of a stochastic regularizer which randomly applies the identity or zero map to a neuron's input. This stochastic regularizer is comparable to nonlinearities aided by dropout, but it removes the need for a traditional nonlinearity. The connection between the GELU and the stochastic regularizer suggests a new probabilistic understanding of nonlinearities. We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all tasks.",/pdf/a171eb69077774da944c6b59bfc7ea95b2286029.pdf,ICLR,2017,A Competitor of ReLUs and ELUs with a Probabilistic Underpinning +HJxpDiC5tX,r1gKk2Eqt7,1538090000000.0,1545360000000.0,304, Large-Scale Visual Speech Recognition,"[""shillingford@google.com"", ""assael@google.com"", ""mwhoffman@google.com"", ""tpaine@google.com"", ""cianh@google.com"", ""utsavprabhu@google.com"", ""hankliao@google.com"", ""hasim@google.com"", ""kanishkarao@google.com"", ""lorrayne@google.com"", ""mariecharlotte@google.com"", ""coppin@google.com"", ""benl@google.com"", ""andrewsenior@google.com"", ""nandodefreitas@google.com""]","[""Brendan Shillingford"", ""Yannis Assael"", ""Matthew W. Hoffman"", ""Thomas Paine"", ""C\u00edan Hughes"", ""Utsav Prabhu"", ""Hank Liao"", ""Hasim Sak"", ""Kanishka Rao"", ""Lorrayne Bennett"", ""Marie Mulville"", ""Ben Coppin"", ""Ben Laurie"", ""Andrew Senior"", ""Nando de Freitas""]","[""visual speech recognition"", ""speech recognition"", ""lipreading""]","This work presents a scalable solution to continuous visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on previous lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.",/pdf/f526b81ad9456833a22429fc7a26e51bce13d838.pdf,ICLR,2019,This work presents a scalable solution to continuous visual speech recognition. +Ma0S4RcfpR_,7TD08TFdj7o,1601310000000.0,1614990000000.0,2927,A Representational Model of Grid Cells' Path Integration Based on Matrix Lie Algebras,"[""~Ruiqi_Gao2"", ""~Jianwen_Xie1"", ""~Xue-Xin_Wei2"", ""~Song-Chun_Zhu1"", ""~Ying_Nian_Wu1""]","[""Ruiqi Gao"", ""Jianwen Xie"", ""Xue-Xin Wei"", ""Song-Chun Zhu"", ""Ying Nian Wu""]","[""grid cells"", ""path integration"", ""representational model"", ""Lie algebras"", ""error correction""]","The grid cells in the mammalian medial entorhinal cortex exhibit striking hexagon firing patterns when the agent navigates in the open field. It is hypothesized that the grid cells are involved in path integration so that the agent is aware of its self-position by accumulating its self-motion. Assuming the grid cells form a vector representation of self-position, we elucidate a minimally simple recurrent model for grid cells' path integration based on two coupled matrix Lie algebras that underlie two coupled rotation systems that mirror the agent's self-motion: (1) When the agent moves along a certain direction, the vector is rotated by a generator matrix. (2) When the agent changes direction, the generator matrix is rotated by another generator matrix. Our experiments show that our model learns hexagonal grid response patterns that resemble the firing patterns observed from the grid cells in the brain. Furthermore, the learned model is capable of near exact path integration, and it is also capable of error correction. Our model is novel and simple, with explicit geometric and algebraic structures. ",/pdf/f4982876ebf51ab1565cd2c829afdc4f927dabf8.pdf,ICLR,2021,We elucidate a minimally simple recurrent model for grid cells' path integration based on two coupled matrix Lie algebras that underlie two coupled rotation systems that mirror the agent's self-motion. +rkgoyn09KQ,Skl37UhqKm,1538090000000.0,1550920000000.0,1016,textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE WITH DISTRIBUTED COMPOSITIONAL PRIOR,"[""pankaj_gupta96@yahoo.com"", ""yatinchaudhary91@gmail.com"", ""fbuettner.phys@gmail.com"", ""hinrich@hotmail.com""]","[""Pankaj Gupta"", ""Yatin Chaudhary"", ""Florian Buettner"", ""Hinrich Schuetze""]","[""neural topic model"", ""natural language processing"", ""text representation"", ""language modeling"", ""information retrieval"", ""deep learning""]","We address two challenges of probabilistic topic modelling in order to better estimate +the probability of a word in a given context, i.e., P(wordjcontext) : (1) No +Language Structure in Context: Probabilistic topic models ignore word order by +summarizing a given context as a “bag-of-word” and consequently the semantics +of words in the context is lost. In this work, we incorporate language structure +by combining a neural autoregressive topic model (TM) with a LSTM based language +model (LSTM-LM) in a single probabilistic framework. The LSTM-LM +learns a vector-space representation of each word by accounting for word order +in local collocation patterns, while the TM simultaneously learns a latent representation +from the entire document. In addition, the LSTM-LM models complex +characteristics of language (e.g., syntax and semantics), while the TM discovers +the underlying thematic structure in a collection of documents. We unite two complementary +paradigms of learning the meaning of word occurrences by combining +a topic model and a language model in a unified probabilistic framework, named +as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: +In settings with a small number of word occurrences (i.e., lack of context) +in short text or data sparsity in a corpus of few documents, the application of TMs +is challenging. We address this challenge by incorporating external knowledge +into neural autoregressive topic models via a language modelling approach: we +use word embeddings as input of a LSTM-LM with the aim to improve the wordtopic +mapping on a smaller and/or short-text corpus. The proposed DocNADE +extension is named as ctx-DocNADEe. + +We present novel neural autoregressive topic model variants coupled with neural +language models and embeddings priors that consistently outperform state-of-theart +generative topic models in terms of generalization (perplexity), interpretability +(topic coherence) and applicability (retrieval and classification) over 6 long-text +and 8 short-text datasets from diverse domains.",/pdf/a2aa6240a32565a3038c6c2e5a4b658295f40de1.pdf,ICLR,2019,Unified neural model of topic and language modeling to introduce language structure in topic models for contextualized topic vectors +H1lWzpNKvr,Bklm_DKLPr,1569440000000.0,1577170000000.0,402,Efficient Multivariate Bandit Algorithm with Path Planning,"[""keyunie@google.com"", ""zezzhang@ebay.com"", ""teyuan@ebay.com"", ""rsong@ebay.com"", ""pmburke10@gmail.com""]","[""Keyu Nie"", ""Zezhong Zhang"", ""Ted Tao Yuan"", ""Rong Song"", ""Pauline Berry Burke""]","[""Multivariate Multi-armed Bandit"", ""Monte Carlo Tree Search"", ""Thompson Sampling"", ""Path Planning""]","In this paper, we solve the arms exponential exploding issues in multivariate Multi-Armed Bandit (Multivariate-MAB) problem when the arm dimension hierarchy is considered. We propose a framework called path planning (TS-PP) which utilizes decision graph/trees to model arm reward success rate with m-way dimension interaction, and adopts Thompson sampling (TS) for heuristic search of arm selection. Naturally, it is quite straightforward to combat the curse of dimensionality using a serial processes that operates sequentially by focusing on one dimension per each process. For our best acknowledge, we are the first to solve Multivariate-MAB problem using graph path planning strategy and deploying alike Monte-Carlo tree search ideas. Our proposed method utilizing tree models has advantages comparing with traditional models such as general linear regression. Simulation studies validate our claim by achieving faster convergence speed, better efficient optimal arm allocation and lower cumulative regret.",/pdf/be64bc50c4f3f505c9c1951ee64a73a2416db854.pdf,ICLR,2020,A novel way utilizing tree models to solve multivariate Multi-Armed Bandit problem. +tkAtoZkcUnm,czLLig6juhE,1601310000000.0,1616020000000.0,1944,Neural Thompson Sampling,"[""~Weitong_ZHANG1"", ""~Dongruo_Zhou1"", ""~Lihong_Li1"", ""~Quanquan_Gu1""]","[""Weitong ZHANG"", ""Dongruo Zhou"", ""Lihong Li"", ""Quanquan Gu""]","[""Deep Learning"", ""Contextual Bandits"", ""Thompson sampling""]","Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of $O(T^{1/2})$, which matches the regret of other contextual bandit algorithms in terms of total round number $T$. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.",/pdf/d0c2efae754e3efbba6032b8b7d232a28b2bf5bc.pdf,ICLR,2021,"We propose NeuralTS, a provable neural work-based Thompson sampling algorithm for stochastic contextual bandits." +30SS5VjvhrZ,ridQwl2NL3L,1601310000000.0,1614990000000.0,1142,Bayesian Neural Networks with Variance Propagation for Uncertainty Evaluation,"[""~Yuki_Mae1"", ""~Wataru_Kumagai2"", ""~Takafumi_Kanamori1""]","[""Yuki Mae"", ""Wataru Kumagai"", ""Takafumi Kanamori""]","[""uncertainty evaluation"", ""sampling-free method"", ""variance propagation"", ""LSTM"", ""out-of-distribution""]","Uncertainty evaluation is a core technique when deep neural networks (DNNs) are used in real-world problems. In practical applications, we often encounter unexpected samples that have not seen in the training process. Not only achieving the high-prediction accuracy but also detecting uncertain data is significant for safety-critical systems. In statistics and machine learning, Bayesian inference has been exploited for uncertainty evaluation. The Bayesian neural networks (BNNs) have recently attracted considerable attention in this context, as the DNN trained using dropout is interpreted as a Bayesian method. Based on this interpretation, several methods to calculate the Bayes predictive distribution for DNNs have been developed. Though the Monte-Carlo method called MC dropout is a popular method for uncertainty evaluation, it requires a number of repeated feed-forward calculations of DNNs with randomly sampled weight parameters. To overcome the computational issue, we propose a sampling-free method to evaluate uncertainty. Our method converts a neural network trained using the dropout to the corresponding Bayesian neural network with variance propagation. Our method is available not only to feed-forward NNs but also to recurrent NNs including LSTM. We report the computational efficiency and statistical reliability of our method in numerical experiments of the language modeling using RNNs, and the out-of-distribution detection with DNNs. ",/pdf/8550394ef17bba748e8df32952ecfc6322200492.pdf,ICLR,2021,We developed a sampling-free method for uncertainty evaluation by converting DNNs/RNNs trained using dropout to the Bayesian neural networks with variance propagation. +BylfTySYvB,HJlnp7kKwr,1569440000000.0,1577170000000.0,1981,GATO: Gates Are Not the Only Option,"[""goldstein@nyu.edu"", ""xh1007@nyu.edu"", ""rajeshr@cims.nyu.edu""]","[""Mark Goldstein*"", ""Xintian Han*"", ""Rajesh Ranganath""]","[""Sequence Models"", ""Vanishing Gradients"", ""Recurrent neural networks"", ""Long-term dependence""]","Recurrent Neural Networks (RNNs) facilitate prediction and generation of structured temporal data such as text and sound. However, training RNNs is hard. Vanishing gradients cause difficulties for learning long-range dependencies. Hidden states can explode for long sequences and send unbounded gradients to model parameters, even when hidden-to-hidden Jacobians are bounded. Models like the LSTM and GRU use gates to bound their hidden state, but most choices of gating functions lead to saturating gradients that contribute to, instead of alleviate, vanishing gradients. Moreover, performance of these models is not robust across random initializations. In this work, we specify desiderata for sequence models. We develop one model that satisfies them and that is capable of learning long-term dependencies, called GATO. GATO is constructed so that part of its hidden state does not have vanishing gradients, regardless of sequence length. We study GATO on copying and arithmetic tasks with long dependencies and on modeling intensive care unit and language data. Training GATO is more stable across random seeds and learning rates than GRUs and LSTMs. GATO solves these tasks using an order of magnitude fewer parameters.",/pdf/64eb5f863b1e70bb312a34c7067733aac3a4af7f.pdf,ICLR,2020,"Recurrent neural networks can avoid vanishing gradients by not using all of their hidden state in recurrences, together with a residual structure." +rJxvD3VKvr,H1laOycTEH,1569440000000.0,1577170000000.0,8,Wide Neural Networks are Interpolating Kernel Methods: Impact of Initialization on Generalization,"[""manuel.nonnenmacher@de.bosch.com"", ""david.reeb@de.bosch.com"", ""ingo.steinwart@mathematik.uni-stuttgart.de""]","[""Manuel Nonnenmacher"", ""David Reeb"", ""Ingo Steinwart""]","[""overparametrization"", ""generalization"", ""initialization"", ""gradient descent"", ""kernel methods"", ""deep learning theory""]","The recently developed link between strongly overparametrized neural networks (NNs) and kernel methods has opened a new way to understand puzzling features of NNs, such as their convergence and generalization behaviors. In this paper, we make the bias of initialization on strongly overparametrized NNs under gradient descent explicit. We prove that fully-connected wide ReLU-NNs trained with squared loss are essentially a sum of two parts: The first is the minimum complexity solution of an interpolating kernel method, while the second contributes to the test error only and depends heavily on the initialization. This decomposition has two consequences: (a) the second part becomes negligible in the regime of small initialization variance, which allows us to transfer generalization bounds from minimum complexity interpolating kernel methods to NNs; (b) in the opposite regime, the test error of wide NNs increases significantly with the initialization variance, while still interpolating the training data perfectly. Our work shows that -- contrary to common belief -- the initialization scheme has a strong effect on generalization performance, providing a novel criterion to identify good initialization strategies.",/pdf/4f1a3c98e7b9e0d087ee57fdab9bb37aec3f102f.pdf,ICLR,2020,We show that the generalization behavior of wide neural networks depends strongly on their initialization. +Bkl1uWb0Z,HkARwZZ0W,1509130000000.0,1518730000000.0,691,Inducing Grammars with and for Neural Machine Translation,"[""ketranmanh@gmail.com"", ""ybisk@yonatanbisk.com""]","[""Ke Tran"", ""Yonatan Bisk""]","[""structured attention"", ""neural machine translation"", ""grammar induction""]","Previous work has demonstrated the benefits of incorporating additional linguistic annotations such as syntactic trees into neural machine translation. However the cost of obtaining those syntactic annotations is expensive for many languages and the quality of unsupervised learning linguistic structures is too poor to be helpful. In this work, we aim to improve neural machine translation via source side dependency syntax but without explicit annotation. We propose a set of models that learn to induce dependency trees on the source side and learn to use that information on the target side. Importantly, we also show that our dependency trees capture important syntactic features of language and improve translation quality on two language pairs En-De and En-Ru.",/pdf/49c0ad913733a7171065db27c434a8cd07361668.pdf,ICLR,2018,improve NMT with latent trees +HJe-oRVtPB,S1xDcxYdPr,1569440000000.0,1577170000000.0,1309,STABILITY AND CONVERGENCE THEORY FOR LEARNING RESNET: A FULL CHARACTERIZATION,"[""huishuai.zhang@microsoft.com"", ""yuda3@mail2.sysu.edu.cn"", ""v-minyi@microsoft.com"", ""wche@microsoft.com"", ""tie-yan.liu@microsoft.com""]","[""Huishuai Zhang"", ""Da Yu"", ""Mingyang Yi"", ""Wei Chen"", ""Tie-yan Liu""]","[""ResNet"", ""stability"", ""convergence theory"", ""over-parameterization""]","ResNet structure has achieved great success since its debut. In this paper, we study the stability of learning ResNet. Specifically, we consider the ResNet block $h_l = \phi(h_{l-1}+\tau\cdot g(h_{l-1}))$ where $\phi(\cdot)$ is ReLU activation and $\tau$ is a scalar. We show that for standard initialization used in practice, $\tau =1/\Omega(\sqrt{L})$ is a sharp value in characterizing the stability of forward/backward process of ResNet, where $L$ is the number of residual blocks. Specifically, stability is guaranteed for $\tau\le 1/\Omega(\sqrt{L})$ while conversely forward process explodes when $\tau>L^{-\frac{1}{2}+c}$ for a positive constant $c$. Moreover, if ResNet is properly over-parameterized, we show for $\tau \le 1/\tilde{\Omega}(\sqrt{L})$ gradient descent is guaranteed to find the global minima \footnote{We use $\tilde{\Omega}(\cdot)$ to hide logarithmic factor.}, which significantly enlarges the range of $\tau\le 1/\tilde{\Omega}(L)$ that admits global convergence in previous work. We also demonstrate that the over-parameterization requirement of ResNet only weakly depends on the depth, which corroborates the advantage of ResNet over vanilla feedforward network. Empirically, with $\tau\le1/\sqrt{L}$, deep ResNet can be easily trained even without normalization layer. Moreover, adding $\tau=1/\sqrt{L}$ can also improve the performance of ResNet with normalization layer.",/pdf/fe8ea761dbe96d8595808de8aed51982c91d1ee9.pdf,ICLR,2020,"We characterize the stability and convergence of gradient descent learning ResNet, unveiling the theorectical and practical importance of tau =1/sqrt(L) in the residual block." +VNJUTmR-CaZ,kdgGjVnsngF,1601310000000.0,1614990000000.0,2958,Learning to Solve Multi-Robot Task Allocation with a Covariant-Attention based Neural Architecture,"[""~Steve_Paul1"", ""~Payam_Ghassemi2"", ""~Souma_Chowdhury1""]","[""Steve Paul"", ""Payam Ghassemi"", ""Souma Chowdhury""]","[""Graph neural network"", ""Attention mechanism"", ""Reinforcement learning"", ""Multi-robotic task allocation""]","This paper presents a new graph neural network architecture over which reinforcement learning can be performed to yield online policies for an important class of multi-robot task allocation (MRTA) problems, one that involves tasks with deadlines, and robots with ferry range and payload constraints and multi-tour capability. While drawing motivation from recent graph learning methods that learn to solve combinatorial optimization problems of the mTSP/VRP type, this paper seeks to provide better convergence and generalizability specifically for MRTA problems. The proposed neural architecture, called Covariant Attention-based Model or CAM, includes three main components: 1) an encoder: a covariant compositional node-based embedding is used to represent each task as a learnable feature vector in manner that preserves the local structure of the task graph while being invariant to the ordering of graph nodes; 2) context: a vector representation of the mission time and state of the concerned robot and its peers; and 2) a decoder: builds upon the attention mechanism to facilitate a sequential output. In order to train the CAM model, a policy-gradient method based on REINFORCE is used. While the new architecture can solve the broad class of MRTA problems stated above, to demonstrate real-world applicability we use a multi-unmanned aerial vehicle or multi-UAV-based flood response problem for evaluation purposes. For comparison, the well-known attention-based approach (designed to solve mTSP/VRP problems) is extended and applied to the MRTA problem, as a baseline. The results show that the proposed CAM method is not only superior to the baseline AM method in terms of the cost function (over training and unseen test scenarios), but also provide significantly faster convergence and yields learnt policies that can be executed within 2.4ms/robot, thereby allowing real-time application.",/pdf/cb28fa753983936767172d8ffb0303bd7fa7ae28.pdf,ICLR,2021, +HyMRUiC9YX,B1egmPLKt7,1538090000000.0,1545360000000.0,221,Exploring and Enhancing the Transferability of Adversarial Examples,"[""leiwu@pku.edu.cn"", ""zhanxing.zhu@pku.edu.cn"", ""chengtai@pku.edu.cn""]","[""Lei Wu"", ""Zhanxing Zhu"", ""Cheng Tai""]","[""Deep learning"", ""Adversarial example"", ""Transferability"", ""Smoothed gradient""]"," State-of-the-art deep neural networks are vulnerable to adversarial examples, formed by applying small but malicious perturbations to the original inputs. Moreover, the perturbations can \textit{transfer across models}: adversarial examples generated for a specific model will often mislead other unseen models. Consequently the adversary can leverage it to attack deployed systems without any query, which severely hinders the application of deep learning, especially in the safety-critical areas. In this work, we empirically study how two classes of factors those might influence the transferability of adversarial examples. One is about model-specific factors, including network architecture, model capacity and test accuracy. The other is the local smoothness of loss surface for constructing adversarial examples. Inspired by these understandings on the transferability of adversarial examples, we then propose a simple but effective strategy to enhance the transferability, whose effectiveness is confirmed by a variety of experiments on both CIFAR-10 and ImageNet datasets.",/pdf/68bc7c478b44556282255089d05bb096362bb06c.pdf,ICLR,2019, +SylU3jC5Y7,BkeorX6qYm,1538090000000.0,1545360000000.0,714,ADAPTIVE NETWORK SPARSIFICATION VIA DEPENDENT VARIATIONAL BETA-BERNOULLI DROPOUT,"[""juho.lee@stats.ox.ac.uk"", ""shkim@aitrics.com"", ""jaehong.yoon@kaist.ac.kr"", ""haebeom.lee@kaist.ac.kr"", ""eunhoy@kaist.ac.kr"", ""sjhwang82@kaist.ac.kr""]","[""Juho Lee"", ""Saehoon Kim"", ""Jaehong Yoon"", ""Hae Beom Lee"", ""Eunho Yang"", ""Sung Ju Hwang""]","[""Bayesian deep learning"", ""network pruning""]","While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such input-independent dropout, each neuron is evolved to be generic across inputs, which makes it difficult to sparsify networks without accuracy loss. To overcome this limitation, we propose adaptive variational dropout whose probabilities are drawn from sparsity-inducing beta-Bernoulli prior. It allows each neuron to be evolved either to be generic or specific for certain inputs, or dropped altogether. Such input-adaptive sparsity-inducing dropout allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features. We validate our dependent variational beta-Bernoulli dropout on multiple public datasets, on which it obtains significantly more compact networks than baseline methods, with consistent accuracy improvements over the base networks.",/pdf/656996b71871b00e53fafbf0d23a17519fb3098d.pdf,ICLR,2019,We propose a novel Bayesian network sparsification method that adaptively prunes networks according to inputs. +HJSCGD9ex,,1478290000000.0,1481000000000.0,372,Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context,"[""upadhya3@illinois.edu"", ""kwchang@virginia.edu"", ""jamesz@stanford.edu"", ""taddy@microsoft.com"", ""adum@microsoft.com""]","[""Shyam Upadhyay"", ""Kai-Wei Chang"", ""James Zou"", ""Matt Taddy"", ""Adam Kalai""]","[""Natural language processing""]","Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense word embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields com- parable performance to a state of the art monolingual model trained on five times more training data.",/pdf/29fc6c264a518a2fcb19fae533526ac659dc7e90.pdf,ICLR,2017,Using multilingual context for learning multi-sense embeddings helps. +rkxjnjA5KQ,HJgOOBTcFm,1538090000000.0,1545360000000.0,739,Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation,"[""gamrianshani@gmail.com"", ""yoav.goldberg@gmail.com""]","[""Shani Gamrian"", ""Yoav Goldberg""]","[""Transfer Learning"", ""Reinforcement Learning"", ""Generative Adversarial Networks"", ""Video Games""]","Deep Reinforcement Learning has managed to achieve state-of-the-art results in learning control policies directly from raw pixels. However, despite its remarkable success, it fails to generalize, a fundamental component required in a stable Artificial Intelligence system. Using the Atari game Breakout, we demonstrate the difficulty of a trained agent in adjusting to simple modifications in the raw image, ones that a human could adapt to trivially. In transfer learning, the goal is to use the knowledge gained from the source task to make the training of the target task faster and better. We show that using various forms of fine-tuning, a common method for transfer learning, is not effective for adapting to such small visual changes. In fact, it is often easier to re-train the agent from scratch than to fine-tune a trained agent. We suggest that in some cases transfer learning can be improved by adding a dedicated component whose goal is to learn to visually map between the known domain and the new one. Concretely, we use Unaligned Generative Adversarial Networks (GANs) to create a mapping function to translate images in the target task to corresponding images in the source task. These mapping functions allow us to transform between various variations of the Breakout game, as well as between different levels of a Nintendo game, Road Fighter. We show that learning this mapping is substantially more efficient than re-training. A visualization of a trained agent playing Breakout and Road Fighter, with and without the GAN transfer, can be seen in \url{https://streamable.com/msgtm} and \url{https://streamable.com/5e2ka}.",/pdf/d50ed8a84bb43b45388cf01d9374d1cf92b25e2c.pdf,ICLR,2019,"We propose a method of transferring knowledge between related RL tasks using visual mappings, and demonstrate its effectiveness on visual variants of the Atari Breakout game and different levels of Road Fighter, a Nintendo car driving game." +S1gNc3NtvB,HygivGKAIH,1569440000000.0,1577170000000.0,111,Learning Algorithmic Solutions to Symbolic Planning Tasks with a Neural Computer,"[""daniel@robot-learning.de"", ""rueckert@rob.uni-luebeck.de"", ""mail@jan-peters.net""]","[""Daniel Tanneberg"", ""Elmar Rueckert"", ""Jan Peters""]",[],"A key feature of intelligent behavior is the ability to learn abstract strategies that transfer to unfamiliar problems. Therefore, we present a novel architecture, based on memory-augmented networks, that is inspired by the von Neumann and Harvard architectures of modern computers. This architecture enables the learning of abstract algorithmic solutions via Evolution Strategies in a reinforcement learning setting. Applied to Sokoban, sliding block puzzle and robotic manipulation tasks, we show that the architecture can learn algorithmic solutions with strong generalization and abstraction: scaling to arbitrary task configurations and complexities, and being independent of both the data representation and the task domain.",/pdf/555ab1e683223122e4550d507915c6704faddd46.pdf,ICLR,2020,A novel neural computer architecture that learns transferable abstract strategies to symbolic planning tasks as algorithmic solutions with evolution strategies. +ehJqJQk9cw,huAKEi8cnkJ,1601310000000.0,1616060000000.0,1878,Personalized Federated Learning with First Order Model Optimization,"[""~Michael_Zhang4"", ""~Karan_Sapra2"", ""~Sanja_Fidler1"", ""~Serena_Yeung1"", ""~Jose_M._Alvarez2""]","[""Michael Zhang"", ""Karan Sapra"", ""Sanja Fidler"", ""Serena Yeung"", ""Jose M. Alvarez""]","[""Federated learning"", ""personalized learning""]","While federated learning traditionally aims to train a single global model across decentralized local datasets, one model may not always be ideal for all participating clients. Here we propose an alternative, where each client only federates with other relevant clients to obtain a stronger model per client-specific objectives. To achieve this personalization, rather than computing a single model average with constant weights for the entire federation as in traditional FL, we efficiently calculate optimal weighted model combinations for each client, based on figuring out how much a client can benefit from another's model. We do not assume knowledge of any underlying data distributions or client similarities, and allow each client to optimize for arbitrary target distributions of interest, enabling greater flexibility for personalization. We evaluate and characterize our method on a variety of federated settings, datasets, and degrees of local data heterogeneity. Our method outperforms existing alternatives, while also enabling new features for personalized FL such as transfer outside of local data distributions.",/pdf/f4b5b0c02b8e08918168f163c5f057cd709802b9.pdf,ICLR,2021,"We propose a new federated learning framework that efficiently computes a personalized weighted combination of available models for each client, outperforming existing work for personalized federated learning." +Gc4MQq-JIgj,QilAf1ZDDw,1601310000000.0,1614990000000.0,2990,Reconnaissance for reinforcement learning with safety constraints,"[""~Shin-ichi_Maeda2"", ""~Hayato_Watahiki1"", ""~Yi_Ouyang1"", ""okada@preferred.jp"", ""~Masanori_Koyama1""]","[""Shin-ichi Maeda"", ""Hayato Watahiki"", ""Yi Ouyang"", ""Shintarou Okada"", ""Masanori Koyama""]","[""Reinforcement Learning"", ""Safety constraints"", ""Constrained Markov Decision Process""]","Practical reinforcement learning problems are often formulated as constrained Markov decision process (CMDP) problems, in which the agent has to maximize the expected return while satisfying a set of prescribed safety constraints. In this study, we consider a situation in which the agent has access to the generative model which provides us with a next state sample for any given state-action pair, and propose a model to solve a CMDP problem by decomposing the CMDP into a pair of MDPs; \textit{reconnaissance} MDP (R-MDP) and \textit{planning} MDP (P-MDP). In R-MDP, we train threat function, the Q-function analogue of danger that can determine whether a given state-action pair is safe or not. In P-MDP, we train a reward-seeking policy while using a fixed threat function to determine the safeness of each action. With the help of generative model, we can efficiently train the threat function by preferentially sampling rare dangerous events. Once the threat function for a baseline policy is computed, we can solve other CMDP problems with different reward and different danger-constraint without the need to re-train the model. We also present an efficient approximation method for the threat function that can greatly reduce the difficulty of solving R-MDP. We will demonstrate the efficacy of our method over classical approaches in benchmark dataset and complex collision-free navigation tasks.",/pdf/2a28f091dc0e2dfb6b2c0f6e289582c6e72f542e.pdf,ICLR,2021,We propose a safe RL algorithm that conducts safety-assessment and reward optimization in two separate phases by using a simulator to efficiently train a danger analogue of Q-function +u8X280hw1Mt,n6Sdtbnji8s,1601310000000.0,1614990000000.0,253,EqCo: Equivalent Rules for Self-supervised Contrastive Learning,"[""~Benjin_Zhu1"", ""~Junqiang_Huang1"", ""~Zeming_Li2"", ""~Xiangyu_Zhang1"", ""~Jian_Sun4""]","[""Benjin Zhu"", ""Junqiang Huang"", ""Zeming Li"", ""Xiangyu Zhang"", ""Jian Sun""]",[],"In this paper, we propose a method, named EqCo (Equivalent Rules for Contrastive Learning), to make self-supervised learning irrelevant to the number of negative samples in the contrastive learning framework. Inspired by the InfoMax principle, we point that the margin term in contrastive loss needs to be adaptively scaled according to the number of negative pairs in order to keep steady mutual information bound and gradient magnitude. EqCo bridges the performance gap among a wide range of negative sample sizes, so that for the first time, we can use only a few negative pairs (e.g. 16 per query) to perform self-supervised contrastive training on large-scale vision datasets like ImageNet, while with almost no accuracy drop. This is quite a contrast to the widely used large batch training or memory bank mechanism in current practices. Equipped with EqCo, our simplified MoCo (SiMo) achieves comparable accuracy with MoCo v2 on ImageNet (linear evaluation protocol) while only involves 16 negative pairs per query instead of 65536, suggesting that large quantities of negative samples might not be a critical factor in contrastive learning frameworks.",/pdf/13d95e0522001abb6f4e81e7f18c20a3cf4b51a8.pdf,ICLR,2021, +n4IMHNb8_f,7HxRji4VSu,1601310000000.0,1614990000000.0,391,Differentiable Spatial Planning using Transformers,"[""~Devendra_Singh_Chaplot2"", ""~Deepak_Pathak1"", ""~Jitendra_Malik2""]","[""Devendra Singh Chaplot"", ""Deepak Pathak"", ""Jitendra Malik""]","[""Planning"", ""Spatial planning"", ""Path planning"", ""Navigation"", ""Manipulation"", ""Robotics""]","We consider the problem of spatial path planning. In contrast to the classical solutions which optimize a new plan from scratch and assume access to the full map with ground truth obstacle locations, we learn a planner from the data in a differentiable manner that allows us to leverage statistical regularities from past data. We propose Spatial Planning Transformers (SPT), which given an obstacle map learns to generate actions by planning over long-range spatial dependencies, unlike prior data-driven planners that propagate information locally via convolutional structure in an iterative manner. In the setting where the ground truth map is not known to the agent, we leverage pre-trained SPTs to in an end-to-end framework that has the structure of mapper and planner built into it which allows seamless generalization to out-of-distribution maps and goals. SPTs outperform prior state-of-the-art across all the setups for both manipulation and navigation tasks, leading to an absolute improvement of 7-19%.",/pdf/70171cff87632105fa7654230cccf371da3ced77.pdf,ICLR,2021,A differentiable spatial planning model designed for long-distance spatial reasoning using Transformers which allows end-to-end mapping and planning without access to ground-truth maps. +AWOSz_mMAPx,TAEGolbL3h,1601310000000.0,1616800000000.0,1497,Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation,"[""~Tanner_Fiez1"", ""~Lillian_J_Ratliff1""]","[""Tanner Fiez"", ""Lillian J Ratliff""]","[""game theory"", ""continuous games"", ""generative adversarial networks"", ""theory"", ""gradient descent-ascent"", ""equilibrium"", ""convergence""]","We study the role that a finite timescale separation parameter $\tau$ has on gradient descent-ascent in non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by $\gamma_1$ and the learning rate of player 2 is defined to be $\gamma_2=\tau\gamma_1$. We provide a non-asymptotic construction of the finite timescale separation parameter $\tau^{\ast}$ such that gradient descent-ascent locally converges to $x^{\ast}$ for all $\tau \in (\tau^{\ast}, \infty)$ if and only if it is a strict local minmax equilibrium. Moreover, we provide explicit local convergence rates given the finite timescale separation. The convergence results we present are complemented by a non-convergence result: given a critical point $x^{\ast}$ that is not a strict local minmax equilibrium, we present a non-asymptotic construction of a finite timescale separation $\tau_{0}$ such that gradient descent-ascent with timescale separation $\tau\in (\tau_0, \infty)$ does not converge to $x^{\ast}$. Finally, we extend the results to gradient penalty regularization methods for generative adversarial networks and empirically demonstrate on CIFAR-10 and CelebA the significant impact timescale separation has on training performance. ",/pdf/f5f6da5642532f0e8370b55e9965ca3ee0dcc3c1.pdf,ICLR,2021,We show that there exists a range of finite learning ratios which we construct such that gradient descent-ascent converges to a critical point if and only if it is a strict local minmax equilibrium +dKwmCtp6YI,2quazUJ3D4i,1601310000000.0,1614990000000.0,3761,Representation and Bias in Multilingual NLP: Insights from Controlled Experiments on Conditional Language Modeling,"[""~Ada_Wan1""]","[""Ada Wan""]","[""multilinguality"", ""science for NLP"", ""fundamental science in the era of AI/DL"", ""representation learning for language"", ""conditional language modeling"", ""Transformer"", ""Double Descent"", ""non-monotonicity"", ""fairness"", ""meta evaluation"", ""visualization or interpretation of learned representations""]","Inspired by the phenomenon of performance disparity between languages in machine translation, we investigate whether and to what extent languages are equally hard to ""conditional-language-model"". Our goal is to improve our understanding and expectation of the relationship between language, data representation, size, and performance. We study one-to-one, bilingual conditional language modeling through a series of systematically controlled experiments with the Transformer and the 6 languages from the United Nations Parallel Corpus. We examine character, byte, and word models in 30 language directions and 5 data sizes, and observe indications suggesting a script bias on the character level, a length bias on the byte level, and a word bias that gives rise to a hierarchy in performance across languages. We also identify two types of sample-wise non-monotonicity --- while word-based representations are prone to exhibit Double Descent, length can induce unstable performance across the size range studied in a novel meta phenomenon which we term ""erraticity"". By eliminating statistically significant performance disparity on the character and byte levels by normalizing length and vocabulary in the data, we show that, in the context of computing with the Transformer, there is no complexity intrinsic to languages other than that related to their statistical attributes and that performance disparity is not a necessary condition but a byproduct of word segmentation. Our application of statistical comparisons as a fairness measure also serves as a novel rigorous method for the intrinsic evaluation of languages, resolving a decades-long debate on language complexity. While these quantitative biases leading to disparity are mitigable through a shallower network, we find room for a human bias to be reflected upon. We hope our work helps open up new directions in the area of language and computing that would be fairer and more flexible and foster a new transdisciplinary perspective for DL-inspired scientific progress.",/pdf/c6b1ca1bd5165e52741b5ced23cd3296c823f630.pdf,ICLR,2021,"We study the relationship between language, representation, size, and performance in Transformer conditional language models, and find, among other things, that statistically significant performance disparity can be a byproduct of word segmentation. " +BJg_roAcK7,Bkgsfj-FKm,1538090000000.0,1549390000000.0,98,INVASE: Instance-wise Variable Selection using Neural Networks,"[""jsyoon0823@gmail.com"", ""james.jordon@wolfson.ox.ac.uk"", ""mihaela.vanderschaar@eng.ox.ac.uk""]","[""Jinsung Yoon"", ""James Jordon"", ""Mihaela van der Schaar""]","[""Instance-wise feature selection"", ""interpretability"", ""actor-critic methodology""]","The advent of big data brings with it data with more and more dimensions and thus a growing need to be able to efficiently select which features to use for a variety of problems. While global feature selection has been a well-studied problem for quite some time, only recently has the paradigm of instance-wise feature selection been developed. In this paper, we propose a new instance-wise feature selection method, which we term INVASE. INVASE consists of 3 neural networks, a selector network, a predictor network and a baseline network which are used to train the selector network using the actor-critic methodology. Using this methodology, INVASE is capable of flexibly discovering feature subsets of a different size for each instance, which is a key limitation of existing state-of-the-art methods. We demonstrate through a mixture of synthetic and real data experiments that INVASE significantly outperforms state-of-the-art benchmarks.",/pdf/00c05f84ed352de8f9bc1d8ab379496d4afe5008.pdf,ICLR,2019, +B1epooR5FX,r1xFjyacYm,1538090000000.0,1545360000000.0,664,Predicted Variables in Programming,"[""victor.carbune@gmail.com"", ""thierryc@google.com"", ""shurick@google.com"", ""deselaers@google.com"", ""nikhilsarda@google.com"", ""jyagnik@google.com""]","[""Victor Carbune"", ""Thierry Coppey"", ""Alexander Daryin"", ""Thomas Deselaers"", ""Nikhil Sarda"", ""Jay Yagnik""]","[""predicted variables"", ""machine learning"", ""programming"", ""computing systems"", ""reinforcement learning""]","We present Predicted Variables, an approach to making machine learning (ML) a first class citizen in programming languages. +There is a growing divide in approaches to building systems: using human experts (e.g. programming) on the one hand, and using behavior learned from data (e.g. ML) on the other hand. PVars aim to make using ML in programming easier by hybridizing the two. We leverage the existing concept of variables and create a new type, a predicted variable. PVars are akin to native variables with one important distinction: PVars determine their value using ML when evaluated. We describe PVars and their interface, how they can be used in programming, and demonstrate the feasibility of our approach on three algorithmic problems: binary search, QuickSort, and caches. +We show experimentally that PVars are able to improve over the commonly used heuristics and lead to a better performance than the original algorithms. +As opposed to previous work applying ML to algorithmic problems, PVars have the advantage that they can be used within the existing frameworks and do not require the existing domain knowledge to be replaced. PVars allow for a seamless integration of ML into existing systems and algorithms. +Our PVars implementation currently relies on standard Reinforcement Learning (RL) methods. To learn faster, PVars use the heuristic function, which they are replacing, as an initial function. We show that PVars quickly pick up the behavior of the initial function and then improve performance beyond that without ever performing substantially worse -- allowing for a safe deployment in critical applications.",/pdf/5b01bfae84014a02afcf507715e6de9bbf62a58a.pdf,ICLR,2019,"We present Predicted Variables, an approach to making machine learning a first class citizen in programming languages." +rkePU0VYDr,HklI--DuPH,1569440000000.0,1577170000000.0,1141,A Perturbation Analysis of Input Transformations for Adversarial Attacks,"[""ady@uchicago.edu"", ""skr@uchicago.edu""]","[""Adam Dziedzic"", ""Sanjay Krishnan""]","[""adversarial examples"", ""defenses"", ""stochastic channels"", ""deterministic channels"", ""input transformations"", ""compression"", ""noise"", ""convolutional neural networks""]","The existence of adversarial examples, or intentional mis-predictions constructed from small changes to correctly predicted examples, is one of the most significant challenges in neural network research today. Ironically, many new defenses are based on a simple observation - the adversarial inputs themselves are not robust and small perturbations to the attacking input often recover the desired prediction. While the intuition is somewhat clear, a detailed understanding of this phenomenon is missing from the research literature. This paper presents a comprehensive experimental analysis of when and why perturbation defenses work and potential mechanisms that could explain their effectiveness (or ineffectiveness) in different settings.",/pdf/5c77fccb170657ac648bde801b6347871a527b3b.pdf,ICLR,2020,We identify a family of defense techniques and show that both deterministic lossy compression and randomized perturbations to the input lead to similar gains in robustness. +H1zeHnA9KX,H1lrLeRqFm,1538090000000.0,1551230000000.0,1511,Representing Formal Languages: A Comparison Between Finite Automata and Recurrent Neural Networks ,"[""jjm7@rice.edu"", ""ameesh@rice.edu"", ""averma@rice.edu"", ""richb@rice.edu"", ""swarat@rice.edu"", ""abp4@rice.edu""]","[""Joshua J. Michalenko"", ""Ameesh Shah"", ""Abhinav Verma"", ""Richard G. Baraniuk"", ""Swarat Chaudhuri"", ""Ankit B. Patel""]","[""Language recognition"", ""Recurrent Neural Networks"", ""Representation Learning"", ""deterministic finite automaton"", ""automaton""]","We investigate the internal representations that a recurrent neural network (RNN) uses while learning to recognize a regular formal language. Specifically, we train a RNN on positive and negative examples from a regular language, and ask if there is a simple decoding function that maps states of this RNN to states of the minimal deterministic finite automaton (MDFA) for the language. Our experiments show that such a decoding function indeed exists, and that it maps states of the RNN not to MDFA states, but to states of an {\em abstraction} obtained by clustering small sets of MDFA states into ``''superstates''. A qualitative analysis reveals that the abstraction often has a simple interpretation. Overall, the results suggest a strong structural relationship between internal representations used by RNNs and finite automata, and explain the well-known ability of RNNs to recognize formal grammatical structure. +",/pdf/68f7fcf247d9252606cf85dbbecb61c23299995f.pdf,ICLR,2019,Finite Automata Can be Linearly decoded from Language-Recognizing RNNs using low coarseness abstraction functions and high accuracy decoders. +23ZjUGpjcc,vN1AcwhYCEK,1601310000000.0,1616060000000.0,2224,Scalable Transfer Learning with Expert Models,"[""~Joan_Puigcerver1"", ""~Carlos_Riquelme_Ruiz1"", ""basilm@google.com"", ""~Cedric_Renggli1"", ""~Andr\u00e9_Susano_Pinto1"", ""~Sylvain_Gelly1"", ""~Daniel_Keysers2"", ""~Neil_Houlsby1""]","[""Joan Puigcerver"", ""Carlos Riquelme Ruiz"", ""Basil Mustafa"", ""Cedric Renggli"", ""Andr\u00e9 Susano Pinto"", ""Sylvain Gelly"", ""Daniel Keysers"", ""Neil Houlsby""]","[""Transfer Learning"", ""Expert Models"", ""Few Shot""]","Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.",/pdf/659e2338755eb562f4d6d679d55eb83e71fa5007.pdf,ICLR,2021, +J_pvI6ap5Mn,Cx-ckr_XSWo,1601310000000.0,1614990000000.0,862,Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization,"[""~Qi_Zhu7"", ""~Yidan_Xu1"", ""~Haonan_Wang1"", ""~Chao_Zhang9"", ""~Jiawei_Han1"", ""~Carl_Yang1""]","[""Qi Zhu"", ""Yidan Xu"", ""Haonan Wang"", ""Chao Zhang"", ""Jiawei Han"", ""Carl Yang""]","[""Transfer learning"", ""graph neural networks""]","Graph neural networks (GNNs) have been shown with superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards the transferability of GNNs. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI (ego-graph information maximization) to analytically achieve this goal. Secondly, we specify the requirement of structure-respecting node features as the GNN input, and conduct a rigorous analysis of GNN transferability based on the difference between the local graph Laplacians of the source and target graphs. Finally, we conduct controlled synthetic experiments to directly justify our theoretical conclusions. Extensive experiments on real-world networks towards role identification show consistent results in the rigorously analyzed setting of direct-transfering (freezing parameters), while those towards large-scale relation prediction show promising results in the more generalized and practical setting of transfering with fine-tuning.",/pdf/211081b2e097e517c34c8d8dafd442b12c0c81b5.pdf,ICLR,2021,We establish a theoretically grounded and practically useful framework for the transfer learning of GNNs with experiments for both theoretical and practical scenarios. +rylb3eBtwr,SygDzebFDr,1569440000000.0,1583910000000.0,2527,Robust Subspace Recovery Layer for Unsupervised Anomaly Detection,"[""laixx313@umn.edu"", ""dzou@umn.edu"", ""lerman@umn.edu""]","[""Chieh-Hsin Lai"", ""Dongmian Zou"", ""Gilad Lerman""]","[""robust subspace recovery"", ""unsupervised anomaly detection"", ""outliers"", ""latent space"", ""autoencoder""]","We propose a neural network for unsupervised anomaly detection with a novel robust subspace recovery layer (RSR layer). This layer seeks to extract the underlying subspace from a latent representation of the given data and removes outliers that lie away from this subspace. It is used within an autoencoder. The encoder maps the data into a latent space, from which the RSR layer extracts the subspace. The decoder then smoothly maps back the underlying subspace to a ``manifold"" close to the original inliers. Inliers and outliers are distinguished according to the distances between the original and mapped positions (small for inliers and large for outliers). Extensive numerical experiments with both image and document datasets demonstrate state-of-the-art precision and recall. ",/pdf/4016b3ae0e7a6bfe6bea03eb07e7ba31b98508b6.pdf,ICLR,2020,This work proposes an autoencoder with a novel robust subspace recovery layer for unsupervised anomaly detection and demonstrates state-of-the-art results on various datasets. +#NAME?,jGQU7TFpzE7,1601310000000.0,1614990000000.0,133,Can Kernel Transfer Operators Help Flow based Generative Models?,"[""~Zhichun_Huang1"", ""~Rudrasis_Chakraborty1"", ""~Xingjian_Zhen1"", ""~Vikas_Singh1""]","[""Zhichun Huang"", ""Rudrasis Chakraborty"", ""Xingjian Zhen"", ""Vikas Singh""]",[],"Flow-based generative models refer to deep generative models with +tractable likelihoods, and offer several attractive properties including +efficient density estimation and sampling. Despite many advantages, +current formulations (e.g., normalizing flow) often have an expensive memory/runtime footprint, which hinders their use in a number of applications. +In this paper, we consider the setting where we have access to an autoencoder, which is +suitably effective for the dataset of interest. Under some mild conditions, +we show that we can calculate a mapping to a RKHS which subsequently enables deploying +mature ideas from the kernel methods literature for flow-based generative models. Specifically, we can explicitly map the RKHS distribution (i.e., +approximate the flow) to match or align with +a template/well-characterized distribution, via kernel transfer operators. This leads to a direct and resource efficient approximation avoiding iterative optimization. We empirically show that this simple idea yields competitive results on popular datasets such as CelebA, +as well as promising results on a public 3D brain imaging dataset where the sample sizes are much smaller. ",/pdf/6577eeb79743ec57777c1223249a3e0e553bb4ee.pdf,ICLR,2021,(almost) training-free generative model +Db4yerZTYkz,fHqoJ7pmoiG,1601310000000.0,1615860000000.0,194,Shape-Texture Debiased Neural Network Training,"[""~Yingwei_Li4"", ""~Qihang_Yu1"", ""~Mingxing_Tan3"", ""~Jieru_Mei2"", ""~Peng_Tang1"", ""~Wei_Shen2"", ""~Alan_Yuille1"", ""~cihang_xie1""]","[""Yingwei Li"", ""Qihang Yu"", ""Mingxing Tan"", ""Jieru Mei"", ""Peng Tang"", ""Wei Shen"", ""Alan Yuille"", ""cihang xie""]","[""data augmentation"", ""representation learning"", ""debiased training""]","Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (eg, an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously. + +Experiments show that our method successfully improves model performance on several image recognition benchmarks and adversarial robustness. For example, by training on ImageNet, it helps ResNet-152 achieve substantial improvements on ImageNet (+1.2%), ImageNet-A (+5.2%), ImageNet-C (+8.3%) and Stylized-ImageNet (+11.1%), and on defending against FGSM adversarial attacker on ImageNet (+14.4%). Our method also claims to be compatible with other advanced data augmentation strategies, eg, Mixup, and CutMix. The code is available here: https://github.com/LiYingwei/ShapeTextureDebiasedTraining.",/pdf/95feebd9ddd0cc554ef18bf3b5cfce3f76bb9dd6.pdf,ICLR,2021,Training CNNs to acquire a debiased shape-texture representation improves image recognition. +rJG8asRqKX,HygfP7BFKm,1538090000000.0,1545360000000.0,806,A Deep Learning Approach for Dynamic Survival Analysis with Competing Risks,"[""chl8856@gmail.com"", ""mihaela@ee.ucla.edu""]","[""Changhee Lee"", ""Mihaela van der Schaar""]","[""dynamic survival analysis"", ""survival analysis"", ""longitudinal measurements"", ""competing risks""]","Currently available survival analysis methods are limited in their ability to deal with complex, heterogeneous, and longitudinal data such as that available in primary care records, or in their ability to deal with multiple competing risks. This paper develops a novel deep learning architecture that flexibly incorporates the available longitudinal data comprising various repeated measurements (rather than only the last available measurements) in order to issue dynamically updated survival predictions for one or multiple competing risk(s). Unlike existing works in the survival analysis on the basis of longitudinal data, the proposed method learns the time-to-event distributions without specifying underlying stochastic assumptions of the longitudinal or the time-to-event processes. Thus, our method is able to learn associations between the longitudinal data and the various associated risks in a fully data-driven fashion. We demonstrate the power of our method by applying it to real-world longitudinal datasets and show a drastic improvement over state-of-the-art methods in discriminative performance. Furthermore, our analysis of the variable importance and dynamic survival predictions will yield a better understanding of the predicted risks which will result in more effective health care.",/pdf/0a2096bd44c86cbe7725ab0f21b2580a72570730.pdf,ICLR,2019, +Bkx29TVFPr,rygpLmCDvB,1569440000000.0,1577170000000.0,721,An implicit function learning approach for parametric modal regression,"[""pan6@ualberta.ca"", ""whitem@ualberta.ca"", ""farahmand@vectorinstitute.ai""]","[""Yangchen Pan"", ""Martha White"", ""Amir-massoud Farahmand""]","[""regression"", ""modal regression"", ""implicit function theorem"", ""multivalue function""]","For multi-valued functions---such as when the conditional distribution on targets given the inputs is multi-modal---standard regression approaches are not always desirable because they provide the conditional mean. Modal regression approaches aim to instead find the conditional mode, but are restricted to nonparametric approaches. Such approaches can be difficult to scale, and make it difficult to benefit from parametric function approximation, like neural networks, which can learn complex relationships between inputs and targets. In this work, we propose a parametric modal regression algorithm, by using the implicit function theorem to develop an objective for learning a joint parameterized function over inputs and targets. We empirically demonstrate on several synthetic problems that our method (i) can learn multi-valued functions and produce the conditional modes, (ii) scales well to high-dimensional inputs and (iii) is even more effective for certain unimodal problems, particularly for high frequency data where the joint function over inputs and targets can better capture the complex relationship between them. We conclude by showing that our method provides small improvements on two regression datasets that have asymmetric distributions over the targets. ",/pdf/9f564c062bbaafd9996290ddc42e916deb322bce.pdf,ICLR,2020,We introduce a simple and novel modal regression algorithm which is easy to scale to large problems. +BkDB51WR-,rkES9yWAZ,1509120000000.0,1518730000000.0,508,Learning temporal evolution of probability distribution with Recurrent Neural Network,"[""kyeo@us.ibm.com"", ""igor.melnyk@ibm.com"", ""nnguyen@us.ibm.com"", ""eunkyung.lee@us.ibm.com""]","[""Kyongmin Yeo"", ""Igor Melnyk"", ""Nam Nguyen"", ""Eun Kyung Lee""]","[""predictive distribution estimation"", ""probabilistic RNN"", ""uncertainty in time series prediction""]","We propose to tackle a time series regression problem by computing temporal evolution of a probability density function to provide a probabilistic forecast. A Recurrent Neural Network (RNN) based model is employed to learn a nonlinear operator for temporal evolution of a probability density function. We use a softmax layer for a numerical discretization of a smooth probability density functions, which transforms a function approximation problem to a classification task. Explicit and implicit regularization strategies are introduced to impose a smoothness condition on the estimated probability distribution. A Monte Carlo procedure to compute the temporal evolution of the distribution for a multiple-step forecast is presented. The evaluation of the proposed algorithm on three synthetic and two real data sets shows advantage over the compared baselines.",/pdf/c318b430cf6264e0d7771c2b6a1c6292ef7953c6.pdf,ICLR,2018,Proposed RNN-based algorithm to estimate predictive distribution in one- and multi-step forecasts in time series prediction problems +HJeu43ActQ,r1guzLFcK7,1538090000000.0,1566940000000.0,1464,NOODL: Provable Online Dictionary Learning and Sparse Coding,"[""rambh002@umn.edu"", ""lixx1661@umn.edu"", ""jdhaupt@umn.edu""]","[""Sirisha Rambhatla"", ""Xingguo Li"", ""Jarvis Haupt""]","[""dictionary learning"", ""provable dictionary learning"", ""online dictionary learning"", ""sparse coding"", ""support recovery"", ""iterative hard thresholding"", ""matrix factorization"", ""neural architectures"", ""neural networks"", ""noodl""]","We consider the dictionary learning problem, where the aim is to model the given data as a linear combination of a few columns of a matrix known as a dictionary, where the sparse weights forming the linear combination are known as coefficients. Since the dictionary and coefficients, parameterizing the linear model are unknown, the corresponding optimization is inherently non-convex. This was a major challenge until recently, when provable algorithms for dictionary learning were proposed. Yet, these provide guarantees only on the recovery of the dictionary, without explicit recovery guarantees on the coefficients. Moreover, any estimation error in the dictionary adversely impacts the ability to successfully localize and estimate the coefficients. This potentially limits the utility of existing provable dictionary learning methods in applications where coefficient recovery is of interest. To this end, we develop NOODL: a simple Neurally plausible alternating Optimization-based Online Dictionary Learning algorithm, which recovers both the dictionary and coefficients exactly at a geometric rate, when initialized appropriately. Our algorithm, NOODL, is also scalable and amenable for large scale distributed implementations in neural architectures, by which we mean that it only involves simple linear and non-linear operations. Finally, we corroborate these theoretical results via experimental evaluation of the proposed algorithm with the current state-of-the-art techniques.",/pdf/17f3ae6a519c4d562612d8654adf43f7d1ad07c4.pdf,ICLR,2019,We present a provable algorithm for exactly recovering both factors of the dictionary learning model. +rkVOXhAqY7,HJerTRnqtX,1538090000000.0,1545360000000.0,1375,The Conditional Entropy Bottleneck,"[""iansf@google.com""]","[""Ian Fischer""]","[""representation learning"", ""information theory"", ""uncertainty"", ""out-of-distribution detection"", ""adversarial example robustness"", ""generalization"", ""objective function""]","We present a new family of objective functions, which we term the Conditional Entropy Bottleneck (CEB). These objectives are motivated by the Minimum Necessary Information (MNI) criterion. We demonstrate the application of CEB to classification tasks. We show that CEB gives: well-calibrated predictions; strong detection of challenging out-of-distribution examples and powerful whitebox adversarial examples; and substantial robustness to those adversaries. Finally, we report that CEB fails to learn from information-free datasets, providing a possible resolution to the problem of generalization observed in Zhang et al. (2016).",/pdf/1269d9d4357927908ebb723d892d7d0a79f171ec.pdf,ICLR,2019,The Conditional Entropy Bottleneck is an information-theoretic objective function for learning optimal representations. +BJlxm30cKm,SJelTc6cFm,1538090000000.0,1548970000000.0,1326,An Empirical Study of Example Forgetting during Deep Neural Network Learning,"[""mariya.k.toneva@gmail.com"", ""alsordon@microsoft.com"", ""retachet@microsoft.com"", ""adtrisch@microsoft.com"", ""yoshua.bengio@mila.quebec"", ""geoff.gordon@microsoft.com""]","[""Mariya Toneva*"", ""Alessandro Sordoni*"", ""Remi Tachet des Combes*"", ""Adam Trischler"", ""Yoshua Bengio"", ""Geoffrey J. Gordon""]","[""catastrophic forgetting"", ""sample weighting"", ""deep generalization""]","Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a ``forgetting event'' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.",/pdf/e6f24f0a844c3f9322d34994c197b5051f9603ad.pdf,ICLR,2019,We show that catastrophic forgetting occurs within what is considered to be a single task and find that examples that are not prone to forgetting can be removed from the training set without loss of generalization. +Hyffti0ctQ,S1eUDhAtKX,1538090000000.0,1545360000000.0,425,PRUNING WITH HINTS: AN EFFICIENT FRAMEWORK FOR MODEL ACCELERATION,"[""weigao1996@outlook.com"", ""wei-y15@mails.tsinghua.edu.cn"", ""liquanquan@sensetime.com"", ""qinghongwei@sensetime.com"", ""wanli.ouyang@sydney.edu.cn"", ""yanjunjie@outlook.com""]","[""Wei Gao"", ""Yi Wei"", ""Quanquan Li"", ""Hongwei Qin"", ""Wanli Ouyang"", ""Junjie Yan""]","[""model acceleration"", ""mimic"", ""knowledge distillation"", ""channel pruning""]","In this paper, we propose an efficient framework to accelerate convolutional neural networks. We utilize two types of acceleration methods: pruning and hints. Pruning can reduce model size by removing channels of layers. Hints can improve the performance of student model by transferring knowledge from teacher model. We demonstrate that pruning and hints are complementary to each other. On one hand, hints can benefit pruning by maintaining similar feature representations. On the other hand, the model pruned from teacher networks is a good initialization for student model, which increases the transferability between two networks. Our approach performs pruning stage and hints stage iteratively to further improve the +performance. Furthermore, we propose an algorithm to reconstruct the parameters of hints layer and make the pruned model more suitable for hints. Experiments were conducted on various tasks including classification and pose estimation. Results on CIFAR-10, ImageNet and COCO demonstrate the generalization and superiority of our framework.",/pdf/0b8f76189d402de4ecf73ec2373a5df2ab4e931c.pdf,ICLR,2019,This is a work aiming for boosting all the existing pruning and mimic method. +qHXkE-8c1sQ,VCdfO7Y5Mj,1601310000000.0,1614990000000.0,780,Information distance for neural network functions,"[""~Xiao_Zhang9"", ""~Dejing_Dou1"", ""wuji_ee@mail.tsinghua.edu.cn""]","[""Xiao Zhang"", ""Dejing Dou"", ""Ji Wu""]",[],"We provide a practical distance measure in the space of functions parameterized by neural networks. It is based on the classical information distance, and we propose to replace the uncomputable Kolmogorov complexity with information measured by codelength of prequential coding. We also provide a method for directly estimating the expectation of such codelength with limited examples. Empirically, we show that information distance is invariant with respect to different parameterization of the neural networks. We also verify that information distance can faithfully reflect similarities of neural network functions. Finally, we applied information distance to investigate the relationship between neural network models, and demonstrate the connection between information distance and multiple characteristics and behaviors of neural networks.",/pdf/be072a562eb55c31f2cce4d12b5fd2344c239654.pdf,ICLR,2021,An approximated information distance for measuring the distance (or similarity) between neural network functions. +SkeRTsAcYm,ryg5G0OqYm,1538090000000.0,1549300000000.0,850,Phase-Aware Speech Enhancement with Deep Complex U-Net,"[""kekepa15@snu.ac.kr"", ""blue378@snu.ac.kr"", ""jaesung.huh@navercorp.com"", ""adrian.kim@navercorp.com"", ""jungwoo.ha@navercorp.com"", ""kglee@snu.ac.kr""]","[""Hyeong-Seok Choi"", ""Jang-Hyun Kim"", ""Jaesung Huh"", ""Adrian Kim"", ""Jung-Woo Ha"", ""Kyogu Lee""]","[""speech enhancement"", ""deep learning"", ""complex neural networks"", ""phase estimation""]","Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.",/pdf/1037a035894c085a0b8f9aaa69c8ddec2f2a2587.pdf,ICLR,2019,This paper proposes a novel complex masking method for speech enhancement along with a loss function for efficient phase estimation. +BJlgNh0qKQ,rJeyvH6cF7,1538090000000.0,1550690000000.0,1419,Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder,"[""c.f.corro@uva.nl"", ""i.a.titov@uva.nl""]","[""Caio Corro"", ""Ivan Titov""]","[""differentiable dynamic programming"", ""variational auto-encoder"", ""dependency parsing"", ""semi-supervised learning""]","Human annotation for syntactic parsing is expensive, and large resources are available only for a fraction of languages. A question we ask is whether one can leverage abundant unlabeled texts to improve syntactic parsers, beyond just using the texts to obtain more generalisable lexical features (i.e. beyond word embeddings). To this end, we propose a novel latent-variable generative model for semi-supervised syntactic dependency parsing. As exact inference is intractable, we introduce a differentiable relaxation to obtain approximate samples and compute gradients with respect to the parser parameters. Our method (Differentiable Perturb-and-Parse) relies on differentiable dynamic programming over stochastically perturbed edge scores. We demonstrate effectiveness of our approach with experiments on English, French and Swedish.",/pdf/54c60102be78774b7e25c856c7b505cc5d61f63b.pdf,ICLR,2019,Differentiable dynamic programming over perturbed input weights with application to semi-supervised VAE +HyxGB2AcY7,BkgJH2p5FX,1538090000000.0,1551760000000.0,1520,Contingency-Aware Exploration in Reinforcement Learning,"[""jwook@umich.edu"", ""guoyijie@umich.edu"", ""marcin.lukasz.moczulski@gmail.com"", ""junhyuk@umich.edu"", ""neal@nealwu.com"", ""mnorouzi@google.com"", ""honglak@eecs.umich.edu""]","[""Jongwook Choi"", ""Yijie Guo"", ""Marcin Moczulski"", ""Junhyuk Oh"", ""Neal Wu"", ""Mohammad Norouzi"", ""Honglak Lee""]","[""Reinforcement Learning"", ""Exploration"", ""Contingency-Awareness""]","This paper investigates whether learning contingency-awareness and controllable aspects of an environment can lead to better exploration in reinforcement learning. To investigate this question, we consider an instantiation of this hypothesis evaluated on the Arcade Learning Element (ALE). In this study, we develop an attentive dynamics model (ADM) that discovers controllable elements of the observations, which are often associated with the location of the character in Atari games. The ADM is trained in a self-supervised fashion to predict the actions taken by the agent. The learned contingency information is used as a part of the state representation for exploration purposes. We demonstrate that combining actor-critic algorithm with count-based exploration using our representation achieves impressive results on a set of notoriously challenging Atari games due to sparse rewards. For example, we report a state-of-the-art score of >11,000 points on Montezuma's Revenge without using expert demonstrations, explicit high-level information (e.g., RAM states), or supervisory data. Our experiments confirm that contingency-awareness is indeed an extremely powerful concept for tackling exploration problems in reinforcement learning and opens up interesting research questions for further investigations.",/pdf/ecaed193fa8602001719bd9ce2255ca938dd3f44.pdf,ICLR,2019,We investigate contingency-awareness and controllable aspects in exploration and achieve state-of-the-art performance on Montezuma's Revenge without expert demonstrations. +SJgf6Z-0W,B10ZpbWCZ,1509130000000.0,1518730000000.0,728,Predicting Multiple Actions for Stochastic Continuous Control,"[""sanjeev.kumar@in.tum.de"", ""christian.rupprecht@in.tum.de"", ""tombari@in.tum.de"", ""hager@cs.tum.edu""]","[""Sanjeev Kumar"", ""Christian Rupprecht"", ""Federico Tombari"", ""Gregory D. Hager""]","[""Reinforcement Learning"", ""DDPG"", ""Multiple Action Prediction""]","We introduce a new approach to estimate continuous actions using actor-critic algorithms for reinforcement learning problems. Policy gradient methods usually predict one continuous action estimate or parameters of a presumed distribution (most commonly Gaussian) for any given state which might not be optimal as it may not capture the complete description of the target distribution. Our approach instead predicts M actions with the policy network (actor) and then uniformly sample one action during training as well as testing at each state. This allows the agent to learn a simple stochastic policy that has an easy to compute expected return. In all experiments, this facilitates better exploration of the state space during training and converges to a better policy. ",/pdf/e5d7fab7bcad1c25ec961bdd9c26b79d91a62440.pdf,ICLR,2018,"We introduce a novel reinforcement learning algorithm, that predicts multiple actions and samples from them." +Skdvd2xAZ,r1wDdheAW,1509110000000.0,1519400000000.0,394,A Scalable Laplace Approximation for Neural Networks,"[""j.ritter@cs.ucl.ac.uk"", ""botevmg@gmail.com"", ""d.barber@cs.ucl.ac.uk""]","[""Hippolyt Ritter"", ""Aleksandar Botev"", ""David Barber""]","[""deep learning"", ""neural networks"", ""laplace approximation"", ""bayesian deep learning""]","We leverage recent insights from second-order optimisation for neural networks to construct a Kronecker factored Laplace approximation to the posterior over the weights of a trained network. Our approximation requires no modification of the training procedure, enabling practitioners to estimate the uncertainty of their models currently used in production without having to retrain them. We extensively compare our method to using Dropout and a diagonal Laplace approximation for estimating the uncertainty of a network. We demonstrate that our Kronecker factored method leads to better uncertainty estimates on out-of-distribution data and is more robust to simple adversarial attacks. Our approach only requires calculating two square curvature factor matrices for each layer. Their size is equal to the respective square of the input and output size of the layer, making the method efficient both computationally and in terms of memory usage. We illustrate its scalability by applying it to a state-of-the-art convolutional network architecture.",/pdf/91eec41e2f124ba0e1e7cfc863aba1bc368c9959.pdf,ICLR,2018,We construct a Kronecker factored Laplace approximation for neural networks that leads to an efficient matrix normal distribution over the weights. +Mh1Abj33qI,BAYR22pSBI8,1601310000000.0,1614990000000.0,1344,Data-driven Learning of Geometric Scattering Networks,"[""~Alexander_Tong1"", ""frederik.wenkel@umontreal.ca"", ""kincaid.macdonald@yale.edu"", ""~Smita_Krishnaswamy1"", ""~Guy_Wolf1""]","[""Alexander Tong"", ""Frederik Wenkel"", ""Kincaid Macdonald"", ""Smita Krishnaswamy"", ""Guy Wolf""]","[""Graph Neural Networks"", ""GNNs"", ""Geometric Scattering"", ""Radial Basis Network"", ""Graph Signal Processing"", ""Wavelet""]","Many popular graph neural network (GNN) architectures, which are often considered as the current state of the art, rely on encoding graph structure via smoothness or similarity between neighbors. While this approach performs well on a surprising number of standard benchmarks, the efficacy of such models does not translate consistently to more complex domains, such as graph data in the biochemistry domain. We argue that these more complex domains require priors that encourage learning of longer range features rather than oversmoothed signals of standard GNN architectures. Here, we propose an alternative GNN architecture, based on a relaxation of recently proposed geometric scattering transforms, which consists of a cascade of graph wavelet filters. Our learned geometric scattering (LEGS) architecture adaptively tunes these wavelets and their scales to encourage band-pass features to emerge in learned representations. This results in a simplified GNN with significantly fewer learned parameters compared to competing methods. We demonstrate the predictive performance of our method on several biochemistry graph classification benchmarks, as well as the descriptive quality of its learned features in biochemical graph data exploration tasks. Our results show that the proposed LEGS network matches or outperforms popular GNNs, as well as the original geometric scattering construction, while retaining certain mathematical properties of its handcrafted (nonlearned) design.",/pdf/4917c6641eb72fc196e1e2402ceabedf2f2e8ec2.pdf,ICLR,2021,We introduce learnable geometric scattering showing theoretical and empirical benefits in graph classification particularly in the biochemical domain. +BJh6Ztuxl,,1478160000000.0,1486460000000.0,60,Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks,"[""yossiadidrum@gmail.com"", ""einatke@il.ibm.com"", ""belinkov@mit.edu"", ""oferl@il.ibm.com"", ""yoav.goldberg@gmail.com""]","[""Yossi Adi"", ""Einat Kermany"", ""Yonatan Belinkov"", ""Ofer Lavi"", ""Yoav Goldberg""]","[""Natural language processing"", ""Deep learning""]","There is a lot of research interest in encoding variable length sentences into fixed +length vectors, in a way that preserves the sentence meanings. Two common +methods include representations based on averaging word vectors, and representations based on the hidden states of recurrent neural networks such as LSTMs. +The sentence vectors are used as features for subsequent machine learning tasks +or for pre-training in the context of deep learning. However, not much is known +about the properties that are encoded in these sentence representations and about +the language information they capture. +We propose a framework that facilitates better understanding of the encoded representations. We define prediction tasks around isolated aspects of sentence structure (namely sentence length, word content, and word order), and score representations by the ability to train a classifier to solve each prediction task when +using the representation as input. We demonstrate the potential contribution of the +approach by analyzing different sentence representation mechanisms. The analysis sheds light on the relative strengths of different sentence embedding methods with respect to these low level prediction tasks, and on the effect of the encoded +vector’s dimensionality on the resulting representations.",/pdf/20c136f446b546356902d50e2364756217c88326.pdf,ICLR,2017,A method for analyzing sentence embeddings on a fine-grained level using auxiliary prediction tasks +r1xN5oA5tm,rJgcj98tFX,1538090000000.0,1545360000000.0,527,Phrase-Based Attentions,"[""xuanphi001@e.ntu.edu.sg"", ""srjoty@ntu.edu.sg""]","[""Phi Xuan Nguyen"", ""Shafiq Joty""]","[""neural machine translation"", ""natural language processing"", ""attention"", ""transformer"", ""seq2seq"", ""phrase-based"", ""phrase"", ""n-gram""]","Most state-of-the-art neural machine translation systems, despite being different +in architectural skeletons (e.g., recurrence, convolutional), share an indispensable +feature: the Attention. However, most existing attention methods are token-based +and ignore the importance of phrasal alignments, the key ingredient for the success +of phrase-based statistical machine translation. In this paper, we propose +novel phrase-based attention methods to model n-grams of tokens as attention +entities. We incorporate our phrase-based attentions into the recently proposed +Transformer network, and demonstrate that our approach yields improvements of +1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation +tasks, and 1.75 and 1.35 BLEU points in English-to-Russian and Russian-to-English translation tasks +on WMT newstest2014 using WMT’16 training data. +",/pdf/3e2a1f64439a3a8e26f1daf7c2c554d8376a1c29.pdf,ICLR,2019,"Phrase-based attention mechanisms to assign attention on phrases, achieving token-to-phrase, phrase-to-token, phrase-to-phrase attention alignments, in addition to existing token-to-token attentions." +rJeU_1SFvr,Hyly9eCuwB,1569440000000.0,1577170000000.0,1804,LOGAN: Latent Optimisation for Generative Adversarial Networks,"[""yanwu@google.com"", ""jeffdonahue@google.com"", ""dbalduzzi@google.com"", ""simonyan@google.com"", ""countzero@google.com""]","[""Yan Wu"", ""Jeff Donahue"", ""David Balduzzi"", ""Karen Simonyan"", ""Timothy Lillicrap""]","[""GAN"", ""adversarial training"", ""generative model"", ""game theory""]","Training generative adversarial networks requires balancing of delicate adversarial dynamics. Even with careful tuning, training may diverge or end up in a bad equilibrium with dropped modes. In this work, we introduce a new form of latent optimisation inspired by the CS-GAN and show that it improves adversarial dynamics by enhancing interactions between the discriminator and the generator. We develop supporting theoretical analysis from the perspectives of differentiable games and stochastic approximation. Our experiments demonstrate that latent optimisation can significantly improve GAN training, obtaining state-of-the-art performance for the ImageNet (128 x 128) dataset. Our model achieves an Inception Score (IS) of 148 and an Frechet Inception Distance (FID) of 3.4, an improvement of 17% and 32% in IS and FID respectively, compared with the baseline BigGAN-deep model with the same architecture and number of parameters.",/pdf/067003bb3a14491d779c8e6211d86588718072d7.pdf,ICLR,2020,Latent optimisation improves adversarial training dynamics. We present both theoretical analysis and state-of-the-art image generation with ImageNet 128x128. +By0ANxbRW,H1jANgZRZ,1509130000000.0,1518730000000.0,577,DNN Model Compression Under Accuracy Constraints,"[""khoram@wisc.edu"", ""jli@ece.wisc.edu""]","[""Soroosh Khoram"", ""Jing Li""]","[""DNN Compression"", ""Weigh-sharing"", ""Model Compression""]","The growing interest to implement Deep Neural Networks (DNNs) on resource-bound hardware has motivated innovation of compression algorithms. Using these algorithms, DNN model sizes can be substantially reduced, with little to no accuracy degradation. This is achieved by either eliminating components from the model, or penalizing complexity during training. While both approaches demonstrate considerable compressions, the former often ignores the loss function during compression while the later produces unpredictable compressions. In this paper, we propose a technique that directly minimizes both the model complexity and the changes in the loss function. In this technique, we formulate compression as a constrained optimization problem, and then present a solution for it. We will show that using this technique, we can achieve competitive results.",/pdf/2dffd9a96b38af32dd4556e9166a0b4bb73917fd.pdf,ICLR,2018,Compressing trained DNN models by minimizing their complexity while constraining their loss. +BygrtoC9Km,Hkl5NRL9Y7,1538090000000.0,1545360000000.0,442,Meta-Learning with Individualized Feature Space for Few-Shot Classification,"[""chunrui.han@vipl.ict.ac.cn"", ""sgshan@ict.ac.cn"", ""kanmeina@ict.ac.cn"", ""shuzhe.wu@vipl.ict.ac.cn"", ""xlchen@ict.ac.cn""]","[""Chunrui Han"", ""Shiguang Shan"", ""Meina Kan"", ""Shuzhe Wu"", ""Xilin Chen""]","[""few-shot classification"", ""meta-learning"", ""individualized feature space""]","Meta-learning provides a promising learning framework to address few-shot classification tasks. In existing meta-learning methods, the meta-learner is designed to learn about model optimization, parameter initialization, or similarity metric. Differently, in this paper, we propose to learn how to create an individualized feature embedding specific to a given query image for better classifying, i.e., given a query image, a specific feature embedding tailored for its characteristics is created accordingly, leading to an individualized feature space in which the query image can be more accurately classified.  Specifically, we introduce a kernel generator as meta-learner to learn to construct feature embedding for query images. The kernel generator acquires meta-knowledge of generating adequate convolutional kernels for different query images during training, which can generalize to unseen categories without fine-tuning. In two standard few-shot classification data sets, i.e. Omniglot, and \emph{mini}ImageNet, our method shows highly competitive performance. ",/pdf/e40c7c2a4a67dc58a7b19e7b1c848d6ddabedbe1.pdf,ICLR,2019, +SkxzSgStPS,ByxUfuxYvr,1569440000000.0,1577170000000.0,2276,Exploration via Flow-Based Intrinsic Rewards,"[""hellochick@gapp.nthu.edu.tw"", ""ymmoy999@gapp.nthu.edu.tw"", ""romulus@gapp.nthu.edu.tw"", ""cylee@gapp.nthu.edu.tw""]","[""Hsuan-Kung Yang"", ""Po-Han Chiang"", ""Min-Fong Hong"", ""Chun-Yi Lee""]","[""reinforcement learning"", ""exploration"", ""curiosity"", ""optical flow"", ""intrinsic rewards""]","Exploration bonuses derived from the novelty of observations in an environment have become a popular approach to motivate exploration for reinforcement learning (RL) agents in the past few years. Recent methods such as curiosity-driven exploration usually estimate the novelty of new observations by the prediction errors of their system dynamics models. In this paper, we introduce the concept of optical flow estimation from the field of computer vision to the RL domain and utilize the errors from optical flow estimation to evaluate the novelty of new observations. We introduce a flow-based intrinsic curiosity module (FICM) capable of learning the motion features and understanding the observations in a more comprehensive and efficient fashion. We evaluate our method and compare it with a number of baselines on several benchmark environments, including Atari games, Super Mario Bros., and ViZDoom. Our results show that the proposed method is superior to the baselines in certain environments, especially for those featuring sophisticated moving patterns or with high-dimensional observation spaces.",/pdf/68041685527637e028c9025779e0dcf2222fd5a3.pdf,ICLR,2020, +rJx4p3NYDB,rkghtU8QPH,1569440000000.0,1583910000000.0,223,Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information,"[""vofhqn@gmail.com"", ""rtz19970824@gmail.com"", ""lijialia16@mails.tsinghua.edu.cn"", ""sproblvem@gmail.com"", ""dcszj@mail.tsinghua.edu.cn""]","[""Yichi Zhou"", ""Tongzheng Ren"", ""Jialian Li"", ""Dong Yan"", ""Jun Zhu""]",[],"Counterfactual regret minimization (CFR) methods are effective for solving two-player zero-sum extensive games with imperfect information with state-of-the-art results. However, the vanilla CFR has to traverse the whole game tree in each round, which is time-consuming in large-scale games. In this paper, we present Lazy-CFR, a CFR algorithm that adopts a lazy update strategy to avoid traversing the whole game tree in each round. We prove that the regret of Lazy-CFR is almost the same to the regret of the vanilla CFR and only needs to visit a small portion of the game tree. Thus, Lazy-CFR is provably faster than CFR. Empirical results consistently show that Lazy-CFR is significantly faster than the vanilla CFR.",/pdf/b827827e363ab68758822586e70c803513976da2.pdf,ICLR,2020, +SyG4RiR5Ym,H1g51a6qYQ,1538090000000.0,1545360000000.0,887,Neural Distribution Learning for generalized time-to-event prediction,"[""egil.martinsson@gmail.com"", ""adrian.kim@navercorp.com"", ""jaesung.huh@navercorp.com"", ""jchoo@korea.ac.kr"", ""jungwoo.ha@navercorp.com""]","[""Egil Martinsson"", ""Adrian Kim"", ""Jaesung Huh"", ""Jaegul Choo"", ""Jung-Woo Ha""]","[""Deep Learning"", ""Survival Analysis"", ""Event prediction"", ""Time Series"", ""Probabilistic Programming"", ""Density Networks""]","Predicting the time to the next event is an important task in various domains. +However, due to censoring and irregularly sampled sequences, time-to-event prediction has resulted in limited success only for particular tasks, architectures and data. Using recent advances in probabilistic programming and density networks, we make the case for a generalized parametric survival approach, sequentially predicting a distribution over the time to the next event. +Unlike previous work, the proposed method can use asynchronously sampled features for censored, discrete, and multivariate data. +Furthermore, it achieves good performance and near perfect calibration for probabilistic predictions without using rigid network-architectures, multitask approaches, complex learning schemes or non-trivial adaptations of cox-models. +We firmly establish that this can be achieved in the standard neural network framework by simply switching out the output layer and loss function.",/pdf/de3044062bda980272d477826dc45ef5b02cfaf1.pdf,ICLR,2019,We present a general solution to event prediction that has been there all along; Discrete Time Parametric Survival Analysis. +Sk2Im59ex,,1478300000000.0,1489320000000.0,533,Unsupervised Cross-Domain Image Generation,"[""yaniv@fb.com"", ""adampolyak@fb.com"", ""wolf@fb.com""]","[""Yaniv Taigman"", ""Adam Polyak"", ""Lior Wolf""]","[""Computer vision"", ""Deep learning"", ""Unsupervised Learning"", ""Transfer Learning""]","We study the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given representation function f, which accepts inputs in either domains, would remain unchanged. Other than f, the training data is unsupervised and consist of a set of samples from each domain, without any mapping between them. The Domain Transfer Network (DTN) we present employs a compound loss function that includes a multiclass GAN loss, an f preserving component, and a regularizing component that encourages G to map samples from T to themselves. We apply our method to visual domains including digits and face images and demonstrate its ability to generate convincing novel images of previously unseen entities, while preserving their identity.",/pdf/2b3b65044995627de4b782921d5b17dbe9086eb0.pdf,ICLR,2017, +HkxCzeHFDB,HJeNt7xtwH,1569440000000.0,1583910000000.0,2193,Functional Regularisation for Continual Learning with Gaussian Processes,"[""mtitsias@google.com"", ""schwarzjn@google.com"", ""alexmatthews@google.com"", ""razp@google.com"", ""ywteh@google.com""]","[""Michalis K. Titsias"", ""Jonathan Schwarz"", ""Alexander G. de G. Matthews"", ""Razvan Pascanu"", ""Yee Whye Teh""]","[""Continual Learning"", ""Gaussian Processes"", ""Lifelong learning"", ""Incremental Learning""]","We introduce a framework for Continual Learning (CL) based on Bayesian inference over the function space rather than the parameters of a deep neural network. This method, referred to as functional regularisation for Continual Learning, avoids forgetting a previous task by constructing and memorising an approximate posterior belief over the underlying task-specific function. To achieve this we rely on a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed. Then, the training algorithm sequentially encounters tasks and constructs posterior beliefs over the task-specific functions by using inducing point sparse Gaussian process methods. At each step a new task is first learnt and then a summary is constructed consisting of (i) inducing inputs – a fixed-size subset of the task inputs selected such that it optimally represents the task – and (ii) a posterior distribution over the function values at these inputs. This summary then regularises learning of future tasks, through Kullback-Leibler regularisation terms. Our method thus unites approaches focused on (pseudo-)rehearsal with those derived from a sequential Bayesian inference perspective in a principled way, leading to strong results on accepted benchmarks.",/pdf/71078be0f0f1886bbfc8f485c61b5fd434452161.pdf,ICLR,2020,Using inducing point sparse Gaussian process methods to overcome catastrophic forgetting in neural networks. +r1dHXnH6-,SkDH73HT-,1508390000000.0,1519240000000.0,16,Natural Language Inference over Interaction Space,"[""yichen.gong@nyu.edu"", ""heng.luo@hobot.cc"", ""jian.zhang@hobot.cc""]","[""Yichen Gong"", ""Heng Luo"", ""Jian Zhang""]","[""natural language inference"", ""attention"", ""SoTA"", ""natural language understanding""]","Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",/pdf/be3d097594e41fedc69ab61eed03aa8d3b7122c1.pdf,ICLR,2018,show multi-channel attention weight contains semantic feature to solve natural language inference task. +RmB-88r9dL,W0dj04I17qT,1601310000000.0,1615710000000.0,2143,VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments,"[""lizhen@uchicago.edu"", ""~Mao_Ye11"", ""~qiang_liu4"", ""nicolae@galton.uchicago.edu""]","[""Lizhen Nie"", ""Mao Ye"", ""qiang liu"", ""Dan Nicolae""]","[""causal inference"", ""continuous treatment effect"", ""doubly robustness""]","Motivated by the rising abundance of observational data with continuous treatments, we investigate the problem of estimating the average dose-response curve (ADRF). Available parametric methods are limited in their model space, and previous attempts in leveraging neural network to enhance model expressiveness relied on partitioning continuous treatment into blocks and using separate heads for each block; this however produces in practice discontinuous ADRFs. Therefore, the question of how to adapt the structure and training of neural network to estimate ADRFs remains open. This paper makes two important contributions. First, we propose a novel varying coefficient neural network (VCNet) that improves model expressiveness while preserving continuity of the estimated ADRF. Second, to improve finite sample performance, we generalize targeted regularization to obtain a doubly robust estimator of the whole ADRF curve.",/pdf/5c2b12078d2981db4bb6e855b2055a732e936fca.pdf,ICLR,2021,We propose a varying coefficient network and a functional targeted regularization for estimating continuous treatment. +S1di0sfgl,,1477780000000.0,1488960000000.0,12,Hierarchical Multiscale Recurrent Neural Networks,"[""junyoung.chung@umontreal.ca"", ""sungjin.ahn@umontreal.ca"", ""yoshua.bengio@umontreal.ca""]","[""Junyoung Chung"", ""Sungjin Ahn"", ""Yoshua Bengio""]","[""Natural language processing"", ""Deep learning""]","Learning both hierarchical and temporal representation has been among the long- standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural network, that can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that the proposed model can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level language modelling and handwriting sequence generation.",/pdf/7b450081bcf1cf199061ff7348f79ed6d403ecd7.pdf,ICLR,2017,Propose a recurrent neural network architecture that can discover the underlying hierarchical structure in the temporal data. +rJe1DTNYPH,HygnWmcDwB,1569440000000.0,1577170000000.0,581,Towards Disentangling Non-Robust and Robust Components in Performance Metric,"[""shiyujun1016@gmail.com"", ""bliao@tencent.com"", ""gycchen@tencent.com"", ""nk12csly@mail.nankai.edu.cn"", ""cmm@nankai.edu.cn"", ""elefjia@nus.edu.sg""]","[""Yujun Shi"", ""Benben Liao"", ""Guangyong Chen"", ""Yun Liu"", ""Ming-ming Cheng"", ""Jiashi Feng""]","[""adversarial examples"", ""robust machine learning""]","The vulnerability to slight input perturbations is a worrying yet intriguing property of deep neural networks (DNNs). Though some efforts have been devoted to investigating the reason behind such adversarial behavior, the relation between standard accuracy and adversarial behavior of DNNs is still little understood. In this work, we reveal such relation by first introducing a metric characterizing the standard performance of DNNs. Then we theoretically show this metric can be disentangled into an information-theoretic non-robust component that is related to adversarial behavior, and a robust component. Then, we show by experiments that DNNs under standard training rely heavily on optimizing the non-robust component in achieving decent performance. We also demonstrate current state-of-the-art adversarial training algorithms indeed try to robustify DNNs by preventing them from using the non-robust component to distinguish samples from different categories. Based on our findings, we take a step forward and point out the possible direction of simultaneously achieving decent standard generalization and adversarial robustness. It is hoped that our theory can further inspire the community to make more interesting discoveries about the relation between standard accuracy and adversarial robustness of DNNs.",/pdf/b3b82f5276e1b8a4cd60c431e65cfc4c565e5fc9.pdf,ICLR,2020,We show the relation between standard performance and adversarial robustness by disentangling the non-robust and robust components in a proposed performance metric. +Syx6bz-Ab,Skka-zZAb,1509130000000.0,1518730000000.0,782,Seq2SQL: Generating Structured Queries From Natural Language Using Reinforcement Learning ,"[""victor@victorzhong.com"", ""cxiong@salesforce.com"", ""richard@socher.org""]","[""Victor Zhong"", ""Caiming Xiong"", ""Richard Socher""]","[""deep learning"", ""reinforcement learning"", ""dataset"", ""natural language processing"", ""natural language interface"", ""sql""]","Relational databases store a significant amount of the worlds data. However, accessing this data currently requires users to understand a query language such as SQL. We propose Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Our model uses rewards from in the loop query execution over the database to learn a policy to generate the query, which contains unordered parts that are less suitable for optimization via cross entropy loss. Moreover, Seq2SQL leverages the structure of SQL to prune the space of generated queries and significantly simplify the generation problem. In addition to the model, we release WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables fromWikipedia that is an order of magnitude larger than comparable datasets. By applying policy based reinforcement learning with a query execution environment to WikiSQL, Seq2SQL outperforms a state-of-the-art semantic parser, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.",/pdf/812cc7a01924d3b8e76947c9145c9bf9fce9a45c.pdf,ICLR,2018,"We introduce Seq2SQL, which translates questions to SQL queries using rewards from online query execution, and WikiSQL, a SQL table/question/query dataset orders of magnitude larger than existing datasets." +xHKVVHGDOEk,XgPfSN8Wq6I,1601310000000.0,1615940000000.0,1227,Influence Functions in Deep Learning Are Fragile,"[""~Samyadeep_Basu1"", ""~Phil_Pope1"", ""~Soheil_Feizi2""]","[""Samyadeep Basu"", ""Phil Pope"", ""Soheil Feizi""]","[""Influence Functions"", ""Interpretability""]","Influence functions approximate the effect of training samples in test-time predictions and have a wide variety of applications in machine learning interpretability and uncertainty estimation. A commonly-used (first-order) influence function can be implemented efficiently as a post-hoc method requiring access only to the gradients and Hessian of the model. For linear models, influence functions are well-defined due to the convexity of the underlying loss function and are generally accurate even across difficult settings where model changes are fairly large such as estimating group influences. Influence functions, however, are not well-understood in the context of deep learning with non-convex loss functions. In this paper, we provide a comprehensive and large-scale empirical study of successes and failures of influence functions in neural network models trained on datasets such as Iris, MNIST, CIFAR-10 and ImageNet. Through our extensive experiments, we show that the network architecture, its depth and width, as well as the extent of model parameterization and regularization techniques have strong effects in the accuracy of influence functions. In particular, we find that (i) influence estimates are fairly accurate for shallow networks, while for deeper networks the estimates are often erroneous; (ii) for certain network architectures and datasets, training with weight-decay regularization is important to get high-quality influence estimates; and (iii) the accuracy of influence estimates can vary significantly depending on the examined test points. These results suggest that in general influence functions in deep learning are fragile and call for developing improved influence estimation methods to mitigate these issues in non-convex setups.",/pdf/5f4c0d6ade226a87db8c80975d4487e28ec11d9e.pdf,ICLR,2021,End-to-end investigation of the behaviour of influence functions in deep learning +HJx8HANFDH,H1xTm3Ldwr,1569440000000.0,1583910000000.0,1113,Four Things Everyone Should Know to Improve Batch Normalization,"[""ceciliasummers07@gmail.com"", ""mjd@cs.auckland.ac.nz""]","[""Cecilia Summers"", ""Michael J. Dinneen""]","[""batch normalization""]","A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics, fixing a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters γ and β; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on six datasets: CIFAR-100, SVHN, Caltech-256, Oxford Flowers-102, CUB-2011, and ImageNet.",/pdf/87f243a3aaed725e0ae736b85445ae61f09b7cb1.pdf,ICLR,2020,Four things that improve batch normalization across all batch sizes +KjeUNkU2d26,hzjvWo4XtLh,1601310000000.0,1614990000000.0,104,Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement,"[""~Xuanchi_Ren1"", ""~Tao_Yang9"", ""~Wenjun_Zeng3"", ""~Yuwang_Wang3""]","[""Xuanchi Ren"", ""Tao Yang"", ""Wenjun Zeng"", ""Yuwang Wang""]","[""Unsupervised Disentanglement"", ""Content and Style Disentanglement"", ""Inductive Bias"", ""Representation Learning""]","Content and style (C-S) disentanglement intends to decompose the underlying explanatory factors of objects into two independent latent spaces. Aiming for unsupervised disentanglement, we introduce an inductive bias to our formulation by assigning different and independent roles to content and style when approximating the real data distributions. The content embeddings of individual images are forced to share a common distribution. The style embeddings encoding instance-specific features are used to customize the shared distribution. The experiments on several popular datasets demonstrate that our method achieves the state-of-the-art disentanglement compared to other unsupervised approaches and comparable or even better results than supervised methods. Furthermore, as a new application of C-S disentanglement, we propose to generate multi-view images from a single view image for 3D reconstruction.",/pdf/1c869c212adf7fdb167c2af8d24c5014674c78df.pdf,ICLR,2021,"Aiming for unsupervised disentanglement, we introduce an inductive bias by assigning different and independent roles to content and style when approximating the real data distributions. " +SkfNU2e0Z,H1-E83x0Z,1509110000000.0,1518730000000.0,391,Statestream: A toolbox to explore layerwise-parallel deep neural networks,"[""volker.fischer@de.bosch.com""]","[""Volker Fischer""]","[""model-parallel"", ""parallelization"", ""software platform""]","Building deep neural networks to control autonomous agents which have to interact in real-time with the physical world, such as robots or automotive vehicles, requires a seamless integration of time into a network’s architecture. The central question of this work is, how the temporal nature of reality should be reflected in the execution of a deep neural network and its components. Most artificial deep neural networks are partitioned into a directed graph of connected modules or layers and the layers themselves consist of elemental building blocks, such as single units. For most deep neural networks, all units of a layer are processed synchronously and in parallel, but layers themselves are processed in a sequential manner. In contrast, all elements of a biological neural network are processed in parallel. In this paper, we define a class of networks between these two extreme cases. These networks are executed in a streaming or synchronous layerwise-parallel manner, unlocking the layers of such networks for parallel processing. Compared to the standard layerwise-sequential deep networks, these new layerwise-parallel networks show a fundamentally different temporal behavior and flow of information, especially for networks with skip or recurrent connections. We argue that layerwise-parallel deep networks are better suited for future challenges of deep neural network design, such as large functional modularized and/or recurrent architectures as well as networks allocating different network capacities dependent on current stimulus and/or task complexity. We layout basic properties and discuss major challenges for layerwise-parallel networks. Additionally, we provide a toolbox to design, train, evaluate, and online-interact with layerwise-parallel networks.",/pdf/bdaddd8d4654234c4a3d112de09a655ee0733636.pdf,ICLR,2018,"We define a concept of layerwise model-parallel deep neural networks, for which layers operate in parallel, and provide a toolbox to design, train, evaluate, and on-line interact with these networks." +SJzR2iRcK7,HyeUGb-5YX,1538090000000.0,1546470000000.0,760,Multi-class classification without multi-class labels,"[""yenchang.hsu@gatech.edu"", ""zhaoyang.lv@gatech.edu"", ""joel.schlosser@gtri.gatech.edu"", ""phillip.odom@gtri.gatech.edu"", ""zkira@gatech.edu""]","[""Yen-Chang Hsu"", ""Zhaoyang Lv"", ""Joel Schlosser"", ""Phillip Odom"", ""Zsolt Kira""]","[""classification"", ""unsupervised learning"", ""semi-supervised learning"", ""problem reduction"", ""weak supervision"", ""cross-task"", ""learning"", ""deep learning"", ""neural network""]","This work presents a new strategy for multi-class classification that requires no class-specific labels, but instead leverages pairwise similarity between examples, which is a weaker form of annotation. The proposed method, meta classification learning, optimizes a binary classifier for pairwise similarity prediction and through this process learns a multi-class classifier as a submodule. We formulate this approach, present a probabilistic graphical model for it, and derive a surprisingly simple loss function that can be used to learn neural network-based models. We then demonstrate that this same framework generalizes to the supervised, unsupervised cross-task, and semi-supervised settings. Our method is evaluated against state of the art in all three learning paradigms and shows a superior or comparable accuracy, providing evidence that learning multi-class classification without multi-class labels is a viable learning option.",/pdf/65e173d93cea9bceab67e6c8caee82e4c764084d.pdf,ICLR,2019, +HyET6tYex,,1478230000000.0,1484720000000.0,114,Universality in halting time,"[""leventsagun@gmail.com"", ""tom.trogdon@gmail.com"", ""yann@cs.nyu.edu""]","[""Levent Sagun"", ""Thomas Trogdon"", ""Yann LeCun""]","[""Optimization""]","The authors present empirical distributions for the halting time (measured by the number of iterations to reach a given accuracy) of optimization algorithms applied to two random systems: spin glasses and deep learning. Given an algorithm, which we take to be both the optimization routine and the form of the random landscape, the fluctuations of the halting time follow a distribution that remains unchanged even when the input is changed drastically. We observe two main classes, a Gumbel-like distribution that appears in Google searches, human decision times, QR factorization and spin glasses, and a Gaussian-like distribution that appears in conjugate gradient method, deep network with MNIST input data and deep network with random input data. This empirical evidence suggests presence of a class of distributions for which the halting time is independent of the underlying distribution under some conditions.",/pdf/9ba24ff08dfbca6e21b3bde7fba80f966453c21e.pdf,ICLR,2017,Normalized halting time distributions are independent of the input data distribution. +SyeLGlHtPS,rygBQGgYDr,1569440000000.0,1577170000000.0,2174,"Learning vector representation of local content and matrix representation of local motion, with implications for V1","[""ruiqigao@ucla.edu"", ""jianwen@ucla.edu"", ""huangsiyuan@ucla.edu"", ""3160104704@zju.edu.cn"", ""sczhu@stat.ucla.edu"", ""ywu@stat.ucla.edu""]","[""Ruiqi Gao"", ""Jianwen Xie"", ""Siyuan Huang"", ""Yufan Ren"", ""Song-Chun Zhu"", ""Ying Nian Wu""]","[""Representation learning"", ""V1"", ""neuroscience""]","This paper proposes a representational model for image pair such as consecutive video frames that are related by local pixel displacements, in the hope that the model may shed light on motion perception in primary visual cortex (V1). The model couples the following two components. (1) The vector representations of local contents of images. (2) The matrix representations of local pixel displacements caused by the relative motions between the agent and the objects in the 3D scene. When the image frame undergoes changes due to local pixel displacements, the vectors are multiplied by the matrices that represent the local displacements. Our experiments show that our model can learn to infer local motions. Moreover, the model can learn Gabor-like filter pairs of quadrature phases.",/pdf/8feff0a09d4f565eb30de6a5382ae79788be4e15.pdf,ICLR,2020, +zFM0Uo_GnYE,zsvrHceiPP,1601310000000.0,1614990000000.0,3592,On the Importance of Looking at the Manifold,"[""~Nil_Adell_Mill1"", ""~Jannis_Born1"", ""npark@us.ibm.com"", ""~James_Hedrick1"", ""mrm@zurich.ibm.com"", ""~Matteo_Manica1""]","[""Nil Adell Mill"", ""Jannis Born"", ""Nathaniel Park"", ""James Hedrick"", ""Mar\u00eda Rodr\u00edguez Mart\u00ednez"", ""Matteo Manica""]","[""Topological Learning"", ""GNN"", ""VAE""]","Data rarely lies on uniquely Euclidean spaces. Even data typically represented in regular domains, such as images, can have a higher level of relational information, either between data samples or even relations within samples, e.g., how the objects in an image are linked. With this perspective our data points can be enriched by explicitly accounting for this connectivity and analyzing them as a graph. Herein, we analyze various approaches for unsupervised representation learning and investigate the importance of considering topological information and its impact when learning representations. We explore a spectrum of models, ranging from uniquely learning representations based on the isolated features of the nodes (focusing on Variational Autoencoders), to uniquely learning representations based on the topology (using node2vec) passing through models that integrate both node features and topological information in a hybrid fashion. For the latter we use Graph Neural Networks, precisely Deep Graph Infomax (DGI), and an extension of the typical formulation of the VAE where the topological structure is accounted for via an explicit regularization of the loss (Graph-Regularized VAEs, introduced in this work). To extensively investigate these methodologies, we consider a wide variety of data types: synthetic data point clouds, MNIST, citation networks, and chemical reactions. We show that each of the representations learned by these models may have critical importance for further downstream tasks, and that accounting for the topological features can greatly improve the modeling capabilities for certain problems. We further provide a framework to analyze these, and future models under different scenarios and types of data.",/pdf/853de54c6bb4f30bbec31d0002795f927a1ab5f5.pdf,ICLR,2021,A study on the importance of the topology in representation learning using implicit and explicit graph structure information. +o7YTArVXdEW,Fm7QhA7axdy,1601310000000.0,1614990000000.0,3268,AC-VAE: Learning Semantic Representation with VAE for Adaptive Clustering,"[""~Xingyu_Xie2"", ""~Minjuan_Zhu1"", ""~Yan_Wang13"", ""~Lei_Zhang25""]","[""Xingyu Xie"", ""Minjuan Zhu"", ""Yan Wang"", ""Lei Zhang""]","[""Unsupervised Representation Learning"", ""Neighbor Clustering"", ""Variational Autoencoder"", ""Unsupervised Classification""]","Unsupervised representation learning is essential in the field of machine learning, and accurate neighbor clusters of representation show great potential to support unsupervised image classification. This paper proposes a VAE (Variational Autoencoder) based network and a clustering method to achieve adaptive neighbor clustering to support the self-supervised classification. The proposed network encodes the image into the representation with boundary information, and the proposed cluster method takes advantage of the boundary information to deliver adaptive neighbor cluster results. Experimental evaluations show that the proposed method outperforms state-of-the-art representation learning methods in terms of neighbor clustering accuracy. Particularly, AC-VAE achieves 95\% and 82\% accuracy on CIFAR10 dataset when the average neighbor cluster sizes are 10 and 100. Furthermore, the neighbor cluster results are found converge within the clustering range ($\alpha\leq2$), and the converged neighbor clusters are used to support the self-supervised classification. The proposed method delivers classification results that are competitive with the state-of-the-art and reduces the super parameter $k$ in KNN (K-nearest neighbor), which is often used in self-supervised classification.",/pdf/90d844e0df148c0ff7e2b38052bb2042f9f0450a.pdf,ICLR,2021,This paper proposes a VAE based network and a z-score based clustering method to achieve adaptive neighbor clustering for supporting unsupervised classification. +0-uUGPbIjD,4x83nyBKIVi,1601310000000.0,1616040000000.0,702,Human-Level Performance in No-Press Diplomacy via Equilibrium Search,"[""~Jonathan_Gray2"", ""~Adam_Lerer1"", ""~Anton_Bakhtin1"", ""~Noam_Brown2""]","[""Jonathan Gray"", ""Adam Lerer"", ""Anton Bakhtin"", ""Noam Brown""]","[""multi-agent systems"", ""regret minimization"", ""no-regret learning"", ""game theory"", ""reinforcement learning""]","Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via regret minimization. Regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and ranks in the top 2% of human players when playing anonymous games on a popular Diplomacy website.",/pdf/59e1f6ceb25265194013bc67945a7001ef36ca84.pdf,ICLR,2021,We present an agent that approximates a one-step equilibrium in no-press Diplomacy using no-regret learning and show that it exceeds human-level performance +-6vS_4Kfz0,RKxdou99RmJ,1601310000000.0,1615850000000.0,3688,Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning,"[""~Shauharda_Khadka1"", ""estelle.aflalo@intel.com"", ""mattias.marder@intel.com"", ""avrech@campus.technion.ac.il"", ""~Santiago_Miret1"", ""~Shie_Mannor2"", ""~Tamir_Hazan1"", ""~Hanlin_Tang1"", ""~Somdeb_Majumdar1""]","[""Shauharda Khadka"", ""Estelle Aflalo"", ""Mattias Mardar"", ""Avrech Ben-David"", ""Santiago Miret"", ""Shie Mannor"", ""Tamir Hazan"", ""Hanlin Tang"", ""Somdeb Majumdar""]","[""Reinforcement Learning"", ""Memory Mapping"", ""Device Placement"", ""Evolutionary Algorithms""]","For deep neural network accelerators, memory movement is both energetically expensive and can bound computation. Therefore, optimal mapping of tensors to memory hierarchies is critical to performance. The growing complexity of neural networks calls for automated memory mapping instead of manual heuristic approaches; yet the search space of neural network computational graphs have previously been prohibitively large. We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces, that combines graph neural networks, reinforcement learning, and evolutionary search. A set of fast, stateless policies guide the evolutionary search to improve its sample-efficiency. We train and validate our approach directly on the Intel NNP-I chip for inference. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on BERT, ResNet-101 and ResNet-50. We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads. ",/pdf/04f5290361838504c7aedc9907d8058042fe455e.pdf,ICLR,2021,"We combine evolutionary and gradient-based reinforcement learning to tackle the large search spaces needed to map tensors to memory, yielding up to 78% speedup on BERT and ResNet on a deep learning inference chip." +H1lPUiRcYQ,SyedgfOPFX,1538090000000.0,1545360000000.0,181,Computing committor functions for the study of rare events using deep learning with importance sampling,"[""liqix@ihpc.a-star.edu.sg"", ""linbo94@u.nus.edu"", ""matrw@nus.edu.sg""]","[""Qianxiao Li"", ""Bo Lin"", ""Weiqing Ren""]","[""committor function"", ""rare event"", ""deep learning"", ""importance sampling""]","The committor function is a central object of study in understanding transitions between metastable states in complex systems. However, computing the committor function for realistic systems at low temperatures is a challenging task, due to the curse of dimensionality and the scarcity of transition data. In this paper, we introduce a computational approach that overcomes these issues and achieves good performance on complex benchmark problems with rough energy landscapes. The new approach combines deep learning, importance sampling and feature engineering techniques. This establishes an alternative practical method for studying rare transition events among metastable states of complex, high dimensional systems.",/pdf/9fca4516f81f910ed4ed0ddc57bfdfb8aade4be9.pdf,ICLR,2019,Computing committor functions for rare events +SJlgTJHKwB,H1xVrQkKDS,1569440000000.0,1577170000000.0,1977,Continual Learning with Delayed Feedback,"[""pranavan@u.nus.edu"", ""tsim@comp.nus.edu.sg""]","[""THEIVENDIRAM PRANAVAN"", ""TERENCE SIM""]",[],"Most of the artificial neural networks are using the benefit of labeled datasets whereas in human brain, the learning is often unsupervised. The feedback or a label for a given input or a sensory stimuli is not often available instantly. After some time when brain gets the feedback, it updates its knowledge. That's how brain learns. Moreover, there is no training or testing phase. Human learns continually. This work proposes a model-agnostic continual learning framework which can be used with neural networks as well as decision trees to incorporate continual learning. Specifically, this work investigates how delayed feedback can be handled. In addition, a way to update the Machine Learning models with unlabeled data is proposed. Promising results are received from the experiments done on neural networks and decision trees. ",/pdf/d75cec22097c903d811598b9a8fb1044b0f2208a.pdf,ICLR,2020, +Hkg0olStDr,HJegkgWFPB,1569440000000.0,1577170000000.0,2520,Multi-Step Decentralized Domain Adaptation,"[""akhilmathurs@gmail.com"", ""sgan@inf.ethz.ch"", ""anton.isopoussu@gmail.com"", ""fahim.kawsar@gmail.com"", ""nadia.berthouze@gmail.com"", ""nicholasd.lane@gmail.com""]","[""Akhil Mathur"", ""Shaoduo Gan"", ""Anton Isopoussu"", ""Fahim Kawsar"", ""Nadia Berthouze"", ""Nicholas D. Lane""]","[""domain adaptation"", ""decentralization""]","Despite the recent breakthroughs in unsupervised domain adaptation (uDA), no prior work has studied the challenges of applying these methods in practical machine learning scenarios. In this paper, we highlight two significant bottlenecks for uDA, namely excessive centralization and poor support for distributed domain datasets. Our proposed framework, MDDA, is powered by a novel collaborator selection algorithm and an effective distributed adversarial training method, and allows for uDA methods to work in a decentralized and privacy-preserving way. +",/pdf/9e4b2c75e86a97daca93b88055bb6be6b7574c44.pdf,ICLR,2020,"A novel method for decentralized and distributed domain adaptation, as a way to make these methods more practical in real ML systems." +BylE1205Fm,SJgV3JL5Km,1538090000000.0,1547560000000.0,980,Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer,"[""theoripress@gmail.com"", ""tomer22g@gmail.com"", ""sagiebenaim@gmail.com"", ""wolf@fb.com""]","[""Ori Press"", ""Tomer Galanti"", ""Sagie Benaim"", ""Lior Wolf""]","[""Image-to-image Translation"", ""Disentanglement"", ""Autoencoders"", ""Faces""]","We study the problem of learning to map, in an unsupervised way, between domains $A$ and $B$, such that the samples $\vb \in B$ contain all the information that exists in samples $\va\in A$ and some additional information. For example, ignoring occlusions, $B$ can be people with glasses, $A$ people without, and the glasses, would be the added information. When mapping a sample $\va$ from the first domain to the other domain, the missing information is replicated from an independent reference sample $\vb\in B$. Thus, in the above example, we can create, for every person without glasses a version with the glasses observed in any face image. + +Our solution employs a single two-pathway encoder and a single decoder for both domains. The common part of the two domains and the separate part are encoded as two vectors, and the separate part is fixed at zero for domain $A$. The loss terms are minimal and involve reconstruction losses for the two domains and a domain confusion term. Our analysis shows that under mild assumptions, this architecture, which is much simpler than the literature guided-translation methods, is enough to ensure disentanglement between the two domains. We present convincing results in a few visual domains, such as no-glasses to glasses, adding facial hair based on a reference image, etc.",/pdf/ea305d5777dc2a4201ed7e386cd0a37a9bc5c586.pdf,ICLR,2019,An image to image translation method which adds to one image the content of another thereby creating a new image. +rJejta4KDS,S1xJiN6vPS,1569440000000.0,1577170000000.0,683,SELF-KNOWLEDGE DISTILLATION ADVERSARIAL ATTACK,"[""maxrumi@163.com"", ""shanicky4ever@gmail.com"", ""tico_tools@163.com"", ""zqdong@stu.xidian.edu.cn"", ""zhenhua_duan@126.com""]","[""Ma Xiaoxiong[1]"", ""Wang Renzhi[1]"", ""Tian Cong"", ""Dong Zeqian"", ""Duan Zhenhua""]","[""Adversarial Examples"", ""Transferability"", ""black-box targeted attack"", ""Distillation""]","Neural networks show great vulnerability under the threat of adversarial examples. + By adding small perturbation to a clean image, neural networks with high classification accuracy can be completely fooled. + One intriguing property of the adversarial examples is transferability. This property allows adversarial examples to transfer to networks of unknown structure, which is harmful even to the physical world. + The current way of generating adversarial examples is mainly divided into optimization based and gradient based methods. + Liu et al. (2017) conjecture that gradient based methods can hardly produce transferable targeted adversarial examples in black-box-attack. + However, in this paper, we use a simple technique to improve the transferability and success rate of targeted attacks with gradient based methods. + We prove that gradient based methods can also generate transferable adversarial examples in targeted attacks. + Specifically, we use knowledge distillation for gradient based methods, and show that the transferability can be improved by effectively utilizing different classes of information. + Unlike the usual applications of knowledge distillation, we did not train a student network to generate adversarial examples. + We take advantage of the fact that knowledge distillation can soften the target and obtain higher information, and combine the soft target and hard target of the same network as the loss function. + Our method is generally applicable to most gradient based attack methods.",/pdf/ceee93a891dfb4de093cc2d2a3765b1f84b6b525.pdf,ICLR,2020, +rkx1b64Fvr,SyeeNW8IDS,1569440000000.0,1577170000000.0,361,A New Multi-input Model with the Attention Mechanism for Text Classification,"[""qiujunhao@csu.edu.cn"", ""shirh@csu.edu.cn"", ""lifangfang@csu.edu.cn"", ""shijinjing@csu.edu.cn"", ""0909123117@csu.edu.cn""]","[""Junhao Qiu"", ""Ronghua Shi"", ""Fangfang Li (the corresponding author)"", ""Jinjing Shi"", ""Wangmin Liao""]","[""Natural Language Processing"", ""Text Classification"", ""Densent"", ""Multi-input Model"", ""Attention Mechanism""]","Recently, deep learning has made extraordinary achievements in text classification. However, most of present models, especially convolutional neural network (CNN), do not extract long-range associations, global representations, and hierarchical features well due to their relatively shallow and simple structures. This causes a negative effect on text classification. Moreover, we find that there are many express methods of texts. It is appropriate to design the multi-input model to improve the classification effect. But most of models of text classification only use words or characters and do not use the multi-input model. Inspired by the above points and Densenet (Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.), we propose a new text classification model, which uses words, characters, and labels as input. The model, which is a deep CNN with a novel attention mechanism, can effectively leverage the input information and solve the above issues of the shallow model. We conduct experiments on six large text classification datasets. Our model achieves the state of the art results on all datasets compared to multiple baseline models.",/pdf/546e7056149308ffedc64a4ef2c883fb6b94a2be.pdf,ICLR,2020,"We propose a new multi-input model with a novel attention mechanism, can effectively solve the issues of the shallow text classification model such as doing not extract long-range associations, global representations, and hierarchical features." +HJex0o05F7,rJl_NF85tQ,1538090000000.0,1545360000000.0,864,UaiNets: From Unsupervised to Active Deep Anomaly Detection,"[""tiago.pimentel@kunumi.com"", ""marianne@kunumi.com"", ""juliano@kunumi.com"", ""adrianov@dcc.ufmg.br"", ""nivio@dcc.ufmg.br""]","[""Tiago Pimentel"", ""Marianne Monteiro"", ""Juliano Viana"", ""Adriano Veloso"", ""Nivio Ziviani""]","[""Anomaly Detection"", ""Active Learning"", ""Unsupervised Learning""]","This work presents a method for active anomaly detection which can be built upon existing deep learning solutions for unsupervised anomaly detection. We show that a prior needs to be assumed on what the anomalies are, in order to have performance guarantees in unsupervised anomaly detection. We argue that active anomaly detection has, in practice, the same cost of unsupervised anomaly detection but with the possibility of much better results. To solve this problem, we present a new layer that can be attached to any deep learning model designed for unsupervised anomaly detection to transform it into an active method, presenting results on both synthetic and real anomaly detection datasets.",/pdf/91efbbf714bf8ccbab311c0420be4e6475ae0617.pdf,ICLR,2019,A method for active anomaly detection. We present a new layer that can be attached to any deep learning model designed for unsupervised anomaly detection to transform it into an active method. +S1gLBgBtDH,B1e2u_gtPr,1569440000000.0,1577170000000.0,2286,SLM Lab: A Comprehensive Benchmark and Modular Software Framework for Reproducible Deep Reinforcement Learning,"[""kengzwl@gmail.com"", ""lhgraesser@gmail.com"", ""mcvitkov@caltech.edu""]","[""Wah Loon Keng"", ""Laura Graesser"", ""Milan Cvitkovic""]","[""reinforcement learning"", ""machine learning"", ""benchmark"", ""reproducibility"", ""software"", ""framework"", ""implementation issues"", ""parallelization"", ""software platforms""]","We introduce SLM Lab, a software framework for reproducible reinforcement learning (RL) research. SLM Lab implements a number of popular RL algorithms, provides synchronous and asynchronous parallel experiment execution, hyperparameter search, and result analysis. RL algorithms in SLM Lab are implemented in a modular way such that differences in algorithm performance can be confidently ascribed to differences between algorithms, not between implementations. In this work we present the design choices behind SLM Lab and use it to produce a comprehensive single-codebase RL algorithm benchmark. In addition, as a consequence of SLM Lab's modular design, we introduce and evaluate a discrete-action variant of the Soft Actor-Critic algorithm (Haarnoja et al., 2018) and a hybrid synchronous/asynchronous training method for RL agents.",/pdf/20663ef9611853595106b962f590b6427cc38377.pdf,ICLR,2020,We introduce a new software framework (SLM Lab) for reinforcement learning and use it to produce a massive performance benchmark of RL algorithms. +Ti87Pv5Oc8,ZfqcDEzPgLJ,1601310000000.0,1612760000000.0,2513,Meta-Learning with Neural Tangent Kernels,"[""~Yufan_Zhou1"", ""~Zhenyi_Wang1"", ""jxian@buffalo.edu"", ""~Changyou_Chen1"", ""~Jinhui_Xu1""]","[""Yufan Zhou"", ""Zhenyi Wang"", ""Jiayi Xian"", ""Changyou Chen"", ""Jinhui Xu""]","[""meta-learning"", ""neural tangent kernel""]","Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-of-distribution adaptation than popular baselines, as demonstrated in our experiments.",/pdf/07382947621a75697286cffb9d20483d2fd8337e.pdf,ICLR,2021,First work to define meta learning in RKHS induced by Neural Tangent Kernel +bMzj6hXL2VJ,yAYglLWksYU,1601310000000.0,1614990000000.0,3280,Ordering-Based Causal Discovery with Reinforcement Learning,"[""~Xiaoqiang_Wang2"", ""~Yali_Du1"", ""~Shengyu_Zhu1"", ""keljxjtu@xjtu.edu.cn"", ""~Zhitang_Chen1"", ""~Jianye_HAO1"", ""~Jun_Wang2""]","[""Xiaoqiang Wang"", ""Yali Du"", ""Shengyu Zhu"", ""Liangjun Ke"", ""Zhitang Chen"", ""Jianye HAO"", ""Jun Wang""]","[""Causal Discovery"", ""Reinforcement Learning"", ""Ordering Search""]","It is a long-standing question to discover causal relations among a set of variables in many empirical sciences. Recently, Reinforcement Learning (RL) has achieved promising results in causal discovery. However, searching the space of directed graphs directly and enforcing acyclicity by implicit penalties tend to be inefficient and restrict the method to the small problems. In this work, we alternatively consider searching an ordering by RL from the variable ordering space that is much smaller than that of directed graphs, which also helps avoid dealing with acyclicity. Specifically, we formulate the ordering search problem as a Markov decision process, and then use different reward designs to optimize the ordering generating model. A generated ordering is then processed using variable selection methods to obtain the final directed acyclic graph. In contrast to other causal discovery methods, our method can also utilize a pretrained model to accelerate training. We conduct experiments on both synthetic and real-world datasets, and show that the proposed method outperforms other baselines on important metrics even on large graph tasks.",/pdf/b7efa795a05282e5518ae2aa297d967030d899b9.pdf,ICLR,2021, +H1eadi0cFQ,Hke7FFKctX,1538090000000.0,1545360000000.0,395,Escaping Flat Areas via Function-Preserving Structural Network Modifications,"[""yannic.kilcher@inf.ethz.ch"", ""garybecigneul06@gmail.com"", ""thomas.hofmann@inf.ethz.ch""]","[""Yannic Kilcher"", ""Gary B\u00e9cigneul"", ""Thomas Hofmann""]","[""deep learning"", ""cnn"", ""structural modification"", ""optimization"", ""saddle point""]","Hierarchically embedding smaller networks in larger networks, e.g.~by increasing the number of hidden units, has been studied since the 1990s. The main interest was in understanding possible redundancies in the parameterization, as well as in studying how such embeddings affect critical points. We take these results as a point of departure to devise a novel strategy for escaping from flat regions of the error surface and to address the slow-down of gradient-based methods experienced in plateaus of saddle points. The idea is to expand the dimensionality of a network in a way that guarantees the existence of new escape directions. We call this operation the opening of a tunnel. One may then continue with the larger network either temporarily, i.e.~closing the tunnel later, or permanently, i.e.~iteratively growing the network, whenever needed. We develop our method for fully-connected as well as convolutional layers. Moreover, we present a practical version of our algorithm that requires no network structure modification and can be deployed as plug-and-play into any current deep learning framework. Experimentally, our method shows significant speed-ups.",/pdf/1c704fa0c25e9106af3f1a343e7884d234b1e240.pdf,ICLR,2019,"If optimization gets stuck in a saddle, we add a filter to a CNN in a specific way in order to escape the saddle." +Syx5eT4KDS,SJxUd4VLDS,1569440000000.0,1577170000000.0,349,Discrete InfoMax Codes for Meta-Learning,"[""einet89@gmail.com"", ""dandelin.kim@kakaocorp.com"", ""seungjin.choi.mlg@gmail.com""]","[""Yoonho Lee"", ""Wonjae Kim"", ""Seungjin Choi""]","[""meta-learning"", ""generalization"", ""discrete representations""]","This paper analyzes how generalization works in meta-learning. Our core contribution is an information-theoretic generalization bound for meta-learning, which identifies the expressivity of the task-specific learner as the key factor that makes generalization to new datasets difficult. Taking inspiration from our bound, we present Discrete InfoMax Codes (DIMCO), a novel meta-learning model that trains a stochastic encoder to output discrete codes. Experiments show that DIMCO requires less memory and less time for similar performance to previous metric learning methods and that our method generalizes particularly well in a challenging small-data setting.",/pdf/f1dc262f5d5ef140d31ca9e63ecdc097f109a4c9.pdf,ICLR,2020,"We derive a generalization bound for meta-learning, and propose a meta-learning model that generalizes well" +HyecJGP5ge,,1478290000000.0,1484600000000.0,365,NEUROGENESIS-INSPIRED DICTIONARY LEARNING: ONLINE MODEL ADAPTION IN A CHANGING WORLD,"[""sahilgar@usc.edu"", ""rish@us.ibm.com"", ""gcecchi@us.ibm.com"", ""aclozano@us.ibm.com""]","[""Sahil Garg"", ""Irina Rish"", ""Guillermo Cecchi"", ""Aurelie Lozano""]","[""Unsupervised Learning"", ""Computer vision"", ""Transfer Learning"", ""Optimization"", ""Applications""]","In this paper, we focus on online representation learning in non-stationary environments which may require continuous adaptation of model’s architecture. We propose a novel online dictionary-learning (sparse-coding) framework which incorporates the addition and deletion of hidden units (dictionary elements), and is inspired by the adult neurogenesis phenomenon in the dentate gyrus of the hippocampus, known to be associated with improved cognitive function and adaptation to new environments. In the online learning setting, where new input instances arrive sequentially in batches, the “neuronal birth” is implemented by adding new units with random initial weights (random dictionary elements); the number of new units is determined by the current performance (representation error) of the dictionary, higher error causing an increase in the birth rate. “Neuronal death” is implemented by imposing l1/l2-regularization (group sparsity) on the dictionary within the block-coordinate descent optimization at each iteration of our online alternating minimization scheme, which iterates between the code and dictionary updates. Finally, hidden unit connectivity adaptation is facilitated by introducing sparsity in dictionary elements. Our empirical evaluation on several real-life datasets (images and language) as well as on synthetic data demonstrates that the proposed approach can considerably outperform the state-of-art fixed-size (nonadaptive) online sparse coding of Mairal et al. (2009) in the presence of nonstationary data. Moreover, we identify certain properties of the data (e.g., sparse inputs with nearly non-overlapping supports) and of the model (e.g., dictionary sparsity) associated with such improvements.",/pdf/fb6912d6259de45abff8efb092d6c30cc681125c.pdf,ICLR,2017,"An online dictionary learning incorporates dynamic model adaptation, adding/deleting its elements in response to nonstationary data." +ByhthReRb,Sy2thAlA-,1509120000000.0,1518730000000.0,466,A Neural Method for Goal-Oriented Dialog Systems to interact with Named Entities,"[""rjana@umich.edu"", ""jatinganhotra@us.ibm.com"", ""xiaoxiao.guo@ibm.com"", ""yum@us.ibm.com"", ""baveja@umich.edu""]","[""Janarthanan Rajendran"", ""Jatin Ganhotra"", ""Xiaoxiao Guo"", ""Mo Yu"", ""Satinder Singh""]","[""Named Entities"", ""Neural methods"", ""Goal oriented dialog""]","Many goal-oriented dialog tasks, especially ones in which the dialog system has to interact with external knowledge sources such as databases, have to handle a large number of Named Entities (NEs). There are at least two challenges in handling NEs using neural methods in such settings: individual NEs may occur only rarely making it hard to learn good representations of them, and many of the Out Of Vocabulary words that occur during test time may be NEs. Thus, the need to interact well with these NEs has emerged as a serious challenge to building neural methods for goal-oriented dialog tasks. In this paper, we propose a new neural method for this problem, and present empirical evaluations on a structured Question answering task and three related goal-oriented dialog tasks that show that our proposed method can be effective in interacting with NEs in these settings.",/pdf/afccb8cf8ffbd2f9242fe0a67559cf72fe106b41.pdf,ICLR,2018, +H1lS8oA5YQ,B1g7wjGUtm,1538090000000.0,1545360000000.0,173,Feature Attribution As Feature Selection,"[""satohara@ar.sanken.osaka-u.ac.jp"", ""k1keno@ar.sanken.osaka-u.ac.jp"", ""tasuku_soma@mist.i.u-tokyo.ac.jp"", ""takanori.maehara@riken.jp""]","[""Satoshi Hara"", ""Koichi Ikeno"", ""Tasuku Soma"", ""Takanori Maehara""]","[""feature attribution"", ""feature selection""]","Feature attribution methods identify ""relevant"" features as an explanation of a complex machine learning model. Several feature attribution methods have been proposed; however, only a few studies have attempted to define the ""relevance"" of each feature mathematically. In this study, we formalize the feature attribution problem as a feature selection problem. In our proposed formalization, there arise two possible definitions of relevance. We name the feature attribution problems based on these two relevances as Exclusive Feature Selection (EFS) and Inclusive Feature Selection (IFS). We show that several existing feature attribution methods can be interpreted as approximation algorithms for EFS and IFS. Moreover, through exhaustive experiments, we show that IFS is better suited as the formalization for the feature attribution problem than EFS.",/pdf/a0fca5765aa3122dd8d4019975f34354234c18d5.pdf,ICLR,2019, +H1lxVyStPH,B1lL35ndDr,1569440000000.0,1583910000000.0,1640,Generalized Convolutional Forest Networks for Domain Generalization and Visual Recognition,"[""jongbin.ryu@gmail.com"", ""kwongitack@gmail.com"", ""mhyang@ucmerced.edu"", ""jlim@hanyang.ac.kr""]","[""Jongbin Ryu"", ""Gitaek Kwon"", ""Ming-Hsuan Yang"", ""Jongwoo Lim""]",[],"When constructing random forests, it is of prime importance to ensure high accuracy and low correlation of individual tree classifiers for good performance. Nevertheless, it is typically difficult for existing random forest methods to strike a good balance between these conflicting factors. In this work, we propose a generalized convolutional forest networks to learn a feature space to maximize the strength of individual tree classifiers while minimizing the respective correlation. The feature space is iteratively constructed by a probabilistic triplet sampling method based on the distribution obtained from the splits of the random forest. The sampling process is designed to pull the data of the same label together for higher strength and push away the data frequently falling to the same leaf nodes. We perform extensive experiments on five image classification and two domain generalization datasets with ResNet-50 and DenseNet-161 backbone networks. Experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.",/pdf/73ebc401417350ff74adfdce192ce60b700f9131.pdf,ICLR,2020, +Byx9p2EtDH,SkgbzAs7Pr,1569440000000.0,1577170000000.0,236,MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics,"[""m.barekatain@tum.de"", ""ryo.yonetani@sinicx.com"", ""masashi.hamaya@sinicx.com""]","[""Mohammadamin Barekatain"", ""Ryo Yonetani"", ""Masashi Hamaya""]","[""reinforcement learning"", ""transfer learning"", ""policy aggregation"", ""residual policy learning""]","Transfer reinforcement learning (RL) aims at improving learning efficiency of an agent by exploiting knowledge from other source agents trained on relevant tasks. However, it remains challenging to transfer knowledge between different environmental dynamics without having access to the source environments. In this work, we explore a new challenge in transfer RL, where only a set of source policies collected under unknown diverse dynamics is available for learning a target task efficiently. To address this problem, the proposed approach, MULTI-source POLicy AggRegation (MULTIPOLAR), comprises two key techniques. We learn to aggregate the actions provided by the source policies adaptively to maximize the target task performance. Meanwhile, we learn an auxiliary network that predicts residuals around the aggregated actions, which ensures the target policy's expressiveness even when some of the source policies perform poorly. We demonstrated the effectiveness of MULTIPOLAR through an extensive experimental evaluation across six simulated environments ranging from classic control problems to challenging robotics simulations, under both continuous and discrete action spaces.",/pdf/12f36cd088194283da579be18d2ae84dfd64b851.pdf,ICLR,2020,"We propose MULTIPOLAR, a transfer RL method that leverages a set of source policies collected under unknown diverse environmental dynamics to efficiently learn a target policy in another dynamics." +HkeMYJHYvS,B1eFVNAuvS,1569440000000.0,1577170000000.0,1833,High-Frequency guided Curriculum Learning for Class-specific Object Boundary Detection,"[""vsr.veera@gmail.com"", ""deepak.mittal@verisk.com"", ""abhishek.goel@verisk.com"", ""maneesh.singh@verisk.com""]","[""VSR Veeravasarapu"", ""Deepak Mittal"", ""Abhishek Goel"", ""Maneesh Singh""]","[""Computer Vision"", ""Object Contour Detection"", ""Curriculum Learning"", ""Wavelets"", ""Aerial Imagery""]","This work addresses class-specific object boundary extraction, i.e., retrieving boundary pixels that belong to a class of objects in the given image. Although recent ConvNet-based approaches demonstrate impressive results, we notice that they produce several false-alarms and misdetections when used in real-world applications. We hypothesize that although boundary detection is simple at some pixels that are rooted in identifiable high-frequency locations, other pixels pose a higher level of difficulties, for instance, region pixels with an appearance similar to the boundaries; or boundary pixels with insignificant edge strengths. Therefore, the training process needs to account for different levels of learning complexity in different regions to overcome false alarms. In this work, we devise a curriculum-learning-based training process for object boundary detection. This multi-stage training process first trains the network at simpler pixels (with sufficient edge strengths) and then at harder pixels in the later stages of the curriculum. We also propose a novel system for object boundary detection that relies on a fully convolutional neural network (FCN) and wavelet decomposition of image frequencies. This system uses high-frequency bands from the wavelet pyramid and augments them to conv features from different layers of FCN. Our ablation studies with contourMNIST dataset, a simulated digit contours from MNIST, demonstrate that this explicit high-frequency augmentation helps the model to converge faster. Our model trained by the proposed curriculum scheme outperforms a state-of-the-art object boundary detection method by a significant margin on a challenging aerial image dataset. +",/pdf/1392ab62dba3286d645f05ff6c47d2d174807612.pdf,ICLR,2020,This work proposes a novel ConvNet architecture and a two-stage training scheme for class-specific object boundary estimation with improved performance levels. +H1gx3kSKPS,HkeB8e1YwH,1569440000000.0,1577170000000.0,1939,Stein Bridging: Enabling Mutual Reinforcement between Explicit and Implicit Generative Models,"[""echo740@sjtu.edu.cn"", ""rui.gao@mccombs.utexas.edu"", ""zha@cc.gatech.edu""]","[""Qitian Wu"", ""Rui Gao"", ""Hongyuan Zha""]","[""generative models"", ""generative adversarial networks"", ""energy models""]","Deep generative models are generally categorized into explicit models and implicit models. The former assumes an explicit density form whose normalizing constant is often unknown; while the latter, including generative adversarial networks (GANs), generates samples using a push-forward mapping. In spite of substantial recent advances demonstrating the power of the two classes of generative models in many applications, both of them, when used alone, suffer from respective limitations and drawbacks. To mitigate these issues, we propose Stein Bridging, a novel joint training framework that connects an explicit density estimator and an implicit sample generator with Stein discrepancy. We show that the Stein Bridge induces new regularization schemes for both explicit and implicit models. Convergence analysis and extensive experiments demonstrate that the Stein Bridging i) improves the stability and sample quality of the GAN training, and ii) facilitates the density estimator to seek more modes in data and alleviate the mode-collapse issue. Additionally, we discuss several applications of Stein Bridging and useful tricks in practical implementation used in our experiments.",/pdf/4ca485f0a2961d11ec56665827c5ed07dd0924c6.pdf,ICLR,2020, +HJekvT4twr,rkxY9QqPvB,1569440000000.0,1577170000000.0,582,RGTI:Response generation via templates integration for End to End dialog,"[""zhangyuxin960625@gmail.com"", ""anchor3l31@gmail.com""]","[""Yuxin Zhang"", ""Songyan Liu""]","[""End-to-end dialogue systems"", ""transformer"", ""pointer-generate network""]","End-to-end models have achieved considerable success in task-oriented dialogue area, but suffer from the challenges of (a) poor semantic control, and (b) little interaction with auxiliary information. In this paper, we propose a novel yet simple end-to-end model for response generation via mixed templates, which can address above challenges. +In our model, we retrieval candidate responses which contain abundant syntactic and sequence information by dialogue semantic information related to dialogue history. Then, we exploit candidate response attention to get templates which should be mentioned in response. Our model can integrate multi template information to guide the decoder module how to generate response better. We show that our proposed model learns useful templates information, which improves the performance of ""how to say"" and ""what to say"" in response generation. Experiments on the large-scale Multiwoz dataset demonstrate the effectiveness of our proposed model, which attain the state-of-the-art performance.",/pdf/465146d56a8702e3998e676334db193173121e99.pdf,ICLR,2020,A new simple but efficient model for end-to-end dialogue +NTP9OdaT6nm,m2iybEfUkbg,1601310000000.0,1614990000000.0,2291,Formal Language Constrained Markov Decision Processes,"[""~Eleanor_Quint1"", ""~Dong_Xu1"", ""~Samuel_W_Flint1"", ""~Stephen_D_Scott1"", ""~Matthew_Dwyer1""]","[""Eleanor Quint"", ""Dong Xu"", ""Samuel W Flint"", ""Stephen D Scott"", ""Matthew Dwyer""]","[""safe reinforcement learning"", ""formal languages"", ""constrained Markov decision process"", ""safety gym"", ""safety""]","In order to satisfy safety conditions, an agent may be constrained from acting freely. A safe controller can be designed a priori if an environment is well understood, but not when learning is employed. In particular, reinforcement learned (RL) controllers require exploration, which can be hazardous in safety critical situations. We study the benefits of giving structure to the constraints of a constrained Markov decision process by specifying them in formal languages as a step towards using safety methods from software engineering and controller synthesis. We instantiate these constraints as finite automata to efficiently recognise constraint violations. Constraint states are then used to augment the underlying MDP state and to learn a dense cost function, easing the problem of quickly learning joint MDP/constraint dynamics. We empirically evaluate the effect of these methods on training a variety of RL algorithms over several constraints specified in Safety Gym, MuJoCo, and Atari environments.",/pdf/b7207b0628a0fcd319c603230ccf9af4e1863532.pdf,ICLR,2021,Specify safety constraints with formal languages to learn constraint structure representation and densely shape the CMDP cost function +Hkemdj09YQ,rJeAQn-GKm,1538090000000.0,1545360000000.0,339,Rectified Gradient: Layer-wise Thresholding for Sharp and Coherent Attribution Maps,"[""1202kbs@gmail.com"", ""sjh@satreci.com"", ""cjy@si-analytics.ai"", ""jmkoo@si-analytics.ai"", ""jsh@satreci.com"", ""tgjeon@si-analytics.ai""]","[""Beomsu Kim"", ""Junghoon Seo"", ""Jeongyeol Choe"", ""Jamyoung Koo"", ""Seunghyeon Jeon"", ""Taegyun Jeon""]","[""Interpretability"", ""Attribution Method"", ""Attribution Map""]","Saliency map, or the gradient of the score function with respect to the input, is the most basic means of interpreting deep neural network decisions. However, saliency maps are often visually noisy. Although several hypotheses were proposed to account for this phenomenon, there is no work that provides a rigorous analysis of noisy saliency maps. This may be a problem as numerous advanced attribution methods were proposed under the assumption that the existing hypotheses are true. In this paper, we identify the cause of noisy saliency maps. Then, we propose Rectified Gradient, a simple method that significantly improves saliency maps by alleviating that cause. Experiments showed effectiveness of our method and its superiority to other attribution methods. Codes and examples for the experiments will be released in public.",/pdf/5858ad7a13d3cf00d39b19c4343c14f46d13258e.pdf,ICLR,2019,We propose a new attribution method that removes noise from saliency maps through layer-wise thresholding during backpropagation. +ryT9R3Yxe,,1478250000000.0,1478280000000.0,145,Generative Paragraph Vector,"[""zhangruqing@software.ict.ac.cn"", ""guojiafeng@ict.ac.cn"", ""lanyanyan@ict.ac.cn"", ""junxu@ict.ac.cn"", ""cxq@ict.ac.cn""]","[""Ruqing Zhang"", ""Jiafeng Guo"", ""Yanyan Lan"", ""Jun Xu"", ""Xueqi Cheng""]","[""Natural language processing"", ""Deep learning"", ""Unsupervised Learning"", ""Supervised Learning""]","The recently introduced Paragraph Vector is an efficient method for learning high-quality distributed representations for pieces of texts. However, an inherent limitation of Paragraph Vector is lack of ability to infer distributed representations for texts outside of the training set. To tackle this problem, we introduce a Generative Paragraph Vector, which can be viewed as a probabilistic extension of the Distributed Bag of Words version of Paragraph Vector with a complete generative process. With the ability to infer the distributed representations for unseen texts, we can further incorporate text labels into the model and turn it into a supervised version, namely Supervised Generative Paragraph Vector. In this way, we can leverage the labels paired with the texts to guide the representation learning, and employ the learned model for prediction tasks directly. Experiments on five text classification benchmark collections show that both model architectures can yield superior classification performance over the state-of-the-art counterparts. +",/pdf/e86e72bc2d8a038d7ca66dba38c2bf74a24391f5.pdf,ICLR,2017,"With a complete generative process, our models are able to infer vector representations as well as labels over unseen texts." +UiLl8yjh57,20hxPB3goT,1601310000000.0,1614990000000.0,3801,Deep Reinforcement Learning For Wireless Scheduling with Multiclass Services,"[""~Apostolos_Avranas1"", ""marios.kountouris@eurecom.fr"", ""~Philippe_Ciblat1""]","[""Apostolos Avranas"", ""Marios Kountouris"", ""Philippe Ciblat""]",[],"In this paper, we investigate the problem of scheduling and resource allocation over a time varying set of clients with heterogeneous demands. This problem appears when service providers need to serve traffic generated by users with different classes of requirements. We thus have to allocate bandwidth resources over time to efficiently satisfy these demands within a limited time horizon. This is a highly intricate problem and solutions may involve tools stemming from diverse fields like combinatorics and optimization. Recent work has successfully proposed Deep Reinforcement Learning (DRL) solutions, although not yet for heterogeneous user traffic. We propose a deep deterministic policy gradient algorithm combining state of the art techniques, namely Distributional RL and Deep Sets, to train a model for heterogeneous traffic scheduling. We test on diverse number scenarios with different time dependence dynamics, users’ requirements, and resources available, demonstrating consistent results. We evaluate the algorithm on a wireless communication setting and show significant gains against state-of-the-art conventional algorithms from combinatorics and optimization (e.g. Knapsack, Integer Linear Programming, Frank-Wolfe).",/pdf/1d5d468bf24db6b38a10e257994bcdb446c5b9a2.pdf,ICLR,2021, +H1lfwAVFwr,Skxdy8PuwS,1569440000000.0,1577170000000.0,1168,CAPACITY-LIMITED REINFORCEMENT LEARNING: APPLICATIONS IN DEEP ACTOR-CRITIC METHODS FOR CONTINUOUS CONTROL,"[""mallot@rpi.edu"", ""mdriemer@us.ibm.com"", ""miao.liu1@ibm.com"", ""tklinger@us.ibm.com"", ""gtesauro@us.ibm.com"", ""simsc3@rpi.edu""]","[""Tyler James Malloy"", ""Matthew Riemer"", ""Miao Liu"", ""Tim Klinger"", ""Gerald Tesauro"", ""Chris R. Sims""]","[""Reinforcement Learning"", ""Generalization"", ""Information Theory"", ""Rate-Distortion Theory""]","Biological and artificial agents must learn to act optimally in spite of a limited capacity for processing, storing, and attending to information. We formalize this type of bounded rationality in terms of an information-theoretic constraint on the complexity of policies that agents seek to learn. We present the Capacity-Limited Reinforcement Learning (CLRL) objective which defines an optimal policy subject to an information capacity constraint. This objective is optimized by drawing from methods used in rate distortion theory and information theory, and applied to the reinforcement learning setting. Using this objective we implement a novel Capacity-Limited Actor-Critic (CLAC) algorithm and situate it within a broader family of RL algorithms such as the Soft Actor Critic (SAC) and discuss their similarities and differences. Our experiments show that compared to alternative approaches, CLAC offers improvements in generalization between training and modified test environments. This is achieved in the CLAC model while displaying high sample efficiency and minimal requirements for hyper-parameter tuning.",/pdf/f9faaeb99481bc0e85926249d884fcab6876ccc7.pdf,ICLR,2020,Applying a limit to the amount of information used to represent policies affords some improvements in generalization in Reinforcement Learning +B1lC62EKwr,SylmZ1mNwH,1569440000000.0,1577170000000.0,247,Evidence-Aware Entropy Decomposition For Active Deep Learning,"[""ws7586@rit.edu"", ""xujiang.zhao@utdallas.edu"", ""feng.chen@utdallas.edu"", ""qi.yu@rit.edu""]","[""Weishi Shi"", ""Xujiang Zhao"", ""Feng Chen"", ""Qi Yu""]","[""active learning"", ""entropy decomposition"", ""uncertainty""]","We present a novel multi-source uncertainty prediction approach that enables deep learning (DL) models to be actively trained with much less labeled data. By leveraging the second-order uncertainty representation provided by subjective logic (SL), we conduct evidence-based theoretical analysis and formally decompose the predicted entropy over multiple classes into two distinct sources of uncertainty: vacuity and dissonance, caused by lack of evidence and conflict of strong evidence, respectively. The evidence based entropy decomposition provides deeper insights on the nature of uncertainty, which can help effectively explore a large and high-dimensional unlabeled data space. We develop a novel loss function that augments DL based evidence prediction with uncertainty anchor sample identification through kernel density estimation (KDE). The accurately estimated multiple sources of uncertainty are systematically integrated and dynamically balanced using a data sampling function for label-efficient active deep learning (ADL). Experiments conducted over both synthetic and real data and comparison with competitive AL methods demonstrate the effectiveness of the proposed ADL model. ",/pdf/dcb28ba9b0358b5e0df7b93594c812dc09f3d021.pdf,ICLR,2020,An evidence-aware entropy decomposition approach for active deep learning using multiple sources of uncertainty +SyrGJYlRZ,ry4zkYe0W,1509100000000.0,1519150000000.0,325,YellowFin and the Art of Momentum Tuning,"[""zjian@cs.stanford.edu"", ""ioannis@iro.umontreal.ca"", ""chrismre@cs.stanford.edu""]","[""Jian Zhang"", ""Ioannis Mitliagkas"", ""Christopher Re""]","[""adaptive optimizer"", ""momentum"", ""hyperparameter tuning""]","Hyperparameter tuning is one of the most time-consuming workloads in deep learning. State-of-the-art optimizers, such as AdaGrad, RMSProp and Adam, reduce this labor by adaptively tuning an individual learning rate for each variable. Recently researchers have shown renewed interest in simpler methods like momentum SGD as they may yield better results. Motivated by this trend, we ask: can simple adaptive methods, based on SGD perform as well or better? We revisit the momentum SGD algorithm and show that hand-tuning a single learning rate and momentum makes it competitive with Adam. We then analyze its robustness to learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for momentum and learning rate in SGD. YellowFin optionally uses a negative-feedback loop to compensate for the momentum dynamics in asynchronous settings on the fly. We empirically show YellowFin can converge in fewer iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing, with a speedup of up to $3.28$x in synchronous and up to $2.69$x in asynchronous settings.",/pdf/fd684e788a3596499fffe9516d7c57837010df4e.pdf,ICLR,2018,YellowFin is an SGD based optimizer with both momentum and learning rate adaptivity. +Bke96sC5tm,BylyDcTcYm,1538090000000.0,1545360000000.0,831,SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning,"[""marvin@cs.berkeley.edu"", ""svikram@cs.ucsd.edu"", ""smithlaura@berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""mattjj@google.com"", ""svlevine@cs.berkeley.edu""]","[""Marvin Zhang*"", ""Sharad Vikram*"", ""Laura Smith"", ""Pieter Abbeel"", ""Matthew Johnson"", ""Sergey Levine""]","[""model-based reinforcement learning"", ""structured representation learning"", ""robotics""]","Model-based reinforcement learning (RL) methods can be broadly categorized as global model methods, which depend on learning models that provide sensible predictions in a wide range of states, or local model methods, which iteratively refit simple models that are used for policy improvement. While predicting future states that will result from the current actions is difficult, local model methods only attempt to understand system dynamics in the neighborhood of the current policy, making it possible to produce local improvements without ever learning to predict accurately far into the future. The main idea in this paper is that we can learn representations that make it easy to retrospectively infer simple dynamics given the data from the current policy, thus enabling local models to be used for policy learning in complex systems. We evaluate our approach against other model-based and model-free RL methods on a suite of robotics tasks, including manipulation tasks on a real Sawyer robotic arm directly from camera images.",/pdf/248d2e42dfccc308fcbbd2e7187196b844a9af10.pdf,ICLR,2019, +rkQsMCJCb,BJzof0J0Z,1509050000000.0,1518730000000.0,187,Generative Adversarial Networks using Adaptive Convolution,"[""nmnguyen@ualberta.ca"", ""nray1@ualberta.ca""]","[""Nhat M. Nguyen"", ""Nilanjan Ray""]","[""Generative Adversarial Networks"", ""Unsupervised Learning"", ""GANs""]","Most existing GANs architectures that generate images use transposed convolution or resize-convolution as their upsampling algorithm from lower to higher resolution feature maps in the generator. We argue that this kind of fixed operation is problematic for GANs to model objects that have very different visual appearances. We propose a novel adaptive convolution method that learns the upsampling algorithm based on the local context at each location to address this problem. We modify a baseline GANs architecture by replacing normal convolutions with adaptive convolutions in the generator. Experiments on CIFAR-10 dataset show that our modified models improve the baseline model by a large margin. Furthermore, our models achieve state-of-the-art performance on CIFAR-10 and STL-10 datasets in the unsupervised setting.",/pdf/8040a0125d10dc2976df2465fe66a49790c37911.pdf,ICLR,2018,We replace normal convolutions with adaptive convolutions to improve GANs generator. +ryGvcoA5YX,rylfAb3ttX,1538090000000.0,1551840000000.0,542,Overcoming Catastrophic Forgetting for Continual Learning via Model Adaptation,"[""wenpeng.hu@pku.edu.cn"", ""scene@pku.edu.cn"", ""liub@uic.edu"", ""chongyangtao@pku.edu.cn"", ""tttzw@pku.edu.cn"", ""jwma@math.pku.edu.cn"", ""zhaody@pku.edu.cn"", ""ruiyan@pku.edu.cn""]","[""Wenpeng Hu"", ""Zhou Lin"", ""Bing Liu"", ""Chongyang Tao"", ""Zhengwei Tao"", ""Jinwen Ma"", ""Dongyan Zhao"", ""Rui Yan""]","[""overcoming forgetting"", ""model adaptation"", ""continual learning""]","Learning multiple tasks sequentially is important for the development of AI and lifelong learning systems. However, standard neural network architectures suffer from catastrophic forgetting which makes it difficult for them to learn a sequence of tasks. Several continual learning methods have been proposed to address the problem. In this paper, we propose a very different approach, called Parameter Generation and Model Adaptation (PGMA), to dealing with the problem. The proposed approach learns to build a model, called the solver, with two sets of parameters. The first set is shared by all tasks learned so far and the second set is dynamically generated to adapt the solver to suit each test example in order to classify it. Extensive experiments have been carried out to demonstrate the effectiveness of the proposed approach.",/pdf/b9aa0b3c43546e6d382decf7fd1b76acb3584250.pdf,ICLR,2019, +lDjgALS4qs8,jSTnPYQ7VE,1601310000000.0,1614990000000.0,1513,To Understand Representation of Layer-aware Sequence Encoders as Multi-order-graph,"[""~Sufeng_Duan1"", ""~hai_zhao1"", ""~Rui_Wang10""]","[""Sufeng Duan"", ""hai zhao"", ""Rui Wang""]","[""multigraph"", ""Transformer"", ""natural language process""]","In this paper, we propose a unified explanation of representation for layer-aware neural sequence encoders, which regards the representation as a revisited multigraph called multi-order-graph (MoG), so that model encoding can be viewed as a processing to capture all subgraphs in MoG. The relationship reflected by Multi-order-graph, called $n$-order dependency, can present what existing simple directed graph explanation cannot present. Our proposed MoG explanation allows to precisely observe every step of the generation of representation, put diverse relationship such as syntax into a unifiedly depicted framework. Based on the proposed MoG explanation, we further propose a graph-based self-attention network empowered Graph-Transformer by enhancing the ability of capturing subgraph information over the current models. Graph-Transformer accommodates different subgraphs into different groups, which allows model to focus on salient subgraphs. Result of experiments on neural machine translation tasks show that the MoG-inspired model can yield effective performance improvement.",/pdf/4401daf2fdf9d014c026662a4792629956257651.pdf,ICLR,2021,This paper proposes a unified explanation of representation for layer-aware neural sequence encoders. +H1YynweCb,Syu1hvxRZ,1509090000000.0,1518730000000.0,298,Kronecker Recurrent Units,"[""cijo.jose@idiap.ch"", ""moustaphacisse@fb.com"", ""francois.fleuret@idiap.ch""]","[""Cijo Jose"", ""Moustapha Cisse"", ""Francois Fleuret""]","[""Recurrent neural network"", ""Vanishing and exploding gradients"", ""Parameter efficiency"", ""Kronecker matrices"", ""Soft unitary constraint""]","Our work addresses two important issues with recurrent neural networks: (1) they are over-parameterized, and (2) the recurrent weight matrix is ill-conditioned. The former increases the sample complexity of learning and the training time. The latter causes the vanishing and exploding gradient problem. We present a flexible recurrent neural network model called Kronecker Recurrent Units (KRU). KRU achieves parameter efficiency in RNNs through a Kronecker factored recurrent matrix. It overcomes the ill-conditioning of the recurrent matrix by enforcing soft unitary constraints on the factors. Thanks to the small dimensionality of the factors, maintaining these constraints is computationally efficient. Our experimental results on seven standard data-sets reveal that KRU can reduce the number of parameters by three orders of magnitude in the recurrent weight matrix compared to the existing recurrent models, without trading the statistical performance. These results in particular show that while there are advantages in having a high dimensional recurrent space, the capacity of the recurrent part of the model can be dramatically reduced.",/pdf/6840ec92b489a3f4342e7cb4c71c633428162fbd.pdf,ICLR,2018,Out work presents a Kronecker factorization of recurrent weight matrices for parameter efficient and well conditioned recurrent neural networks. +whNntrHtB8D,rT9UAOUwCs9,1601310000000.0,1614990000000.0,283,Gradient Based Memory Editing for Task-Free Continual Learning,"[""~Xisen_Jin3"", ""~Junyi_Du1"", ""~Xiang_Ren1""]","[""Xisen Jin"", ""Junyi Du"", ""Xiang Ren""]","[""Continual learning"", ""task-free continual learning""]","Prior work on continual learning often operate in a “task-aware” manner, by assuming that the task boundaries and identifies of the data examples are known at all times. While in practice, it is rarely the case that such information are exposed to the methods (i.e., thus called “task-free”)–a setting that is relatively underexplored. Recent attempts on task-free continual learning build on previous memory replay methods and focus on developing memory construction and replay strategies such that model performance over previously seen examples can be best retained. In this paper, looking from a complementary angle, we propose a novel approach to “edit” memory examples so that the edited memory can better retain past performance when they are replayed. We use gradient updates to edit memory examples so that they are more likely to be “forgotten” in the future. Experiments on five benchmark datasets show the proposed method can be seamlessly combined with baselines to significantly improve the performance.",/pdf/3b6d74506aa3a89bf1d3abdc210e25407fa680f8.pdf,ICLR,2021,We propose a task-free memory-based continual learning algorithm that edits stored examples over time +eNdiU_DbM9,uYM8bLaJVYp,1601310000000.0,1615860000000.0,94,Uncertainty Sets for Image Classifiers using Conformal Prediction,"[""~Anastasios_Nikolas_Angelopoulos1"", ""stephenbates@eecs.berkeley.edu"", ""~Michael_Jordan1"", ""~Jitendra_Malik2""]","[""Anastasios Nikolas Angelopoulos"", ""Stephen Bates"", ""Michael Jordan"", ""Jitendra Malik""]","[""classification"", ""predictive uncertainty"", ""conformal inference"", ""computer vision"", ""imagenet""]","Convolutional image classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network’s probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple and fast like Platt scaling, but provides a formal finite-sample coverage guarantee for every model and dataset. Our method modifies an existing conformal prediction algorithm to give more stable predictive sets by regularizing the small scores of unlikely classes after Platt scaling. In experiments on both Imagenet and Imagenet-V2 with ResNet-152 and other classifiers, our scheme outperforms existing approaches, achieving coverage with sets that are often factors of 5 to 10 smaller than a stand-alone Platt scaling baseline.",/pdf/54ecc59706032f693269ac3a32a22051e5b97bbd.pdf,ICLR,2021,"We quantify uncertainty for image classifiers using prediction sets, with detailed experiments on Imagenet Val and V2." +JdCUjf9xvlc,zdORpDc1gt,1601310000000.0,1614990000000.0,3577,Fourier Representations for Black-Box Optimization over Categorical Variables,"[""~Hamid_Dadkhahi1"", ""jriosal@us.ibm.com"", ""~Karthikeyan_Shanmugam1"", ""~Payel_Das1""]","[""Hamid Dadkhahi"", ""Jesus Rios"", ""Karthikeyan Shanmugam"", ""Payel Das""]",[],"Optimization of real-world black-box functions defined over purely categorical variables is an active area of research. In particular, optimization and design of biological sequences with specific functional or structural properties have a profound impact in medicine, materials science, and biotechnology. Standalone acquisition methods, such as simulated annealing (SA) and Monte Carlo tree search (MCTS), are typically used for such optimization problems. In order to improve the performance and sample efficiency of such acquisition methods, we propose to use existing acquisition methods in conjunction with a surrogate model for the black-box evaluations over purely categorical variables. To this end, we present two different representations, a group-theoretic Fourier expansion and an abridged one-hot encoded Boolean Fourier expansion. To learn such models, characters of each representation are considered as experts and their respective coefficients are updated via an exponential weight update rule each time the black box is evaluated. Numerical experiments over synthetic benchmarks as well as real-world RNA sequence optimization and design problems demonstrate the representational power of the proposed methods, which achieve competitive or superior performance compared to state-of-the-art counterparts, while improving the computational cost and/or sample efficiency substantially.",/pdf/027fd8efe95961ea81d77c0d88202d403512e7ca.pdf,ICLR,2021,We propose novel Fourier representations as surrogate models for black box optimization over categorical variables and show its performance improvement over existing baselines when combined with state of the art acquisition functions. +T6AxtOaWydQ,EX32ELJC0WG,1601310000000.0,1616050000000.0,782,$i$-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning,"[""~Kibok_Lee1"", ""yianz@umich.edu"", ""~Kihyuk_Sohn1"", ""~Chun-Liang_Li1"", ""~Jinwoo_Shin1"", ""~Honglak_Lee2""]","[""Kibok Lee"", ""Yian Zhu"", ""Kihyuk Sohn"", ""Chun-Liang Li"", ""Jinwoo Shin"", ""Honglak Lee""]","[""self-supervised learning"", ""unsupervised representation learning"", ""contrastive representation learning"", ""data augmentation"", ""MixUp""]","Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are mixed in both the input and virtual label spaces, providing more augmented data during training. In experiments, we demonstrate that i-Mix consistently improves the quality of learned representations across domains, including image, speech, and tabular data. Furthermore, we confirm its regularization effect via extensive ablation studies across model and dataset sizes. The code is available at https://github.com/kibok90/imix.",/pdf/c7fd4a731db5f1505f19ca2c4439421df41a8c7b.pdf,ICLR,2021,"We propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning." +SkfhIo0qtQ,SylfCCH9Y7,1538090000000.0,1545360000000.0,210,Volumetric Convolution: Automatic Representation Learning in Unit Ball,"[""sameera.ramasinghe@anu.edu.au"", ""salman.khan@anu.edu.au"", ""nick.barnes@data61.csiro.au""]","[""Sameera Ramasinghe"", ""Salman Khan"", ""Nick Barnes""]","[""convolution"", ""unit sphere"", ""3D object recognition""]","Convolution is an efficient technique to obtain abstract feature representations using hierarchical layers in deep networks. Although performing convolution in Euclidean geometries is fairly straightforward, its extension to other topological spaces---such as a sphere S^2 or a unit ball B^3---entails unique challenges. In this work, we propose a novel `""volumetric convolution"" operation that can effectively convolve arbitrary functions in B^3. We develop a theoretical framework for ""volumetric convolution"" based on Zernike polynomials and efficiently implement it as a differentiable and an easily pluggable layer for deep networks. Furthermore, our formulation leads to derivation of a novel formula to measure the symmetry of a function in B^3 around an arbitrary axis, that is useful in 3D shape analysis tasks. We demonstrate the efficacy of proposed volumetric convolution operation on a possible use-case i.e., 3D object recognition task.",/pdf/46e500ec2f270769806e95c6e0ab09064a5c3c3c.pdf,ICLR,2019,A novel convolution operator for automatic representation learning inside unit ball +XPZIaotutsD,m9dVaxlq5M,1601310000000.0,1616030000000.0,3690,DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION,"[""~Pengcheng_He2"", ""~Xiaodong_Liu1"", ""~Jianfeng_Gao1"", ""~Weizhu_Chen1""]","[""Pengcheng He"", ""Xiaodong Liu"", ""Jianfeng Gao"", ""Weizhu Chen""]","[""Transformer"", ""Attention"", ""Natural Language Processing"", ""Language Model Pre-training"", ""Position Encoding""]","Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ + generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand(NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus +89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa. +",/pdf/283448c4c3318a56c7bb21743019e9938f252538.pdf,ICLR,2021,A new model architecture DeBERTa is proposed that improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. +Byg-An4tPr,rkxkcnrNPB,1569440000000.0,1577170000000.0,254,Differential Privacy in Adversarial Learning with Provable Robustness,"[""phan@njit.edu"", ""mythai@cise.ufl.edu"", ""rjin1@kent.edu"", ""hh255@njit.edu"", ""dou@cs.uoregon.edu""]","[""NhatHai Phan"", ""My T. Thai"", ""Ruoming Jin"", ""Han Hu"", ""Dejing Dou""]","[""differential privacy"", ""adversarial learning"", ""robustness bound"", ""adversarial example""]","In this paper, we aim to develop a novel mechanism to preserve differential privacy (DP) in adversarial learning for deep neural networks, with provable robustness to adversarial examples. We leverage the sequential composition theory in DP, to establish a new connection between DP preservation and provable robustness. To address the trade-off among model utility, privacy loss, and robustness, we design an original, differentially private, adversarial objective function, based on the post-processing property in DP, to tighten the sensitivity of our model. An end-to-end theoretical analysis and thorough evaluations show that our mechanism notably improves the robustness of DP deep neural networks.",/pdf/dd1b46e08d8208651196a85ece98519b8355a203.pdf,ICLR,2020,Preserving Differential Privacy in Adversarial Learning with Provable Robustness to Adversarial Examples +ryGfnoC5KQ,BJlypba9F7,1538090000000.0,1550870000000.0,691,Kernel RNN Learning (KeRNL),"[""christopher_roth@utexas.edu"", ""ingmar@openai.com"", ""fiete@mit.edu""]","[""Christopher Roth"", ""Ingmar Kanitscheider"", ""Ila Fiete""]","[""RNNs"", ""Biologically plausible learning rules"", ""Algorithm"", ""Neural Networks"", ""Supervised Learning""]","We describe Kernel RNN Learning (KeRNL), a reduced-rank, temporal eligibility trace-based approximation to backpropagation through time (BPTT) for training recurrent neural networks (RNNs) that gives competitive performance to BPTT on long time-dependence tasks. The approximation replaces a rank-4 gradient learning tensor, which describes how past hidden unit activations affect the current state, by a simple reduced-rank product of a sensitivity weight and a temporal eligibility trace. In this structured approximation motivated by node perturbation, the sensitivity weights and eligibility kernel time scales are themselves learned by applying perturbations. The rule represents another step toward biologically plausible or neurally inspired ML, with lower complexity in terms of relaxed architectural requirements (no symmetric return weights), a smaller memory demand (no unfolding and storage of states over time), and a shorter feedback time. ",/pdf/4ea360f981a40f9874d4c4e517414d724678f4af.pdf,ICLR,2019,A biologically plausible learning rule for training recurrent neural networks +H1g0piA9tQ,HJgyvUoqYm,1538090000000.0,1545360000000.0,849,Evaluation Methodology for Attacks Against Confidence Thresholding Models,"[""goodfellow@google.com"", ""yaoqin@google.com"", ""dberth@google.com""]","[""Ian Goodfellow"", ""Yao Qin"", ""David Berthelot""]","[""adversarial examples""]","Current machine learning algorithms can be easily fooled by adversarial examples. One possible solution path is to make models that use confidence thresholding to avoid making mistakes. Such models refuse to make a prediction when they are not confident of their answer. We propose to evaluate such models in terms of tradeoff curves with the goal of high success rate on clean examples and low failure rate on adversarial examples. Existing untargeted attacks developed for models that do not use confidence thresholding tend to underestimate such models' vulnerability. We propose the MaxConfidence family of attacks, which are optimal in a variety of theoretical settings, including one realistic setting: attacks against linear models. Experiments show the attack attains good results in practice. We show that simple defenses are able to perform well on MNIST but not on CIFAR, contributing further to previous calls that MNIST should be retired as a benchmarking dataset for adversarial robustness research. We release code for these evaluations as part of the cleverhans (Papernot et al 2018) library (ICLR reviewers should be careful not to look at who contributed these features to cleverhans to avoid de-anonymizing this submission).",/pdf/58ee81bbb9ab33424599a4a7ac6ff9642b54721a.pdf,ICLR,2019,We present metrics and an optimal attack for evaluating models that defend against adversarial examples using confidence thresholding +BkgZxpVFvH,rkgdkbABvS,1569440000000.0,1577170000000.0,327,LSTOD: Latent Spatial-Temporal Origin-Destination prediction model and its applications in ride-sharing platforms,"[""zhoufan@mail.shufe.edu.cn"", ""zhou@bios.unc.edu"", ""zhuhongtu@didiglobal.com""]","[""Fan Zhou"", ""Haibo Zhou"", ""Hongtu Zhu""]","[""Origin-Destination Flow"", ""Spatial Adjacent Convolution Network"", ""Periodically Shift Attention Mechanism""]","Origin-Destination (OD) flow data is an important instrument in transportation studies. Precise prediction of customer demands from each original location to a destination given a series of previous snapshots helps ride-sharing platforms to better understand their market mechanism. However, most existing prediction methods ignore the network structure of OD flow data and fail to utilize the topological dependencies among related OD pairs. In this paper, we propose a latent spatial-temporal origin-destination (LSTOD) model, with a novel convolutional neural network (CNN) filter to learn the spatial features of OD pairs from a graph perspective and an attention structure to capture their long-term periodicity. Experiments on a real customer request dataset with available OD information from a ride-sharing platform demonstrate the advantage of LSTOD in achieving at least 6.5% improvement in prediction accuracy over the second best model. ",/pdf/8da4d09a569302f8170bfb6d7ae3f91aff217445.pdf,ICLR,2020,We propose a purely convolutional CNN model with attention mechanism to predict spatial-temporal origin-destination flows. +Wga_hrCa3P3,3X0NZQqn_y,1601310000000.0,1616650000000.0,491,Contrastive Learning with Adversarial Perturbations for Conditional Text Generation,"[""~Seanie_Lee1"", ""~Dong_Bok_Lee1"", ""~Sung_Ju_Hwang1""]","[""Seanie Lee"", ""Dong Bok Lee"", ""Sung Ju Hwang""]","[""conditional text generation"", ""contrastive learning""]","Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the ""exposure bias"" problem. In this work, we propose to solve the conditional text generation problem by contrasting positive pairs with negative pairs, such that the model is exposed to various valid or incorrect perturbations of the inputs, for improved generalization. However, training the model with naïve contrastive learning framework using random non-target sequences as negative examples is suboptimal, since they are easily distinguishable from the correct output, especially so with models pretrained with large text corpora. Also, generating positive examples requires domain-specific augmentation heuristics which may not generalize over diverse domains. To tackle this problem, we propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models. Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples by adding large perturbations while enforcing it to have a high conditional likelihood. Such `""hard'' positive and negative pairs generated using our method guides the model to better distinguish correct outputs from incorrect ones. We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks --- machine translation, text summarization, and question generation.",/pdf/a9b3656c6f165fb3975db9f4187eae140eca3593.pdf,ICLR,2021,We propose a contrastive learning with adversarial perturbation to tackle the exposure bias problem. +SyglyANFDr,SylJa7f_vr,1569440000000.0,1577170000000.0,879,SGD with Hardness Weighted Sampling for Distributionally Robust Deep Learning,"[""lucas.fidon@kcl.ac.uk"", ""sebastien.ourselin@kcl.ac.uk"", ""tom.vercauteren@kcl.ac.uk""]","[""Lucas Fidon"", ""Sebastien Ourselin"", ""Tom Vercauteren""]","[""distributionally robust optimization"", ""distributionally robust deep learning"", ""over-parameterized deep neural networks"", ""deep neural networks"", ""AI safety"", ""hard example mining""]","Distributionally Robust Optimization (DRO) has been proposed as an alternative to Empirical Risk Minimization (ERM) in order to account for potential biases in the training data distribution. However, its use in deep learning has been severely restricted due to the relative inefficiency of the optimizers available for DRO compared to the wide-spread Stochastic Gradient Descent (SGD) based optimizers for deep learning with ERM. In this work, we demonstrate that SGD with hardness weighted sampling is a principled and efficient optimization method for DRO in machine learning and is particularly suited in the context of deep learning. Similar to a hard example mining strategy in essence and in practice, the proposed algorithm is straightforward to implement and computationally as efficient as SGD-based optimizers used for deep learning. It only requires adding a softmax layer and maintaining an history of the loss values for each training example to compute adaptive sampling probabilities. In contrast to typical ad hoc hard mining approaches, and exploiting recent theoretical results in deep learning optimization, we prove the convergence of our DRO algorithm for over-parameterized deep learning networks with ReLU activation and finite number of layers and parameters. Preliminary results demonstrate the feasibility and usefulness of our approach.",/pdf/606d6bade0185bfa838098ad646418e247c6cd69.pdf,ICLR,2020,An SGD-based method for training deep neural networks with distributionally robust optimization +SJeQEp4YDH,HklsWtMvDH,1569440000000.0,1589000000000.0,480,GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification,"[""xy4cm@virginia.edu"", ""skolouri@hrl.com"", ""gustavo@virginia.edu""]","[""Xuwang Yin"", ""Soheil Kolouri"", ""Gustavo K Rohde""]","[""adversarial example detection"", ""adversarial examples classification"", ""robust optimization"", ""ML security"", ""generative modeling"", ""generative classification""]","The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we present an adversarial example detection method that provides performance guarantee to norm constrained adversaries. The method is based on the idea of training adversarial robust subspace detectors using generative adversarial training (GAT). The novel GAT objective presents a saddle point problem similar to that of GANs; it has the same convergence property, and consequently supports the learning of class conditional distributions. We demonstrate that the saddle point problem could be reasonably solved by PGD attack, and further use the learned class conditional generative models to define generative detection/classification models that are both robust and more interpretable. We provide comprehensive evaluations of the above methods, and demonstrate their competitive performances and compelling properties on adversarial detection and robust classification problems.",/pdf/7dea0414ad25ca0d21744619b376b9d45dd8743e.pdf,ICLR,2020,We propose an objective that could be used for training adversarial example detection and robust classification systems. +r1lUOzWCW,By2SuzZRb,1509140000000.0,1521650000000.0,871,Demystifying MMD GANs,"[""mikbinkowski@gmail.com"", ""dsuth@cs.ubc.ca"", ""michael.n.arbel@gmail.com"", ""arthur.gretton@gmail.com""]","[""Miko\u0142aj Bi\u0144kowski"", ""Danica J. Sutherland"", ""Michael Arbel"", ""Arthur Gretton""]","[""gans"", ""mmd"", ""ipms"", ""wgan"", ""gradient penalty"", ""unbiased gradients""]","We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramér GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an improved measure of GAN convergence, the Kernel Inception Distance, and show how to use it to dynamically adapt learning rates during GAN training.",/pdf/5308a4739abf6c4d149c09c21a4c52e29538f914.pdf,ICLR,2018,Explain bias situation with MMD GANs; MMD GANs work with smaller critic networks than WGAN-GPs; new GAN evaluation metric. +Skl1HCNKDr,r1goXOUuvH,1569440000000.0,1577170000000.0,1096,Learning Generative Models using Denoising Density Estimators,"[""siavash.bigdeli@csem.ch"", ""geng@cs.umd.edu"", ""tiziano.portenier@vision.ee.ethz.ch"", ""andrea.dunbar@csem.ch"", ""zwicker@cs.umd.edu""]","[""Siavash Bigdeli"", ""Geng Lin"", ""Tiziano Portenier"", ""Andrea Dunbar"", ""Matthias Zwicker""]","[""generative probabilistic models"", ""denoising autoencoders"", ""neural density estimation""]","Learning generative probabilistic models that can estimate the continuous density given a set of samples, and that can sample from that density is one of the fundamental challenges in unsupervised machine learning. In this paper we introduce a new approach to obtain such models based on what we call denoising density estimators (DDEs). A DDE is a scalar function, parameterized by a neural network, that is efficiently trained to represent a kernel density estimator of the data. In addition, we show how to leverage DDEs to develop a novel approach to obtain generative models that sample from given densities. We prove that our algorithms to obtain both DDEs and generative models are guaranteed to converge to the correct solutions. Advantages of our approach include that we do not require specific network architectures like in normalizing flows, ODE solvers as in continuous normalizing flows, nor do we require adversarial training as in generative adversarial networks (GANs). Finally, we provide experimental results that demonstrate practical applications of our technique. +",/pdf/05ecd61d335401252aa317f2d4fe4972637a70e3.pdf,ICLR,2020,"A novel approach to train generative models including density estimation; different from normalizing and continuous flows, VAEs, or autoregressive models." +B1p461b0W,rkoE6kWRW,1509130000000.0,1518730000000.0,526,Deep Learning is Robust to Massive Label Noise,"[""drolnick@mit.edu"", ""av443@cornell.edu"", ""sjb344@cornell.edu"", ""shanir@csail.mit.edu""]","[""David Rolnick"", ""Andreas Veit"", ""Serge Belongie"", ""Nir Shavit""]","[""label noise"", ""weakly supervised learning"", ""robustness of neural networks"", ""deep learning"", ""large datasets""]","Deep neural networks trained on large supervised datasets have led to impressive results in recent years. However, since well-annotated datasets can be prohibitively expensive and time-consuming to collect, recent work has explored the use of larger but noisy datasets that can be more easily obtained. In this paper, we investigate the behavior of deep neural networks on training sets with massively noisy labels. We show on multiple datasets such as MINST, CIFAR-10 and ImageNet that successful learning is possible even with an essentially arbitrary amount of noise. For example, on MNIST we find that accuracy of above 90 percent is still attainable even when the dataset has been diluted with 100 noisy examples for each clean example. Such behavior holds across multiple patterns of label noise, even when noisy labels are biased towards confusing classes. Further, we show how the required dataset size for successful training increases with higher label noise. Finally, we present simple actionable techniques for improving learning in the regime of high label noise.",/pdf/cc32910daa6421501516c5147058865e62bd999a.pdf,ICLR,2018,We show that deep neural networks are able to learn from data that has been diluted by an arbitrary amount of noise. +SJa1Nk10b,HyhyEkyC-,1508990000000.0,1518730000000.0,111,Anytime Neural Network: a Versatile Trade-off Between Computation and Accuracy,"[""hanzhang@cs.cmu.edu"", ""dedey@microsoft.com"", ""hebert@ri.cmu.edu"", ""dbagnell@ri.cmu.edu""]","[""Hanzhang Hu"", ""Debadeepta Dey"", ""Martial Hebert"", ""J. Andrew Bagnell""]","[""anytime"", ""neural network"", ""adaptive prediction"", ""budgeted prediction""]","We present an approach for anytime predictions in deep neural networks (DNNs). For each test sample, an anytime predictor produces a coarse result quickly, and then continues to refine it until the test-time computational budget is depleted. Such predictors can address the growing computational problem of DNNs by automatically adjusting to varying test-time budgets. In this work, we study a \emph{general} augmentation to feed-forward networks to form anytime neural networks (ANNs) via auxiliary predictions and losses. Specifically, we point out a blind-spot in recent studies in such ANNs: the importance of high final accuracy. In fact, we show on multiple recognition data-sets and architectures that by having near-optimal final predictions in small anytime models, we can effectively double the speed of large ones to reach corresponding accuracy level. We achieve such speed-up with simple weighting of anytime losses that oscillate during training. We also assemble a sequence of exponentially deepening ANNs, to achieve both theoretically and practically near-optimal anytime results at any budget, at the cost of a constant fraction of additional consumed budget.",/pdf/011ca597b2f91c674b738341ac82d2b102c6ac29.pdf,ICLR,2018,"By focusing more on the final predictions in anytime predictors (such as the very recent Multi-Scale-DenseNets), we make small anytime models to outperform large ones that don't have such focus. " +QM4_h99pjCE,#NAME?,1601310000000.0,1614990000000.0,3223,Decentralized Deterministic Multi-Agent Reinforcement Learning,"[""antoine.grosnit@polytechnique.edu"", ""desmond.cai@gmail.com"", ""~Laura_Wynter1""]","[""Antoine Grosnit"", ""Desmond Cai"", ""Laura Wynter""]","[""multiagent reinforcement learning"", ""MARL"", ""decentralized actor-critic algorithm""]","Recent work in multi-agent reinforcement learning (MARL) by [Zhang, ICML12018] provided the first decentralized actor-critic algorithm to offer convergence guarantees. In that work, policies are stochastic and are defined on finite action spaces. We extend those results to develop a provably-convergent decentralized actor-critic algorithm for learning deterministic policies on continuous action spaces. Deterministic policies are important in many real-world settings. To handle the lack of exploration inherent in deterministic policies we provide results for the off-policy setting as well as the on-policy setting. We provide the main ingredients needed for this problem: the expression of a local deterministic policy gradient, a decentralized deterministic actor-critic algorithm, and convergence guarantees when the value functions are approximated linearly. This work enables decentralized MARL in high-dimensional action spaces and paves the way for more widespread application of MARL.",/pdf/dcaee4b93ff5a38c3d6871819ef8f9589288a27f.pdf,ICLR,2021,We provide a provably-convergent decentralized actor-critic algorithm for learning deterministic reinforcement learning policies on continuous action spaces. +SJldu6EtDS,SyxPKb2vDS,1569440000000.0,1577170000000.0,637,Wasserstein Adversarial Regularization (WAR) on label noise,"[""bharath-bhushan.damodaran@irisa.fr"", ""kilian.fatras@irisa.fr"", ""sylvain.lobry@wur.nl"", ""remi.flamary@unice.fr"", ""devis.tuia@wur.nl"", ""ncourty@irisa.fr""]","[""Bharath Damodaran"", ""Kilian Fatras"", ""Sylvain Lobry"", ""R\u00e9mi Flamary"", ""Devis Tuia"", ""Nicolas Courty""]","[""Label Noise"", ""Adversarial regularization"", ""Wasserstein""]","Noisy labels often occur in vision datasets, especially when they are obtained from crowdsourcing or Web scraping. We propose a new regularization method, which enables learning robust classifiers in presence of noisy data. To achieve this goal, we propose a new adversarial regularization scheme based on the Wasserstein distance. Using this distance allows taking into account specific relations between classes by leveraging the geometric properties of the labels space. Our Wasserstein Adversarial Regularization (WAR) encodes a selective regularization, which promotes smoothness of the classifier between some classes, while preserving sufficient complexity of the decision boundary between others. We first discuss how and why adversarial regularization can be used in the context of label noise and then show the effectiveness of our method on five datasets corrupted with noisy labels: in both benchmarks and real datasets, WAR outperforms the state-of-the-art +competitors.",/pdf/977f77169faf514e2d36da10593a55f72bc3378a.pdf,ICLR,2020,We present a novel method for handling label noise through an adversarial regularization incorporating a Wasserstein distance +OCm0rwa1lx1,8DOc41KCUeT,1601310000000.0,1614990000000.0,3287,Addressing Some Limitations of Transformers with Feedback Memory,"[""~Angela_Fan2"", ""thibautlav@fb.com"", ""~Edouard_Grave1"", ""~Armand_Joulin1"", ""~Sainbayar_Sukhbaatar1""]","[""Angela Fan"", ""Thibaut Lavril"", ""Edouard Grave"", ""Armand Joulin"", ""Sainbayar Sukhbaatar""]","[""Feedback"", ""Memory"", ""Transformers""]","Transformers have been successfully applied to sequential tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.",/pdf/7bcd00e7e6fb6ac4638fb5f348ea2992f7d70681.pdf,ICLR,2021,Transformers have shortcomings - limited memory and limited state update - but Feedback Memory is a straightforward way to resolve these. +rJg46kHYwH,S1gwfVkYPH,1569440000000.0,1577170000000.0,1985,Adaptive Generation of Unrestricted Adversarial Inputs,"[""isaac.dunn@cs.ox.ac.uk"", ""hadrien.pouget@cs.ox.ac.uk"", ""tom.melham@cs.ox.ac.uk"", ""kroening@cs.ox.ac.uk""]","[""Isaac Dunn"", ""Hadrien Pouget"", ""Tom Melham"", ""Daniel Kroening""]","[""Adversarial Examples"", ""Adversarial Robustness"", ""Generative Adversarial Networks"", ""Image Classification""]","Neural networks are vulnerable to adversarially-constructed perturbations of their inputs. Most research so far has considered perturbations of a fixed magnitude under some $l_p$ norm. Although studying these attacks is valuable, there has been increasing interest in the construction of—and robustness to—unrestricted attacks, which are not constrained to a small and rather artificial subset of all possible adversarial inputs. We introduce a novel algorithm for generating such unrestricted adversarial inputs which, unlike prior work, is adaptive: it is able to tune its attacks to the classifier being targeted. It also offers a 400–2,000× speedup over the existing state of the art. We demonstrate our approach by generating unrestricted adversarial inputs that fool classifiers robust to perturbation-based attacks. We also show that, by virtue of being adaptive and unrestricted, our attack is able to bypass adversarial training against it.",/pdf/478d88342cc84a2a15248b6b8b62095735ea7137.pdf,ICLR,2020,Training GANs to generate unrestricted adversarial examples +iaO86DUuKi,hV18Wxgg2cg,1601310000000.0,1616010000000.0,883,Conservative Safety Critics for Exploration,"[""~Homanga_Bharadhwaj1"", ""~Aviral_Kumar2"", ""~Nicholas_Rhinehart1"", ""~Sergey_Levine1"", ""~Florian_Shkurti1"", ""~Animesh_Garg1""]","[""Homanga Bharadhwaj"", ""Aviral Kumar"", ""Nicholas Rhinehart"", ""Sergey Levine"", ""Florian Shkurti"", ""Animesh Garg""]","[""Safe exploration"", ""Reinforcement Learning""]","Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL, by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are satisfied with high probability during training, derive provable convergence guarantees for our approach which is no worse asymptotically then standard RL, and empirically demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Our results demonstrate that the proposed approach can achieve competitive task performance, while incurring significantly lower catastrophic failure rates during training as compared to prior methods. Videos are at this URL https://sites.google.com/view/conservative-safety-critics/",/pdf/31cfa17ce6b5a4dd1c7e3bf3ce4c025642d3199e.pdf,ICLR,2021,Safe exploration in reinforcement learning can be achieved by constraining policy learning with conservative safety estimates of the environment. +KxUlUb26-P3,kNDxd0ywTy,1601310000000.0,1614990000000.0,1347,PABI: A Unified PAC-Bayesian Informativeness Measure for Incidental Supervision Signals,"[""~Hangfeng_He3"", ""myz@seas.upenn.edu"", ""qning@amazon.com"", ""~Dan_Roth3""]","[""Hangfeng He"", ""Mingyuan Zhang"", ""Qiang Ning"", ""Dan Roth""]","[""informativeness measure"", ""incidental supervision"", ""natural language processing""]","Real-world applications often require making use of {\em a range of incidental supervision signals}. However, we currently lack a principled way to measure the benefit an incidental training dataset can bring, and the common practice of using indirect, weaker signals is through exhaustive experiments with various models and hyper-parameters. This paper studies whether we can, {\em in a single framework, quantify the benefit of various types of incidental signals for one's target task without going through combinatorial experiments}. We propose PABI, a unified informativeness measure motivated by PAC-Bayesian theory, characterizing the reduction in uncertainty that indirect, weak signals provide. We demonstrate PABI's use in quantifying various types of incidental signals including partial labels, noisy labels, constraints, cross-domain signals, and combinations of these. Experiments with various setups on two natural language processing (NLP) tasks, named entity recognition (NER) and question answering (QA), show that PABI correlates well with learning performance, providing a promising way to determine, ahead of learning, which supervision signals would be beneficial.",/pdf/372d209a33d11701f6e69ef4dcd894d5dd78f4f9.pdf,ICLR,2021,A unified informativeness measure to foreshadow the benefits of incidental supervision signals in natural language processing +rygBVTVFPB,HJlozRfPPS,1569440000000.0,1577170000000.0,485,Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning,"[""wang.yufei@pku.edu.cn"", ""zjshen@pku.edu.cn"", ""zlong@pku.edu.cn"", ""dongbin@math.pku.edu.cn""]","[""Yufei Wang*"", ""Ziju Shen*"", ""Zichao Long"", ""Bin Dong""]","[""Numerical Methods"", ""Conservation Laws"", ""Reinforcement Learning""]","Conservation laws are considered to be fundamental laws of nature. It has broad application in many fields including physics, chemistry, biology, geology, and engineering. Solving the differential equations associated with conservation laws is a major branch in computational mathematics. Recent success of machine learning, especially deep learning, in areas such as computer vision and natural language processing, has attracted a lot of attention from the community of computational mathematics and inspired many intriguing works in combining machine learning with traditional methods. In this paper, we are the first to explore the possibility and benefit of solving nonlinear conservation laws using deep reinforcement learning. As a proof of concept, we focus on 1-dimensional scalar conservation laws. We deploy the machinery of deep reinforcement learning to train a policy network that can decide on how the numerical solutions should be approximated in a sequential and spatial-temporal adaptive manner. We will show that the problem of solving conservation laws can be naturally viewed as a sequential decision making process and the numerical schemes learned in such a way can easily enforce long-term accuracy. +Furthermore, the learned policy network is carefully designed to determine a good local discrete approximation based on the current state of the solution, which essentially makes the proposed method a meta-learning approach. +In other words, the proposed method is capable of learning how to discretize for a given situation mimicking human experts. Finally, we will provide details on how the policy network is trained, how well it performs compared with some state-of-the-art numerical solvers such as WENO schemes, and how well it generalizes. Our code is released anomynously at \url{https://github.com/qwerlanksdf/L2D}.",/pdf/400c18d81359004c254bad433bc535d8fdbe2e73.pdf,ICLR,2020,"We observe that numerical PDE solvers can be regarded as Markov Desicion Processes, and propose to use Reinforcement Learning to solve 1D scalar Conservation Laws" +t4EWDRLHwcZ,feyXPIOnnKF,1601310000000.0,1614990000000.0,2389,Graph Learning via Spectral Densification,"[""~Zhuo_Feng3"", ""~Yongyu_Wang1"", ""~Zhiqiang_Zhao1""]","[""Zhuo Feng"", ""Yongyu Wang"", ""Zhiqiang Zhao""]","[""Spectral Graph Theory"", ""Undirected Graphical models"", ""Gaussian Markov Random Fields""]","Graph learning plays important role in many data mining and machine learning tasks, such as manifold learning, data representation and analysis, dimensionality reduction, data clustering, and visualization, etc. For the first time, we present a highly-scalable spectral graph densification approach (GRASPEL) for graph learning from data. By limiting the precision matrix to be a graph-Laplacian-like matrix in graphical Lasso, our approach aims to learn ultra-sparse undirected graphs from potentially high-dimensional input data. A very unique property of the graphs learned by GRASPEL is that the spectral embedding (or approximate effective-resistance) distances on the graph will encode the similarities between the original input data points. By interleaving the latest high-performance nearly-linear +time spectral methods, ultrasparse yet spectrally-robust graphs can be learned by identifying and including the most spectrally-critical edges into the graph. Compared with prior state-of-the-art graph learning approaches, GRASPEL is more scalable and allows substantially improving computing efficiency and solution quality of a variety of data mining and +machine learning applications, such as manifold learning, spectral clustering (SC), and dimensionality reduction.",/pdf/7db97865241a53715e50c1b12944d8c350d260b7.pdf,ICLR,2021,A highly-efficient graph learning approach exploiting high-performance spectral graph algorithms +F-mvpFpn_0q,V5mlUf8f29,1601310000000.0,1615220000000.0,515,Rapid Task-Solving in Novel Environments,"[""~Samuel_Ritter1"", ""~Ryan_Faulkner2"", ""~Laurent_Sartran1"", ""~Adam_Santoro1"", ""~Matthew_Botvinick1"", ""~David_Raposo1""]","[""Samuel Ritter"", ""Ryan Faulkner"", ""Laurent Sartran"", ""Adam Santoro"", ""Matthew Botvinick"", ""David Raposo""]","[""deep reinforcement learning"", ""meta learning"", ""deep learning"", ""exploration"", ""planning""]","We propose the challenge of rapid task-solving in novel environments (RTS), wherein an agent must solve a series of tasks as rapidly as possible in an unfamiliar environment. An effective RTS agent must balance between exploring the unfamiliar environment and solving its current task, all while building a model of the new environment over which it can plan when faced with later tasks. While modern deep RL agents exhibit some of these abilities in isolation, none are suitable for the full RTS challenge. To enable progress toward RTS, we introduce two challenge domains: (1) a minimal RTS challenge called the Memory&Planning Game and (2) One-Shot StreetLearn Navigation, which introduces scale and complexity from real-world data. We demonstrate that state-of-the-art deep RL agents fail at RTS in both domains, and that this failure is due to an inability to plan over gathered knowledge. We develop Episodic Planning Networks (EPNs) and show that deep-RL agents with EPNs excel at RTS, outperforming the nearest baseline by factors of 2-3 and learning to navigate held-out StreetLearn maps within a single episode. We show that EPNs learn to execute a value iteration-like planning algorithm and that they generalize to situations beyond their training experience.",/pdf/2a2a34541a2b4e34e92e1050f5935a08cca0163b.pdf,ICLR,2021,"Our agents meta-learn to explore, build models on-the-fly, and plan, enabling them to rapidly solve sequences of tasks in unfamiliar environments." +rJq_YBqxx,,1478280000000.0,1482370000000.0,261,Deep Character-Level Neural Machine Translation By Learning Morphology,"[""sword.york@gmail.com"", ""zhzhang@math.pku.edu.cn""]","[""Shenjian Zhao"", ""Zhihua Zhang""]","[""Natural language processing"", ""Deep learning""]","Neural machine translation aims at building a single large neural network that can be trained to maximize translation performance. The encoder-decoder architecture with an attention mechanism achieves a translation performance comparable to the existing state-of-the-art phrase-based systems. However, the use of large vocabulary becomes the bottleneck in both training and improving the performance. In this paper, we propose a novel architecture which learns morphology by using two recurrent networks and a hierarchical decoder which translates at character level. This gives rise to a deep character-level model consisting of six recurrent networks. Such a deep model has two major advantages. It avoids the large vocabulary issue radically; at the same time, it is more efficient in training than word-based models. Our model obtains a higher BLEU score than the bpe-based model after training for one epoch on En-Fr and En-Cs translation tasks. Further analyses show that our model is able to learn morphology. + +",/pdf/d4495be4c260cf351432b34cb5232967aeabe5c4.pdf,ICLR,2017,"We devise a character-level neural machine translation built on six recurrent networks, and obtain a BLEU score comparable to the state-of-the-art NMT on En-Fr and Cs-En translation tasks. " +S1xoy3CcYX,H1lHEBj9Ym,1538090000000.0,1545360000000.0,1013,Adversarial Examples Are a Natural Consequence of Test Error in Noise,"[""nicf@google.com"", ""gilmer@google.com"", ""cubuk@google.com""]","[""Nicolas Ford"", ""Justin Gilmer"", ""Ekin D. Cubuk""]","[""Adversarial examples"", ""generalization""]"," Over the last few years, the phenomenon of adversarial examples --- maliciously constructed inputs that fool trained machine learning models --- has captured the attention of the research community, especially when the adversary is restricted to making small modifications of a correctly handled input. At the same time, less surprisingly, image classifiers lack human-level performance on randomly corrupted images, such as images with additive Gaussian noise. In this work, we show that these are two manifestations of the same underlying phenomenon. We establish this connection in several ways. First, we find that adversarial examples exist at the same distance scales we would expect from a linear model with the same performance on corrupted images. Next, we show that Gaussian data augmentation during training improves robustness to small adversarial perturbations and that adversarial training improves robustness to several types of image corruptions. Finally, we present a model-independent upper bound on the distance from a corrupted image to its nearest error given test performance and show that in practice we already come close to achieving the bound, so that improving robustness further for the corrupted image distribution requires significantly reducing test error. All of this suggests that improving adversarial robustness should go hand in hand with improving performance in the presence of more general and realistic image corruptions. This yields a computationally tractable evaluation metric for defenses to consider: test error in noisy image distributions.",/pdf/9960667397256afcc527f8bb3c019f018d62cb05.pdf,ICLR,2019,Small adversarial perturbations should be expected given observed error rates of models outside the natural data distribution. +5B8YAz6W3eX,R1jbaxE6_e,1601310000000.0,1614990000000.0,88,Apollo: An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization,"[""~Xuezhe_Ma1""]","[""Xuezhe Ma""]","[""Optimization"", ""Stochastic Optimization"", ""Nonconvex"", ""Quasi-Newton"", ""Neural Network"", ""Deep Learning""]","In this paper, we introduce Apollo, a quasi-newton method for noncovex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix. Algorithmically, Apollo requires only first-order gradients and updates the approximation of the Hessian diagonally such that it satisfies the weak secant relation. To handle nonconvexity, we replace the Hessian with its absolute value, the computation of which is also efficient under our diagonal approximation, yielding an optimization algorithm with linear complexity for both time and memory. Experimentally, through three tasks on vision and language we show that Apollo achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in term of both convergence speed and generalization performance.",/pdf/856f79ae886cd9660bc7b5e3ea4f6794e339f82e.pdf,ICLR,2021,An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization +ctgsGEmWjDY,r8fYfFiKKIi,1601310000000.0,1614990000000.0,556,On The Adversarial Robustness of 3D Point Cloud Classification,"[""~Jiachen_Sun1"", ""kamako@umich.edu"", ""yulongc@umich.edu"", ""~Qi_Alfred_Chen1"", ""~Zhuoqing_Mao1""]","[""Jiachen Sun"", ""Karl Koenig"", ""Yulong Cao"", ""Qi Alfred Chen"", ""Zhuoqing Mao""]","[""Adversarial Machine Learning"", ""Point Cloud Classification"", ""Adversarial Training""]","3D point clouds play pivotal roles in various safety-critical fields, such as autonomous driving, which desires the corresponding deep neural networks to be robust to adversarial perturbations. Though a few defenses against adversarial point cloud classification have been proposed, it remains unknown whether they can provide real robustness. To this end, we perform the first security analysis of state-of-the-art defenses and design adaptive attacks on them. Our 100% adaptive attack success rates demonstrate that current defense designs are still vulnerable. Since adversarial training (AT) is believed to be the most effective defense, we present the first in-depth study showing how AT behaves in point cloud classification and identify that the required symmetric function (pooling operation) is paramount to the model's robustness under AT. Through our systematic analysis, we find that the default used fixed pooling operations (e.g., MAX pooling) generally weaken AT's performance in point cloud classification. Still, sorting-based parametric pooling operations can significantly improve the models' robustness. Based on the above insights, we further propose DeepSym, a deep symmetric pooling operation, to architecturally advance the adversarial robustness under AT to 47.01% without sacrificing nominal accuracy, outperforming the original design and a strong baseline by 28.5% ($\sim 2.6 \times$) and 6.5%, respectively, in PointNet. ",/pdf/079efa186d8c5910771ccd134093b0eee8be3666.pdf,ICLR,2021,"In this work, we first design adaptive attacks to break the state-of-the-art defenses against adversarial point cloud classification and further improve the adversarial training performance of point cloud classification models by a large margin." +PvVbsAmxdlZ,umGQatPCnzQ,1601310000000.0,1614990000000.0,2721,Causal Inference Q-Network: Toward Resilient Reinforcement Learning,"[""~Chao-Han_Huck_Yang1"", ""ih2320@columbia.edu"", ""~Yi_Ouyang1"", ""~Pin-Yu_Chen1""]","[""Chao-Han Huck Yang"", ""Danny I-Te Hung"", ""Yi Ouyang"", ""Pin-Yu Chen""]","[""Deep Reinforcement Learning"", ""Causal Inference"", ""Robust Reinforcement Learning"", ""Adversarial Robustness""]","Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as black-out, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an important yet challenging task. In this paper, we consider a resilient DRL framework with observational interferences. Under this framework, we discuss the importance of the causal relation and propose a causal inference based DRL algorithm called causal inference Q-network (CIQ). We evaluate the performance of CIQ in several benchmark DRL environments with different types of interferences. Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.",/pdf/fb3f6e301a9b5d162d26242ad28646ac78706517.pdf,ICLR,2021,We propose a causal inference based DRL algorithm called causal inference Q-network (CIQ) under interferences toward resilient learning. +HyNmRiCqtm,H1g5ihTcYm,1538090000000.0,1545360000000.0,881,CDeepEx: Contrastive Deep Explanations,"[""sfegh001@ucr.edu"", ""cshelton@cs.ucr.edu"", ""pazzani@ucr.edu"", ""ktang012@ucr.edu""]","[""Amir Feghahati"", ""Christian R. Shelton"", ""Michael J. Pazzani"", ""Kevin Tang""]","[""Deep learning"", ""Explanation"", ""Network interpretation"", ""Contrastive explanation""]","We propose a method which can visually explain the classification decision of deep neural networks (DNNs). There are many proposed methods in machine learning and computer vision seeking to clarify the decision of machine learning black boxes, specifically DNNs. All of these methods try to gain insight into why the network ""chose class A"" as an answer. Humans, when searching for explanations, ask two types of questions. The first question is, ""Why did you choose this answer?"" The second question asks, ""Why did you not choose answer B over A?"" The previously proposed methods are either not able to provide the latter directly or efficiently. + +We introduce a method capable of answering the second question both directly and efficiently. In this work, we limit the inputs to be images. In general, the proposed method generates explanations in the input space of any model capable of efficient evaluation and gradient evaluation. We provide results, showing the superiority of this approach for gaining insight into the inner representation of machine learning models.",/pdf/661169393f0e8e4146d8aaa8d4c4de7b4c33d40f.pdf,ICLR,2019,"A method to answer ""why not class B?"" for explaining deep networks" +rJgUfTEYvH,BygUKT98vH,1569440000000.0,1583910000000.0,413,VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation,"[""manojkumarsivaraj334@gmail.com"", ""mb2@uiuc.edu"", ""dumitru@google.com"", ""cbfinn@eecs.berkeley.edu"", ""slevine@google.com"", ""laurentdinh@google.com"", ""d.p.kingma@uva.nl""]","[""Manoj Kumar"", ""Mohammad Babaeizadeh"", ""Dumitru Erhan"", ""Chelsea Finn"", ""Sergey Levine"", ""Laurent Dinh"", ""Durk Kingma""]","[""Video generation"", ""flow-based generative models"", ""stochastic video prediction""]","Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video.",/pdf/f0048f33cb4788f547460b93e23a9fa5253a59ac.pdf,ICLR,2020,We demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video. +rylrI1HtPr,B1euaI6dPH,1569440000000.0,1577170000000.0,1728,Pixel Co-Occurence Based Loss Metrics for Super Resolution Texture Recovery,"[""yingda.wang@unsw.edu.au"", ""p.swietojanski@unsw.edu.au"", ""ryan.armstrong@unsw.edu.au"", ""peyman@unsw.edu.au""]","[""Ying Da Wang"", ""Pawel Swietojanski"", ""Ryan T Armstrong"", ""Peyman Mostaghimi""]","[""Super Resolution Generative Adversarial Networks"", ""Perceptual Loss Functions""]","Single Image Super Resolution (SISR) has significantly improved with Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), often achieving order of magnitude better pixelwise accuracies (distortions) and state-of-the-art perceptual accuracy. Due to the stochastic nature of GAN reconstruction and the ill-posed nature of the problem, perceptual accuracy tends to correlate inversely with pixelwise accuracy which is especially detrimental to SISR, where preservation of original content is an objective. GAN stochastics can be guided by intermediate loss functions such as the VGG featurewise loss, but these features are typically derived from biased pre-trained networks. Similarly, measurements of perceptual quality such as the human Mean Opinion Score (MOS) and no-reference measures have issues with pre-trained bias. The spatial relationships between pixel values can be measured without bias using the Grey Level Co-occurence Matrix (GLCM), which was found to match the cardinality and comparative value of the MOS while reducing subjectivity and automating the analytical process. In this work, the GLCM is also directly used as a loss function to guide the generation of perceptually accurate images based on spatial collocation of pixel values. We compare GLCM based loss against scenarios where (1) no intermediate guiding loss function, and (2) the VGG feature function are used. Experimental validation is carried on X-ray images of rock samples, characterised by significant number of high frequency texture features. We find GLCM-based loss to result in images with higher pixelwise accuracy and better perceptual scores.",/pdf/d611e52ebd5e2adfb9a62ba2b859d54ea1a15313.pdf,ICLR,2020,We introduce an unbiased perceptual loss function and metric and show that it improves recovery of texture during super resolution +BkgOM1rKvr,Hklt3M3OPH,1569440000000.0,1577170000000.0,1586,The Surprising Behavior Of Graph Neural Networks,"[""vivek.kothari@cs.ox.ac.uk"", ""eu.tong@cs.ox.ac.uk"", ""nicholas.lane@cs.ox.ac.uk""]","[""Vivek Kothari"", ""Catherine Tong"", ""Nicholas Lane""]","[""Graph Neural Networks"", ""Graph Toplogy"", ""Noise"", ""Attributed Networks""]","We highlight a lack of understanding of the behaviour of Graph Neural Networks (GNNs) in various topological contexts. We present 4 experimental studies which counter-intuitively demonstrate that the performance of GNNs is weakly dependent on the topology, sensitive to structural noise and the modality (attributes or edges) of information, and degraded by strong coupling between nodal attributes and structure. We draw on the empirical results to recommend reporting of topological context in GNN evaluation and propose a simple (attribute-structure) decoupling method to improve GNN performance.",/pdf/d91a1e8fb64fcff2ea4b881d928c2ba8c50fd7f6.pdf,ICLR,2020,The paper presents a set of experiements which highlight the gap in our intuitive understanding of Graph Neural Networks. +3teh9zI0j4L,dvf513PjnM,1601310000000.0,1614990000000.0,1195,Quantifying Exposure Bias for Open-ended Language Generation,"[""~Tianxing_He1"", ""~Jingzhao_Zhang2"", ""~Zhiming_Zhou2"", ""~James_R._Glass1""]","[""Tianxing He"", ""Jingzhao Zhang"", ""Zhiming Zhou"", ""James R. Glass""]","[""exposure bias"", ""natural language generation"", ""autoregressive""]","The exposure bias problem refers to the incrementally distorted generation induced by the training-generation discrepancy, in teacher-forcing training for auto-regressive neural network language models (LM). It has been regarded as a central problem for LMs trained for open-ended language generation. Although a lot of algorithms have been proposed to avoid teacher forcing and therefore alleviate exposure bias, there is little work showing how serious the exposure bias problem actually is. In this work, we propose novel metrics to quantify the impact of exposure bias in the generation of MLE-trained LMs. Our key intuition is that if we feed ground-truth data prefixes (instead of prefixes generated by the model itself) into the model and ask it to continue the generation, the performance should become much better because the training-generation discrepancy in the prefix is removed. We conduct both automatic and human evaluation in our experiments, and our observations are two-fold: (1) We confirm that the prefix discrepancy indeed induces some level of performance loss. (2) However, the induced distortion seems to be limited, and is not incremental during the generation, which contradicts the claim of exposure bias.",/pdf/fc58f2aeb44d9fc0c9eded8bad47828fa46450e4.pdf,ICLR,2021,"We design metrics to quantify the impact of the exposure bias problem, but find it to be only a minor problem for open-ended language generation." +9vCLOXwprc,glXY2eFLTlu,1601310000000.0,1614990000000.0,1109,Iterated graph neural network system,"[""~Hanju_Li1""]","[""Hanju Li""]",[],"We present Iterated Graph Neural Network System (IGNNS), a new framework of Graph Neural Networks (GNNs), which can deal with undirected graph and directed graph in a unified way. The core component of IGNNS is the Iterated Function System (IFS), which is an important research field in fractal geometry. The key idea of IGNNS is to use a pair of affine transformations to characterize the process of message passing between graph nodes and assign an adjoint probability vector to them to form an IFS layer with probability. After embedding in the latent space, the node features are sent to IFS layer for iterating, and then obtain the high-level representation of graph nodes. We also analyze the geometric properties of IGNNS from the perspective of dynamical system. We prove that if the IFS induced by IGNNS is contractive, then the fractal representation of graph nodes converges to the fractal set of IFS in Hausdorff distance and the ergodic representation of that converges to a constant matrix in Frobenius norm. We have carried out a series of semi supervised node classification experiments on citation network datasets such as citeser, Cora and PubMed. The experimental results show that the performance of our method is obviously better than the related methods.",/pdf/db91605be08c98a99f63edddc8963e64cb2749d1.pdf,ICLR,2021, +Skl6k209Ym,rkgYGO6tY7,1538090000000.0,1545360000000.0,1029,Alignment Based Mathching Networks for One-Shot Classification and Open-Set Recognition,"[""pareshmg@csail.mit.edu"", ""tommi@csail.mit.edu""]","[""Paresh Malalur"", ""Tommi Jaakkola""]",[],"Deep learning for object classification relies heavily on convolutional models. While effective, CNNs are rarely interpretable after the fact. An attention mechanism can be used to highlight the area of the image that the model focuses on thus offering a narrow view into the mechanism of classification. We expand on this idea by forcing the method to explicitly align images to be classified to reference images representing the classes. The mechanism of alignment is learned and therefore does not require that the reference objects are anything like those being classified. Beyond explanation, our exemplar based cross-alignment method enables classification with only a single example per category (one-shot). Our model cuts the 5-way, 1-shot error rate in Omniglot from 2.1\% to 1.4\% and in MiniImageNet from 53.5\% to 46.5\% while simultaneously providing point-wise alignment information providing some understanding on what the network is capturing. This method of alignment also enables the recognition of an unsupported class (open-set) in the one-shot setting while maintaining an F1-score of above 0.5 for Omniglot even with 19 other distracting classes while baselines completely fail to separate the open-set class in the one-shot setting.",/pdf/b2aff749b17fb4801b3c2c8e8546b79f74c3b1d7.pdf,ICLR,2019, +HyxjwgbRZ,S1ysDe-CZ,1509130000000.0,1518730000000.0,601,Convergence rate of sign stochastic gradient descent for non-convex functions,"[""bernstein@caltech.edu"", ""kazizzad@uci.edu"", ""yuxiangw@cs.cmu.edu"", ""animakumar@gmail.com""]","[""Jeremy Bernstein"", ""Kamyar Azizzadenesheli"", ""Yu-Xiang Wang"", ""Anima Anandkumar""]","[""sign"", ""stochastic"", ""gradient"", ""non-convex"", ""optimization"", ""gradient"", ""quantization"", ""convergence"", ""rate""]","The sign stochastic gradient descent method (signSGD) utilizes only the sign of the stochastic gradient in its updates. Since signSGD carries out one-bit quantization of the gradients, it is extremely practical for distributed optimization where gradients need to be aggregated from different processors. For the first time, we establish convergence rates for signSGD on general non-convex functions under transparent conditions. We show that the rate of signSGD to reach first-order critical points matches that of SGD in terms of number of stochastic gradient calls, up to roughly a linear factor in the dimension. We carry out simple experiments to explore the behaviour of sign gradient descent (without the stochasticity) close to saddle points and show that it often helps completely avoid them without using either stochasticity or curvature information.",/pdf/4016cac4da9cf8fd5a11d2ca7e28de01a6d4b192.pdf,ICLR,2018,"We prove a non-convex convergence rate for the sign stochastic gradient method. The algorithm has links to algorithms like Adam and Rprop, as well as gradient quantisation schemes used in distributed machine learning." +SygkSkSFDB,ryevMxTOvr,1569440000000.0,1577170000000.0,1677,On the expected running time of nonconvex optimization with early stopping,"[""thomasflynn918@gmail.com"", ""kyu@bnl.gov"", ""amalik@bnl.gov"", ""sjyoo@bnl.gov"", ""dimperio@bnl.gov""]","[""Thomas Flynn"", ""Kwang Min Yu"", ""Abid Malik"", ""Shinjae Yoo"", ""Nicholas D'Imperio""]","[""non-convex"", ""stopping times"", ""statistics"", ""gradient descent"", ""early stopping""]","This work examines the convergence of stochastic gradient algorithms that use early stopping based on a validation function, wherein optimization ends when the magnitude of a validation function gradient drops below a threshold. We derive conditions that guarantee this stopping rule is well-defined and analyze the expected number of iterations and gradient evaluations needed to meet this criteria. The guarantee accounts for the distance between the training and validation sets, measured with the Wasserstein distance. We develop the approach for stochastic gradient descent (SGD), allowing for biased update directions subject to a Lyapunov condition. We apply the approach to obtain new bounds on the expected running time of several algorithms, including Decentralized SGD (DSGD), a variant of decentralized SGD, known as \textit{Stacked SGD}, and the stochastic variance reduced gradient (SVRG) algorithm. Finally, we consider the generalization properties of the iterate returned by early stopping.",/pdf/2b0abd7b1c162b99f9e3cd41e53759885ba710d4.pdf,ICLR,2020,How to bound the expected number of iterations before gradient descent finds a stationary point +B1Yy1BxCZ,Sk_J1Bx0Z,1509080000000.0,1519420000000.0,245,"Don't Decay the Learning Rate, Increase the Batch Size","[""slsmith@google.com"", ""pikinder@google.com"", ""chrisying@google.com"", ""qvl@google.com""]","[""Samuel L. Smith"", ""Pieter-Jan Kindermans"", ""Chris Ying"", ""Quoc V. Le""]","[""batch size"", ""learning rate"", ""simulated annealing"", ""large batch training"", ""scaling rules"", ""stochastic gradient descent"", ""sgd"", ""imagenet"", ""optimization""]","It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes.",/pdf/fe7f95b0f64080e97b0c748b2bb48f9ff4cf2783.pdf,ICLR,2018,Decaying the learning rate and increasing the batch size during training are equivalent. +SJgwf04KPr,rygrs6Euvr,1569440000000.0,1577170000000.0,1003,Confidence-Calibrated Adversarial Training: Towards Robust Models Generalizing Beyond the Attack Used During Training,"[""david.stutz@mpi-inf.mpg.de"", ""schiele@mpi-inf.mpg.de"", ""matthias.hein@uni-tuebingen.de""]","[""David Stutz"", ""Matthias Hein"", ""Bernt Schiele""]","[""Adversarial Training"", ""Adversarial Examples"", ""Adversarial Robustness"", ""Confidence Calibration""]","Adversarial training is the standard to train models robust against adversarial examples. However, especially for complex datasets, adversarial training incurs a significant loss in accuracy and is known to generalize poorly to stronger attacks, e.g., larger perturbations or other threat models. In this paper, we introduce confidence-calibrated adversarial training (CCAT) where the key idea is to enforce that the confidence on adversarial examples decays with their distance to the attacked examples. We show that CCAT preserves better the accuracy of normal training while robustness against adversarial examples is achieved via confidence thresholding. Most importantly, in strong contrast to adversarial training, the robustness of CCAT generalizes to larger perturbations and other threat models, not encountered during training. We also discuss our extensive work to design strong adaptive attacks against CCAT and standard adversarial training which is of independent interest. We present experimental results on MNIST, SVHN and Cifar10.",/pdf/deb84c76ab6b22589eaae3003cc49e72e0997f04.pdf,ICLR,2020,This paper introduces confidence-calibrated adversarial training to generalize adversarial robustness to attacks not used during training. +HyljY04YDB,SJlR48_Owr,1569440000000.0,1577170000000.0,1259,Towards Interpretable Molecular Graph Representation Learning,"[""emmanuel@invivoai.com"", ""dominique@invivoai.com"", ""julien@invivoai.com"", ""prudencio@invivoai.com""]","[""Emmanuel Noutahi"", ""Dominique Beani"", ""Julien Horwood"", ""Prudencio Tossou""]","[""molecular graphs"", ""graph pooling"", ""hierarchical"", ""GNN"", ""Laplacian"", ""drug discovery""]","Recent work in graph neural networks (GNNs) has led to improvements in molecular activity and property prediction tasks. Unfortunately, GNNs often fail to capture the relative importance of interactions between molecular substructures, in part due to the absence of efficient intermediate pooling steps. To address these issues, we propose LaPool (Laplacian Pooling), a novel, data-driven, and interpretable hierarchical graph pooling method that takes into account both node features and graph structure to improve molecular understanding. +We benchmark LaPool and show that it not only outperforms recent GNNs on molecular graph understanding and prediction tasks but also remains highly competitive on other graph types. We then demonstrate the improved interpretability achieved with LaPool using both qualitative and quantitative assessments, highlighting its potential applications in drug discovery.",/pdf/c8d8cf457b96da1ab365507e83f7e6c0f2fccb79.pdf,ICLR,2020,We propose a new Laplacian-based hierarchical graph pooling layers that not only outperforms existing GNNs on several graph benchmarks but is also more interpretable. +HyeX7aVKvr,r1gIIaRUwS,1569440000000.0,1577170000000.0,443,Zero-shot task adaptation by homoiconic meta-mapping,"[""lampinen@stanford.edu"", ""jlmcc@stanford.edu""]","[""Andrew K. Lampinen"", ""James L. McClelland""]","[""Meta-mapping"", ""zero-shot"", ""task adaptation"", ""task representation"", ""meta-learning""]","How can deep learning systems flexibly reuse their knowledge? Toward this goal, we propose a new class of challenges, and a class of architectures that can solve them. The challenges are meta-mappings, which involve systematically transforming task behaviors to adapt to new tasks zero-shot. We suggest that the key to achieving these challenges is representing the task being performed in such a way that this task representation is itself transformable. We therefore draw inspiration from functional programming and recent work in meta-learning to propose a class of Homoiconic Meta-Mapping (HoMM) approaches that represent data points and tasks in a shared latent space, and learn to infer transformations of that space. HoMM approaches can be applied to any type of machine learning task, including supervised learning and reinforcement learning. We demonstrate the utility of this perspective by exhibiting zero-shot remapping of behavior to adapt to new tasks.",/pdf/10df39af3e95c3aa28896470a356aa7df96723cd.pdf,ICLR,2020,We propose an approach to performing novel tasks zero-shot based on adapting task representations +Sk7KsfW0-,r1TOsfbRb,1509140000000.0,1519540000000.0,944,Lifelong Learning with Dynamically Expandable Networks,"[""mmvc98@unist.ac.kr"", ""eunhoy@kaist.ac.kr"", ""jtlee@unist.ac.kr"", ""sjhwang82@kaist.ac.kr""]","[""Jaehong Yoon"", ""Eunho Yang"", ""Jeongtae Lee"", ""Sung Ju Hwang""]","[""Transfer learning"", ""Lifelong learning"", ""Selective retraining"", ""Dynamic network expansion""]","We propose a novel deep network architecture for lifelong learning which we refer to as Dynamically Expandable Network (DEN), that can dynamically decide its network capacity as it trains on a sequence of tasks, to learn a compact overlapping knowledge sharing structure among tasks. DEN is efficiently trained in an online manner by performing selective retraining, dynamically expands network capacity upon arrival of each task with only the necessary number of units, and effectively prevents semantic drift by splitting/duplicating units and timestamping them. We validate DEN on multiple public datasets in lifelong learning scenarios on multiple public datasets, on which it not only significantly outperforms existing lifelong learning methods for deep networks, but also achieves the same level of performance as the batch model with substantially fewer number of parameters. ",/pdf/90ff6015e299f9cadc9f8490c564d7def37f4c92.pdf,ICLR,2018,We propose a novel deep network architecture that can dynamically decide its network capacity as it trains on a lifelong learning scenario. +u2YNJPcQlwq,j4T4n2ZI_PB,1601310000000.0,1616830000000.0,2685,Efficient Empowerment Estimation for Unsupervised Stabilization,"[""~Ruihan_Zhao1"", ""~Kevin_Lu2"", ""~Pieter_Abbeel2"", ""~Stas_Tiomkin1""]","[""Ruihan Zhao"", ""Kevin Lu"", ""Pieter Abbeel"", ""Stas Tiomkin""]","[""unsupervised stabilization"", ""representation of dynamical systems"", ""neural networks"", ""empowerment"", ""intrinsic motivation""]","Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efficiently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images. Consequently, our method opens a path to wider and easier adoption of empowerment for various applications.",/pdf/59dc834b878ff1144857f1787ea553243043395c.pdf,ICLR,2021,"We propose an efficient estimation of empowerment which is demonstrated on unsupervised stabilization of dynamical systems, and compared to the existing relevant methods." +B1MAJhR5YX,BJxrqUs9KX,1538090000000.0,1545360000000.0,1035,Empirical Bounds on Linear Regions of Deep Rectifier Networks,"[""tserra@gmail.com"", ""srikumar.ramalingam@gmail.com""]","[""Thiago Serra"", ""Srikumar Ramalingam""]","[""linear regions"", ""approximate model counting"", ""mixed-integer linear programming""]","One form of characterizing the expressiveness of a piecewise linear neural network is by the number of linear regions, or pieces, of the function modeled. We have observed substantial progress in this topic through lower and upper bounds on the maximum number of linear regions and a counting procedure. However, these bounds only account for the dimensions of the network and the exact counting may take a prohibitive amount of time, therefore making it infeasible to benchmark the expressiveness of networks. In this work, we approximate the number of linear regions of specific rectifier networks with an algorithm for probabilistic lower bounds of mixed-integer linear sets. In addition, we present a tighter upper bound that leverages network coefficients. We test both on trained networks. The algorithm for probabilistic lower bounds is several orders of magnitude faster than exact counting and the values reach similar orders of magnitude, hence making our approach a viable method to compare the expressiveness of such networks. The refined upper bound is particularly stronger on networks with narrow layers. ",/pdf/2cd4fcee595e00b15e892dc060154b0fe7e231e1.pdf,ICLR,2019,"We provide improved upper bounds for the number of linear regions used in network expressivity, and an highly efficient algorithm (w.r.t. exact counting) to obtain probabilistic lower bounds on the actual number of linear regions." +ByMHvs0cFQ,rJl2swNFKm,1538090000000.0,1547050000000.0,261,Quaternion Recurrent Neural Networks,"[""titouan.parcollet@alumni.univ-avignon.fr"", ""mirco.ravanelli@gmail.com"", ""mohamed.morchid@univ-avignon.fr"", ""georges.linares@univ-avignon.fr"", ""chiheb.trabelsi@polymtl.ca"", ""rdemori@cs.mcgill.ca"", ""yoshua.bengio@mila.quebec""]","[""Titouan Parcollet"", ""Mirco Ravanelli"", ""Mohamed Morchid"", ""Georges Linar\u00e8s"", ""Chiheb Trabelsi"", ""Renato De Mori"", ""Yoshua Bengio""]","[""Quaternion recurrent neural networks"", ""quaternion numbers"", ""recurrent neural networks"", ""speech recognition""]","Recurrent neural networks (RNNs) are powerful architectures to model sequential data, due to their capability to learn short and long-term dependencies between the basic elements of a sequence. Nonetheless, popular tasks such as speech or images recognition, involve multi-dimensional input features that are characterized by strong internal dependencies between the dimensions of the input vector. We propose a novel quaternion recurrent neural network (QRNN), alongside with a quaternion long-short term memory neural network (QLSTM), that take into account both the external relations and these internal structural dependencies with the quaternion algebra. Similarly to capsules, quaternions allow the QRNN to code internal dependencies by composing and processing multidimensional features as single entities, while the recurrent operation reveals correlations between the elements composing the sequence. We show that both QRNN and QLSTM achieve better performances than RNN and LSTM in a realistic application of automatic speech recognition. Finally, we show that QRNN and QLSTM reduce by a maximum factor of 3.3x the number of free parameters needed, compared to real-valued RNNs and LSTMs to reach better results, leading to a more compact representation of the relevant information.",/pdf/61bfe1883871c0199c6c47572ed9c8f6de9ebe3c.pdf,ICLR,2019, +BJxt60VtPr,S1e0hzqdPB,1569440000000.0,1583910000000.0,1401,Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping,"[""aharley@cmu.edu"", ""kowshika@cmu.edu"", ""fangyul@cmu.edu"", ""zhouxian@cmu.edu"", ""htung@cs.cmu.edu"", ""katef@cs.cmu.edu""]","[""Adam W. Harley"", ""Shrinidhi K. Lakshmikanth"", ""Fangyu Li"", ""Xian Zhou"", ""Hsiao-Yu Fish Tung"", ""Katerina Fragkiadaki""]","[""3D feature learning"", ""unsupervised learning"", ""inverse graphics"", ""object discovery""]","Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing information is tightly coupled with perception: we feel as if we see the world in 3 dimensions, while in fact, information from only the front surface of the world hits our retinas. This paper explores the role of view prediction in the development of 3D visual recognition. We propose neural 3D mapping networks, which take as input 2.5D (color and depth) video streams captured by a moving camera, and lift them to stable 3D feature maps of the scene, by disentangling the scene content from the motion of the camera. The model also projects its 3D feature maps to novel viewpoints, to predict and match against target views. We propose contrastive prediction losses to replace the standard color regression loss, and show that this leads to better performance on complex photorealistic data. We show that the proposed model learns visual representations useful for (1) semi-supervised learning of 3D object detectors, and (2) unsupervised learning of 3D moving object detectors, by estimating the motion of the inferred 3D feature maps in videos of dynamic scenes. To the best of our knowledge, this is the first work that empirically shows view prediction to be a scalable self-supervised task beneficial to 3D object detection. ",/pdf/abeec25f410ea4e54959aa68bc7ed0aa36858274.pdf,ICLR,2020,"We show that with the right loss and architecture, view-predictive learning improves 3D object detection" +iVaPuvROtMm,iWNEktylbBJ,1601310000000.0,1614990000000.0,290,Learning Stochastic Behaviour from Aggregate Data,"[""~Shaojun_Ma1"", ""sliu459@gatech.edu"", ""~Hongyuan_Zha1"", ""~Hao-Min_Zhou1""]","[""Shaojun Ma"", ""Shu Liu"", ""Hongyuan Zha"", ""Hao-Min Zhou""]","[""Fokker Planck Equation"", ""weak form"", ""Wasserstein GAN""]","Learning nonlinear dynamics from aggregate data is a challenging problem since the full trajectory of each individual is not available, namely, the individual observed at one time point may not be observed at next time point, or the identity of individual is unavailable. This is in sharp contrast to learning dynamics with trajectory data, on which the majority of existing methods are based. We propose a novel method using the weak form of Fokker Planck Equation (FPE) to describe density evolution of data in a sampling form, which is then combined with Wasserstein generative adversarial network (WGAN) in training process. In such a sample-based framework we are able to study nonlinear dynamics from aggregate data without solving the partial differential equation (PDE). The model can also handle high dimensional cases with the help of deep neural networks. We demonstrate our approach in the context of a series of synthetic and real-world data sets.",/pdf/22a94b3c5203fabcefe6e2680a44b7f6da453944.pdf,ICLR,2021,Develop a weak form of Fokker Planck Equation and WGAN model to recover hidden dynamics of aggregate data. +r1gPoCEKvH,HkeFSMtdwS,1569440000000.0,1577170000000.0,1322,SINGLE PATH ONE-SHOT NEURAL ARCHITECTURE SEARCH WITH UNIFORM SAMPLING,"[""guozichao@megvii.com"", ""zhangxiangyu@megvii.com"", ""muhy17@mails.tsinghua.edu.cn"", ""hengwen@megvii.com"", ""zliubq@connect.ust.hk"", ""weiyichen@megvii.com"", ""sunjian@megvii.com""]","[""Zichao Guo"", ""Xiangyu Zhang"", ""Haoyuan Mu"", ""Wen Heng"", ""Zechun Liu"", ""Yichen Wei"", ""Jian Sun""]","[""Neural Architecture Search"", ""Single Path""]","We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches. Existing one-shot method (Benderet al., 2018), however, is hard to train and not yet effective on large scale datasets like ImageNet. This work propose a Single Path One-Shot model to address the challenge in the training. Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally. +Comprehensive experiments verify that our approach is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces(e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency). It is thus convenient to use for various needs. It achieves start-of-the-art performance on the large dataset ImageNet.",/pdf/aff7b030f7d5ab7277d47e8b110454df2444af0f.pdf,ICLR,2020, +vsU0efpivw,occg9gh4aDK,1601310000000.0,1615990000000.0,1113,Shapley Explanation Networks,"[""~Rui_Wang1"", ""~Xiaoqian_Wang1"", ""~David_I._Inouye1""]","[""Rui Wang"", ""Xiaoqian Wang"", ""David I. Inouye""]","[""Shapley values"", ""Feature Attribution"", ""Interpretable Machine Learning""]","Shapley values have become one of the most popular feature attribution explanation methods. However, most prior work has focused on post-hoc Shapley explanations, which can be computationally demanding due to its exponential time complexity and preclude model regularization based on Shapley explanations during training. Thus, we propose to incorporate Shapley values themselves as latent representations in deep models thereby making Shapley explanations first-class citizens in the modeling paradigm. This intrinsic explanation approach enables layer-wise explanations, explanation regularization of the model during training, and fast explanation computation at test time. We define the Shapley transform that transforms the input into a Shapley representation given a specific function. We operationalize the Shapley transform as a neural network module and construct both shallow and deep networks, called ShapNets, by composing Shapley modules. We prove that our Shallow ShapNets compute the exact Shapley values and our Deep ShapNets maintain the missingness and accuracy properties of Shapley values. We demonstrate on synthetic and real-world datasets that our ShapNets enable layer-wise Shapley explanations, novel Shapley regularizations during training, and fast computation while maintaining reasonable performance. Code is available at https://github.com/inouye-lab/ShapleyExplanationNetworks.",/pdf/12770a255a18bc6c25ce69bdda082fd0d7a8cc87.pdf,ICLR,2021,"To enable new capabilities, we propose to use Shapley values as inter-layer representations in deep neural networks rather than as post-hoc explanations." +SJlDDnVKwS,ryxjfgYTNS,1569440000000.0,1577170000000.0,7,Improving Evolutionary Strategies with Generative Neural Networks,"[""l.faury@criteo.com"", ""c.calauzenes@criteo.com"", ""olivier.fercoq@telecom-paris.fr""]","[""Louis Faury"", ""Cl\u00e9ment Calauz\u00e8nes"", ""Olivier Fercoq""]","[""black-box optimization"", ""evolutionary strategies"", ""generative neural networks""]","Evolutionary Strategies (ES) are a popular family of black-box zeroth-order optimization algorithms which rely on search distributions to efficiently optimize a large variety of objective functions. This paper investigates the potential benefits of using highly flexible search distributions in ES algorithms, in contrast to standard ones (typically Gaussians). We model such distributions with Generative Neural Networks (GNNs) and introduce a new ES algorithm that leverages their expressiveness to accelerate the stochastic search. Because it acts as a plug-in, our approach allows to augment virtually any standard ES algorithm with flexible search distributions. We demonstrate the empirical advantages of this method on a diversity of objective functions.",/pdf/65dd73535587f2f88f2c4ddfc534fb983359cf73.pdf,ICLR,2020,We propose a new algorithm leveraging the expressiveness of Generative Neural Networks to improve Evolutionary Strategies algorithms. +HJKkY35le,,1478310000000.0,1488560000000.0,541,Mode Regularized Generative Adversarial Networks,"[""tong.che@umontreal.ca"", ""csyli@comp.polyu.edu.hk"", ""ap.jacob@umontreal.ca"", ""yoshua.bengio@umontreal.ca"", ""cswjli@comp.polyu.edu.hk""]","[""Tong Che"", ""Yanran Li"", ""Athul Jacob"", ""Yoshua Bengio"", ""Wenjie Li""]","[""Deep learning"", ""Unsupervised Learning""]","Although Generative Adversarial Networks achieve state-of-the-art results on a +variety of generative tasks, they are regarded as highly unstable and prone to miss +modes. We argue that these bad behaviors of GANs are due to the very particular +functional shape of the trained discriminators in high dimensional spaces, which +can easily make training stuck or push probability mass in the wrong direction, +towards that of higher concentration than that of the data generating distribution. +We introduce several ways of regularizing the objective, which can dramatically +stabilize the training of GAN models. We also show that our regularizers can help +the fair distribution of probability mass across the modes of the data generating +distribution during the early phases of training, thus providing a unified solution +to the missing modes problem.",/pdf/dbc3a1a8eb32546aa4ea026dc4503018939cfc34.pdf,ICLR,2017, +HJxyZkBKDr,Bkg3JwjdPr,1569440000000.0,1583910000000.0,1527,NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search,"[""xuanyi.dxy@gmail.com"", ""yi.yang@uts.edu.au""]","[""Xuanyi Dong"", ""Yi Yang""]","[""Neural Architecture Search"", ""AutoML"", ""Benchmark""]","Neural architecture search (NAS) has achieved breakthrough success in a great number of applications in the past few years. +It could be time to take a step back and analyze the good and bad aspects in the field of NAS. A variety of algorithms search architectures under different search space. These searched architectures are trained using different setups, e.g., hyper-parameters, data augmentation, regularization. This raises a comparability problem when comparing the performance of various NAS algorithms. NAS-Bench-101 has shown success to alleviate this problem. In this work, we propose an extension to NAS-Bench-101: NAS-Bench-201 with a different search space, results on multiple datasets, and more diagnostic information. NAS-Bench-201 has a fixed search space and provides a unified benchmark for almost any up-to-date NAS algorithms. The design of our search space is inspired by the one used in the most popular cell-based searching algorithms, where a cell is represented as a directed acyclic graph. Each edge here is associated with an operation selected from a predefined operation set. For it to be applicable for all NAS algorithms, the search space defined in NAS-Bench-201 includes all possible architectures generated by 4 nodes and 5 associated operation options, which results in 15,625 neural cell candidates in total. The training log using the same setup and the performance for each architecture candidate are provided for three datasets. This allows researchers to avoid unnecessary repetitive training for selected architecture and focus solely on the search algorithm itself. The training time saved for every architecture also largely improves the efficiency of most NAS algorithms and presents a more computational cost friendly NAS community for a broader range of researchers. We provide additional diagnostic information such as fine-grained loss and accuracy, which can give inspirations to new designs of NAS algorithms. In further support of the proposed NAS-Bench-102, we have analyzed it from many aspects and benchmarked 10 recent NAS algorithms, which verify its applicability.",/pdf/73272ec47892eefdbb8e7caf4780b7d0a2d6ef71.pdf,ICLR,2020,A NAS benchmark applicable to almost any NAS algorithms. +HHSEKOnPvaO,SPZQX7wW93A,1601310000000.0,1614560000000.0,536,Graph-Based Continual Learning,"[""~Binh_Tang1"", ""~David_S._Matteson1""]","[""Binh Tang"", ""David S. Matteson""]",[],"Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.",/pdf/39a91d5348ba8489817ac3ee4a93637e12b23c4b.pdf,ICLR,2021, +HkuGJ3kCb,rk_zJnkC-,1509040000000.0,1518800000000.0,166,All-but-the-Top: Simple and Effective Postprocessing for Word Representations,"[""jiaqimu2@illinois.edu"", ""pramodv@illinois.edu""]","[""Jiaqi Mu"", ""Pramod Viswanath""]",[],"Real-valued word representations have transformed NLP applications; popular examples are word2vec and GloVe, recognized for their ability to capture linguistic regularities. In this paper, we demonstrate a {\em very simple}, and yet counter-intuitive, postprocessing technique -- eliminate the common mean vector and a few top dominating directions from the word vectors -- that renders off-the-shelf representations {\em even stronger}. The postprocessing is empirically validated on a variety of lexical-level intrinsic tasks (word similarity, concept categorization, word analogy) and sentence-level tasks (semantic textural similarity and text classification) on multiple datasets and with a variety of representation methods and hyperparameter choices in multiple languages; in each case, the processed representations are consistently better than the original ones. ",/pdf/810661909ffeb22f3eb0167d8d60da875c67a2be.pdf,ICLR,2018, +Im43P9kuaeP,5G3b-93QPn,1601310000000.0,1614990000000.0,658,Certified Watermarks for Neural Networks,"[""~Arpit_Amit_Bansal1"", ""~Ping-yeh_Chiang1"", ""~Michael_Curry2"", ""~Hossein_Souri1"", ""~Rama_Chellappa1"", ""~John_P_Dickerson1"", ""~Rajiv_Jain1"", ""~Tom_Goldstein1""]","[""Arpit Amit Bansal"", ""Ping-yeh Chiang"", ""Michael Curry"", ""Hossein Souri"", ""Rama Chellappa"", ""John P Dickerson"", ""Rajiv Jain"", ""Tom Goldstein""]","[""certified defense"", ""watermarking"", ""backdoor attack""]","Watermarking is a commonly used strategy to protect creators' rights to digital images, videos and audio. Recently, watermarking methods have been extended to deep learning models -- in principle, the watermark should be preserved when an adversary tries to copy the model. However, in practice, watermarks can often be removed by an intelligent adversary. Several papers have proposed watermarking methods that claim to be empirically resistant to different types of removal attacks, but these new techniques often fail in the face of new or better-tuned adversaries. In this paper, we propose the first certifiable watermarking method. Using the randomized smoothing technique proposed in Chiang et al., we show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain $\ell_2$ threshold. In addition to being certifiable, our watermark is also empirically more robust compared to previous watermarking methods.",/pdf/7ccb1fcac9aa5cf20a3356589c284c72d18fdfe8.pdf,ICLR,2021,"We propose the first certifiable watermark for neural networks, which is also empirically more robust." +Bk_zTU5eg,,1478290000000.0,1480810000000.0,330,Inefficiency of stochastic gradient descent with larger mini-batches (and more learners),"[""onkar.bhardwaj@gmail.com"", ""gcong@us.ibm.com""]","[""Onkar Bhardwaj"", ""Guojing Cong""]","[""Deep learning"", ""Optimization""]","Stochastic Gradient Descent (SGD) and its variants are the most important optimization algorithms used in large scale machine learning. Mini-batch version of stochastic gradient is often used in practice for taking advantage of hardware parallelism. In this work, we analyze the effect of mini-batch size over SGD convergence for the case of general non-convex objective functions. Building on the past analyses, we justify mathematically that there can often be a large difference between the convergence guarantees provided by small and large mini-batches (given each instance processes equal number of training samples), while providing experimental evidence for the same. Going further to distributed settings, we show that an analogous effect holds with popular Asynchronous Gradient Descent (\asgd): there can be a large difference between convergence guarantees with increasing number of learners given that the cumulative number of training samples processed remains the same. Thus there is an inherent (and similar) inefficiency introduced in the convergence behavior when we attempt to take advantage of parallelism, either by increasing mini-batch size or by increase the number of learners.",/pdf/6310c06d61332f49aa903e88e073a64c8fd66636.pdf,ICLR,2017,We theoretically justify that increasing mini-batch size or increasing the number of learners can lead to slower SGD/ASGD convergence +Sygg3JHtwB,SkxQLe1KvS,1569440000000.0,1577170000000.0,1938,Step Size Optimization,"[""ngs0726@gmail.com"", ""dmhyeon@postech.ac.kr"", ""hwanjoyu@postech.ac.kr""]","[""Gyoung S. Na"", ""Dongmin Hyeon"", ""Hwanjo Yu""]","[""Deep Learning"", ""Step Size Adaptation"", ""Nonconvex Optimization""]","This paper proposes a new approach for step size adaptation in gradient methods. The proposed method called step size optimization (SSO) formulates the step size adaptation as an optimization problem which minimizes the loss function with respect to the step size for the given model parameters and gradients. Then, the step size is optimized based on alternating direction method of multipliers (ADMM). SSO does not require the second-order information or any probabilistic models for adapting the step size, so it is efficient and easy to implement. Furthermore, we also introduce stochastic SSO for stochastic learning environments. In the experiments, we integrated SSO to vanilla SGD and Adam, and they outperformed state-of-the-art adaptive gradient methods including RMSProp, Adam, L4-Adam, and AdaBound on extensive benchmark datasets.",/pdf/70b268c295747e06120638d5ce4d6c3ff77e67ab.pdf,ICLR,2020,We propose an efficient and effective step size adaptation method for the gradient methods. +rJRhzzKxl,,1478200000000.0,1482240000000.0,80,Knowledge Adaptation: Teaching to Adapt,"[""sebastian.ruder@insight-centre.org"", ""parsa@aylien.com"", ""john.breslin@insight-centre.org""]","[""Sebastian Ruder"", ""Parsa Ghaffari"", ""John G. Breslin""]","[""Natural language processing"", ""Deep learning"", ""Transfer Learning"", ""Unsupervised Learning""]","Domain adaptation is crucial in many real-world applications where the distribution of the training data differs from the distribution of the test data. Previous Deep Learning-based approaches to domain adaptation need to be trained jointly on source and target domain data and are therefore unappealing in scenarios where models need to be adapted to a large number of domains or where a domain is evolving, e.g. spam detection where attackers continuously change their tactics. + +To fill this gap, we propose Knowledge Adaptation, an extension of Knowledge Distillation (Bucilua et al., 2006; Hinton et al., 2015) to the domain adaptation scenario. We show how a student model achieves state-of-the-art results on unsupervised domain adaptation from multiple sources on a standard sentiment analysis benchmark by taking into account the domain-specific expertise of multiple teachers and the similarities between their domains. + +When learning from a single teacher, using domain similarity to gauge trustworthiness is inadequate. To this end, we propose a simple metric that correlates well with the teacher's accuracy in the target domain. We demonstrate that incorporating high-confidence examples selected by this metric enables the student model to achieve state-of-the-art performance in the single-source scenario.",/pdf/01f3b4e4b56ca020139f8e2d033e09b03166865f.pdf,ICLR,2017,We propose a teacher-student framework for domain adaptation together with a novel confidence measure that achieves state-of-the-art results on single-source and multi-source adaptation on a standard sentiment analysis benchmark. +BkgGJlBFPS,SkxROFJYwB,1569440000000.0,1577170000000.0,2054,Unsupervised Hierarchical Graph Representation Learning with Variational Bayes,"[""shashanka.ubaru@ibm.com"", ""chenjie@us.ibm.com""]","[""Shashanka Ubaru"", ""Jie Chen""]","[""Hierarchical Graph Representation"", ""Unsupervised Graph Learning"", ""Variational Bayes"", ""Graph classification""]","Hierarchical graph representation learning is an emerging subject owing to the increasingly popular adoption of graph neural networks in machine learning and applications. Loosely speaking, work under this umbrella falls into two categories: (a) use a predefined graph hierarchy to perform pooling; and (b) learn the hierarchy for a given graph through differentiable parameterization of the coarsening process. These approaches are supervised; a predictive task with ground-truth labels is used to drive the learning. In this work, we propose an unsupervised approach, \textsc{BayesPool}, with the use of variational Bayes. It produces graph representations given a predefined hierarchy. Rather than relying on labels, the training signal comes from the evidence lower bound of encoding a graph and decoding the subsequent one in the hierarchy. Node features are treated latent in this variational machinery, so that they are produced as a byproduct and are used in downstream tasks. We demonstrate a comprehensive set of experiments to show the usefulness of the learned representation in the context of graph classification.",/pdf/10565c2d8c02e553e387fc44d4c2904fca443840.pdf,ICLR,2020,Bayespool: An unsupervised hierarchical graph representation learning method based on Variational Bayes. +ryHlUtqge,,1478300000000.0,1488670000000.0,492,Generalizing Skills with Semi-Supervised Reinforcement Learning,"[""cbfinn@eecs.berkeley.edu"", ""tianhe.yu@berkeley.edu"", ""justinfu@eecs.berkeley.edu"", ""pabbeel@eecs.berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Chelsea Finn"", ""Tianhe Yu"", ""Justin Fu"", ""Pieter Abbeel"", ""Sergey Levine""]","[""Reinforcement Learning""]","Deep reinforcement learning (RL) can acquire complex behaviors from low-level inputs, such as images. However, real-world applications of such methods require generalizing to the vast variability of the real world. Deep networks are known to achieve remarkable generalization when provided with massive amounts of labeled data, but can we provide this breadth of experience to an RL agent, such as a robot? The robot might continuously learn as it explores the world around it, even while it is deployed and performing useful tasks. However, this learning requires access to a reward function, to tell the agent whether it is succeeding or failing at its task. Such reward functions are often hard to measure in the real world, especially in domains such as robotics and dialog systems, where the reward could depend on the unknown positions of objects or the emotional state of the user. On the other hand, it is often quite practical to provide the agent with reward functions in a limited set of situations, such as when a human supervisor is present, or in a controlled laboratory setting. Can we make use of this limited supervision, and still benefit from the breadth of experience an agent might collect in the unstructured real world? In this paper, we formalize this problem setting as semi-supervised reinforcement learning (SSRL), where the reward function can only be evaluated in a set of “labeled” MDPs, and the agent must generalize its behavior to the wide range of states it might encounter in a set of “unlabeled” MDPs, by using experience from both settings. Our proposed method infers the task objective in the unlabeled MDPs through an algorithm that resembles inverse RL, using the agent’s own prior experience in the labeled MDPs as a kind of demonstration of optimal behavior. We evaluate our method on challenging, continuous control tasks that require control directly from images, and show that our approach can improve the generalization of a learned deep neural network policy by using experience for which no reward function is available. We also show that our method outperforms direct supervised learning of the reward.",/pdf/96383516c545c926b2bc3bbd36554fe4bd0cb3b6.pdf,ICLR,2017,"We propose an algorithm for generalizing a deep neural network policy using ""unlabeled"" experience collected in MDPs where rewards are not available." +SJlt6oA9Fm,BJe55Pm5YX,1538090000000.0,1545360000000.0,822,Selective Convolutional Units: Improving CNNs via Channel Selectivity,"[""jongheonj@kaist.ac.kr"", ""jinwoos@kaist.ac.kr""]","[""Jongheon Jeong"", ""Jinwoo Shin""]","[""convolutional neural networks"", ""channel-selectivity"", ""channel re-wiring"", ""bottleneck architectures"", ""deep learning""]","Bottleneck structures with identity (e.g., residual) connection are now emerging popular paradigms for designing deep convolutional neural networks (CNN), for processing large-scale features efficiently. In this paper, we focus on the information-preserving nature of identity connection and utilize this to enable a convolutional layer to have a new functionality of channel-selectivity, i.e., re-distributing its computations to important channels. In particular, we propose Selective Convolutional Unit (SCU), a widely-applicable architectural unit that improves parameter efficiency of various modern CNNs with bottlenecks. During training, SCU gradually learns the channel-selectivity on-the-fly via the alternative usage of (a) pruning unimportant channels, and (b) rewiring the pruned parameters to important channels. The rewired parameters emphasize the target channel in a way that selectively enlarges the convolutional kernels corresponding to it. Our experimental results demonstrate that the SCU-based models without any postprocessing generally achieve both model compression and accuracy improvement compared to the baselines, consistently for all tested architectures.",/pdf/dcda87724459e99b88356eb1657ee7ba9e65ff78.pdf,ICLR,2019,"We propose a new module that improves any ResNet-like architectures by enforcing ""channel selective"" behavior to convolutional layers" +6lH8nkwKRXV,Q4tk5Oe5ZGN,1601310000000.0,1614990000000.0,641,Graph Structural Aggregation for Explainable Learning,"[""~Alexis_Galland1"", ""~marc_lelarge1""]","[""Alexis Galland"", ""marc lelarge""]","[""graph"", ""deep"", ""learning""]","Graph neural networks have proven to be very efficient to solve several tasks in graphs such as node classification or link prediction. These algorithms that operate by propagating information from vertices to their neighbors allow one to build node embeddings that contain local information. In order to use graph neural networks for graph classification, node embeddings must be aggregated to obtain a graph representation able to discriminate among different graphs (of possibly various sizes). Moreover, in analogy to neural networks for image classification, there is a need for explainability regarding the features that are selected in the graph classification process. To this end, we introduce StructAgg, a simple yet effective aggregation process based on the identification of structural roles for nodes in graphs that we use to create an end-to-end model. Through extensive experiments we show that this architecture can compete with state-of-the-art methods. We show how this aggregation step allows us to cluster together nodes that have comparable structural roles and how these roles provide explainability to this neural network model. +",/pdf/d7b07a34fa212b5aef3f47b3f1e14721a2c6aedd.pdf,ICLR,2021,An aggregation process to detect structural roles to bring explainability to a graph classification task. +HyGTuv9eg,,1478290000000.0,1486680000000.0,409,Incorporating long-range consistency in CNN-based texture generation,"[""guillaume.berger@umontreal.ca"", ""memisevr@iro.umontreal.ca""]","[""Guillaume Berger"", ""Roland Memisevic""]","[""Computer vision"", ""Deep learning""]","Gatys et al. (2015) showed that pair-wise products of features in a convolutional network are a very effective representation of image textures. We propose a simple modification to that representation which makes it possible to incorporate long-range structure into image generation, and to render images that satisfy various symmetry constraints. We show how this can greatly improve rendering of regular textures and of images that contain other kinds of symmetric structure. We also present applications to inpainting and season transfer.",/pdf/6fe8e450d48b17c195a7b9212e55f47001c57fbf.pdf,ICLR,2017,We propose a simple extension to the Gatys et al. algorithm which makes it possible to incorporate long-range structure into texture generation. +SkgEaj05t7,Hkxzpt9FYX,1538090000000.0,1555500000000.0,792,On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length,"[""staszek.jastrzebski@gmail.com"", ""zakenton@gmail.com"", ""ballasn@fb.com"", ""asja.fischer@gmail.com"", ""yoshua.umontreal@gmail.com"", ""a.storkey@ed.ac.uk""]","[""Stanis\u0142aw Jastrz\u0119bski"", ""Zachary Kenton"", ""Nicolas Ballas"", ""Asja Fischer"", ""Yoshua Bengio"", ""Amos Storkey""]","[""optimization"", ""generalization"", ""theory of deep learning"", ""SGD"", ""hessian""]","The training of deep neural networks with Stochastic Gradient Descent (SGD) with a large learning rate or a small batch-size typically ends in flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. This was found to correlate with a good final generalization performance. In this paper we extend previous work by investigating the curvature of the loss surface along the whole training trajectory, rather than only at the endpoint. We find that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. At this peak value SGD starts to fail to minimize the loss along directions in the loss surface corresponding to the largest curvature (sharpest directions). To further investigate the effect of these dynamics in the training process, we study a variant of SGD using a reduced learning rate along the sharpest directions which we show can improve training speed while finding both sharper and better generalizing solution, compared to vanilla SGD. Overall, our results show that the SGD dynamics in the subspace of the sharpest directions influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.",/pdf/9e4a1664119555f99638d384771856a4a7758ed1.pdf,ICLR,2019,"SGD is steered early on in training towards a region in which its step is too large compared to curvature, which impacts the rest of training. " +jlVNBPEDynH,4OTso14fjNr,1601310000000.0,1614990000000.0,1992,Neuro-algorithmic Policies for Discrete Planning,"[""~Marin_Vlastelica_Pogan\u010di\u01071"", ""~Michal_Rolinek2"", ""~Georg_Martius1""]","[""Marin Vlastelica Pogan\u010di\u0107"", ""Michal Rolinek"", ""Georg Martius""]","[""planning"", ""reinforcement learning"", ""combinatorial optimization"", ""control"", ""imitation learning""]","Although model-based and model-free approaches to learning the control of systems have achieved impressive results on standard benchmarks, generalization to variations in the task are still unsatisfactory. Recent results suggest that generalization of standard architectures improves only after obtaining exhaustive amounts of data.We give evidence that the generalization capabilities are in many cases bottlenecked by the inability to generalize on the combinatorial aspects. Further, we show that for a certain subclass of the MDP framework, this can be alleviated by neuro-algorithmic architectures. + +Many control problems require long-term planning that is hard to solve generically with neural networks alone. We introduce a neuro-algorithmic policy architecture consisting of a neural network and an embedded time-depended shortest path solver. These policies can be trained end-to-end by blackbox differentiation. We show that this type of architecture generalizes well to unseen variations in the environment already after seeing a few examples. + + +https://sites.google.com/view/neuro-algorithmic",/pdf/4035dfe4e9d3057d428a8a964ac9688562fcf684.pdf,ICLR,2021,We introduce neuro-algorithmic policies embedding shortest path solvers for discrete planning. +HJfxbhR9KQ,HkxDoqa5YQ,1538090000000.0,1545360000000.0,1140,Mimicking actions is a good strategy for beginners: Fast Reinforcement Learning with Expert Action Sequences,"[""tharun.medini@rice.edu"", ""anshumali@rice.edu""]","[""Tharun Medini"", ""Anshumali Shrivastava""]","[""Reinforcement Learning"", ""Imitation Learning"", ""Atari"", ""A3C"", ""GA3C""]","Imitation Learning is the task of mimicking the behavior of an expert player in a Reinforcement Learning(RL) Environment to enhance the training of a fresh agent (called novice) beginning from scratch. Most of the Reinforcement Learning environments are stochastic in nature, i.e., the state sequences that an agent may encounter usually follow a Markov Decision Process (MDP). This makes the task of mimicking difficult as it is very unlikely that a new agent may encounter same or similar state sequences as an expert. Prior research in Imitation Learning proposes various ways to learn a mapping between the states encountered and the respective actions taken by the expert while mostly being agnostic to the order in which these were performed. Most of these methods need considerable number of states-action pairs to achieve good results. We propose a simple alternative to Imitation Learning by appending the novice’s action space with the frequent short action sequences that the expert has taken. This simple modification, surprisingly improves the exploration and significantly outperforms alternative approaches like Dataset Aggregation. We experiment with several popular Atari games and show significant and consistent growth in the score that the new agents achieve using just a few expert action sequences.",/pdf/888c9ff135a94729e2061a5d2fbfc1d025acab8f.pdf,ICLR,2019,Appending most frequent action pairs from an expert player to a novice RL agent's action space improves the scores by huge margin. +T6RYeudzf1,HZBpow94i1p,1601310000000.0,1614990000000.0,982,TextSETTR: Label-Free Text Style Extraction and Tunable Targeted Restyling,"[""~Parker_Riley1"", ""~Noah_Constant1"", ""~Mandy_Guo2"", ""~Girish_Kumar1"", ""~David_Uthus1"", ""~Zarana_Parekh2""]","[""Parker Riley"", ""Noah Constant"", ""Mandy Guo"", ""Girish Kumar"", ""David Uthus"", ""Zarana Parekh""]","[""style transfer"", ""text style"", ""text generation"", ""generative models"", ""conditional generation""]","We present a novel approach to the challenging problem of label-free text style transfer. Unlike previous approaches that use parallel or non-parallel labeled data, our technique removes the need for labels entirely, relying instead on the implicit connection in style between adjacent sentences in unlabeled text. We show that T5 (Raffel et al., 2020), a strong pretrained text-to-text model, can be adapted to extract a style vector from arbitrary text and use this vector to condition the decoder to perform style transfer. As the resulting learned style vector space encodes many facets of textual style, we recast transfers as ""targeted restyling"" vector operations that adjust specific attributes of the input text while preserving others. When trained over unlabeled Amazon reviews data, our resulting TextSETTR model is competitive on sentiment transfer, even when given only four exemplars of each class. Furthermore, we demonstrate that a single model trained on unlabeled Common Crawl data is capable of transferring along multiple dimensions including dialect, emotiveness, formality, politeness, and sentiment. ",/pdf/0b4185928968deff46d9e5ca8884452237322e9b.pdf,ICLR,2021,"We present a technique for training a style transfer model in the complete absence of labels, and show the resulting model can control many different style attributes at test time (sentiment, dialect, formality, etc.)." +AZ4vmLoJft,4ElWPp54rzK,1601310000000.0,1614990000000.0,897,(Updated submission 11/20/2020) MISIM: A Novel Code Similarity System,"[""yefangke@gatech.edu"", ""~Shengtian_Zhou1"", ""anand.venkat@intel.com"", ""~Ryan_Marcus1"", ""~Nesime_Tatbul1"", ""jesmin.jahan.tithi@intel.com"", ""niranjan.hasabnis@intel.com"", ""paul.petersen@intel.com"", ""timothy.g.mattson@intel.com"", ""~Tim_Kraska1"", ""~Pradeep_Dubey1"", ""~Vivek_Sarkar2"", ""~Justin_Gottschlich1""]","[""Fangke Ye"", ""Shengtian Zhou"", ""Anand Venkat"", ""Ryan Marcus"", ""Nesime Tatbul"", ""Jesmin Jahan Tithi"", ""Niranjan Hasabnis"", ""Paul Petersen"", ""Timothy G Mattson"", ""Tim Kraska"", ""Pradeep Dubey"", ""Vivek Sarkar"", ""Justin Gottschlich""]","[""Machine Programming"", ""Machine Learning"", ""Code Similarity"", ""Code Representation""]","Semantic code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. Yet, these systems still lack the maturity in accuracy for general and reliable wide-scale usage. To help address this, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure (CASS), which is designed to aid in lifting semantic meaning from code syntax. We compare CASS with the abstract syntax tree (AST) and show CASS is more accurate than AST by up to 1.67x. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems: (i) Aroma, (ii) code2seq, (iii) code2vec, and (iv) Neural Code Comprehension. In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5x to 43.4x better accuracy across all four systems. +",/pdf/4e99ed3893b89259f1fd9768dafbe7bddb314ddf.pdf,ICLR,2021,We present a new state-of-the-art code similarity system that includes a novel code structure and a flexible neural back-end to learn the code similarity algorithm for different code corpi. +rJgvf3RcFQ,HklvwfC5YQ,1538090000000.0,1545360000000.0,1271,On Inductive Biases in Deep Reinforcement Learning,"[""mtthss@google.com"", ""hado@google.com"", ""modayil@google.com"", ""davidsilver@google.com""]","[""Matteo Hessel"", ""Hado van Hasselt"", ""Joseph Modayil"", ""David Silver""]",[],"Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent's objective and its interface to the environment. These inductive biases can take many forms, including domain knowledge and pretuned hyper-parameters. In general, there is a trade-off between generality and performance when we use such biases. Stronger biases can lead to faster learning, but weaker biases can potentially lead to more general algorithms that work on a wider class of problems. +This trade-off is relevant because these inductive biases are not free; substantial effort may be required to obtain relevant domain knowledge or to tune hyper-parameters effectively. In this paper, we re-examine several domain-specific components that modify the agent's objective and environmental interface. We investigated whether the performance deteriorates when all these fixed components are replaced with adaptive solutions from the literature. In our experiments, performance sometimes decreased with the adaptive components, as one might expect when comparing to components crafted for the domain, but sometimes the adaptive components performed better. We then investigated the main benefit of having fewer domain-specific components, by comparing the learning performance of the two systems on a different set of continuous control problems, without additional tuning of either system. As hypothesized, the system with adaptive components performed better on many of the tasks.",/pdf/cb4e289ebf704d6b7d066a967dc219cc0a67a7f9.pdf,ICLR,2019, +imnG4Ap9dAd,hus0bDjiJ0p,1601310000000.0,1614990000000.0,1175,News-Driven Stock Prediction Using Noisy Equity State Representation,"[""~Xiao_Liu14"", ""hhy63@bit.edu.cn"", ""yue.zhang@wias.org.cn""]","[""Xiao Liu"", ""Heyan Huang"", ""Yue Zhang""]","[""news-driven stock prediction"", ""equity state representation"", ""recurrent state transition""]","News-driven stock prediction investigates the correlation between news events and stock price movements. +Previous work has considered effective ways for representing news events and their sequences, but rarely exploited the representation of underlying equity states. +We address this issue by making use of a recurrent neural network to represent an equity state transition sequence, integrating news representation using contextualized embeddings as inputs to the state transition mechanism. +Thanks to the separation of news and equity representations, our model can accommodate additional input factors. +We design a novel random noise factor for modeling influencing factors beyond news events, and a future event factor to address the delay of news information (e.g., insider trading). +Results show that the proposed model outperforms strong baselines in the literature.",/pdf/8edfea7f8cb03fcf3205c123f11639241f554caf.pdf,ICLR,2021,"We investigate explicit modeling of equity state representations in news-driven stock prediction by using an LSTM state to model the fundamentals, adding news impact and noise impact by using attention and noise sampling, respectively." +S1dIzvclg,,1478290000000.0,1487720000000.0,368,A recurrent neural network without chaos,"[""tlaurent@lmu.edu"", ""james.vonbrecht@csulb.edu""]","[""Thomas Laurent"", ""James von Brecht""]",[],"We introduce an exceptionally simple gated recurrent neural network (RNN) that achieves performance comparable to well-known gated architectures, such as LSTMs and GRUs, on the word-level language modeling task. We prove that our model has simple, predicable and non-chaotic dynamics. This stands in stark contrast to more standard gated architectures, whose underlying dynamical systems exhibit chaotic behavior.",/pdf/90ed26d8d9ccbd87ae3e0019930730bbbdb93556.pdf,ICLR,2017, +Rld-9OxQ6HU,NaAqp-ehJ0h-,1601310000000.0,1614990000000.0,3049,MC-LSTM: Mass-conserving LSTM,"[""~Pieter-Jan_Hoedt1"", ""~Frederik_Kratzert1"", ""~Daniel_Klotz1"", ""halmich@ml.jku.at"", ""~Markus_Holzleitner1"", ""gsnearing@google.com"", ""~Sepp_Hochreiter1"", ""~G\u00fcnter_Klambauer1""]","[""Pieter-Jan Hoedt"", ""Frederik Kratzert"", ""Daniel Klotz"", ""Christina Halmich"", ""Markus Holzleitner"", ""Grey Nearing"", ""Sepp Hochreiter"", ""G\u00fcnter Klambauer""]","[""LSTM"", ""RNN"", ""mass-conservation"", ""neural arithmetic units"", ""inductive bias"", ""hydrology""]","The success of Convolutional Neural Networks (CNNs) in computer vision is mainly driven by their strong inductive bias, which is strong enough to allow CNNs to solve vision-related tasks with random weights, meaning without learning. Similarly, Long Short-Term Memory (LSTM) has a strong inductive bias towards storing information over time. However, many real-world systems are governed by conservation laws, which lead to the redistribution of particular quantities —e.g. in physical and economical systems. Our novel Mass-Conserving LSTM (MC-LSTM) adheres to these conservation laws by extending the inductive bias of LSTM to model the redistribution of those stored quantities. MC-LSTMs set a new state-of-the-art for neural arithmetic units at learning arithmetic operations, such as addition tasks, which have a strong conservation law, as the sum is constant overtime. Further, MC-LSTM is applied to traffic forecasting, modeling a pendulum, and a large benchmark dataset in hydrology, where it sets a new state-of-the-art for predicting peak flows. In the hydrology example, we show that MC-LSTM states correlate with real world processes and are therefore interpretable.",/pdf/429ce339b6db0b897232eaf0dc6443bbb123c59c.pdf,ICLR,2021,We present a mass-conserving variant of LSTM that excels as neural arithmetic units and at flood forecasting +ryza73R9tQ,rJl4eu4qtQ,1538090000000.0,1545360000000.0,1402,Machine Translation With Weakly Paired Bilingual Documents,"[""wulijun3@mail2.sysu.edu.cn"", ""teslazhu@mail.ustc.edu.cn"", ""di_he@pku.edu.cn"", ""feiga@microsoft.com"", ""xuta@microsoft.com"", ""taoqin@microsoft.com"", ""tyliu@microsoft.com""]","[""Lijun Wu"", ""Jinhua Zhu"", ""Di He"", ""Fei Gao"", ""Xu Tan"", ""Tao Qin"", ""Tie-Yan Liu""]","[""Natural Language Processing"", ""Machine Translation"", ""Unsupervised Learning""]","Neural machine translation, which achieves near human-level performance in some languages, strongly relies on the availability of large amounts of parallel sentences, which hinders its applicability to low-resource language pairs. Recent works explore the possibility of unsupervised machine translation with monolingual data only, leading to much lower accuracy compared with the supervised one. Observing that weakly paired bilingual documents are much easier to collect than bilingual sentences, e.g., from Wikipedia, news websites or books, in this paper, we investigate the training of translation models with weakly paired bilingual documents. Our approach contains two components/steps. First, we provide a simple approach to mine implicitly bilingual sentence pairs from document pairs which can then be used as supervised signals for training. Second, we leverage the topic consistency of two weakly paired documents and learn the sentence-to-sentence translation by constraining the word distribution-level alignments. We evaluate our proposed method on weakly paired documents from Wikipedia on four tasks, the widely used WMT16 German$\leftrightarrow$English and WMT13 Spanish$\leftrightarrow$English tasks, and obtain $24.1$/$30.3$ and $28.0$/$27.6$ BLEU points separately, outperforming +state-of-the-art unsupervised results by more than 5 BLEU points and reducing the gap between unsupervised translation and supervised translation up to 50\%. ",/pdf/e78738172a26113eaf3f8d4f6fa24649c105d430.pdf,ICLR,2019, +ByldlhAqYQ,rkxTMWt9Fm,1538090000000.0,1550810000000.0,1089,Transfer Learning for Sequences via Learning to Collocate,"[""cui.wanyun@sufe.edu.cn"", ""simonzgy@outlook.com"", ""shen54@illinois.edu"", ""tedjiangfdu@gmail.com"", ""weiwang1@fudan.edu.cn""]","[""Wanyun Cui"", ""Guangyu Zheng"", ""Zhiqiang Shen"", ""Sihang Jiang"", ""Wei Wang""]","[""transfer learning"", ""recurrent neural network"", ""attention"", ""natural language processing""]","Transfer learning aims to solve the data sparsity for a specific domain by applying information of another domain. Given a sequence (e.g. a natural language sentence), the transfer learning, usually enabled by recurrent neural network (RNN), represent the sequential information transfer. RNN uses a chain of repeating cells to model the sequence data. However, previous studies of neural network based transfer learning simply transfer the information across the whole layers, which are unfeasible for seq2seq and sequence labeling. Meanwhile, such layer-wise transfer learning mechanisms also lose the fine-grained cell-level information from the source domain. + +In this paper, we proposed the aligned recurrent transfer, ART, to achieve cell-level information transfer. ART is in a recurrent manner that different cells share the same parameters. Besides transferring the corresponding information at the same position, ART transfers information from all collocated words in the source domain. This strategy enables ART to capture the word collocation across domains in a more flexible way. We conducted extensive experiments on both sequence labeling tasks (POS tagging, NER) and sentence classification (sentiment analysis). ART outperforms the state-of-the-arts over all experiments. +",/pdf/911a87ed4e22b26e2748ee0758b8c4f2eb816d39.pdf,ICLR,2019,Transfer learning for sequence via learning to align cell-level information across domains. +rJxug2R9Km,SJxeuxT-YX,1538090000000.0,1545360000000.0,1093,Meta-Learning for Contextual Bandit Exploration,"[""amr@cs.umd.edu"", ""hal@umiacs.umd.edu""]","[""Amr Sharaf"", ""Hal Daum\u00e9 III""]","[""meta-learning"", ""bandits"", ""exploration"", ""imitation learning""]","We describe MÊLÉE, a meta-learning algorithm for learning a good exploration policy in the interactive contextual bandit setting. Here, an algorithm must take actions based on contexts, and learn based only on a reward signal from the action taken, thereby generating an exploration/exploitation trade-off. MÊLÉE addresses this trade-off by learning a good exploration strategy based on offline synthetic tasks, on which it can simulate the contextual bandit setting. Based on these simulations, MÊLÉE uses an imitation learning strategy to learn a good exploration policy that can then be applied to true contextual bandit tasks at test time. We compare MÊLÉE to seven strong baseline contextual bandit algorithms on a set of three hundred real-world datasets, on which it outperforms alternatives in most settings, especially when differences in rewards are large. Finally, we demonstrate the importance of having a rich feature representation for learning how to explore. +",/pdf/dc2298039804e0454caff65a4df83e29da6cd652.pdf,ICLR,2019,"We present a meta-learning algorithm, MÊLÉE, for learning a good exploration function in the interactive contextual bandit setting." +BJlkgaNKvr,Hyx9Q3nBvH,1569440000000.0,1577170000000.0,324,Towards Understanding the Regularization of Adversarial Robustness on Neural Networks,"[""wen.yuxin@mail.scut.edu.cn"", ""lishuai918@gmail.com"", ""kuijia@scut.edu.cn""]","[""Yuxin Wen"", ""Shuai Li"", ""Kui Jia""]","[""Adversarial robustness"", ""Statistical Learning"", ""Regularization""]"," The problem of adversarial examples has shown that modern Neural Network (NN) models could be rather fragile. Among the most promising techniques to solve the problem, one is to require the model to be {\it $\epsilon$-adversarially robust} (AR); that is, to require the model not to change predicted labels when any given input examples are perturbed within a certain range. However, it is widely observed that such methods would lead to standard performance degradation, i.e., the degradation on natural examples. In this work, we study the degradation through the regularization perspective. We identify quantities from generalization analysis of NNs; with the identified quantities we empirically find that AR is achieved by regularizing/biasing NNs towards less confident solutions by making the changes in the feature space (induced by changes in the instance space) of most layers smoother uniformly in all directions; so to a certain extent, it prevents sudden change in prediction w.r.t. perturbations. However, the end result of such smoothing concentrates samples around decision boundaries, resulting in less confident solutions, and leads to worse standard performance. Our studies suggest that one might consider ways that build AR into NNs in a gentler way to avoid the problematic regularization. +",/pdf/5109cf6af39021a094ead15a16ac9d8e3172da47.pdf,ICLR,2020,We study the accuracy degradation in adversarial training through regularization perspective and find that such training induces diffident NNs that concentrate prediction around decision boundary which leads to worse standard performance. +SkgvvCVtDS,Skx0f_wdPH,1569440000000.0,1577170000000.0,1177,DeepSimplex: Reinforcement Learning of Pivot Rules Improves the Efficiency of Simplex Algorithm in Solving Linear Programming Problems,"[""vs478@cornell.edu"", ""onur.tavaslioglu@bcm.edu"", ""ankit.patel@bcm.edu"", ""andrew.schaefer@rice.edu""]","[""Varun Suriyanarayana"", ""Onur Tavaslioglu"", ""Ankit B. Patel"", ""Andrew J. Schaefer""]","[""Simplex Algorithm"", ""Pivoting Rules"", ""Reinforcement Learning"", ""Combinatorial Optimization"", ""Supervised Learning"", ""Travelling Salesman Problem""]","Linear Programs (LPs) are a fundamental class of optimization problems with a wide variety of applications. Fast algorithms for solving LPs are the workhorse of many combinatorial optimization algorithms, especially those involving integer programming. One popular method to solve LPs is the simplex method which, at each iteration, traverses the surface of the polyhedron of feasible solutions. At each vertex of the polyhedron, one of several heuristics chooses the next neighboring vertex, and these vary in accuracy and computational cost. We use deep value-based reinforcement learning to learn a pivoting strategy that at each iteration chooses between two of the most popular pivot rules -- Dantzig and steepest edge. +Because the latter is typically more accurate and computationally costly than the former, we assign a higher wall time-based cost to steepest edge iterations than Dantzig iterations. We optimize this weighted cost on a neural net architecture designed for the simplex algorithm. We obtain between 20% to 50% reduction in the gap between weighted iterations of the individual pivoting rules, and the best possible omniscient policies for LP relaxations of randomly generated instances of five-city Traveling Salesman Problem. ",/pdf/2451828ee7353798dee80bc210cb39b0c54e9b44.pdf,ICLR,2020,"Learning pivoting rules of the simplex algorithm for solving linear programs to improve the solution times, demonstrated on linear approximations of travelling salesman problem." +1FvkSpWosOl,vQChj59Z9mj,1601310000000.0,1616000000000.0,3278,Is Attention Better Than Matrix Decomposition?,"[""~Zhengyang_Geng1"", ""~Meng-Hao_Guo1"", ""~Hongxu_Chen2"", ""~Xia_Li3"", ""~Ke_Wei1"", ""~Zhouchen_Lin1""]","[""Zhengyang Geng"", ""Meng-Hao Guo"", ""Hongxu Chen"", ""Xia Li"", ""Ke Wei"", ""Zhouchen Lin""]","[""attention models"", ""matrix decomposition"", ""computer vision""]","As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition~(MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank completion problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants. Code is available at https://github.com/Gsunshine/Enjoy-Hamburger.",/pdf/1cb5acc6fe475a215dd1192beec6158b8a4da5dc.pdf,ICLR,2021, +rkg-TJBFPB,HJeNiQ1YDB,1569440000000.0,1588260000000.0,1978,RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments,"[""raileanu@cs.nyu.edu"", ""tim.rocktaeschel@gmail.com""]","[""Roberta Raileanu"", ""Tim Rockt\u00e4schel""]","[""reinforcement learning"", ""exploration"", ""curiosity""]","Exploration in sparse reward environments remains one of the key challenges of model-free reinforcement learning. Instead of solely relying on extrinsic rewards provided by the environment, many state-of-the-art methods use intrinsic rewards to encourage exploration. However, we show that existing methods fall short in procedurally-generated environments where an agent is unlikely to visit a state more than once. We propose a novel type of intrinsic reward which encourages the agent to take actions that lead to significant changes in its learned state representation. We evaluate our method on multiple challenging procedurally-generated tasks in MiniGrid, as well as on tasks with high-dimensional observations used in prior work. Our experiments demonstrate that this approach is more sample efficient than existing exploration methods, particularly for procedurally-generated MiniGrid environments. Furthermore, we analyze the learned behavior as well as the intrinsic reward received by our agent. In contrast to previous approaches, our intrinsic reward does not diminish during the course of training and it rewards the agent substantially more for interacting with objects that it can control.",/pdf/68bc2411f16aa93f68d856590f15a91041d772d2.pdf,ICLR,2020,Reward agents for taking actions that lead to changes in the environment state. +BJEOOsCqKm,BkxwqAtcF7,1538090000000.0,1545360000000.0,370,Psychophysical vs. learnt texture representations in novelty detection,"[""m.grunwald@htwg-konstanz.de"", ""matthias.hermann@htwg-konstanz.de"", ""f.freiberg@htwg-konstanz.de"", ""mfanz@htwg-konstanz.de""]","[""Michael Grunwald"", ""Matthias Hermann"", ""Fabian Freiberg"", ""Matthias O. Franz""]","[""novelty detection"", ""learnt texture representation"", ""one-class neural network"", ""human-vision-inspired anomaly detection""]","Parametric texture models have been applied successfully to synthesize artificial images. Psychophysical studies show that under defined conditions observers are unable to differentiate between model-generated and original natural textures. In industrial applications the reverse case is of interest: a texture analysis system should decide if human observers are able to discriminate between a reference and a novel texture. For example, in case of inspecting decorative surfaces the de- tection of visible texture anomalies without any prior knowledge is required. Here, we implemented a human-vision-inspired novelty detection approach. Assuming that the features used for texture synthesis are important for human texture percep- tion, we compare psychophysical as well as learnt texture representations based on activations of a pretrained CNN in a novelty detection scenario. Additionally, we introduce a novel objective function to train one-class neural networks for novelty detection and compare the results to standard one-class SVM approaches. Our experiments clearly show the differences between human-vision-inspired texture representations and learnt features in detecting visual anomalies. Based on a dig- ital print inspection scenario we show that psychophysical texture representations are able to outperform CNN-encoded features.",/pdf/d3b8ba69ac8f3cb3e17b5f592e57e42c05bfd029.pdf,ICLR,2019,Comparison of psychophysical and CNN-encoded texture representations in a one-class neural network novelty detection application. +wG5XIGi6nrt,KivAb2mzsyb,1601310000000.0,1614990000000.0,1600,Learning Private Representations with Focal Entropy,"[""~Tassilo_Klein1"", ""~Moin_Nabi1""]","[""Tassilo Klein"", ""Moin Nabi""]",[],"How can we learn a representation with good predictive power while preserving user privacy? +We present an adversarial representation learning method to sanitize sensitive content from the representation in an adversarial fashion. +Specifically, we propose focal entropy - a variant of entropy embedded in an adversarial representation learning setting to leverage privacy sanitization. Focal entropy enforces maximum uncertainty in terms of confusion on the subset of privacy-related similar classes, separated from the dissimilar ones. As such, our proposed sanitization method yields deep sanitization of private features yet is conceptually simple and empirically powerful. We showcase feasibility in terms of classification of facial attributes and identity on the CelebA dataset as well as CIFAR-100. The results suggest that private components can be removed reliably. ",/pdf/efd01481c7e84f17c8eae45b98f4f3fc81d84efc.pdf,ICLR,2021,We propose a variant of entropy embedded in an adversarial representation learning setting to leverage privacy sanitization in a semantic-aware fashion. +S19eAF9ee,,1478300000000.0,1481780000000.0,524,Structured Sequence Modeling with Graph Convolutional Recurrent Networks,"[""youngjoo.seo@epfl.ch"", ""michael.defferrard@epfl.ch"", ""pierre.vandergheynst@epfl.ch"", ""xavier.bresson@gmail.com""]","[""Youngjoo Seo"", ""Micha\u00ebl Defferrard"", ""Pierre Vandergheynst"", ""Xavier Bresson""]","[""Structured prediction""]","This paper introduces Graph Convolutional Recurrent Network (GCRN), a deep learning model able to predict structured sequences of data. Precisely, GCRN is a generalization of classical recurrent neural networks (RNN) to data structured by any arbitrary graph. Such structured sequences can be series of frames in videos, spatio-temporal measurements on a network of sensors, or random walks on a vocabulary graph for natural language modeling.The proposed model combines convolutional neural networks (CNN) on graphs to identify spatial structures and RNN to find dynamic patterns. We study two possible architectures of GCRN, and apply the models to two practical problems: predicting moving MNIST data, and modeling natural language with the Penn Treebank dataset. Experiments show that exploiting simultaneously graph spatial and dynamic information about data can improve both precision and learning speed.",/pdf/f9c5c3d23f1cd33b4ab933c1fc3ec2db5d976103.pdf,ICLR,2017,This paper introduces a neural network to model graph-structured sequences +BJgEd6NYPH,Hkgq1xhPvB,1569440000000.0,1577170000000.0,629,Ellipsoidal Trust Region Methods for Neural Network Training,"[""ladolphs@inf.ethz.ch"", ""jonas.kohler@inf.ethz.ch"", ""aurelien.lucchi@inf.ethz.ch""]","[""Leonard Adolphs"", ""Jonas Kohler"", ""Aurelien Lucchi""]","[""non-convex"", ""optimization"", ""neural networks"", ""trust-region""]","We investigate the use of ellipsoidal trust region constraints for second-order optimization of neural networks. This approach can be seen as a higher-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we show that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for provable convergence of second-order trust region methods with standard worst-case complexities. Furthermore, we run experiments across different neural architectures and datasets to find that the ellipsoidal constraints constantly outperform their spherical counterpart both in terms of number of backpropagations and asymptotic loss value. Finally, we find comparable performance to state-of-the-art first-order methods in terms of backpropagations, but further advances in hardware are needed to render Newton methods competitive in terms of time.",/pdf/61896ad3a7458495979b13e87415796b386ca80e.pdf,ICLR,2020,We prepose a generalization of adaptive gradient methods to second-order algorithms. +DEa4JdMWRHp,DC9nSbj9cKw,1601310000000.0,1611610000000.0,1672,Interpretable Models for Granger Causality Using Self-explaining Neural Networks,"[""~Ri\u010dards_Marcinkevi\u010ds1"", ""~Julia_E_Vogt1""]","[""Ri\u010dards Marcinkevi\u010ds"", ""Julia E Vogt""]","[""time series"", ""Granger causality"", ""interpretability"", ""inference"", ""neural networks""]","Exploratory analysis of time series data can yield a better understanding of complex dynamical systems. Granger causality is a practical framework for analysing interactions in sequential data, applied in a wide range of domains. In this paper, we propose a novel framework for inferring multivariate Granger causality under nonlinear dynamics based on an extension of self-explaining neural networks. This framework is more interpretable than other neural-network-based techniques for inferring Granger causality, since in addition to relational inference, it also allows detecting signs of Granger-causal effects and inspecting their variability over time. In comprehensive experiments on simulated data, we show that our framework performs on par with several powerful baseline methods at inferring Granger causality and that it achieves better performance at inferring interaction signs. The results suggest that our framework is a viable and more interpretable alternative to sparse-input neural networks for inferring Granger causality.",/pdf/9adb4d4523a376d9afeece07cffb8e319f8cc687.pdf,ICLR,2021,We propose an interpretable framework for inferring Granger causality based on self-explaining neural networks. +rkeNfp4tPr,r1x65m5IwB,1569440000000.0,1583910000000.0,408,Escaping Saddle Points Faster with Stochastic Momentum,"[""jimwang@gatech.edu"", ""cl3385@gatech.edu"", ""prof@gatech.edu""]","[""Jun-Kun Wang"", ""Chi-Heng Lin"", ""Jacob Abernethy""]","[""SGD"", ""momentum"", ""escaping saddle point""]","Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM, AMSGrad, etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $\beta \in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.",/pdf/3778e4dec93b5402c07b28d95240b23c60396bce.pdf,ICLR,2020,Higher momentum parameter $\beta$ helps for escaping saddle points faster +rklMnyBtPB,S1ew3xJYwr,1569440000000.0,1577170000000.0,1944,Adversarial Robustness Against the Union of Multiple Perturbation Models,"[""pratyush.maini@gmail.com"", ""ericwong@cs.cmu.edu"", ""zkolter@cs.cmu.edu""]","[""Pratyush Maini"", ""Eric Wong"", ""Zico Kolter""]","[""adversarial"", ""robustness"", ""multiple perturbation"", ""MNIST"", ""CIFAR10""]","Owing to the susceptibility of deep learning systems to adversarial attacks, there has been a great deal of work in developing (both empirically and certifiably) robust classifiers, but the vast majority has defended against single types of attacks. Recent work has looked at defending against multiple attacks, specifically on the MNIST dataset, yet this approach used a relatively complex architecture, claiming that standard adversarial training can not apply because it ""overfits"" to a particular norm. In this work, we show that it is indeed possible to adversarially train a robust model against a union of norm-bounded attacks, by using a natural generalization of the standard PGD-based procedure for adversarial training to multiple threat models. With this approach, we are able to train standard architectures which are robust against l_inf, l_2, and l_1 attacks, outperforming past approaches on the MNIST dataset and providing the first CIFAR10 network trained to be simultaneously robust against (l_inf, l_2, l_1) threat models, which achieves adversarial accuracy rates of (47.6%, 64.3%, 53.4%) for (l_inf, l_2, l_1) perturbations with epsilon radius = (0.03,0.5,12).",/pdf/cf2808a44fdfabec066720081b9e3e428c42140d.pdf,ICLR,2020,"We develop a generalization of the standard PGD-based procedure to train architectures which are robust against multiple perturbation models, outperforming past approaches on the MNIST and CIFAR10 datasets." +ByED-X-0W,SkcLW7ZCW,1509140000000.0,1540800000000.0,1130,Parametric Information Bottleneck to Optimize Stochastic Neural Networks,"[""thanhnguyen2792@gmail.com"", ""jaesik@unist.ac.kr""]","[""Thanh T. Nguyen"", ""Jaesik Choi""]","[""Information Bottleneck"", ""Deep Neural Networks""]","In this paper, we present a layer-wise learning of stochastic neural networks (SNNs) in an information-theoretic perspective. In each layer of an SNN, the compression and the relevance are defined to quantify the amount of information that the layer contains about the input space and the target space, respectively. We jointly optimize the compression and the relevance of all parameters in an SNN to better exploit the neural network's representation. Previously, the Information Bottleneck (IB) framework (\cite{Tishby99}) extracts relevant information for a target variable. Here, we propose Parametric Information Bottleneck (PIB) for a neural network by utilizing (only) its model parameters explicitly to approximate the compression and the relevance. We show that, as compared to the maximum likelihood estimate (MLE) principle, PIBs : (i) improve the generalization of neural networks in classification tasks, (ii) push the representation of neural networks closer to the optimal information-theoretical representation in a faster manner. ",/pdf/6fe8c9470934d85135d8faa73a46e9e71285176a.pdf,ICLR,2018,Learning a better neural networks' representation with Information Bottleneck principle +HylDpoActX,rylxt_pctX,1538090000000.0,1545360000000.0,809,N-Ary Quantization for CNN Model Compression and Inference Acceleration,"[""guenther.schindler@ziti.uni-heidelberg.de"", ""roth@tugraz.at"", ""pernkopf@tugraz.at"", ""holger.froening@ziti.uni-heidelberg.de""]","[""G\u00fcnther Schindler"", ""Wolfgang Roth"", ""Franz Pernkopf"", ""Holger Fr\u00f6ning""]","[""low-resource deep neural networks"", ""quantized weights"", ""weight-clustering"", ""resource efficient neural networks""]","The tremendous memory and computational complexity of Convolutional Neural Networks (CNNs) prevents the inference deployment on resource-constrained systems. As a result, recent research focused on CNN optimization techniques, in particular quantization, which allows weights and activations of layers to be represented with just a few bits while achieving impressive prediction performance. However, aggressive quantization techniques still fail to achieve full-precision prediction performance on state-of-the-art CNN architectures on large-scale classification tasks. In this work we propose a method for weight and activation quantization that is scalable in terms of quantization levels (n-ary representations) and easy to compute while maintaining the performance close to full-precision CNNs. Our weight quantization scheme is based on trainable scaling factors and a nested-means clustering strategy which is robust to weight updates and therefore exhibits good convergence properties. The flexibility of nested-means clustering enables exploration of various n-ary weight representations with the potential of high parameter compression. For activations, we propose a linear quantization strategy that takes the statistical properties of batch normalization into account. We demonstrate the effectiveness of our approach using state-of-the-art models on ImageNet.",/pdf/44dac03ae98944e274670d33b826c952f9f8a1c9.pdf,ICLR,2019,We propose a quantization scheme for weights and activations of deep neural networks. This reduces the memory footprint substantially and accelerates inference. +krz7T0xU9Z_,yi8KH1VrtoZ,1601310000000.0,1614780000000.0,3072,The inductive bias of ReLU networks on orthogonally separable data,"[""~Mary_Phuong1"", ""~Christoph_H_Lampert1""]","[""Mary Phuong"", ""Christoph H Lampert""]","[""inductive bias"", ""implicit bias"", ""gradient descent"", ""ReLU networks"", ""max-margin"", ""extremal sector""]","We study the inductive bias of two-layer ReLU networks trained by gradient flow. We identify a class of easy-to-learn (`orthogonally separable') datasets, and characterise the solution that ReLU networks trained on such datasets converge to. Irrespective of network width, the solution turns out to be a combination of two max-margin classifiers: one corresponding to the positive data subset and one corresponding to the negative data subset. +The proof is based on the recently introduced concept of extremal sectors, for which we prove a number of properties in the context of orthogonal separability. In particular, we prove stationarity of activation patterns from some time $T$ onwards, which enables a reduction of the ReLU network to an ensemble of linear subnetworks. +",/pdf/a68e4ef7c465175fddb6ba540763c62f8708c9e3.pdf,ICLR,2021,We characterise the function learnt by two-layer ReLU nets trained on orthogonally separable data. +bQtejwuIqB,GgYAjt57GDw,1601310000000.0,1614990000000.0,375,"With False Friends Like These, Who Can Have Self-Knowledge?","[""~Lue_Tao1"", ""~Songcan_Chen1""]","[""Lue Tao"", ""Songcan Chen""]","[""Robustness"", ""Adversarial Risk"", ""Neural Networks"", ""Machine Learning Security""]","Adversarial examples arise from excessive sensitivity of a model. Commonly studied adversarial examples are malicious inputs, crafted by an adversary from correctly classified examples, to induce misclassification. This paper studies an intriguing, yet far overlooked consequence of the excessive sensitivity, that is, a misclassified example can be easily perturbed to help the model to produce correct output. Such perturbed examples look harmless, but actually can be maliciously utilized by a false friend to make the model self-satisfied. Thus we name them hypocritical examples. With false friends like these, a poorly performed model could behave like a state-of-the-art one. Once a deployer trusts the hypocritical performance and uses the ""well-performed"" model in real-world applications, potential security concerns appear even in benign environments. In this paper, we formalize the hypocritical risk for the first time and propose a defense method specialized for hypocritical examples by minimizing the tradeoff between natural risk and an upper bound of hypocritical risk. Moreover, our theoretical analysis reveals connections between adversarial risk and hypocritical risk. Extensive experiments verify the theoretical results and the effectiveness of our proposed methods.",/pdf/a5829dd51dc29c9441bd1979f9fb1a8dc004a99a.pdf,ICLR,2021,Model performance could be hypocritically improved by false friends: we formalize this new realistic risk and analyze its relation with natural risk and adversarial risk. +JAlqRs9duhz,bD1oL_jpk4R,1601310000000.0,1614990000000.0,1675,Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation,"[""~Xiang_Lin2"", ""~SIMENG_HAN1"", ""~Shafiq_Joty1""]","[""Xiang Lin"", ""SIMENG HAN"", ""Shafiq Joty""]","[""text generation"", ""text degeneration"", ""language model"", ""summarization"", ""image captioning""]","Advanced large-scale neural language models have led to significant success in many natural language generation tasks. However, the most commonly used training objective, Maximum Likelihood Estimation (MLE), has been shown to be problematic, where the trained model prefers using dull and repetitive phrases. In this work, we introduce ScaleGrad, a modification straight to the gradient of the loss function, to remedy the degeneration issues of the standard MLE objective. By directly maneuvering the gradient information, ScaleGrad makes the model learn to use novel tokens during training. Empirical results show the effectiveness of our method not only in open-ended generation, but also in directed generation. With the simplicity in architecture, our method can serve as a general training objective that is applicable to most of the neural text generation tasks.",/pdf/ea2a1cfe7da1dd4d2597fc847df0824cb9c17d7f.pdf,ICLR,2021,We proposed a simple modification to MLE based on gradient analysis and achieved significant improvement on token-level degeneration in different tasks. +BJx3_0VKPB,ryxanyddPB,1569440000000.0,1577170000000.0,1225,On the Unintended Social Bias of Training Language Generation Models with News Articles,"[""omar.florez@aggiemail.usu.edu""]","[""Omar U. Florez""]","[""Fair AI"", ""latent representations"", ""sequence to sequence""]","There are concerns that neural language models may preserve some of the stereotypes of the underlying societies that generate the large corpora needed to train these models. For example, gender bias is a significant problem when generating text, and its unintended memorization could impact the user experience of many applications (e.g., the smart-compose feature in Gmail). + +In this paper, we introduce a novel architecture that decouples the representation learning of a neural model from its memory management role. This architecture allows us to update a memory module with an equal ratio across gender types addressing biased correlations directly in the latent space. We experimentally show that our approach can mitigate the gender bias amplification in the automatic generation of articles news while providing similar perplexity values when extending the Sequence2Sequence architecture.",/pdf/7c088642d698bf7fc1a365d44de2915202241dc0.pdf,ICLR,2020,we introduce a novel architecture that allows us to update a memory module with an equal ratio across gender types addressing biased correlations directly in the latent space. +Byg9A24tvB,BkejR0xSPS,1569440000000.0,1583910000000.0,275,Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness,"[""pty17@mails.tsinghua.edu.cn"", ""kunxu.thu@gmail.com"", ""dyp17@mails.tsinghua.edu.cn"", ""duchao0726@gmail.com"", ""ningchen@mail.tsinghua.edu.cn"", ""dcszj@mail.tsinghua.edu.cn""]","[""Tianyu Pang"", ""Kun Xu"", ""Yinpeng Dong"", ""Chao Du"", ""Ning Chen"", ""Jun Zhu""]","[""Trustworthy Machine Learning"", ""Adversarial Robustness"", ""Training Objective"", ""Sample Density""]","Previous work shows that adversarially robust generalization requires larger sample complexity, and the same dataset, e.g., CIFAR-10, which enables good standard accuracy may not suffice to train robust models. Since collecting new training data could be costly, we focus on better utilizing the given data by inducing the regions with high sample density in the feature space, which could lead to locally sufficient samples for robust learning. We first formally show that the softmax cross-entropy (SCE) loss and its variants convey inappropriate supervisory signals, which encourage the learned feature points to spread over the space sparsely in training. This inspires us to propose the Max-Mahalanobis center (MMC) loss to explicitly induce dense feature regions in order to benefit robustness. Namely, the MMC loss encourages the model to concentrate on learning ordered and compact representations, which gather around the preset optimal centers for different classes. We empirically demonstrate that applying the MMC loss can significantly improve robustness even under strong adaptive attacks, while keeping state-of-the-art accuracy on clean inputs with little extra computation compared to the SCE loss.",/pdf/93ff9f719c322e9483f6d72f5d1e02a5cd43f1b4.pdf,ICLR,2020,Applying the softmax function in training leads to indirect and unexpected supervision on features. We propose a new training objective to explicitly induce dense feature regions for locally sufficient samples to benefit adversarial robustness. +1dm_j4ciZp,TOLuhAdm8Cz,1601310000000.0,1614990000000.0,724,How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers,"[""~Yuanhao_Xiong1"", ""~Xuanqing_Liu1"", ""~Li-Cheng_Lan1"", ""~Yang_You1"", ""~Si_Si1"", ""~Cho-Jui_Hsieh1""]","[""Yuanhao Xiong"", ""Xuanqing Liu"", ""Li-Cheng Lan"", ""Yang You"", ""Si Si"", ""Cho-Jui Hsieh""]","[""deep learning"", ""optimization"", ""benchmarking""]","Many optimizers have been proposed for training deep neural networks, and they often have multiple hyperparameters, which make it tricky to benchmark their performance. In this work, we propose a new benchmarking protocol to evaluate both end-to-end efficiency (training a model from scratch without knowing the best hyperparameter) and data-addition training efficiency (the previously selected hyperparameters are used for periodically re-training the model with newly collected data). For end-to-end efficiency, unlike previous work that assumes random hyperparameter tuning, which over-emphasizes the tuning time, we propose to evaluate with a bandit hyperparameter tuning strategy. A human study is conducted to show our evaluation protocol matches human tuning behavior better than the random search. For data-addition training, we propose a new protocol for assessing the hyperparameter sensitivity to data shift. We then apply the proposed benchmarking framework to 7 optimizers and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining. Our results show that there is no clear winner across all the tasks. +",/pdf/41949bb057785b4b6c93879f2d53f7aa946f79f7.pdf,ICLR,2021,We propose a new benchmarking framework to evaluate various optimizers. +S1xqRTNtDr,HkgHOCZdDB,1569440000000.0,1577170000000.0,866,Learning a Behavioral Repertoire from Demonstrations,"[""noju@itu.edu"", ""migonzalez@unal.edu.co"", ""dcarbarc@unal.edu.co"", ""jean-baptiste.mouret@inria.fr"", ""sebr@itu.dk""]","[""Niels Justesen"", ""Miguel Gonz\u00e1lez Duque"", ""Daniel Cabarcas Jaramillo"", ""Jean-Baptiste Mouret"", ""Sebastian Risi""]","[""Behavioral Repertoires"", ""Imitation Learning"", ""Deep Learning"", ""Adaptation"", ""StarCraft 2""]","Imitation Learning (IL) is a machine learning approach to learn a policy from a set of demonstrations. IL can be useful to kick-start learning before applying reinforcement learning (RL) but it can also be useful on its own, e.g. to learn to imitate human players in video games. However, a major limitation of current IL approaches is that they learn only a single ``""average"" policy based on a dataset that possibly contains demonstrations of numerous different types of behaviors. In this paper, we present a new approach called Behavioral Repertoire Imitation Learning (BRIL) that instead learns a repertoire of behaviors from a set of demonstrations by augmenting the state-action pairs with behavioral descriptions. The outcome of this approach is a single neural network policy conditioned on a behavior description that can be precisely modulated. We apply this approach to train a policy on 7,777 human demonstrations for the build-order planning task in StarCraft II. Dimensionality reduction techniques are applied to construct a low-dimensional behavioral space from the high-dimensional army unit composition of each demonstration. The results demonstrate that the learned policy can be effectively manipulated to express distinct behaviors. Additionally, by applying the UCB1 algorithm, the policy can adapt its behavior -in-between games- to reach a performance beyond that of the traditional IL baseline approach.",/pdf/0ef3206a531452869fde3dda234d09a4517fcc22.pdf,ICLR,2020,BRIL allows a single neural network to learn a repertoire of behaviors from a set of demonstrations that can be precisely modulated. +S1gKkpNKwH,rJgjGuYBvB,1569440000000.0,1577170000000.0,309,Reinforcement Learning with Chromatic Networks,"[""xingyousong@google.com"", ""kchoro@google.com"", ""jh3764@columbia.edu"", ""yt2541@columbia.edu"", ""wg2279@columbia.edu"", ""pacchiano@berkeley.edu"", ""stamas@google.com"", ""jaindeepali@google.com"", ""yxyang@google.com""]","[""Xingyou Song"", ""Krzysztof Choromanski"", ""Jack Parker-Holder"", ""Yunhao Tang"", ""Wenbo Gao"", ""Aldo Pacchiano"", ""Tamas Sarlos"", ""Deepali Jain"", ""Yuxiang Yang""]","[""reinforcement"", ""learning"", ""chromatic"", ""networks"", ""partitioning"", ""efficient"", ""neural"", ""architecture"", ""search"", ""weight"", ""sharing"", ""compactification""]","We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as 17 weight parameters, providing >90 % compression over vanilla policies and 6x compression over state-of-the-art compact policies based on Toeplitz matrices, while still maintaining good reward. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.",/pdf/1b37a5f08839f737c3a33a4409efd6c907e3c0ec.pdf,ICLR,2020,"We show that ENAS with ES-optimization in RL is highly scalable, and use it to compactify neural network policies by weight sharing." +agyFqcmgl6y,1nk1nZBNe90,1601310000000.0,1614990000000.0,1663,Disentangled Generative Causal Representation Learning,"[""~Xinwei_Shen1"", ""~Furui_Liu1"", ""~Hanze_Dong1"", ""~Qing_LIAN3"", ""~Zhitang_Chen1"", ""~Tong_Zhang2""]","[""Xinwei Shen"", ""Furui Liu"", ""Hanze Dong"", ""Qing LIAN"", ""Zhitang Chen"", ""Tong Zhang""]","[""disentanglement"", ""causality"", ""representation learning"", ""generative model""]","This paper proposes a Disentangled gEnerative cAusal Representation (DEAR) learning method. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally correlated. We show that previous methods with independent priors fail to disentangle causally related factors. Motivated by this finding, we propose a new disentangled learning method called DEAR that enables causal controllable generation and causal representation learning. The key ingredient of this new formulation is to use a structural causal model (SCM) as the prior for a bidirectional generative model. A generator is then trained jointly with an encoder using a suitable GAN loss. Theoretical justification on the proposed formulation is provided, which guarantees disentangled causal representation learning under appropriate conditions. We conduct extensive experiments on both synthesized and real datasets to demonstrate the effectiveness of DEAR in causal controllable generation, and the benefits of the learned representations for downstream tasks in terms of sample efficiency and distributional robustness.",/pdf/96e9a7d313660be4f4b9caee3c911b84796aa64a.pdf,ICLR,2021, +H1laeJrKDB,HJxbs8juvH,1569440000000.0,1583910000000.0,1524,Controlling generative models with continuous factors of variations,"[""antoine.plumerault@cea.fr"", ""herve.le-borgne@cea.fr"", ""celine.hudelot@centralesupelec.fr""]","[""Antoine Plumerault"", ""Herv\u00e9 Le Borgne"", ""C\u00e9line Hudelot""]","[""Generative models"", ""factor of variation"", ""GAN"", ""beta-VAE"", ""interpretable representation"", ""interpretability""]","Recent deep generative models can provide photo-realistic images as well as visual or textual content embeddings useful to address various tasks of computer vision and natural language processing. Their usefulness is nevertheless often limited by the lack of control over the generative process or the poor understanding of the learned representation. To overcome these major issues, very recent works have shown the interest of studying the semantics of the latent space of generative models. In this paper, we propose to advance on the interpretability of the latent space of generative models by introducing a new method to find meaningful directions in the latent space of any generative model along which we can move to control precisely specific properties of the generated image like position or scale of the object in the image. Our method is weakly supervised and particularly well suited for the search of directions encoding simple transformations of the generated image, such as translation, zoom or color variations. We demonstrate the effectiveness of our method qualitatively and quantitatively, both for GANs and variational auto-encoders.",/pdf/4cac3ebd54d9bc4856abd2a90f247a83bff51ea1.pdf,ICLR,2020,A model to control the generation of images with GAN and beta-VAE with regard to scale and position of the objects +OCRKCul3eKN,ngR6PSs-RM,1601310000000.0,1614990000000.0,3585,Addressing Extrapolation Error in Deep Offline Reinforcement Learning,"[""~Caglar_Gulcehre1"", ""~Sergio_G\u00f3mez_Colmenarejo1"", ""~ziyu_wang1"", ""~Jakub_Sygnowski1"", ""~Thomas_Paine1"", ""~Konrad_Zolna1"", ""~Yutian_Chen1"", ""~Matthew_Hoffman1"", ""~Razvan_Pascanu1"", ""~Nando_de_Freitas1""]","[""Caglar Gulcehre"", ""Sergio G\u00f3mez Colmenarejo"", ""ziyu wang"", ""Jakub Sygnowski"", ""Thomas Paine"", ""Konrad Zolna"", ""Yutian Chen"", ""Matthew Hoffman"", ""Razvan Pascanu"", ""Nando de Freitas""]","[""Addressing Extrapolation Error in Deep Offline Reinforcement Learning""]","Reinforcement learning (RL) encompasses both online and offline regimes. Unlike its online counterpart, offline RL agents are trained using logged-data only, without interaction with the environment. Therefore, offline RL is a promising direction for real-world applications, such as healthcare, where repeated interaction with environments is prohibitive. However, since offline RL losses often involve evaluating state-action pairs not well-covered by training data, they can suffer due to the errors introduced when the function approximator attempts to extrapolate those pairs' value. These errors can be compounded by bootstrapping when the function approximator overestimates, leading the value function to *grow unbounded*, thereby crippling learning. In this paper, we introduce a three-part solution to combat extrapolation errors: (i) behavior value estimation, (ii) ranking regularization, and (iii) reparametrization of the value function. We provide ample empirical evidence on the effectiveness of our method, showing state of the art performance on the RL Unplugged (RLU) ATARI dataset. Furthermore, we introduce new datasets for bsuite as well as partially observable DeepMind Lab environments, on which our method outperforms state of the art offline RL algorithms. +",/pdf/37d266175a632196cbe79266e09fc8389d86f1b2.pdf,ICLR,2021,We are proposing methods to address extrapolation error in deep offline reinforcement learning. +B1ldb6NKDr,HJlVljPUvH,1569440000000.0,1577170000000.0,380,Multi-Agent Hierarchical Reinforcement Learning for Humanoid Navigation,"[""gberseth@gmail.com"", ""m.brandon.haworth@gmail.com"", ""sm2062@cs.rutgers.edu"", ""mubbasir.kapadia@gmail.com"", ""pfaloutsos@gmail.com""]","[""Glen Berseth"", ""Brandon haworth"", ""Seonghyeon Moon"", ""Mubbasir Kapadia"", ""Petros Faloutsos""]","[""Multi-Agent Reinforcement Learning"", ""Reinforcement Learning"", ""Hierarchical Reinforcement Learning""]","Multi-agent reinforcement learning is a particularly challenging problem. Current +methods have made progress on cooperative and competitive environments with +particle-based agents. Little progress has been made on solutions that could op- +erate in the real world with interaction, dynamics, and humanoid robots. In this +work, we make a significant step in multi-agent models on simulated humanoid +robot navigation by combining Multi-Agent Reinforcement Learning (MARL) +with Hierarchical Reinforcement Learning (HRL). We build on top of founda- +tional prior work in learning low-level physical controllers for locomotion and +add a layer to learn decentralized policies for multi-agent goal-directed collision +avoidance systems. A video of our results on a multi-agent pursuit environment +can be seen here +",/pdf/2f518ada45637a068b015e4b044c60a2cb71a07e.pdf,ICLR,2020,Improving MARL by sharing task agnostic sub policies. +HkG3e205K7,HyxmiuYYY7,1538090000000.0,1545860000000.0,1114,Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives,"[""gjt@google.com"", ""jdl404@nyu.edu"", ""shanegu@google.com"", ""cmaddis@google.com""]","[""George Tucker"", ""Dieterich Lawson"", ""Shixiang Gu"", ""Chris J. Maddison""]","[""variational autoencoder"", ""reparameterization trick"", ""IWAE"", ""VAE"", ""RWS"", ""JVI""]","Deep latent variable models have become a popular model choice due to the scalable learning algorithms introduced by (Kingma & Welling 2013, Rezende et al. 2014). These approaches maximize a variational lower bound on the intractable log likelihood of the observed data. Burda et al. (2015) introduced a multi-sample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases. Counterintuitively, the typical inference network gradient estimator for the IWAE bound performs poorly as the number of samples increases (Rainforth et al. 2018, Le et al. 2018). Roeder et a. (2017) propose an improved gradient estimator, however, are unable to show it is unbiased. We show that it is in fact biased and that the bias can be estimated efficiently with a second application of the reparameterization trick. The doubly reparameterized gradient (DReG) estimator does not suffer as the number of samples increases, resolving the previously raised issues. The same idea can be used to improve many recently introduced training techniques for latent variable models. In particular, we show that this estimator reduces the variance of the IWAE gradient, the reweighted wake-sleep update (RWS) (Bornschein & Bengio 2014), and the jackknife variational inference (JVI) gradient (Nowozin 2018). Finally, we show that this computationally efficient, drop-in estimator translates to improved performance for all three objectives on several modeling tasks.",/pdf/19726e292215ba5feaede515a28e98368ebfc974.pdf,ICLR,2019,Doubly reparameterized gradient estimators provide unbiased variance reduction which leads to improved performance. +yZBuYjD8Gd,6CHJ3n9I15d,1601310000000.0,1614990000000.0,675,Are all negatives created equal in contrastive instance discrimination?,"[""tcai2718@gmail.com"", ""~Jonathan_Frankle1"", ""~David_J._Schwab1"", ""~Ari_S._Morcos1""]","[""Tiffany Cai"", ""Jonathan Frankle"", ""David J. Schwab"", ""Ari S. Morcos""]","[""self-supervised learning"", ""contrastive learning"", ""contrastive instance discrimination"", ""negatives"", ""understanding self-supervised learning"", ""ssl""]","Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive while discriminating against a pool of other instances (negatives). Using MoCo v2 as our testbed, we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found that a small minority of negatives--just the hardest 5%--were both necessary and sufficient for the downstream task to reach full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, we found that the very hardest 0.1% of negatives were not only unnecessary but also detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives play heterogeneous roles and CID may benefit from more intelligent negative treatment.",/pdf/0f9b44218b3d61a7d37a3e4dd1ec7ba08fff3e84.pdf,ICLR,2021,"We study the relative importance of negatives in contrastive instance discrimination, finding that only the hardest 5% of negatives are necessary and sufficient for good performance, and these negatives are more semantically similar to the query." +9GsFOUyUPi,#NAME?,1601310000000.0,1616020000000.0,1729,Progressive Skeletonization: Trimming more fat from a network at initialization,"[""~Pau_de_Jorge1"", ""~Amartya_Sanyal1"", ""~Harkirat_Behl1"", ""~Philip_Torr1"", ""~Gr\u00e9gory_Rogez1"", ""~Puneet_K._Dokania1""]","[""Pau de Jorge"", ""Amartya Sanyal"", ""Harkirat Behl"", ""Philip Torr"", ""Gr\u00e9gory Rogez"", ""Puneet K. Dokania""]","[""Pruning"", ""Pruning at initialization"", ""Sparsity""]","Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight connection sensitivity (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration. We then propose two approximate procedures to maximize our objective (1) Iterative SNIP: allows parameters that were unimportant at earlier stages of skeletonization to become important at later stages; and (2) FORCE: iterative process that allows exploration by allowing already pruned parameters to resurrect at later stages of skeletonization. Empirical analysis on a large suite of experiments show that our approach, while providing at least as good performance as other recent approaches on moderate pruning levels, provide remarkably improved performance on high pruning levels (could remove up to 99.5% parameters while keeping the networks trainable).",/pdf/c774cc8cddefe7ca854f3b051602f706c05b7239.pdf,ICLR,2021,"We find performance of current methods for pruning at initialization plummets at high sparsity levels, we study the possible reasons and present a more robust method overall." +r1x_DaVKwH,S1lvx-jvPr,1569440000000.0,1577170000000.0,601,Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field,"[""marin.toromanoff@mines-paristech.fr"", ""emilie.wirbel@valeo.com"", ""fabien.moutarde@mines-paristech.fr""]","[""Marin Toromanoff"", ""Emilie Wirbel"", ""Fabien Moutarde""]","[""Reinforcement Learning"", ""Deep Learning"", ""Atari benchmark"", ""Reproducibility""]","Consistent and reproducible evaluation of Deep Reinforcement Learning (DRL) is not straightforward. In the Arcade Learning Environment (ALE), small changes in environment parameters such as stochasticity or the maximum allowed play time can lead to very different performance. In this work, we discuss the difficulties of comparing different agents trained on ALE. In order to take a step further towards reproducible and comparable DRL, we introduce SABER, a Standardized Atari BEnchmark for general Reinforcement learning algorithms. Our methodology extends previous recommendations and contains a complete set of environment parameters as well as train and test procedures. We then use SABER to evaluate the current state of the art, Rainbow. Furthermore, we introduce a human world records baseline, and argue that previous claims of expert or superhuman performance of DRL might not be accurate. Finally, we propose Rainbow-IQN by extending Rainbow with Implicit Quantile Networks (IQN) leading to new state-of-the-art performance. Source code is available for reproducibility.",/pdf/bdc62ef7a94c10a6b17f2e1d0fb6ea550313a9ff.pdf,ICLR,2020,Introducing a Standardized Atari BEnchmark for general Reinforcement learning algorithms (SABER) and highlight the remaining gap between RL agents and best human players. +rygxdA4YPS,SkeaXjvOPB,1569440000000.0,1577170000000.0,1198,AdaScale SGD: A Scale-Invariant Algorithm for Distributed Training,"[""tbjohns@apple.com"", ""pulkit_agrawal@apple.com"", ""jaygu@apple.com"", ""guestrin@apple.com""]","[""Tyler B. Johnson"", ""Pulkit Agrawal"", ""Haijie Gu"", ""Carlos Guestrin""]","[""Large-batch SGD"", ""large-scale learning"", ""distributed training""]","When using distributed training to speed up stochastic gradient descent, learning rates must adapt to new scales in order to maintain training effectiveness. Re-tuning these parameters is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, a practical and principled algorithm that is approximately scale invariant. By continually adapting to the gradient’s variance, AdaScale often trains at a wide range of scales with nearly identical results. We describe this invariance formally through AdaScale’s convergence bounds. As the batch size increases, the bounds maintain final objective values, while smoothly transitioning away from linear speed-ups. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular “linear learning rate scaling” rules. This includes large-scale training without model degradation for machine translation, image classification, object detection, and speech recognition tasks. The algorithm introduces negligible computational overhead and no tuning parameters, making AdaScale an attractive choice for large-scale training. +",/pdf/cd61994f894b01bfa39c76d7e2ff31f4271a6da4.pdf,ICLR,2020,"A practical and principled algorithm for distributed SGD, which simplifies the process of scaling up training" +BJGjOi09t7,SylKn7ccKQ,1538090000000.0,1545360000000.0,385,A Variational Autoencoder for Probabilistic Non-Negative Matrix Factorisation,"[""ses2g14@ecs.soton.ac.uk"", ""apb@ecs.soton.ac.uk"", ""mn@ecs.soton.ac.uk""]","[""Steven Squires"", ""Adam Prugel-Bennett"", ""Mahesan Niranjan""]","[""Non-negative matrix factorisation"", ""Variational autoencoder"", ""Probabilistic""]","We introduce and demonstrate the variational autoencoder (VAE) for probabilistic non-negative matrix factorisation (PAE-NMF). We design a network which can perform non-negative matrix factorisation (NMF) and add in aspects of a VAE to make the coefficients of the latent space probabilistic. By restricting the weights in the final layer of the network to be non-negative and using the non-negative Weibull distribution we produce a probabilistic form of NMF which allows us to generate new data and find a probability distribution that effectively links the latent and input variables. We demonstrate the effectiveness of PAE-NMF on three heterogeneous datasets: images, financial time series and genomic.",/pdf/fbb739d61b4b03aa36c9892bd691cecc79f5e3f2.pdf,ICLR,2019, +BJ6oOfqge,,1478270000000.0,1488390000000.0,176,Temporal Ensembling for Semi-Supervised Learning,"[""slaine@nvidia.com"", ""taila@nvidia.com""]","[""Samuli Laine"", ""Timo Aila""]",[],"In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions. This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the (non-augmented) classification error rate from 18.44% to 7.05% in SVHN with 500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further to 5.12% and 12.16% by enabling the standard augmentations. We additionally obtain a clear improvement in CIFAR-100 classification accuracy by using random images from the Tiny Images dataset as unlabeled extra inputs during training. Finally, we demonstrate good tolerance to incorrect labels. +",/pdf/f1be5d5e303e6b6f9d99528910a14f327f62c2c7.pdf,ICLR,2017, +HklJdaNYPH,SJxLctiPPH,1569440000000.0,1577170000000.0,617,Augmenting Self-attention with Persistent Memory,"[""sainbar@fb.com"", ""egrave@fb.com"", ""guismay@fb.com"", ""rvj@fb.com"", ""ajoulin@fb.com""]","[""Sainbayar Sukhbaatar"", ""Edouard Grave"", ""Guillaume Lample"", ""Herve Jegou"", ""Armand Joulin""]","[""transformer"", ""language modeling"", ""self-attention""]","Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.",/pdf/34427eadb75e7c720d721ee129aa7ad8ff5d6809.pdf,ICLR,2020,A novel attention layer that combines self-attention and feed-forward sublayers of Transformer networks. +SJLlmG-AZ,rkrgmz-0b,1509140000000.0,1519420000000.0,799,Understanding image motion with group representations ,"[""ajaegle@upenn.edu"", ""stephi@seas.upenn.edu"", ""daphnei@seas.upenn.edu"", ""kostas@seas.upenn.edu""]","[""Andrew Jaegle"", ""Stephen Phillips"", ""Daphne Ippolito"", ""Kostas Daniilidis""]","[""vision"", ""motion"", ""recurrent neural networks"", ""self-supervised learning"", ""unsupervised learning"", ""group theory""]","Motion is an important signal for agents in dynamic environments, but learning to represent motion from unlabeled video is a difficult and underconstrained problem. We propose a model of motion based on elementary group properties of transformations and use it to train a representation of image motion. While most methods of estimating motion are based on pixel-level constraints, we use these group properties to constrain the abstract representation of motion itself. We demonstrate that a deep neural network trained using this method captures motion in both synthetic 2D sequences and real-world sequences of vehicle motion, without requiring any labels. Networks trained to respect these constraints implicitly identify the image characteristic of motion in different sequence types. In the context of vehicle motion, this method extracts information useful for localization, tracking, and odometry. Our results demonstrate that this representation is useful for learning motion in the general setting where explicit labels are difficult to obtain.",/pdf/39bd1e217d2e87a3863064fe89f5860df73c2f55.pdf,ICLR,2018,We propose of method of using group properties to learn a representation of motion without labels and demonstrate the use of this method for representing 2D and 3D motion. +BJxhijAcY7,r1l5M7q9KX,1538090000000.0,1550860000000.0,658,signSGD with Majority Vote is Communication Efficient and Fault Tolerant,"[""bernstein@caltech.edu"", ""jiaweizhao.zjw@qq.com"", ""kazizzad@uci.edu"", ""anima@caltech.edu""]","[""Jeremy Bernstein"", ""Jiawei Zhao"", ""Kamyar Azizzadenesheli"", ""Anima Anandkumar""]","[""large-scale learning"", ""distributed systems"", ""communication efficiency"", ""convergence rate analysis"", ""robust optimisation""]","Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses 32x less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.",/pdf/6933f7682283061fe8e760f1b4d92c465e999e64.pdf,ICLR,2019,"Workers send gradient signs to the server, and the update is decided by majority vote. We show that this algorithm is convergent, communication efficient and fault tolerant, both in theory and in practice." +BW5PuV4V-rL,_xYFToQ-m-w,1601310000000.0,1614990000000.0,3386,Gradient-based training of Gaussian Mixture Models for High-Dimensional Streaming Data,"[""~Alexander_Gepperth1"", ""~Benedikt_Pf\u00fclb1""]","[""Alexander Gepperth"", ""Benedikt Pf\u00fclb""]","[""Gaussian Mixture Models"", ""Stochastic Gradient Descent"", ""Unsupervised Representation Learning"", ""Continual Learning""]","We present an approach for efficiently training Gaussian Mixture Models by SGD on non-stationary, high-dimensional streaming data. +Our training scheme does not require data-driven parameter initialization (e.g., k-means) and has the ability to process high-dimensional samples without numerical problems. +Furthermore, the approach allows mini-batch sizes as low as 1, typical for streaming-data settings, and it is possible to react and adapt to changes in data statistics (concept drift/shift) without catastrophic forgetting. +Major problems in such streaming-data settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities.%, and catastrophic forgetting when encountering concept drift. +We introduce an adaptive annealing procedure to address the first problem,%, which additionally plays a decisive role in controlling the \acp{GMM}' reaction to concept drift. +whereas numerical instabilities are eliminated by using an exponential-free approximation to the standard \ac{GMM} log-likelihood. +Experiments on a variety of visual and non-visual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization, and compares favorably to sEM, an online variant of EM.",/pdf/463e7963cd6c52bb2a0d699be8681fb635534b40.pdf,ICLR,2021,"We present a method to train Gaussian Mixture Models by SGD, which requires no prior k-means initialization as EM does, and is thus feasible for streaming data. " +LmUJqB1Cz8,7TtGJbWJ7rk,1601310000000.0,1616060000000.0,3169,Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic,"[""~Deunsol_Yoon1"", ""~Sunghoon_Hong2"", ""~Byung-Jun_Lee1"", ""~Kee-Eung_Kim4""]","[""Deunsol Yoon"", ""Sunghoon Hong"", ""Byung-Jun Lee"", ""Kee-Eung Kim""]","[""power grid management"", ""deep reinforcement learning"", ""graph neural network""]","Safe and reliable electricity transmission in power grids is crucial for modern society. It is thus quite natural that there has been a growing interest in the automatic management of power grids, exemplified by the Learning to Run a Power Network Challenge (L2RPN), modeling the problem as a reinforcement learning (RL) task. However, it is highly challenging to manage a real-world scale power grid, mostly due to the massive scale of its state and action space. In this paper, we present an off-policy actor-critic approach that effectively tackles the unique challenges in power grid management by RL, adopting the hierarchical policy together with the afterstate representation. Our agent ranked first in the latest challenge (L2RPN WCCI 2020), being able to avoid disastrous situations while maintaining the highest level of operational efficiency in every test scenarios. This paper provides a formal description of the algorithmic aspect of our approach, as well as further experimental studies on diverse power grids.",/pdf/19ea8e1953e4da5a00beede7da663e04f6717020.pdf,ICLR,2021,"We present an off-policy actor-critic approach that effectively tackles the unique challenges in power grid management by reinforcement learning, adopting the hierarchical policy together with the afterstate representation. " +tH6_VWZjoq,pySuivHybg7,1601310000000.0,1611610000000.0,2635,Local Search Algorithms for Rank-Constrained Convex Optimization,"[""~Kyriakos_Axiotis1"", ""sviri@verizonmedia.com""]","[""Kyriakos Axiotis"", ""Maxim Sviridenko""]","[""low rank"", ""rank-constrained convex optimization"", ""matrix completion""]","We propose greedy and local search algorithms for rank-constrained convex optimization, namely solving $\underset{\mathrm{rank}(A)\leq r^*}{\min}\, R(A)$ given a convex function $R:\mathbb{R}^{m\times n}\rightarrow \mathbb{R}$ and a parameter $r^*$. These algorithms consist of repeating two steps: (a) adding a new rank-1 matrix to $A$ and (b) enforcing the rank constraint on $A$. We refine and improve the theoretical analysis of Shalev-Shwartz et al. (2011), and show that if the rank-restricted condition number of $R$ is $\kappa$, a solution $A$ with rank $O(r^*\cdot \min\{\kappa \log \frac{R(\mathbf{0})-R(A^*)}{\epsilon}, \kappa^2\})$ and $R(A) \leq R(A^*) + \epsilon$ can be recovered, where $A^*$ is the optimal solution. This significantly generalizes associated results on sparse convex optimization, as well as rank-constrained convex optimization for smooth functions. We then introduce new practical variants of these algorithms that have superior runtime and recover better solutions in practice. We demonstrate the versatility of these methods on a wide range of applications involving matrix completion and robust principal component analysis. +",/pdf/458331518335ba1ae4617033b3d271418ec81093.pdf,ICLR,2021,Efficient greedy and local search algorithms for optimizing a convex objective under a rank constraint. +S1m6h21Cb,Bkma23y0-,1509050000000.0,1518730000000.0,174,The Cramer Distance as a Solution to Biased Wasserstein Gradients,"[""bellemare@google.com"", ""danihelka@google.com"", ""shakir@google.com"", ""balajiln@google.com"", ""shoyer@google.com"", ""munos@google.com""]","[""Marc G. Bellemare"", ""Ivo Danihelka"", ""Will Dabney"", ""Shakir Mohamed"", ""Balaji Lakshminarayanan"", ""Stephan Hoyer"", ""Remi Munos""]","[""Probability metrics"", ""Wasserstein metric"", ""stochastic gradient descent"", ""GANs""]","The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling, and most recently in reinforcement learning. In this paper we describe three natural properties of probability divergences that we believe reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramér distance. We show that the Cramér distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. We give empirical results on a number of domains comparing these three divergences. To illustrate the practical relevance of the Cramér distance we design a new algorithm, the Cramér Generative Adversarial Network (GAN), and show that it has a number of desirable properties over the related Wasserstein GAN. +",/pdf/1d816bb74648025fb262531c64921bebbfe3a382.pdf,ICLR,2018,"The Wasserstein distance is hard to minimize with stochastic gradient descent, while the Cramer distance can be optimized easily and works just as well." +SyxTZ1HYwB,SJxWT6s_vS,1569440000000.0,1577170000000.0,1559,TWO-STEP UNCERTAINTY NETWORK FOR TASKDRIVEN SENSOR PLACEMENT,"[""yangyang@knights.ucf.edu"", ""yangzhang@knights.ucf.edu"", ""foroosh@cs.ucf.edu"", ""pang@creol.ucf.edu""]","[""Yangyang Sun"", ""Yang Zhang"", ""Hassan Foroosh"", ""Shuo Pang""]","[""Uncertainty Estimation"", ""Sensor Placement"", ""Sequential Control"", ""Adaptive Sensing""]","Optimal sensor placement achieves the minimal cost of sensors while obtaining the prespecified objectives. In this work, we propose a framework for sensor placement to maximize the information gain called Two-step Uncertainty Network(TUN). TUN encodes an arbitrary number of measurements, models the conditional distribution of high dimensional data, and estimates the task-specific information gain at un-observed locations. Experiments on the synthetic data show that TUN outperforms the random sampling strategy and Gaussian Process-based strategy consistently.",/pdf/35d6c95ff9c33c710fb5efb2bf3d81e9ae0603a6.pdf,ICLR,2020,Strategy of sensor placement to maximize the information gain with generative neural network. +3GYfIYvNNhL,1_gKnUBEycd,1601310000000.0,1614990000000.0,1252,Characterizing Structural Regularities of Labeled Data in Overparameterized Models,"[""~Ziheng_Jiang1"", ""~Chiyuan_Zhang1"", ""~Kunal_Talwar1"", ""~Michael_Curtis_Mozer1""]","[""Ziheng Jiang"", ""Chiyuan Zhang"", ""Kunal Talwar"", ""Michael Curtis Mozer""]",[],"Humans are accustomed to environments that contain both regularities and exceptions. For example, at most gas stations, one pays prior to pumping, but the occasional rural station does not accept payment in advance. +Likewise, deep neural networks can generalize across instances that share common patterns or structures yet have the capacity to memorize rare or irregular forms. We analyze how individual instances are treated by a model via a consistency score. The score characterizes the expected accuracy for a held-out instance given training sets of varying size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data-sets, and we show that the score identifies out-of-distribution and mislabeled examples at one end of the continuum and strongly regular examples at the other end. We identify computationally inexpensive proxies to the consistency score using statistics collected during training. We apply the score toward understanding the dynamics of representation learning and to filter outliers during training, and we discuss other potential applications including curriculum learning and active data collection.",/pdf/5f7f242ac474b8e806de4483ef6e237f31ccc305.pdf,ICLR,2021, +BJG__i0qF7,rklIFAt9Ym,1538090000000.0,1545360000000.0,369,Learning to encode spatial relations from natural language,"[""tiago.mpramalho@gmail.com"", ""tkocisky@google.com"", ""fbesse@google.com"", ""aeslami@google.com"", ""melisgl@google.com"", ""fviola@google.com"", ""pblunsom@google.com"", ""kmh@google.com""]","[""Tiago Ramalho"", ""Tomas Kocisky\u200e"", ""Frederic Besse"", ""S. M. Ali Eslami"", ""Gabor Melis"", ""Fabio Viola"", ""Phil Blunsom"", ""Karl Moritz Hermann""]","[""generative model"", ""grounded language"", ""scene understanding"", ""natural language""]","Natural language processing has made significant inroads into learning the semantics of words through distributional approaches, however representations learnt via these methods fail to capture certain kinds of information implicit in the real world. In particular, spatial relations are encoded in a way that is inconsistent with human spatial reasoning and lacking invariance to viewpoint changes. We present a system capable of capturing the semantics of spatial relations such as behind, left of, etc from natural language. Our key contributions are a novel multi-modal objective based on generating images of scenes from their textual descriptions, and a new dataset on which to train it. We demonstrate that internal representations are robust to meaning preserving transformations of descriptions (paraphrase invariance), while viewpoint invariance is an emergent property of the system.",/pdf/d32d75aba7bac2c452d66f92b85bd271973eb086.pdf,ICLR,2019,We introduce a system capable of capturing the semantics of spatial relations by grounding representation learning in vision. +QKbS9KXkE_y,AX6FmMd8gOC,1601310000000.0,1614990000000.0,3646,Data-efficient Hindsight Off-policy Option Learning,"[""~Markus_Wulfmeier1"", ""~Dushyant_Rao1"", ""~Roland_Hafner1"", ""~Thomas_Lampe1"", ""~Abbas_Abdolmaleki3"", ""thertweck@google.com"", ""~Michael_Neunert1"", ""~Dhruva_Tirumala1"", ""~Noah_Yamamoto_Siegel1"", ""~Nicolas_Heess1"", ""~Martin_Riedmiller1""]","[""Markus Wulfmeier"", ""Dushyant Rao"", ""Roland Hafner"", ""Thomas Lampe"", ""Abbas Abdolmaleki"", ""Tim Hertweck"", ""Michael Neunert"", ""Dhruva Tirumala"", ""Noah Yamamoto Siegel"", ""Nicolas Heess"", ""Martin Riedmiller""]","[""Hierarchical Reinforcement Learning"", ""Off-Policy"", ""Abstractions"", ""Data-Efficiency""]","Hierarchical approaches for reinforcement learning aim to improve data efficiency and accelerate learning by incorporating different abstractions. We introduce Hindsight Off-policy Options (HO2), an efficient off-policy option learning algorithm, and isolate the impact of action and temporal abstraction in the option framework by comparing flat policies, mixture policies without temporal abstraction, and finally option policies; all with comparable policy optimization. When aiming for data efficiency, we demonstrate the importance of off-policy optimization, as even flat policies trained off-policy can outperform on-policy option methods. In addition, off-policy training and backpropagation through a dynamic programming inference procedure -- through time and through the policy components for every time-step -- enable us to train all components' parameters independently of the data-generating behavior policy. We continue to illustrate challenges in off-policy option learning and the related importance of trust-region constraints. Experimentally, we demonstrate that HO2 outperforms existing option learning methods and that both action and temporal abstraction provide strong benefits in particular in more demanding simulated robot manipulation tasks from raw pixel inputs. Finally, we develop an intuitive extension to encourage temporal abstraction and investigate differences in its impact between learning from scratch and using pre-trained options. ",/pdf/9c326f6583bfbc5ba0236f55057ba2f320273c24.pdf,ICLR,2021,"We develop an efficient off-policy option learning method, isolate the impact of action and temporal abstraction, demonstrate the importance and challenges of off-policy learning and solve challenging tasks from raw pixels." +B1hYRMbCW,B1bORM-Rb,1509140000000.0,1519380000000.0,1066,On the regularization of Wasserstein GANs,"[""henning.petzka@iais.fraunhofer.de"", ""asja.fischer@gmail.com"", ""lukovnik@cs.uni-bonn.de""]","[""Henning Petzka"", ""Asja Fischer"", ""Denis Lukovnikov""]",[],"Since their invention, generative adversarial networks (GANs) have become a popular approach for learning to model a distribution of real (unlabeled) data. Convergence problems during training are overcome by Wasserstein GANs which minimize the distance between the model and the empirical distribution in terms of a different metric, but thereby introduce a Lipschitz constraint into the optimization problem. A simple way to enforce the Lipschitz constraint on the class of functions, which can be modeled by the neural network, is weight clipping. Augmenting the loss by a regularization term that penalizes the deviation of the gradient norm of the critic (as a function of the network's input) from one, was proposed as an alternative that improves training. We present theoretical arguments why using a weaker regularization term enforcing the Lipschitz constraint is preferable. These arguments are supported by experimental results on several data sets.",/pdf/f907f04706a14a4797db32215e0ad6323e1da6e0.pdf,ICLR,2018,A new regularization term can improve your training of wasserstein gans +S1eOHo09KX,r1ggdEZCuX,1538090000000.0,1550460000000.0,99,Opportunistic Learning: Budgeted Cost-Sensitive Learning from Data Streams,"[""mkachuee@cs.ucla.edu"", ""orpgol@cs.ucla.edu"", ""kimmo@cs.ucla.edu"", ""sajad.darabi@cs.ucla.edu"", ""majid@cs.ucla.edu""]","[""Mohammad Kachuee"", ""Orpaz Goldstein"", ""Kimmo K\u00e4rkk\u00e4inen"", ""Sajad Darabi"", ""Majid Sarrafzadeh""]","[""Cost-Aware Learning"", ""Feature Acquisition"", ""Reinforcement Learning"", ""Stream Learning"", ""Deep Q-Learning""]","In many real-world learning scenarios, features are only acquirable at a cost constrained under a budget. In this paper, we propose a novel approach for cost-sensitive feature acquisition at the prediction-time. The suggested method acquires features incrementally based on a context-aware feature-value function. We formulate the problem in the reinforcement learning paradigm, and introduce a reward function based on the utility of each feature. Specifically, MC dropout sampling is used to measure expected variations of the model uncertainty which is used as a feature-value function. Furthermore, we suggest sharing representations between the class predictor and value function estimator networks. The suggested approach is completely online and is readily applicable to stream learning setups. The solution is evaluated on three different datasets including the well-known MNIST dataset as a benchmark as well as two cost-sensitive datasets: Yahoo Learning to Rank and a dataset in the medical domain for diabetes classification. According to the results, the proposed method is able to efficiently acquire features and make accurate predictions. ",/pdf/eb9189687175bb57690f0a8fa707f245189c0d77.pdf,ICLR,2019,An online algorithm for cost-aware feature acquisition and prediction +HyRVBzap-,S1C4rfpTZ,1508870000000.0,1519550000000.0,67,Cascade Adversarial Machine Learning Regularized with a Unified Embedding,"[""taesik.na@gatech.edu"", ""jonghwan.ko@gatech.edu"", ""saibal.mukhopadhyay@ece.gatech.edu""]","[""Taesik Na"", ""Jong Hwan Ko"", ""Saibal Mukhopadhyay""]","[""adversarial machine learning"", ""embedding"", ""regularization"", ""adversarial attack""]","Injecting adversarial examples during training, known as adversarial training, can improve robustness against one-step attacks, but not for unknown iterative attacks. To address this challenge, we first show iteratively generated adversarial images easily transfer between networks trained with the same strategy. Inspired by this observation, we propose cascade adversarial training, which transfers the knowledge of the end results of adversarial training. We train a network from scratch by injecting iteratively generated adversarial images crafted from already defended networks in addition to one-step adversarial images from the network being trained. We also propose to utilize embedding space for both classification and low-level (pixel-level) similarity learning to ignore unknown pixel level perturbation. During training, we inject adversarial images without replacing their corresponding clean images and penalize the distance between the two embeddings (clean and adversarial). Experimental results show that cascade adversarial training together with our proposed low-level similarity learning efficiently enhances the robustness against iterative attacks, but at the expense of decreased robustness against one-step attacks. We show that combining those two techniques can also improve robustness under the worst case black box attack scenario.",/pdf/cfd7384d620080eb45f4aa3140a32672c81253d5.pdf,ICLR,2018,Cascade adversarial training + low level similarity learning improve robustness against both white box and black box attacks. +rJguRyBYvr,ryxtFPytwB,1569440000000.0,1577170000000.0,2029,Improved Detection of Adversarial Attacks via Penetration Distortion Maximization,"[""shairoz@cs.technion.ac.il"", ""elidan@google.com"", ""elyaniv@google.com""]","[""Shai Rozenberg"", ""Gal Elidan"", ""Ran El-Yaniv""]","[""Adversarial Examples"", ""Adversarial Attacks"", ""Adversarial Defense"", ""White-Box threat models""]","This paper is concerned with the defense of deep models against adversarial at- +tacks. We develop an adversarial detection method, which is inspired by the cer- +tificate defense approach, and captures the idea of separating class clusters in the +embedding space so as to increase the margin. The resulting defense is intuitive, +effective, scalable and can be integrated into any given neural classification model. +Our method demonstrates state-of-the-art detection performance under all threat +models.",/pdf/9cf4ef94dc8e0162f4999ac930dc7e6bcf94915e.pdf,ICLR,2020,Adversarial detection method based on separating class clusters in the embedding space. +Syfz6sC9tQ,H1gfLV9qtm,1538090000000.0,1545360000000.0,783,Generative Feature Matching Networks,"[""cicerons@us.ibm.com"", ""inkit.padhi@ibm.com"", ""pdognin@us.ibm.com"", ""mroueh@us.ibm.com""]","[""Cicero Nogueira dos Santos"", ""Inkit Padhi"", ""Pierre Dognin"", ""Youssef Mroueh""]","[""Generative Deep Neural Networks"", ""Feature Matching"", ""Maximum Mean Discrepancy"", ""Generative Adversarial Networks""]","We propose a non-adversarial feature matching-based approach to train generative models. Our approach, Generative Feature Matching Networks (GFMN), leverages pretrained neural networks such as autoencoders and ConvNet classifiers to perform feature extraction. We perform an extensive number of experiments with different challenging datasets, including ImageNet. Our experimental results demonstrate that, due to the expressiveness of the features from pretrained ImageNet classifiers, even by just matching first order statistics, our approach can achieve state-of-the-art results for challenging benchmarks such as CIFAR10 and STL10.",/pdf/684ef9e9a7e6198def65b188ddc7e58ee91981c8.pdf,ICLR,2019,A new non-adversarial feature matching-based approach to train generative models that achieves state-of-the-art results. +Bke4KsA5FX,Byxo9EP5YQ,1538090000000.0,1550840000000.0,437,Generative Code Modeling with Graphs,"[""mabrocks@microsoft.com"", ""miallama@microsoft.com"", ""algaunt@microsoft.com"", ""polozov@microsoft.com""]","[""Marc Brockschmidt"", ""Miltiadis Allamanis"", ""Alexander L. Gaunt"", ""Oleksandr Polozov""]","[""Generative Model"", ""Source Code"", ""Graph Learning""]","Generative models forsource code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. Our model generates code by interleaving grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.",/pdf/4eb23f9e8372ae16ee1c20b1d3a80268e1b501f0.pdf,ICLR,2019,Representing programs as graphs including semantics helps when generating programs +ohz3OEhVcs,jNRJ6RmM-0Z,1601310000000.0,1614990000000.0,732,Graph Autoencoders with Deconvolutional Networks,"[""~Jia_Li4"", ""jwyu@se.cuhk.edu.hk"", ""~Da-Cheng_Juan1"", ""~HAN_Zhichao1"", ""arjung@google.com"", ""~Hong_Cheng1"", ""~Andrew_Tomkins2""]","[""Jia Li"", ""Jianwei Yu"", ""Da-Cheng Juan"", ""HAN Zhichao"", ""Arjun Gopalan"", ""Hong Cheng"", ""Andrew Tomkins""]","[""graph autoencoders"", ""graph deconvolutional networks""]","Recent studies have indicated that Graph Convolutional Networks (GCNs) act as a $\textit{low pass}$ filter in spectral domain and encode smoothed node representations. In this paper, we consider their opposite, namely Graph Deconvolutional Networks (GDNs) that reconstruct graph signals from smoothed node representations. We motivate the design of Graph Deconvolutional Networks via a combination of inverse filters in spectral domain and de-noising layers in wavelet domain, as the inverse operation results in a $\textit{high pass}$ filter and may amplify the noise. Based on the proposed GDN, we further propose a graph autoencoder framework that first encodes smoothed graph representations with GCN and then decodes accurate graph signals with GDN. We demonstrate the effectiveness of the proposed method on several tasks including unsupervised graph-level representation, social recommendation and graph generation.",/pdf/922581d7032f7f06d9f934717eb97d594762822c.pdf,ICLR,2021, +zI38PZQHWKj,PEKGzjDjfTq,1601310000000.0,1614990000000.0,390,Feature-Robust Optimal Transport for High-Dimensional Data,"[""mathis.petrovich@gmail.com"", ""cs.chaoliang@zju.edu.cn"", ""~Ryoma_Sato1"", ""~Yanbin_Liu1"", ""~Yao-Hung_Hubert_Tsai1"", ""~Linchao_Zhu1"", ""~Yi_Yang4"", ""~Ruslan_Salakhutdinov1"", ""~Makoto_Yamada3""]","[""Mathis Petrovich"", ""Chao Liang"", ""Ryoma Sato"", ""Yanbin Liu"", ""Yao-Hung Hubert Tsai"", ""Linchao Zhu"", ""Yi Yang"", ""Ruslan Salakhutdinov"", ""Makoto Yamada""]","[""Optimal Transport"", ""feature selection"", ""semantic correspondence""]","Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature-robust optimal transport (FROT) for high-dimensional data, which solves high-dimensional OT problems using feature selection to avoid the curse of dimensionality. Specifically, we find a transport plan with discriminative features. To this end, we formulate the FROT problem as a min--max optimization problem. We then propose a convex formulation of the FROT problem and solve it using a Frank--Wolfe-based optimization algorithm, whereby the subproblem can be efficiently solved using the Sinkhorn algorithm. Since FROT finds the transport plan from selected features, it is robust to noise features. To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can find a strong correspondence by determining important layers. We show that the FROT algorithm achieves state-of-the-art performance in real-world semantic correspondence datasets.",/pdf/e3cd4ae10b6a27e4d902914fcb6f609f1c2f3de6.pdf,ICLR,2021,We propose an optimal transport method for high-dimensional data and applied it to semantic correspondence problems. +rylSzl-R-,rJ1rfxZAZ,1509130000000.0,1519560000000.0,559,On Unifying Deep Generative Models,"[""zhitinghu@gmail.com"", ""yangtze2301@gmail.com"", ""rsalakhu@cs.cmu.edu"", ""epxing@cs.cmu.edu""]","[""Zhiting Hu"", ""Zichao Yang"", ""Ruslan Salakhutdinov"", ""Eric P. Xing""]","[""deep generative models"", ""generative adversarial networks"", ""variational autoencoders"", ""variational inference""]","Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as powerful frameworks for deep generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transfered techniques. ",/pdf/fac1dcc1ed1f2dbc8a5857311ef0397d72a5b89d.pdf,ICLR,2018,A unified statistical view of the broad class of deep generative models +rJx7wlSYvB,rkgRBixFDH,1569440000000.0,1577170000000.0,2354,Differentiable Bayesian Neural Network Inference for Data Streams,"[""namuk.park@yonsei.ac.kr"", ""taekyu.lee@yonsei.ac.kr"", ""songkuk@yonsei.ac.kr""]","[""Namuk Park"", ""Taekyu Lee"", ""Songkuk Kim""]","[""Bayesian neural network"", ""approximate predictive inference"", ""data stream"", ""histogram""]","While deep neural networks (NNs) do not provide the confidence of its prediction, Bayesian neural network (BNN) can estimate the uncertainty of the prediction. However, BNNs have not been widely used in practice due to the computational cost of predictive inference. This prohibitive computational cost is a hindrance especially when processing stream data with low-latency. To address this problem, we propose a novel model which approximate BNNs for data streams. Instead of generating separate prediction for each data sample independently, this model estimates the increments of prediction for a new data sample from the previous predictions. The computational cost of this model is almost the same as that of non-Bayesian deep NNs. Experiments including semantic segmentation on real-world data show that this model performs significantly faster than BNNs, estimating uncertainty comparable to the results of BNNs. +",/pdf/6418ed64b2ea906d18bbde58011c614cef0d4794.pdf,ICLR,2020,We propose approximate Bayesian neural network that predicts results for data stream as fast as deterministic deep neural network does +BkM27IxR-,B1bnXUe0W,1509090000000.0,1518730000000.0,272,Learning to Optimize Neural Nets,"[""ke.li@eecs.berkeley.edu"", ""jitendram@google.com""]","[""Ke Li"", ""Jitendra Malik""]","[""Learning to learn"", ""meta-learning"", ""reinforcement learning"", ""optimization""]","Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR-100. ",/pdf/5d1212e6c1f455da823b960470352fc6badd5f9e.pdf,ICLR,2018,We learn an optimization algorithm that generalizes to unseen tasks +Bklzkh0qFm,S1xxzkCcFX,1538090000000.0,1545360000000.0,964,Relational Graph Attention Networks,"[""dan.busbridge@gmail.com"", ""danesherbs@gmail.com"", ""p.cavallo85@gmail.com"", ""nils.hammerla@babylonhealth.com""]","[""Dan Busbridge"", ""Dane Sherburn"", ""Pietro Cavallo"", ""Nils Y. Hammerla""]","[""RGCN"", ""attention"", ""graph convolutional networks"", ""semi-supervised learning"", ""graph classification"", ""molecules""]","We investigate Relational Graph Attention Networks, a class of models that extends non-relational graph attention mechanisms to incorporate relational information, opening up these methods to a wider variety of problems. A thorough evaluation of these models is performed, and comparisons are made against established benchmarks. To provide a meaningful comparison, we retrain Relational Graph Convolutional Networks, the spectral counterpart of Relational Graph Attention Networks, and evaluate them under the same conditions. We find that Relational Graph Attention Networks perform worse than anticipated, although some configurations are marginally beneficial for modelling molecular properties. We provide insights as to why this may be, and suggest both modifications to evaluation strategies, as well as directions to investigate for future work.",/pdf/a26a2303cc1e28f1b4936c7954825c9ef7ee5e72.pdf,ICLR,2019,We propose a new model for relational graphs and evaluate it on relational transductive and inductive tasks. +rJfMusFll,,1478240000000.0,1486760000000.0,124,Batch Policy Gradient Methods for Improving Neural Conversation Models,"[""kandasamy@cmu.edu"", ""yorambac@gmail.com"", ""ryoto@microsoft.com"", ""dtarlow@microsoft.com"", ""dacart@microsoft.com""]","[""Kirthevasan Kandasamy"", ""Yoram Bachrach"", ""Ryota Tomioka"", ""Daniel Tarlow"", ""David Carter""]","[""Natural language processing"", ""Reinforcement Learning""]","We study reinforcement learning of chat-bots with recurrent neural network +architectures when the rewards are noisy and expensive to +obtain. For instance, a chat-bot used in automated customer service support can +be scored by quality assurance agents, but this process can be expensive, time consuming +and noisy. +Previous reinforcement learning work for natural language uses on-policy updates +and/or is designed for on-line learning settings. +We demonstrate empirically that such strategies are not appropriate for this setting +and develop an off-policy batch policy gradient method (\bpg). +We demonstrate the efficacy of our method via a series of +synthetic experiments and an Amazon Mechanical Turk experiment on +a restaurant recommendations dataset. + +",/pdf/19a87595022232faa9fb28ec94bb6886410ecbca.pdf,ICLR,2017, +5UY7aZ_h37,6Z6qIwZya3x,1601310000000.0,1614990000000.0,1572,Transferring Inductive Biases through Knowledge Distillation,"[""~Samira_Abnar1"", ""~Mostafa_Dehghani1"", ""~Willem_H._Zuidema1""]","[""Samira Abnar"", ""Mostafa Dehghani"", ""Willem H. Zuidema""]","[""Knowledge Distillation"", ""Inductive Biases"", ""Analyzing and Understanding Neural Networks"", ""Recurrent Inductive Bias""]","Having the right inductive biases can be crucial in many tasks or scenarios where data or computing resources are a limiting factor, or where training data is not perfectly representative of the conditions at test time. However, defining, designing, and efficiently adapting inductive biases is not necessarily straightforward. Inductive biases of a model affect its generalisation behaviour and influence the solution it converges to from different aspects. In this paper, we investigate the power of knowledge distillation in transferring the effects of inductive biases of a teacher model to a student model, when they have different architectures. +We consider different families of models: LSTMs vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios with linguistics and vision applications, where having the right inductive biases is critical. We train our models in different setups: no knowledge distillation, self-distillation, and distillation using a teacher with a better inductive bias for the task at hand. We show that in the later setup, compared to no distillation and self-distillation, we can not only improve the performance of the students, but also the solutions they converge become similar to their teachers with respect to a wide range of properties, including different task-specific performance metrics, per sample behaviour of the models, representational similarity and how the representational space of the models evolve during training, performance on out-of-distribution datasets, confidence calibration, and finally whether the converged solutions fall within the same basins of attractions.",/pdf/cf7588c6cd6761ce3863ffe7a487febfa94edb65.pdf,ICLR,2021,We study the effect of inductive biases on the solutions the models converge to and investigate to what extent the effect of inductive biases is transferred through knowledge distillation. +ajOrOhQOsYx,IwbVjOMmQ7G,1601310000000.0,1611610000000.0,2140,A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels,"[""~Leon_Lang1"", ""~Maurice_Weiler1""]","[""Leon Lang"", ""Maurice Weiler""]","[""Group Equivariant Convolution"", ""Steerable Kernel"", ""Quantum Mechanics"", ""Wigner-Eckart Theorem"", ""Representation Theory"", ""Harmonic Analysis"", ""Peter-Weyl Theorem""]","Group equivariant convolutional networks (GCNNs) endow classical convolutional networks with additional symmetry priors, which can lead to a considerably improved performance. Recent advances in the theoretical description of GCNNs revealed that such models can generally be understood as performing convolutions with $G$-steerable kernels, that is, kernels that satisfy an equivariance constraint themselves. While the $G$-steerability constraint has been derived, it has to date only been solved for specific use cases - a general characterization of $G$-steerable kernel spaces is still missing. This work provides such a characterization for the practically relevant case of $G$ being any compact group. Our investigation is motivated by a striking analogy between the constraints underlying steerable kernels on the one hand and spherical tensor operators from quantum mechanics on the other hand. By generalizing the famous Wigner-Eckart theorem for spherical tensor operators, we prove that steerable kernel spaces are fully understood and parameterized in terms of 1) generalized reduced matrix elements, 2) Clebsch-Gordan coefficients, and 3) harmonic basis functions on homogeneous spaces.",/pdf/2b834ab8bd8e5862e1cbaac32cff3d6594975b30.pdf,ICLR,2021,We parameterize equivariant convolution kernels by proving a generalization of the Wigner-Eckart theorem for spherical tensor operators. +J4XaMT9OcZ,rqB1GzwU_CE,1601310000000.0,1614990000000.0,2427,Mitigating Deep Double Descent by Concatenating Inputs,"[""~John_Chen3"", ""~Qihan_Wang1"", ""~Anastasios_Kyrillidis2""]","[""John Chen"", ""Qihan Wang"", ""Anastasios Kyrillidis""]","[""deep double descent"", ""feedforward neural network"", ""image classificaiton""]","The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.",/pdf/e052c9f9157bbb5b14fb3897d3109a3b36064806.pdf,ICLR,2021,We introduce a construction to artificially increase the size of the dataset which mitigates deep double descent in a variety of settings. +z9k8BWL-_2u,TY_H27mm2uE,1601310000000.0,1616050000000.0,1665,Statistical inference for individual fairness,"[""~Subha_Maity1"", ""sxue@umich.edu"", ""~Mikhail_Yurochkin1"", ""~Yuekai_Sun1""]","[""Subha Maity"", ""Songkai Xue"", ""Mikhail Yurochkin"", ""Yuekai Sun""]",[],"As we rely on machine learning (ML) models to make more consequential decisions, the issue of ML models perpetuating unwanted social biases has come to the fore of the public's and the research community's attention. In this paper, we focus on the problem of detecting violations of individual fairness in ML models. We formalize the problem as measuring the susceptibility of ML models against a form of adversarial attack and develop a suite of inference tools for the adversarial loss. The tools allow practitioners to assess the individual fairness of ML models in a statistically-principled way: form confidence intervals for the adversarial loss and test hypotheses of model fairness with (asymptotic) non-coverage/Type I error rate control. We demonstrate the utility of our tools in a real-world case study.",/pdf/ac8d0be951fdef76c35e87af88b3fb817cadfde9.pdf,ICLR,2021, +r28GdiQF7vM,YaMZTQ7VXt2,1601310000000.0,1635170000000.0,546,Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions,"[""~Yunwen_Lei1"", ""~Yiming_Ying1""]","[""Yunwen Lei"", ""Yiming Ying""]","[""generalization bounds"", ""non-convex learning""]","Stochastic optimization has become the workhorse behind many successful machine learning applications, which motivates a lot of theoretical analysis to understand its empirical behavior. As a comparison, there is far less work to study the generalization behavior especially in a non-convex learning setting. In this paper, we study the generalization behavior of stochastic optimization by leveraging the algorithmic stability for learning with $\beta$-gradient-dominated objective functions. We develop generalization bounds of the order $O(1/(n\beta))$ plus the convergence rate of the optimization algorithm, where $n$ is the sample size. Our stability analysis significantly improves the existing non-convex analysis by removing the bounded gradient assumption and implying better generalization bounds. We achieve this improvement by exploiting the smoothness of loss functions instead of the Lipschitz condition in Charles & Papailiopoulos (2018). We apply our general results to various stochastic optimization algorithms, which show clearly how the variance-reduction techniques improve not only training but also generalization. Furthermore, our discussion explains how interpolation helps generalization for highly expressive models.",/pdf/43ac8e3a506c7ad54162957c4698675f659e050a.pdf,ICLR,2021,We develop sharper generalization bounds for learning with gradient-dominated objective functions. +rJx9vaVtDS,Hke93moPwB,1569440000000.0,1577170000000.0,605,Individualised Dose-Response Estimation using Generative Adversarial Nets,"[""ioana.bica@eng.ox.ac.uk"", ""james.jordon@wolfson.ox.ac.uk"", ""mschaar@turing.ac.uk""]","[""Ioana Bica"", ""James Jordon"", ""Mihaela van der Schaar""]","[""individualised dose-response estimation"", ""treatment effects"", ""causal inference"", ""generative adversarial networks""]","The problem of estimating treatment responses from observational data is by now a well-studied one. Less well studied, though, is the problem of treatment response estimation when the treatments are accompanied by a continuous dosage parameter. In this paper, we tackle this lesser studied problem by building on a modification of the generative adversarial networks (GANs) framework that has already demonstrated effectiveness in the former problem. Our model, DRGAN, is flexible, capable of handling multiple treatments each accompanied by a dosage parameter. The key idea is to use a significantly modified GAN model to generate entire dose-response curves for each sample in the training data which will then allow us to use standard supervised methods to learn an inference model capable of estimating these curves for a new sample. Our model consists of 3 blocks: (1) a generator, (2) a discriminator, (3) an inference block. In order to address the challenge presented by the introduction of dosages, we propose novel architectures for both our generator and discriminator. We model the generator as a multi-task deep neural network. In order to address the increased complexity of the treatment space (because of the addition of dosages), we develop a hierarchical discriminator consisting of several networks: (a) a treatment discriminator, (b) a dosage discriminator for each treatment. In the experiments section, we introduce a new semi-synthetic data simulation for use in the dose-response setting and demonstrate improvements over the existing benchmark models.",/pdf/5ef517dd3a07d90dba71454f18bc13fd337e1326.pdf,ICLR,2020, +Bkx8OiRcYX,ByeIosY5tX,1538090000000.0,1545360000000.0,358,Countdown Regression: Sharp and Calibrated Survival Predictions,"[""avati@cs.stanford.edu"", ""tonyduan@cs.stanford.edu"", ""sharonz@cs.stanford.edu""]","[""Anand Avati"", ""Tony Duan"", ""Sharon Zhou"", ""Kenneth Jung"", ""Nigam Shah"", ""Andrew Ng""]",[],"Personalized probabilistic forecasts of time to event (such as mortality) can be crucial in decision making, especially in the clinical setting. Inspired by ideas from the meteorology literature, we approach this problem through the paradigm of maximizing sharpness of prediction distributions, subject to calibration. In regression problems, it has been shown that optimizing the continuous ranked probability score (CRPS) instead of maximum likelihood leads to sharper prediction distributions while maintaining calibration. We introduce the Survival-CRPS, a generalization of the CRPS to the time to event setting, and present right-censored and interval-censored variants. To holistically evaluate the quality of predicted distributions over time to event, we present the scale agnostic Survival-AUPRC evaluation metric, an analog to area under the precision-recall curve. We apply these ideas by building a recurrent neural network for mortality prediction, using an Electronic Health Record dataset covering millions of patients. We demonstrate significant benefits in models trained by the Survival-CRPS objective instead of maximum likelihood.",/pdf/5a6093888e79a44b5ae0e03fd746ce7314f52dec.pdf,ICLR,2019, +Uh0T_Q0pg7r,SSsvSRB2MqJ,1601310000000.0,1614990000000.0,2484,Active Learning in CNNs via Expected Improvement Maximization,"[""~Udai_G._Nagpal1"", ""~David_A._Knowles1""]","[""Udai G. Nagpal"", ""David A. Knowles""]","[""active learning"", ""batch-mode active learning"", ""deep learning"", ""convolutional neural networks"", ""supervised learning"", ""regression"", ""classification"", ""MC dropout"", ""computer vision"", ""computational biology""]","Deep learning models such as Convolutional Neural Networks (CNNs) have demonstrated high levels of effectiveness in a variety of domains, including computer vision and more recently, computational biology. However, training effective models often requires assembling and/or labeling large datasets, which may be prohibitively time-consuming or costly. Pool-based active learning techniques have the potential to mitigate these issues, leveraging models trained on limited data to selectively query unlabeled data points from a pool in an attempt to expedite the learning process. Here we present ""Dropout-based Expected IMprOvementS"" (DEIMOS), a flexible and computationally-efficient approach to active learning that queries points that are expected to maximize the model's improvement across a representative sample of points. The proposed framework enables us to maintain a prediction covariance matrix capturing model uncertainty, and to dynamically update this matrix in order to generate diverse batches of points in the batch-mode setting. Our active learning results demonstrate that DEIMOS outperforms several existing baselines across multiple regression and classification tasks taken from computer vision and genomics.",/pdf/01719ebc0cd4b20dde88a2b7c7e3fb33ef812923.pdf,ICLR,2021,"An efficient batch-mode active learning algorithm for CNNs is proposed based on acquisition of points expected to maximize the model’s improvement upon being queried, and is found to perform well across regression and classification tasks." +sr68jSUakP,d7E-2zCVpZf,1601310000000.0,1627920000000.0,266,Orthogonal Subspace Decomposition: A New Perspective of Learning Discriminative Features for Face Clustering,"[""~Jianfeng_Wang2"", ""~Thomas_Lukasiewicz2"", ""~zhongchao_shi1""]","[""Jianfeng Wang"", ""Thomas Lukasiewicz"", ""zhongchao shi""]",[],"Face clustering is an important task, due to its wide applications in practice. Graph-based face clustering methods have recently made a great progress and achieved new state-of-the-art results. Learning discriminative node features is the key to further improve the performance of graph-based face clustering. To this end, we propose subspace learning as a new way to learn discriminative node features, which is implemented by a new orthogonal subspace decomposition (OSD) module. In graph-based face clustering, OSD leads to more discriminative node features, which better reflect the relationship between each pair of faces, thereby boosting the accuracy of face clustering. Extensive experiments show that OSD outperforms state-of-the-art results with a healthy margin.",/pdf/a0cfccda4d056e99db6f96f8ca2ab386469faf1d.pdf,ICLR,2021, +B1xZD1rtPr,H1xOv9adwS,1569440000000.0,1577170000000.0,1755,The Dual Information Bottleneck,"[""zoe.piran@mail.huji.ac.il"", ""tishby@cs.huji.ac.il""]","[""Zoe Piran"", ""Naftali Tishby""]","[""optimal prediction learning"", ""exponential families"", ""critical points"", ""information theory""]","The Information-Bottleneck (IB) framework suggests a general characterization of optimal representations in learning, and deep learning in particular. It is based on the optimal trade off between the representation complexity and accuracy, both of which are quantified by mutual information. The problem is solved by alternating projections between the encoder and decoder of the representation, which can be performed locally at each representation level. The framework, however, has practical drawbacks, in that mutual information is notoriously difficult to handle at high dimension, and only has closed form solutions in special cases. Further, because it aims to extract representations which are minimal sufficient statistics of the data with respect to the desired label, it does not necessarily optimize the actual prediction of unseen labels. Here we present a formal dual problem to the IB which has several interesting properties. By switching the order in the KL-divergence between the representation decoder and data, the optimal decoder becomes the geometric rather than the arithmetic mean of the input points. While providing a good approximation to the original IB, it also preserves the form of exponential families, and optimizes the mutual information on the predicted label rather than the desired one. We also analyze the critical points of the dualIB and discuss their importance for the quality of this approach.",/pdf/412e688ab35b39126ef71131f567c2e69083a238.pdf,ICLR,2020,"A new dual formulation of the Information Bottleneck, optimizing label prediction and preserving distributions of exponential form." +S1x2PCNKDB,SyxIu9DuvS,1569440000000.0,1577170000000.0,1190,Task-Relevant Adversarial Imitation Learning,"[""konrad.zolna@gmail.com"", ""reedscot@google.com"", ""anovikov@google.com"", ""ziyu@google.com"", ""sergomez@google.com"", ""budden@google.com"", ""cabi@google.com"", ""mdenil@google.com"", ""nandodefreitas@google.com""]","[""Konrad Zolna"", ""Scott Reed"", ""Alexander Novikov"", ""Ziyu Wang"", ""Sergio G\u00f3mez"", ""David Budden"", ""Serkan Cabi"", ""Misha Denil"", ""Nando de Freitas""]","[""adversarial"", ""imitation"", ""robot"", ""manipulation""]","We show that a critical problem in adversarial imitation from high-dimensional sensory data is the tendency of discriminator networks to distinguish agent and expert behaviour using task-irrelevant features beyond the control of the agent. We analyze this problem in detail and propose a solution as well as several baselines that outperform standard Generative Adversarial Imitation Learning (GAIL). Our proposed solution, Task-Relevant Adversarial Imitation Learning (TRAIL), uses a constrained optimization objective to overcome task-irrelevant features. Comprehensive experiments show that TRAIL can solve challenging manipulation tasks from pixels by imitating human operators, where other agents such as behaviour cloning (BC), standard GAIL, improved GAIL variants including our newly proposed baselines, and Deterministic Policy Gradients from Demonstrations (DPGfD) fail to find solutions, even when the other agents have access to task reward. ",/pdf/e73878130090de3a503e21ae906565e28b856a83.pdf,ICLR,2020,"Improve GAIL by preventing the discriminator from exploiting task-irrelevant information, to solve difficult sim robot manipulation tasks from pixels." +Z2qyx5vC8Xn,ZCVpf8KLt2J,1601310000000.0,1614990000000.0,497,Temporal Difference Uncertainties as a Signal for Exploration,"[""~Sebastian_Flennerhag1"", ""~Jane_X_Wang1"", ""~Pablo_Sprechmann1"", ""~Francesco_Visin1"", ""~Alexandre_Galashov1"", ""~Steven_Kapturowski1"", ""~Diana_L_Borsa1"", ""~Nicolas_Heess1"", ""~Andre_Barreto1"", ""~Razvan_Pascanu1""]","[""Sebastian Flennerhag"", ""Jane X Wang"", ""Pablo Sprechmann"", ""Francesco Visin"", ""Alexandre Galashov"", ""Steven Kapturowski"", ""Diana L Borsa"", ""Nicolas Heess"", ""Andre Barreto"", ""Razvan Pascanu""]","[""deep reinforcement learning"", ""deep-rl"", ""exploration""]","An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging as the exploration problem itself. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates efficient exploration.",/pdf/b1f081d47000b34d557d4a38a25e4f35a83ae58a.pdf,ICLR,2021,A method for exploration based on learning from uncertainty over the agent's value-function. +H1gEP6NFwr,rke4n3cDDB,1569440000000.0,1577170000000.0,593,On the Tunability of Optimizers in Deep Learning,"[""prabhu.teja@idiap.ch"", ""florian.mai@idiap.ch"", ""thijs.vogels@epfl.ch"", ""martin.jaggi@epfl.ch"", ""francois.fleuret@idiap.ch""]","[""Prabhu Teja S*"", ""Florian Mai*"", ""Thijs Vogels"", ""Martin Jaggi"", ""Francois Fleuret""]","[""Optimization"", ""Benchmarking"", ""Hyperparameter optimization""]","There is no consensus yet on the question whether adaptive gradient methods like Adam are easier to use than non-adaptive optimization methods like SGD. In this work, we fill in the important, yet ambiguous concept of ‘ease-of-use’ by defining an optimizer’s tunability: How easy is it to find good hyperparameter configurations using automatic random hyperparameter search? We propose a practical and universal quantitative measure for optimizer tunability that can form the basis for a fair optimizer benchmark. Evaluating a variety of optimizers on an extensive set of standard datasets and architectures, we find that Adam is the most tunable for the majority of problems, especially with a low budget for hyperparameter tuning.",/pdf/c8f619eb0065a52f3dd713250c986219ddd85949.pdf,ICLR,2020,We provide a method to benchmark optimizers that is cognizant to the hyperparameter tuning process. +punMXQEsPr0,io7BwLqdQtR,1601310000000.0,1631080000000.0,607,BROS: A Pre-trained Language Model for Understanding Texts in Document,"[""~Teakgyu_Hong1"", ""~DongHyun_Kim3"", ""~Mingi_Ji1"", ""~Wonseok_Hwang1"", ""~Daehyun_Nam2"", ""~Sungrae_Park1""]","[""Teakgyu Hong"", ""DongHyun Kim"", ""Mingi Ji"", ""Wonseok Hwang"", ""Daehyun Nam"", ""Sungrae Park""]","[""Pre-trained model"", ""language model"", ""Document understanding"", ""Document intelligence"", ""OCR""]","Understanding document from their visual snapshots is an emerging and challenging problem that requires both advanced computer vision and NLP methods. Although the recent advance in OCR enables the accurate extraction of text segments, it is still challenging to extract key information from documents due to the diversity of layouts. To compensate for the difficulties, this paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), that represents and understands the semantics of spatially distributed texts. Different from previous pre-training methods on 1D text, BROS is pre-trained on large-scale semi-structured documents with a novel area-masking strategy while efficiently including the spatial layout information of input documents. Also, to generate structured outputs in various document understanding tasks, BROS utilizes a powerful graph-based decoder that can capture the relation between text segments. BROS achieves state-of-the-art results on four benchmark tasks: FUNSD, SROIE*, CORD, and SciTSR. Our experimental settings and implementation codes will be publicly available.",/pdf/169051de1157bbb4c4c44071420e3c267331bb6e.pdf,ICLR,2021,We propose a pre-trained language model for understanding texts in document and it shows the state-of-the-art performances on four benchmark tasks. +Io8oYQb4LRK,sXPCyvYBCZJ,1601310000000.0,1614990000000.0,791,Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons,"[""~Paul_Micaelli1"", ""~Amos_Storkey1""]","[""Paul Micaelli"", ""Amos Storkey""]","[""Hyperparameter optimization"", ""Meta-learning""]","Gradient-based meta-learning has earned a widespread popularity in few-shot learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn meta-parameters online, but this introduces greediness which comes with a significant performance drop. In this work, we enable non-greedy meta-learning of hyperparameters over long horizons by sharing hyperparameters that are contiguous in time, and using the sign of hypergradients rather than their magnitude to indicate convergence. We implement this with forward-mode differentiation, which we extend to the popular momentum-based SGD optimizer. We demonstrate that the hyperparameters of this optimizer can be learned non-greedily without gradient degradation over $\sim 10^4$ inner gradient steps, by only requiring $\sim 10$ outer gradient steps. On CIFAR-10, we outperform greedy and random search methods for the same computational budget by nearly $10\%$. Code will be available upon publication.",/pdf/1776a38a805bab15f05b215c07748ce667abdde1.pdf,ICLR,2021,Learning hyperparameter over long horizons without greediness +RmB-zwXOIVC,dXs5IHudfIW,1601310000000.0,1614990000000.0,2727,Imitation with Neural Density Models,"[""~Kuno_Kim1"", ""akshatj@cs.stanford.edu"", ""~Yang_Song1"", ""~Jiaming_Song1"", ""~Yanan_Sui1"", ""~Stefano_Ermon1""]","[""Kuno Kim"", ""Akshat Jindal"", ""Yang Song"", ""Jiaming Song"", ""Yanan Sui"", ""Stefano Ermon""]","[""Imitation Learning"", ""Reinforcement Learning"", ""Density Estimation"", ""Density Model"", ""Maximum Entropy RL"", ""Mujoco""]","We propose a new framework for Imitation Learning (IL) via density estimation of the expert's occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback–Leibler divergence between occupancy measures of the expert and imitator. We present a practical IL algorithm, Neural Density Imitation (NDI), which obtains state-of-the-art demonstration efficiency on benchmark control tasks. ",/pdf/d0cbf2093232ab70917cec4b89c4534466141a8c.pdf,ICLR,2021,New Imitation Learning framework based on density estimation that achieves good demonstration efficiency +hE3JWimujG,kHmtvifbCC,1601310000000.0,1614990000000.0,3598,Cortico-cerebellar networks as decoupled neural interfaces,"[""oq19042@bristol.ac.uk"", ""ellen.boven@bristol.ac.uk"", ""r.apps@bristol.ac.uk"", ""~Rui_Ponte_Costa3""]","[""Joseph Pemberton"", ""Ellen Boven"", ""Richard Apps"", ""Rui Ponte Costa""]","[""systems neuroscience"", ""cerebellum"", ""neocortex"", ""decoupled neural interfaces"", ""deep learning"", ""decorrelation"", ""inverse models"", ""forward models""]","The brain solves the credit assignment problem remarkably well. For credit to be correctly assigned across multiple cortical areas a given area should, in principle, wait for others to finish their computation. How the brain deals with this locking problem has remained unclear. Deep learning methods suffer from similar locking constraints both on the forward and backward phase. Recently, decoupled neural interfaces (DNI) were introduced as a solution to the forward and backward locking problems. +Here we propose that a specialised brain region, the cerebellum, helps the cerebral cortex solve the locking problem closely matching the computations and architecture of DNI. In particular, we propose that classical cerebellar forward and inverse models are equivalent to solving the backward and forward locking problems, respectively. To demonstrate the potential of this framework we focus on modelling a given brain area as a recurrent neural network in which the cerebellum approximates temporal feedback signals as provided by BPTT. We tested the cortico-cerebellar-DNI (CC-DNI) model in a range of sensorimotor and cognitive tasks that have been shown to be cerebellar-dependent. First, we show that the CC-DNI unlocking mechanisms can facilitate learning in a simple target reaching task. Next, by building on the sequential MNIST task we demonstrate that these results generalise to more complex sensorimotor tasks. Our cortico-cerebellar model readily applies to a wider range of modalities, to demonstrate this we tested the model in a cognitive task, caption generation. Models without the cerebellar-DNI component exhibit deficits similar to those observed in cerebellar patients in both motor and cognitive tasks. Moreover, we used CC-DNI to generate a set of specific neuroscience predictions. Finally, we introduce a CC-DNI model with highly sparse connectivity as observed in the cerebellum, which substantially reduces the number of parameters while improving learning through decorrelation. +Overall, our work offers a novel perspective on the cerebellum as a brain-wide decoupling machine for efficient credit assignment and opens a new avenue of research between deep learning and neuroscience.",/pdf/9a9bb9c545f92910a0e6a756fce89db5509f430e.pdf,ICLR,2021,We propose that cerebellar-cortical networks facilitate learning across the brain by solving the locking problems inherent to credit assignment. +HJxB5sRcFQ,H1xdAnEXKX,1538090000000.0,1547280000000.0,533,LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators,"[""lijianan15@gmail.com"", ""jimyang@adobe.com"", ""hertzman@adobe.com"", ""jianmzha@adobe.com"", ""ciom_xtf1@bit.edu.cn""]","[""Jianan Li"", ""Jimei Yang"", ""Aaron Hertzmann"", ""Jianming Zhang"", ""Tingfa Xu""]",[],"Layout is important for graphic design and scene generation. We propose a novel Generative Adversarial Network, called LayoutGAN, that synthesizes layouts by modeling geometric relations of different types of 2D elements. The generator of LayoutGAN takes as input a set of randomly-placed 2D graphic elements and uses self-attention modules to refine their labels and geometric parameters jointly to produce a realistic layout. Accurate alignment is critical for good layouts. We thus propose a novel differentiable wireframe rendering layer that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in image space. We validate the effectiveness of LayoutGAN in various experiments including MNIST digit generation, document layout generation, clipart abstract scene generation and tangram graphic design.",/pdf/6bea64344b5868c50f04a043422f06436b39c251.pdf,ICLR,2019, +HyGDdsCcFQ,BJg0_FCtFX,1538090000000.0,1545360000000.0,365,Better Generalization with On-the-fly Dataset Denoising,"[""jiaming.tsong@gmail.com"", ""tengyuma@cs.stanford.edu"", ""michael.auli@gmail.com"", ""yann@dauphin.io""]","[""Jiaming Song"", ""Tengyu Ma"", ""Michael Auli"", ""Yann Dauphin""]","[""dataset denoising"", ""supervised learning"", ""implicit regularization""]","Memorization in over-parameterized neural networks can severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are to hard avoid in extremely large datasets. We address this problem using the implicit regularization effect of stochastic gradient descent with large learning rates, which we find to be able to separate clean and mislabeled examples with remarkable success using loss statistics. We leverage this to identify and on-the-fly discard mislabeled examples using a threshold on their losses. This leads to On-the-fly Data Denoising (ODD), a simple yet effective algorithm that is robust to mislabeled examples, while introducing almost zero computational overhead. Empirical results demonstrate the effectiveness of ODD on several datasets containing artificial and real-world mislabeled examples.",/pdf/6b4b2accf193a736c09c22d49e28446bb345c862.pdf,ICLR,2019,We introduce a fast and easy-to-implement algorithm that is robust to dataset noise. +S16FPMgRZ,rJaFPGeA-,1509070000000.0,1518730000000.0,225,Tensor Contraction & Regression Networks,"[""jean.kossaifi@gmail.com"", ""zlipton@cmu.edu"", ""arankhan@amazon.com"", ""tfurlanello@gmail.com"", ""animakumar@gmail.com""]","[""Jean Kossaifi"", ""Zack Chase Lipton"", ""Aran Khanna"", ""Tommaso Furlanello"", ""Anima Anandkumar""]","[""tensor contraction"", ""tensor regression"", ""network compression"", ""deep neural networks""]","Convolution neural networks typically consist of many convolutional layers followed by several fully-connected layers. While convolutional layers map between high-order activation tensors, the fully-connected layers operate on flattened activation vectors. Despite its success, this approach has notable drawbacks. Flattening discards the multi-dimensional structure of the activations, and the fully-connected layers require a large number of parameters. +We present two new techniques to address these problems. First, we introduce tensor contraction layers which can replace the ordinary fully-connected layers in a neural network. Second, we introduce tensor regression layers, which express the output of a neural network as a low-rank multi-linear mapping from a high-order activation tensor to the softmax layer. Both the contraction and regression weights are learned end-to-end by backpropagation. By imposing low rank on both, we use significantly fewer parameters. Experiments on the ImageNet dataset show that applied to the popular VGG and ResNet architectures, our methods significantly reduce the number of parameters in the fully connected layers (about 65% space savings) while negligibly impacting accuracy.",/pdf/e2e7a7729421305c784b8aef0919f6829adc2076.pdf,ICLR,2018,"We propose tensor contraction and low-rank tensor regression layers to preserve and leverage the multi-linear structure throughout the network, resulting in huge space savings with little to no impact on performance." +BJlRs34Fvr,SJlGMmHWvS,1569440000000.0,1583910000000.0,171,Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets,"[""wu-dx16@mails.tsinghua.edu.cn"", ""eewangyisen@gmail.com"", ""xiast@sz.tsinghua.edu.cn"", ""baileyj@unimelb.edu.au"", ""xingjun.ma@unimelb.edu.au""]","[""Dongxian Wu"", ""Yisen Wang"", ""Shu-Tao Xia"", ""James Bailey"", ""Xingjun Ma""]","[""Adversarial Example"", ""Transferability"", ""Skip Connection"", ""Neural Network""]","Skip connections are an essential component of current state-of-the-art deep neural networks (DNNs) such as ResNet, WideResNet, DenseNet, and ResNeXt. Despite their huge success in building deeper and more powerful DNNs, we identify a surprising \emph{security weakness} of skip connections in this paper. Use of skip connections \textit{allows easier generation of highly transferable adversarial examples}. Specifically, in ResNet-like (with skip connections) neural networks, gradients can backpropagate through either skip connections or residual modules. We find that using more gradients from the skip connections rather than the residual modules according to a decay factor, allows one to craft adversarial examples with high transferability. Our method is termed \emph{Skip Gradient Method} (SGM). We conduct comprehensive transfer attacks against state-of-the-art DNNs including ResNets, DenseNets, Inceptions, Inception-ResNet, Squeeze-and-Excitation Network (SENet) and robustly trained DNNs. We show that employing SGM on the gradient flow can greatly improve the transferability of crafted attacks in almost all cases. Furthermore, SGM can be easily combined with existing black-box attack techniques, and obtain high improvements over state-of-the-art transferability methods. Our findings not only motivate new research into the architectural vulnerability of DNNs, but also open up further challenges for the design of secure DNN architectures.",/pdf/4678889e89b4ae9d2de1be499ae000b936efdeaa.pdf,ICLR,2020,We identify the security weakness of skip connections in ResNet-like neural networks +S11KBYclx,,1478300000000.0,1488580000000.0,488,Learning Curve Prediction with Bayesian Neural Networks,"[""kleinaa@cs.uni-freiburg.de"", ""sfalkner@cs.uni-freiburg.de"", ""springj@cs.uni-freiburg.de"", ""fh@cs.uni-freiburg.de""]","[""Aaron Klein"", ""Stefan Falkner"", ""Jost Tobias Springenberg"", ""Frank Hutter""]","[""Deep learning"", ""Applications""]","Different neural network architectures, hyperparameters and training protocols lead to different performances as a function of time. +Human experts routinely inspect the resulting learning curves to quickly terminate runs with poor hyperparameter settings and thereby considerably speed up manual hyperparameter optimization. Exploiting the same information in automatic Bayesian hyperparameter optimization requires a probabilistic model of learning curves across hyperparameter settings. Here, we study the use of Bayesian neural networks for this purpose and improve their performance by a specialized learning curve layer.",/pdf/b55f7dcf80e10780e6603b0c9238b2802cb42e31.pdf,ICLR,2017,We present a general probabilistic method based on Bayesian neural networks to predit learning curves of iterative machine learning methods. +_0kaDkv3dVf,#NAME?,1601310000000.0,1616020000000.0,2323,HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark,"[""~Chaojian_Li1"", ""~Zhongzhi_Yu1"", ""~Yonggan_Fu1"", ""~Yongan_Zhang1"", ""~Yang_Zhao1"", ""~Haoran_You1"", ""~Qixuan_Yu1"", ""~Yue_Wang3"", ""hc.onioncc@gmail.com"", ""~Yingyan_Lin1""]","[""Chaojian Li"", ""Zhongzhi Yu"", ""Yonggan Fu"", ""Yongan Zhang"", ""Yang Zhao"", ""Haoran You"", ""Qixuan Yu"", ""Yue Wang"", ""Cong Hao"", ""Yingyan Lin""]","[""Hardware-Aware Neural Architecture Search"", ""AutoML"", ""Benchmark""]","HardWare-aware Neural Architecture Search (HW-NAS) has recently gained tremendous attention by automating the design of deep neural networks deployed in more resource-constrained daily life devices. Despite its promising performance, developing optimal HW-NAS solutions can be prohibitively challenging as it requires cross-disciplinary knowledge in the algorithm, micro-architecture, and device-specific compilation. First, to determine the hardware-cost to be incorporated into the NAS process, existing works mostly adopt either pre-collected hardware-cost look-up tables or device-specific hardware-cost models. The former can be time-consuming due to the required knowledge of the device’s compilation method and how to set up the measurement pipeline, while building the latter is often a barrier for non-hardware experts like NAS researchers. Both of them limit the development of HW-NAS innovations and impose a barrier-to-entry to non-hardware experts. Second, similar to generic NAS, it can be notoriously difficult to benchmark HW-NAS algorithms due to their significant required computational resources and the differences in adopted search spaces, hyperparameters, and hardware devices. To this end, we develop HW-NAS-Bench, the first public dataset for HW-NAS research which aims to democratize HW-NAS research to non-hardware experts and make HW-NAS research more reproducible and accessible. To design HW-NAS-Bench, we carefully collected the measured/estimated hardware performance (e.g., energy cost and latency) of all the networks in the search spaces of both NAS-Bench-201 and FBNet, on six hardware devices that fall into three categories (i.e., commercial edge devices, FPGA, and ASIC). Furthermore, we provide a comprehensive analysis of the collected measurements in HW-NAS-Bench to provide insights for HW-NAS research. Finally, we demonstrate exemplary user cases to (1) show that HW-NAS-Bench allows non-hardware experts to perform HW-NAS by simply querying our pre-measured dataset and (2) verify that dedicated device-specific HW-NAS can indeed lead to optimal accuracy-cost trade-offs. The codes and all collected data are available at https://github.com/RICE-EIC/HW-NAS-Bench.",/pdf/1256d25521d41df912407cdd9aea52fa6d3d5265.pdf,ICLR,2021,A Hardware-Aware Neural Architecture Search Benchmark +Ut1vF_q_vC,zj4cMdZUG4Z,1601310000000.0,1616440000000.0,952,Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?,"[""~Zhen_Qin5"", ""lyyanle@google.com"", ""~Honglei_Zhuang1"", ""~Yi_Tay1"", ""~Rama_Kumar_Pasumarthi1"", ""~Xuanhui_Wang1"", ""bemike@google.com"", ""najork@google.com""]","[""Zhen Qin"", ""Le Yan"", ""Honglei Zhuang"", ""Yi Tay"", ""Rama Kumar Pasumarthi"", ""Xuanhui Wang"", ""Michael Bendersky"", ""Marc Najork""]","[""Learning to Rank"", ""benchmark"", ""neural network"", ""gradient boosted decision trees""]","Despite the success of neural models on many major machine learning problems, their effectiveness on traditional Learning-to-Rank (LTR) problems is still not widely acknowledged. We first validate this concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available Gradient Boosted Decision Trees (GBDT) in terms of their reported ranking accuracy on benchmark datasets. This unfortunately was somehow overlooked in recent neural LTR papers. We then investigate why existing neural LTR models under-perform and identify several of their weaknesses. Furthermore, we propose a unified framework comprising of counter strategies to ameliorate the existing weaknesses of neural models. Our models are the first to be able to perform equally well, comparing with the best tree-based baseline, while outperforming recently published neural LTR models by a large margin. Our results can also serve as a benchmark to facilitate future improvement of neural LTR models.",/pdf/ad3fca583fdc23233f81a4e1b068afdb9ccb877f.pdf,ICLR,2021, +r1ezqaEFPr,HygfoYaDPS,1569440000000.0,1577170000000.0,698,Multi-Task Learning via Scale Aware Feature Pyramid Networks and Effective Joint Head,"[""nifeng@pku.edu.cn""]","[""Feng Ni""]","[""Multi-Task Learning"", ""Object Detection"", ""Instance Segmentation""]","As a concise and classic framework for object detection and instance segmentation, Mask R-CNN achieves promising performance in both two tasks. However, considering stronger feature representation for Mask R-CNN fashion framework, there is room for improvement from two aspects. On the one hand, performing multi-task prediction needs more credible feature extraction and multi-scale features integration to handle objects with varied scales. In this paper, we address this problem by using a novel neck module called SA-FPN (Scale Aware Feature Pyramid Networks). With the enhanced feature representations, our model can accurately detect and segment the objects of multiple scales. On the other hand, in Mask R-CNN framework, isolation between parallel detection branch and instance segmentation branch exists, causing the gap between training and testing processes. To narrow this gap, we propose a unified head module named EJ-Head (Effective Joint Head) to combine two branches into one head, not only realizing the interaction between two tasks, but also enhancing the effectiveness of multi-task learning. Comprehensive experiments show that our proposed methods bring noticeable gains for object detection and instance segmentation. In particular, our model outperforms the original Mask R-CNN by 1~2 percent AP in both object detection and instance segmentation task on MS-COCO benchmark. Code will be available soon.",/pdf/63b8b7be773d30672c2843506aecc167fa3fac5b.pdf,ICLR,2020,Our work improve Mask R-CNN by using Scale Aware FPN to remedy scale variation issue and conbining detection and segmentation branches into Effective Joint Head for more expressive multi-task learning. +S1g_S0VYvr,ByemchIdPr,1569440000000.0,1577170000000.0,1116,Learning to Combat Compounding-Error in Model-Based Reinforcement Learning,"[""chenjun@ualberta.ca"", ""yw4@andrew.cmu.edu"", ""chenchloem@gmail.com"", ""daes@ualberta.ca"", ""mmueller@ualberta.ca""]","[""Chenjun Xiao"", ""Yifan Wu"", ""Chen Ma"", ""Dale Schuurmans"", ""Martin M\u00fcller""]","[""reinforcement learning"", ""model-based RL""]","Despite its potential to improve sample complexity versus model-free approaches, model-based reinforcement learning can fail catastrophically if the model is inaccurate. An algorithm should ideally be able to trust an imperfect model over a reasonably long planning horizon, and only rely on model-free updates when the model errors get infeasibly large. In this paper, we investigate techniques for choosing the planning horizon on a state-dependent basis, where a state's planning horizon is determined by the maximum cumulative model error around that state. We demonstrate that these state-dependent model errors can be learned with Temporal Difference methods, based on a novel approach of temporally decomposing the cumulative model errors. Experimental results show that the proposed method can successfully adapt the planning horizon to account for state-dependent model accuracy, significantly improving the efficiency of policy learning compared to model-based and model-free baselines. +",/pdf/3666c400c6df308df7837cacd0c8d06a22f11c7a.pdf,ICLR,2020, +2Ey_1FeNtOC,Y4wrVWsbO3j,1601310000000.0,1614990000000.0,3355,Minimum Description Length Recurrent Neural Networks,"[""nurlan@mail.tau.ac.il"", ""chemla@ens.fr"", ""~Roni_Katzir1""]","[""Nur Lan"", ""Emmanuel Chemla"", ""Roni Katzir""]","[""recurrent neural network"", ""neural network"", ""language modeling"", ""minimum description length"", ""genetic algorithm"", ""semantics"", ""syntax""]","Recurrent neural networks (RNNs) face two well-known challenges: (a) the difficulty of such networks to generalize appropriately as opposed to memorizing, especially from very short input sequences (generalization); and (b) the difficulty for us to understand the knowledge that the network has attained (transparency). We explore the implications to these challenges of employing a general search through neural architectures using a genetic algorithm with Minimum Description Length (MDL) as an objective function. We find that MDL leads the networks to reach adequate levels of generalization from very small corpora, improving over backpropagation-based alternatives. We demonstrate this approach by evolving networks which perform tasks of increasing complexity with absolute correctness. The resulting networks are small, easily interpretable, and unlike classical RNNs, are provably appropriate for sequences of arbitrary length even when trained on very limited corpora. One case study is addition, for which our system grows a network with just four cells, reaching 100% accuracy (and at least .999 certainty) for arbitrary large numbers.",/pdf/2b41d4860e963d7e823df3c25425606ec075cdd5.pdf,ICLR,2021,"Description length-based optimization of recurrent neural networks leads to interpretable and absolutely-correct networks for language modeling, using very small training data. " +HkSOlP9lg,,1478290000000.0,1480460000000.0,353,Recurrent Inference Machines for Solving Inverse Problems,"[""patrick.putzky@gmail.com"", ""welling.max@gmail.com""]","[""Patrick Putzky"", ""Max Welling""]","[""Optimization"", ""Deep learning"", ""Computer vision""]","Inverse problems are typically solved by first defining a model and then choosing an inference procedure. With this separation of modeling from inference, inverse problems can be framed in a modular way. For example, variational inference can be applied to a broad class of models. The modularity, however, typically goes away after model parameters have been trained under a chosen inference procedure. During training, model and inference often interact in a way that the model parameters will ultimately be adapted to the chosen inference procedure, posing the two components inseparable after training. But if model and inference become inseperable after training, why separate them in the first place? + +We propose a novel learning framework which abandons the dichotomy between model and inference. Instead, we introduce Recurrent Inference Machines (RIM), a class of recurrent neural networks (RNN) that directly learn to solve inverse problems. + +We demonstrate the effectiveness of RIMs in experiments on various image reconstruction tasks. We show empirically that RIMs exhibit the desirable convergence behavior of classical inference procedures, and that they can outperform state-of- the-art methods when trained on specialized inference tasks. + +Our approach bridges the gap between inverse problems and deep learning, providing a framework for fast progression in the field of inverse problems.",/pdf/1a94a67472aff744865bead1de0cf01ff3805ded.pdf,ICLR,2017, +ePh9bvqIgKL,0dDj8K1opuA,1601310000000.0,1614990000000.0,199,Discovering Parametric Activation Functions,"[""~Garrett_Bingham1"", ""~Risto_Miikkulainen1""]","[""Garrett Bingham"", ""Risto Miikkulainen""]","[""activation function"", ""parametric"", ""evolution""]","Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task-dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with three different neural network architectures on the CIFAR-100 image classification dataset show that this approach is effective. It discovers different activation functions for different architectures, and consistently improves accuracy over ReLU and other recently proposed activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.",/pdf/2c1508897cc35db6c53ddd3ffb21e3a8dba1b346.pdf,ICLR,2021,"Evolutionary search discovers the general form of novel activation functions, and gradient descent fine-tunes the shape for different parts of the network and over the learning process." +ByeTHsAqtX,rJxl1oMKY7,1538090000000.0,1545360000000.0,125,Gradient Descent Happens in a Tiny Subspace,"[""guyg@ias.edu"", ""danr@fb.com"", ""edyer@google.com""]","[""Guy Gur-Ari"", ""Daniel A. Roberts"", ""Ethan Dyer""]","[""Gradient Descent"", ""Hessian"", ""Deep Learning""]","We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.",/pdf/1a5c9d880096c438ce4e5908c4b58ad58d158bee.pdf,ICLR,2019,"For classification problems with k classes, we show that the gradient tends to live in a tiny, slowly-evolving subspace spanned by the eigenvectors corresponding to the k-largest eigenvalues of the Hessian." +uFk038O5wZ,2bDV-O0k-dK,1601310000000.0,1614990000000.0,1634,Improving Abstractive Dialogue Summarization with Conversational Structure and Factual Knowledge,"[""~Lulu_Zhao1"", ""buptyzy@bupt.edu.cn"", ""xuweiran@bupt.edu.cn"", ""~Sheng_Gao1"", ""~Jun_Guo1""]","[""Lulu Zhao"", ""Zeyuan Yang"", ""Weiran Xu"", ""Sheng Gao"", ""Jun Guo""]","[""abstractive dialogue summarization"", ""long-distance cross-sentence relation"", ""conversational structure"", ""factual knowledge"", ""sparse relational graph self-attention network"", ""dual-copy mechanism""]","Recently, people have been paying more attention to the abstractive dialogue summarization task. Compared with news text, the information flows of the dialogue exchange between at least two interlocutors, which leads to the necessity of capturing long-distance cross-sentence relations. In addition, the generated summaries commonly suffer from fake facts because the key elements of dialogues often scatter in multiple utterances. However, the existing sequence-to-sequence models are difficult to address these issues. Therefore, it is necessary for researchers to explore the implicit conversational structure to ensure the richness and faithfulness of generated contents. In this paper, we present a Knowledge Graph Enhanced Dual-Copy network (KGEDC), a novel framework for abstractive dialogue summarization with conversational structure and factual knowledge. We use a sequence encoder to draw local features and a graph encoder to integrate global features via the sparse relational graph self-attention network, complementing each other. Besides, a dual-copy mechanism is also designed in decoding process to force the generation conditioned on both the source text and extracted factual knowledge. The experimental results show that our method produces significantly higher ROUGE scores than most of the baselines on both SAMSum corpus and Automobile Master corpus. Human judges further evaluate that outputs of our model contain more richer and faithful information.",/pdf/c9b3eabe79e36d17867d7b4e3a73f27619168eb6.pdf,ICLR,2021, +nkIDwI6oO4_,cKKNdFmUDDf,1601310000000.0,1615700000000.0,2099,Learning A Minimax Optimizer: A Pilot Study,"[""~Jiayi_Shen1"", ""~Xiaohan_Chen1"", ""~Howard_Heaton2"", ""~Tianlong_Chen1"", ""~Jialin_Liu1"", ""~Wotao_Yin1"", ""~Zhangyang_Wang1""]","[""Jiayi Shen"", ""Xiaohan Chen"", ""Howard Heaton"", ""Tianlong Chen"", ""Jialin Liu"", ""Wotao Yin"", ""Zhangyang Wang""]","[""Learning to Optimize"", ""Minimax Optimization""]","Solving continuous minimax optimization is of extensive practical interest, yet notoriously unstable and difficult. This paper introduces the learning to optimize(L2O) methodology to the minimax problems for the first time and addresses its accompanying unique challenges. We first present Twin-L2O, the first dedicated minimax L2O method consisting of two LSTMs for updating min and max variables separately. The decoupled design is found to facilitate learning, particularly when the min and max variables are highly asymmetric. Empirical experiments on a variety of minimax problems corroborate the effectiveness of Twin-L2O. We then discuss a crucial concern of Twin-L2O, i.e., its inevitably limited generalizability to unseen optimizees. To address this issue, we present two complementary strategies. Our first solution, Enhanced Twin-L2O, is empirically applicable for general minimax problems, by improving L2O training via leveraging curriculum learning. Our second alternative, called Safeguarded Twin-L2O, is a preliminary theoretical exploration stating that under some strong assumptions, it is possible to theoretically establish the convergence of Twin-L2O. We benchmark our algorithms on several testbed problems and compare against state-of-the-art minimax solvers. The code is available at: https://github.com/VITA-Group/L2O-Minimax.",/pdf/d13003c8f7a704dcb012e4a100a51df5332b5286.pdf,ICLR,2021,"This paper introduces the learning to optimize (L2O) methodology, called Twin L2O, for minimax optimization consisting of two LSTMs." +SkC_7v5gx,,1478290000000.0,1480790000000.0,375,The Power of Sparsity in Convolutional Neural Networks,"[""schangpi@usc.edu"", ""sandler@google.com"", ""azhmogin@google.com""]","[""Soravit Changpinyo"", ""Mark Sandler"", ""Andrey Zhmoginov""]","[""Deep learning"", ""Supervised Learning""]","Deep convolutional networks are well-known for their high computational and memory demands. Given limited resources, how does one design a network that balances its size, training time, and prediction accuracy? A surprisingly effective approach to trade accuracy for size and speed is to simply reduce the number of channels in each convolutional layer by a fixed fraction and retrain the network. In many cases this leads to significantly smaller networks with only minimal changes to accuracy. In this paper, we take a step further by empirically examining a strategy for deactivating connections between filters in convolutional layers in a way that allows us to harvest savings both in run-time and memory for many network architectures. More specifically, we generalize 2D convolution to use a channel-wise sparse connection structure and show that this leads to significantly better results than the baseline approach for large networks including VGG and Inception V3.",/pdf/6514d69f5b4a80e843ab8cef6af09836b69bd679.pdf,ICLR,2017,Sparse random connections that allow savings to be harvested and that are very effective at compressing CNNs. +xPw-dr5t1RH,rxETo9L_FH,1601310000000.0,1614990000000.0,1428,KETG: A Knowledge Enhanced Text Generation Framework,"[""~Yan_Cui3"", ""chx_1988@163.com"", ""~Jiang_Qian1"", ""~Bojin_Zhuang2"", ""~Shaojun_Wang1"", ""~Jing_Xiao3""]","[""Yan Cui"", ""Xi Chen"", ""Jiang Qian"", ""Bojin Zhuang"", ""Shaojun Wang"", ""Jing Xiao""]","[""text generation"", ""knowledge graph""]","Embedding logical knowledge information into text generation is a challenging NLP task. In this paper, we propose a knowledge enhanced text generation (KETG) framework, which incorporates both the knowledge and associated text corpus to address logicality and diversity in text generation. Specifically, we validate our framework on rhetorical text generation from our newly built rhetoric knowledge graph. Experiments show that our framework outperforms baseline models such as Transformer and GPT-2, on rhetorical type control, semantic comprehensibility and diversity.",/pdf/07486f08ef3c6f034e1dd21eafa13a633fee7ac0.pdf,ICLR,2021, +SyjsLqxR-,rJqsI9eR-,1509100000000.0,1518730000000.0,351,"Universality, Robustness, and Detectability of Adversarial Perturbations under Adversarial Training","[""janhendrik.metzen@de.bosch.com""]","[""Jan Hendrik Metzen""]","[""adversarial examples"", ""adversarial training"", ""universal perturbations"", ""safety"", ""deep learning""]","Classifiers such as deep neural networks have been shown to be vulnerable against adversarial perturbations on problems with high-dimensional input space. While adversarial training improves the robustness of classifiers against such adversarial perturbations, it leaves classifiers sensitive to them on a non-negligible fraction of the inputs. We argue that there are two different kinds of adversarial perturbations: shared perturbations which fool a classifier on many inputs and singular perturbations which only fool the classifier on a small fraction of the data. We find that adversarial training increases the robustness of classifiers against shared perturbations. Moreover, it is particularly effective in removing universal perturbations, which can be seen as an extreme form of shared perturbations. Unfortunately, adversarial training does not consistently increase the robustness against singular perturbations on unseen inputs. However, we find that adversarial training decreases robustness of the remaining perturbations against image transformations such as changes to contrast and brightness or Gaussian blurring. It thus makes successful attacks on the classifier in the physical world less likely. Finally, we show that even singular perturbations can be easily detected and must thus exhibit generalizable patterns even though the perturbations are specific for certain inputs. ",/pdf/08eb21a1477129f3a1fa0d8814f31889b6a435b8.pdf,ICLR,2018,"We empirically show that adversarial training is effective for removing universal perturbations, makes adversarial examples less robust to image transformations, and leaves them detectable for a detection approach." +Hyepjh4FwB,rJgCMC7-vS,1569440000000.0,1577170000000.0,170,ProtoAttend: Attention-Based Prototypical Learning,"[""soarik@google.com"", ""tpfister@google.com""]","[""Sercan O. Arik"", ""Tomas Pfister""]","[""Interpretability"", ""sample-based explanations"", ""prototypes"", ""confidence estimation""]","We propose a novel inherently interpretable machine learning method that bases decisions on few relevant examples that we call prototypes. Our method, ProtoAttend, can be integrated into a wide range of neural network architectures including pre-trained models. It utilizes an attention mechanism that relates the encoded representations to samples in order to determine prototypes. The resulting model outperforms state of the art in three high impact problems without sacrificing accuracy of the original model: (1) it enables high-quality interpretability that outputs samples most relevant to the decision-making (i.e. a sample-based interpretability method); (2) it achieves state of the art confidence estimation by quantifying the mismatch across prototype labels; and (3) it obtains state of the art in distribution mismatch detection. All this can be achieved with minimal additional test time and a practically viable training time computational cost.",/pdf/7498dbb6822e4af4705b07b0b1bfa3b356fa3866.pdf,ICLR,2020,We propose a new learning framework that bases decision-making on few relevant examples that we call prototypes. +HJzgZ3JCW,rJ-xWhk0-,1509040000000.0,1519020000000.0,168,Efficient Sparse-Winograd Convolutional Neural Networks,"[""xyl@stanford.edu"", ""jpool@nvidia.com"", ""songhan@stanford.edu"", ""dally@stanford.edu""]","[""Xingyu Liu"", ""Jeff Pool"", ""Song Han"", ""William J. Dally""]","[""deep learning"", ""convolutional neural network"", ""pruning""]","Convolutional Neural Networks (CNNs) are computationally intensive, which limits their application on mobile devices. Their energy is dominated by the number of multiplies needed to perform the convolutions. Winograd’s minimal filtering algorithm (Lavin, 2015) and network pruning (Han et al., 2015) can reduce the operation count, but these two methods cannot be straightforwardly combined — applying the Winograd transform fills in the sparsity in both the weights and the activations. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations. Second, we prune the weights in the Winograd domain to exploit static weight sparsity. For models on CIFAR-10, CIFAR-100 and ImageNet datasets, our method reduces the number of multiplications by 10.4x, 6.8x and 10.8x respectively with loss of accuracy less than 0.1%, outperforming previous baselines by 2.0x-3.0x. We also show that moving ReLU to the Winograd domain allows more aggressive pruning.",/pdf/577cc25450693c482d3f2922681e407b24c98a93.pdf,ICLR,2018,Prune and ReLU in Winograd domain for efficient convolutional neural network +HkNGYjR9FX,S1erhnqcKQ,1538090000000.0,1548290000000.0,427,Learning Recurrent Binary/Ternary Weights,"[""arash.ardakani@mail.mcgill.ca"", ""zhengyun.ji@mail.mcgill.ca"", ""sean.smithson@mail.mcgill.ca"", ""brett.meyer@mcgill.ca"", ""warren.gross@mcgill.ca""]","[""Arash Ardakani"", ""Zhengyun Ji"", ""Sean C. Smithson"", ""Brett H. Meyer"", ""Warren J. Gross""]","[""Quantized Recurrent Neural Network"", ""Hardware Implementation"", ""Deep Learning""]","Recurrent neural networks (RNNs) have shown excellent performance in processing sequence data. However, they are both complex and memory intensive due to their recursive nature. These limitations make RNNs difficult to embed on mobile devices requiring real-time processes with limited hardware resources. To address the above issues, we introduce a method that can learn binary and ternary weights during the training phase to facilitate hardware implementations of RNNs. As a result, using this approach replaces all multiply-accumulate operations by simple accumulations, bringing significant benefits to custom hardware in terms of silicon area and power consumption. On the software side, we evaluate the performance (in terms of accuracy) of our method using long short-term memories (LSTMs) and gated recurrent units (GRUs) on various sequential models including sequence classification and language modeling. We demonstrate that our method achieves competitive results on the aforementioned tasks while using binary/ternary weights during the runtime. On the hardware side, we present custom hardware for accelerating the recurrent computations of LSTMs with binary/ternary weights. Ultimately, we show that LSTMs with binary/ternary weights can achieve up to 12x memory saving and 10x inference speedup compared to the full-precision hardware implementation design.",/pdf/20ed3f83c3c07d994e560f50e97fedbc65182702.pdf,ICLR,2019,"We propose high-performance LSTMs with binary/ternary weights, that can greatly reduce implementation complexity" +lSijhyKKsct,W-ZelrLdlUX,1601310000000.0,1614990000000.0,1855,Reinforcement Learning with Latent Flow,"[""~Wenling_Shang1"", ""w.xf@berkeley.edu"", ""~Aravind_Rajeswaran1"", ""~Aravind_Srinivas1"", ""~Yang_Gao1"", ""~Pieter_Abbeel2"", ""~Michael_Laskin1""]","[""Wenling Shang"", ""Xiaofei Wang"", ""Aravind Rajeswaran"", ""Aravind Srinivas"", ""Yang Gao"", ""Pieter Abbeel"", ""Michael Laskin""]","[""reinforcement learning"", ""deep learning"", ""machine learning"", ""deep reinforcement learning""]","Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast to the current paradigm in video classification architectures, which utilize explicit encodings of temporal information through methods such as optical flow and two-stream architectures to achieve state-of-the-art performance. Inspired by leading video classification architectures, we introduce the Flow of Latents for Reinforcement Learning Flare, a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show that Flare (i) recovers optimal performance in state-based RL without explicit access to the state velocity, solely with positional state information, (ii) achieves state-of-the-art performance on pixel-based continuous control tasks within the DeepMind control benchmark suite, (iii) is the most sample efficient model-free pixel-based RL algorithm on challenging environments in the DeepMind control suite such as quadruped walk, hopper hop, finger turn hard, pendulum swing, and walker run, outperforming the prior model-free state-of-the-art by 1.9 and 1.5 on the 500k and 1M step benchmarks, respectively, and (iv), when augmented over rainbow DQN, outperforms or matches the baseline on a diversity of challenging Atari games at 50M time step benchmark.",/pdf/2ca91ea74f6f2000a0fb71ae0301ceda03c9fee6.pdf,ICLR,2021,We investigate explicit encoding of temporal information in Deep Reinforcement Learning through latent vector differences and show SOTA results on the DeepMind control suite benchmark. +S1eFtj0cKQ,SJeqiKXcK7,1538090000000.0,1545360000000.0,463,Generative Models from the perspective of Continual Learning,"[""timothee.lesort@thalesgroup.com"", ""caselles@ensta.fr"", ""mgarciaortiz@softbankrobotics.com"", ""jean-francois.goudou@thalesgroup.com"", ""david.filliat@ensta.fr""]","[""Timoth\u00e9e Lesort"", ""Hugo Caselles-Dupr\u00e9"", ""Michael Garcia-Ortiz"", ""Jean-Fran\u00e7ois Goudou"", ""David Filliat""]","[""Generative Models"", ""Continual Learning""]","Which generative model is the most suitable for Continual Learning? This paper aims at evaluating and comparing generative models on disjoint sequential image generation tasks. We investigate how several models learn and forget, considering various strategies: rehearsal, regularization, generative replay and fine-tuning. We used two quantitative metrics to estimate the generation quality and memory ability. We experiment with sequential tasks on three commonly used benchmarks for Continual Learning (MNIST, Fashion MNIST and CIFAR10). We found that among all models, the original GAN performs best and among Continual Learning strategies, generative replay outperforms all other methods. Even if we found satisfactory combinations on MNIST and Fashion MNIST, training generative models sequentially on CIFAR10 is particularly instable, and remains a challenge.",/pdf/18f5b6a7a14a714e3317a1daddbd5ba551983225.pdf,ICLR,2019,A comparative study of generative models on Continual Learning scenarios. +bXLMnw03KPz,7QJptJoIvm_,1601310000000.0,1614990000000.0,1913,FERMI: Fair Empirical Risk Minimization via Exponential Rényi Mutual Information,"[""~Rakesh_Pavan1"", ""lowya@usc.edu"", ""~Sina_Baharlouei1"", ""~Meisam_Razaviyayn1"", ""~Ahmad_Beirami1""]","[""Rakesh Pavan"", ""Andrew Lowy"", ""Sina Baharlouei"", ""Meisam Razaviyayn"", ""Ahmad Beirami""]","[""algorithmic fairness""]","Several notions of fairness, such as demographic parity and equal opportunity, are defined based on statistical independence between a predicted target and a sensitive attribute. In machine learning applications, however, the data distribution is unknown to the learner and statistical independence is not verifiable. Hence, the learner could only resort to empirical evaluation of the degree of fairness violation. Many fairness violation notions are defined as a divergence/distance between the joint distribution of the target and sensitive attributes and the Kronecker product of their marginals, such as \Renyi correlation, mutual information, $L_\infty$ distance, to name a few. + In this paper, we propose another notion of fairness violation, called Exponential R\'enyi Mutual Information (ERMI) between sensitive attributes and the predicted target. We show that ERMI is a strong fairness violation notion in the sense that it provides an upper bound guarantee on all of the aforementioned notions of fairness violation. We also propose the Fair Empirical Risk Minimization via ERMI regularization framework, called FERMI. Whereas existing in-processing fairness algorithms are deterministic, we provide a stochastic optimization method for solving FERMI that is amenable to large-scale problems. In addition, we provide a batch (deterministic) method to solve FERMI. Both of our proposed algorithms come with theoretical convergence guarantees. Our experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and accuracy on test data across different problem setups, even when fairness violation is measured in notions other than ERMI. ",/pdf/ad73a29d7afdd2d33a9ca8ab07fb6501a0b1d0be.pdf,ICLR,2021,"We propose the Fair Empirical Risk Minimization via Exponential Rényi Mutual Information (FERMI) framework, and showcase its effectiveness theoretically and empirically." +S1ey2sRcYQ,rkx81-I9KX,1538090000000.0,1545360000000.0,674,Direct Optimization through $\arg \max$ for Discrete Variational Auto-Encoder,"[""guy_lorber@campus.technion.ac.il"", ""tamir.hazan@technion.ac.il""]","[""Guy Lorberbom"", ""Tamir Hazan""]","[""discrete variational auto encoders"", ""generative models"", ""perturbation models""]","Reparameterization of variational auto-encoders is an effective method for reducing the variance of their gradient estimates. However, when the latent variables are discrete, a reparameterization is problematic due to discontinuities in the discrete space. In this work, we extend the direct loss minimization technique to discrete variational auto-encoders. We first reparameterize a discrete random variable using the $\arg \max$ function of the Gumbel-Max perturbation model. We then use direct optimization to propagate gradients through the non-differentiable $\arg \max$ using two perturbed $\arg \max$ operations. +",/pdf/3ddfa8f3ab7ea91f6e06dc5dfe4a138f1614bc81.pdf,ICLR,2019, +ry4SNTe0-,SJQSNTlCb,1509110000000.0,1518730000000.0,414,Improve Training Stability of Semi-supervised Generative Adversarial Networks with Collaborative Training,"[""daleiwu@gmail.com"", ""dalei.wu@huawei.com"", ""liuxh3@huawei.com""]","[""Dalei Wu"", ""Xiaohua Liu""]","[""generative adversarial training"", ""semi-supervised training"", ""collaborative training""]","Improved generative adversarial network (Improved GAN) is a successful method of using generative adversarial models to solve the problem of semi-supervised learning. However, it suffers from the problem of unstable training. In this paper, we found that the instability is mostly due to the vanishing gradients on the generator. To remedy this issue, we propose a new method to use collaborative training to improve the stability of semi-supervised GAN with the combination of Wasserstein GAN. The experiments have shown that our proposed method is more stable than the original Improved GAN and achieves comparable classification accuracy on different data sets. ",/pdf/306340feeb83895cceef15b4bedb480f439ec901.pdf,ICLR,2018,Improve Training Stability of Semi-supervised Generative Adversarial Networks with Collaborative Training +XkI_ggnfLZ4,0Kb0IQRFf0n,1601310000000.0,1614990000000.0,701,Uncovering the impact of hyperparameters for global magnitude pruning,"[""~Janice_Lan1"", ""~Rudy_Chin2"", ""~Alexei_Baevski1"", ""~Ari_S._Morcos1""]","[""Janice Lan"", ""Rudy Chin"", ""Alexei Baevski"", ""Ari S. Morcos""]","[""deep learning"", ""pruning"", ""understanding""]","A common paradigm in model pruning is to train a model, prune, and then either fine-tune or, in the lottery ticket framework, reinitialize and retrain. Prior work has implicitly assumed that the best training configuration for model evaluation is also the best configuration for mask discovery. However, what if a training configuration which yields worse performance actually yields a mask which trains to higher performance? To test this, we decoupled the hyperparameters for mask discovery (H_find) and mask evaluation (H_eval). Using unstructured magnitude pruning on vision classification tasks, we discovered the ""decoupled find-eval phenomenon,"" in which certain H_find values lead to models which have lower performance, but generate masks with substantially higher eventual performance compared to using the same hyperparameters for both stages. We show that this phenomenon holds across a number of models, datasets, configurations, and also for one-shot structured pruning. Finally, we demonstrate that different H_find values yield masks with materially different layerwise pruning ratios and that the decoupled find-eval phenomenon is causally mediated by these ratios. Our results demonstrate the practical utility of decoupling hyperparameters and provide clear insights into the mechanisms underlying this counterintuitive effect. ",/pdf/2cc3a313bbc00938e8a88598a059d754769870e4.pdf,ICLR,2021,"When pruning, we should decouple the hyperparameters used to find the mask and to evaluate the mask; some hyperparameters, despite leading to better accuracy pre-pruning, lead to bad layerwise pruning ratios, which causes decreased pruned accuracy." +B1gHjoRqYQ,rkgYkHu9YQ,1538090000000.0,1545360000000.0,619,An Efficient and Margin-Approaching Zero-Confidence Adversarial Attack,"[""yang.zhang2@ibm.com"", ""shiyu.chang@ibm.com"", ""yum@us.ibm.com"", ""kqian3@illinois.edu""]","[""Yang Zhang"", ""Shiyu Chang"", ""Mo Yu"", ""Kaizhi Qian""]","[""adversarial attack"", ""zero-confidence attack""]","There are two major paradigms of white-box adversarial attacks that attempt to impose input perturbations. The first paradigm, called the fix-perturbation attack, crafts adversarial samples within a given perturbation level. The second paradigm, called the zero-confidence attack, finds the smallest perturbation needed to cause misclassification, also known as the margin of an input feature. While the former paradigm is well-resolved, the latter is not. Existing zero-confidence attacks either introduce significant approximation errors, or are too time-consuming. We therefore propose MarginAttack, a zero-confidence attack framework that is able to compute the margin with improved accuracy and efficiency. Our experiments show that MarginAttack is able to compute a smaller margin than the state-of-the-art zero-confidence attacks, and matches the state-of-the-art fix-perturbation attacks. In addition, it runs significantly faster than the Carlini-Wagner attack, currently the most accurate zero-confidence attack algorithm.",/pdf/314981c3fbefb5652b34535e82efe8f2258bcfee.pdf,ICLR,2019,"This paper introduces MarginAttack, a stronger and faster zero-confidence adversarial attack." +8Sqhl-nF50,9cRGMRyqV5w,1601310000000.0,1616060000000.0,1725,On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis,"[""~Zhong_Li2"", ""~Jiequn_Han1"", ""~Weinan_E1"", ""~Qianxiao_Li1""]","[""Zhong Li"", ""Jiequn Han"", ""Weinan E"", ""Qianxiao Li""]","[""recurrent neural network"", ""dynamical system"", ""universal approximation"", ""optimization"", ""curse of memory""]","We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals and characterize the approximation rate. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs by gradient methods. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on both approximation and optimization: when there is long-term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with increasing memory - a phenomenon we call the “curse of memory”. These analyses represent a basic step towards a concrete mathematical understanding of new phenomenons that may arise in learning temporal relationships using recurrent architectures.",/pdf/1572c344c730b678160659f1c3497d51342252bd.pdf,ICLR,2021,"We study the approximation properties and optimization dynamics of RNNs in the linear setting, where we uncover precisely the adverse effect of memory on learning." +r1f0YiCctm,BklJug59K7,1538090000000.0,1559840000000.0,493,Minimal Random Code Learning: Getting Bits Back from Compressed Model Parameters,"[""mh740@cam.ac.uk"", ""rp587@cam.ac.uk"", ""jmh233@cam.ac.uk""]","[""Marton Havasi"", ""Robert Peharz"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""compression"", ""neural networks"", ""bits-back argument"", ""Bayesian"", ""Shannon"", ""information theory""]","While deep neural networks are a highly successful model class, their large memory footprint puts considerable strain on energy consumption, communication bandwidth, and storage requirements. Consequently, model size reduction has become an utmost goal in deep learning. A typical approach is to train a set of deterministic weights, while applying certain techniques such as pruning and quantization, in order that the empirical weight distribution becomes amenable to Shannon-style coding schemes. However, as shown in this paper, relaxing weight determinism and using a full variational distribution over weights allows for more efficient coding schemes and consequently higher compression rates. In particular, following the classical bits-back argument, we encode the network weights using a random sample, requiring only a number of bits corresponding to the Kullback-Leibler divergence between the sampled variational distribution and the encoding distribution. By imposing a constraint on the Kullback-Leibler divergence, we are able to explicitly control the compression rate, while optimizing the expected loss on the training set. The employed encoding scheme can be shown to be close to the optimal information-theoretical lower bound, with respect to the employed variational family. Our method sets new state-of-the-art in neural network compression, as it strictly dominates previous approaches in a Pareto sense: On the benchmarks LeNet-5/MNIST and VGG-16/CIFAR-10, our approach yields the best test performance for a fixed memory budget, and vice versa, it achieves the highest compression rates for a fixed test performance.",/pdf/44c2f8b0aa023cdfa824e9d917e51fb348205bb3.pdf,ICLR,2019,This paper proposes an effective method to compress neural networks based on recent results in information theory. +ZK6vTvb84s,u4_GC8pSv5-,1601310000000.0,1615990000000.0,855,A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention,"[""~Gr\u00e9goire_Mialon1"", ""~Dexiong_Chen1"", ""~Alexandre_d'Aspremont1"", ""~Julien_Mairal1""]","[""Gr\u00e9goire Mialon"", ""Dexiong Chen"", ""Alexandre d'Aspremont"", ""Julien Mairal""]","[""bioinformatics"", ""optimal transport"", ""kernel methods"", ""attention"", ""transformers""]","We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Our aggregation technique admits two useful interpretations: it may be seen as a mechanism related to attention layers in neural networks, or it may be seen as a scalable surrogate of a classical optimal transport-based kernel. We experimentally demonstrate the effectiveness of our approach on biological sequences, achieving state-of-the-art results for protein fold recognition and detection of chromatin profiles tasks, and, as a proof of concept, we show promising results for processing natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models at https://github.com/claying/OTK.",/pdf/209f1d71b4e8e73a59634e771ff1c3a43d72b849.pdf,ICLR,2021,"We propose a new, trainable embedding for large sets of features such as biological sequences, and demonstrate its effectiveness." +nsZGadY22N4,Ejpcut_P1A3,1601310000000.0,1614990000000.0,2085,Weighted Bellman Backups for Improved Signal-to-Noise in Q-Updates,"[""~Kimin_Lee1"", ""~Michael_Laskin1"", ""~Aravind_Srinivas1"", ""~Pieter_Abbeel2""]","[""Kimin Lee"", ""Michael Laskin"", ""Aravind Srinivas"", ""Pieter Abbeel""]","[""Deep reinforcement learning"", ""ensemble learning"", ""Q-learning""]","Off-policy deep reinforcement learning (RL) has been successful in a range of challenging domains. However, standard off-policy RL algorithms can suffer from low signal and even instability in Q-learning because target values are derived from current Q-estimates, which are often noisy. To mitigate the issue, we propose ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble. We empirically observe that the proposed method stabilizes and improves learning on both continuous and discrete control benchmarks. We also specifically investigate the signal-to-noise aspect by studying environments with noisy rewards, and find that weighted Bellman backups significantly outperform standard Bellman backups. Furthermore, since our weighted Bellman backups rely on maintaining an ensemble, we investigate how weighted Bellman backups interact with UCB Exploration. By enforcing the diversity between agents using Bootstrap, we show that these different ideas are largely orthogonal and can be fruitfully integrated, together further improving the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments.",/pdf/2f2bba140117c02853d83e56de0d0abb80fd3fe9.pdf,ICLR,2021,"We propose ensemble-based weighted Bellman backups for preventing error propagation in Q-learning, and introduce a simple unified ensemble method that handles various issues in off-policy RL algorithms." +Sk0pHeZAW,H1RTHlZAW,1509130000000.0,1518730000000.0,587,Sparse Regularized Deep Neural Networks For Efficient Embedded Learning,"[""jb4e14@soton.ac.uk""]","[""Jia Bi""]","[""Sparse representation"", ""Compression Deep Learning Models"", ""L1 regularisation"", ""Optimisation.""]","Deep learning is becoming more widespread in its application due to its power in solving complex classification problems. However, deep learning models often require large memory and energy consumption, which may prevent them from being deployed effectively on embedded platforms, limiting their applications. This work addresses the problem by proposing methods {\em Weight Reduction Quantisation} for compressing the memory footprint of the models, including reducing the number of weights and the number of bits to store each weight. Beside, applying with sparsity-inducing regularization, our work focuses on speeding up stochastic variance reduced gradients (SVRG) optimization on non-convex problem. Our method that mini-batch SVRG with $\ell$1 regularization on non-convex problem has faster and smoother convergence rates than SGD by using adaptive learning rates. Experimental evaluation of our approach uses MNIST and CIFAR-10 datasets on LeNet-300-100 and LeNet-5 models, showing our approach can reduce the memory requirements both in the convolutional and fully connected layers by up to 60$\times$ without affecting their test accuracy.",/pdf/169794ee15ebfcba024cb7a63289ac13ced0574a.pdf,ICLR,2018,Compression of Deep neural networks deployed on embedded device. +S1gN8yrYwB,BkgKuIpdwS,1569440000000.0,1577170000000.0,1725,AUGMENTED POLICY GRADIENT METHODS FOR EFFICIENT REINFORCEMENT LEARNING,"[""kai.lagemann@rwth-aachen.de"", ""gregor.roering@rwth-aachen.de"", ""christoph.henke@ifu.rwth-aachen.de"", ""rene.vossen@ifu.rwth-aachen.de"", ""hees.office@ima-ifu.rwth-aachen.de""]","[""Kai Lagemann"", ""Gregor Roering"", ""Christoph Henke"", ""Rene Vossen"", ""Frank Hees""]","[""model-free reinforcement learning"", ""model-based reinforcement learning"", ""Baysian neural network"", ""deep learning"", ""reinforcement learning""]","We propose a new mixture of model-based and model-free reinforcement learning +(RL) algorithms that combines the strengths of both RL methods. Our goal is to reduce the sample complexity of model-free approaches utilizing fictitious trajectory +rollouts performed on a learned dynamics model to improve the data efficiency of +policy gradient methods while maintaining the same asymptotic behaviour. We +suggest to use a special type of uncertainty quantification by a stochastic dynamics +model in which the next state prediction is randomly drawn from the distribution +predicted by the dynamics model. As a result, the negative effect of exploiting +erroneously optimistic regions in the dynamics model is addressed by next state +predictions based on an uncertainty aware ensemble of dynamics models. The +influence of the ensemble of dynamics models on the policy update is controlled +by adjusting the number of virtually performed rollouts in the next iteration according to the ratio of the real and virtual total reward. Our approach, which we +call Model-Based Policy Gradient Enrichment (MBPGE), is tested on a collection of benchmark tests including simulated robotic locomotion. We compare our +approach to plain model-free algorithms and a model-based one. Our evaluation +shows that MBPGE leads to higher learning rates in an early training stage and an +improved asymptotic behaviour.",/pdf/06d9be6f2c3129c83f066142401777d6bb2566ff.pdf,ICLR,2020, +xGZG2kS5bFk,U6ucC-Pvf-E,1601310000000.0,1615980000000.0,1186, Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning,"[""~Ruozi_Huang1"", ""~Huang_Hu1"", ""~Wei_Wu1"", ""kesawada@microsoft.com"", ""mi_zhang@fudan.edu.cn"", ""djiang@microsoft.com""]","[""Ruozi Huang"", ""Huang Hu"", ""Wei Wu"", ""Kei Sawada"", ""Mi Zhang"", ""Daxin Jiang""]","[""Multimodal Learning"", ""Computer Vision"", ""Sequence Modeling"", ""Generative Models""]","Dancing to music is one of human's innate abilities since ancient times. In machine learning research, however, synthesizing dance movements from music is a challenging problem. Recently, researchers synthesize human motion sequences through autoregressive models like recurrent neural network (RNN). Such an approach often generates short sequences due to an accumulation of prediction errors that are fed back into the neural network. This problem becomes even more severe in the long motion sequence generation. Besides, the consistency between dance and music in terms of style, rhythm and beat is yet to be taken into account during modeling. In this paper, we formalize the music-driven dance generation as a sequence-to-sequence learning problem and devise a novel seq2seq architecture to efficiently process long sequences of music features and capture the fine-grained correspondence between music and dance. Furthermore, we propose a novel curriculum learning strategy to alleviate error accumulation of autoregressive models in long motion sequence generation, which gently changes the training process from a fully guided teacher-forcing scheme using the previous ground-truth movements, towards a less guided autoregressive scheme mostly using the generated movements instead. Extensive experiments show that our approach significantly outperforms the existing state-of-the-arts on automatic metrics and human evaluation. We also make a demo video to demonstrate the superior performance of our proposed approach at https://www.youtube.com/watch?v=lmE20MEheZ8.",/pdf/e8c5ca7f64bdc5ae6998cf6e5e47931d79341d40.pdf,ICLR,2021, +aYuZO9DIdnn,m0ggD8XEuPT,1601310000000.0,1611610000000.0,523,The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods,"[""~Louis_THIRY1"", ""~Michael_Arbel1"", ""~Eugene_Belilovsky1"", ""~Edouard_Oyallon1""]","[""Louis THIRY"", ""Michael Arbel"", ""Eugene Belilovsky"", ""Edouard Oyallon""]","[""convolutional kernel methods"", ""image classification""]","A recent line of work showed that various forms of convolutional kernel methods can be competitive with standard supervised deep convolutional networks on datasets like CIFAR-10, obtaining accuracies in the range of 87-90% while being more amenable to theoretical analysis. In this work, we highlight the importance of a data-dependent feature extraction step that is key to the obtain good performance in convolutional kernel methods. This step typically corresponds to a whitened dictionary of patches, and gives rise to a data-driven convolutional kernel methods.We extensively study its effect, demonstrating it is the key ingredient for high performance of these methods. Specifically, we show that one of the simplest instances of such kernel methods, based on a single layer of image patches followed by a linear classifier is already obtaining classification accuracies on CIFAR-10 in the same range as previous more sophisticated convolutional kernel methods. We scale this method to the challenging ImageNet dataset, showing such a simple approach can exceed all existing non-learned representation methods. This is a new baseline for object recognition without representation learning methods, that initiates the investigation of convolutional kernel models on ImageNet. We conduct experiments to analyze the dictionary that we used, our ablations showing they exhibit low-dimensional properties.",/pdf/a4eeda4ce2da499c9d47258922e4e8d57aaceb28.pdf,ICLR,2021,Patch-based representation is a key ingredient for competitive convolutional kernel methods. +r1l-VeSKwS,ByeFxLlKvr,1569440000000.0,1577170000000.0,2237,SemanticAdv: Generating Adversarial Examples via Attribute-Conditional Image Editing,"[""haonanqiu@link.cuhk.edu.cn"", ""xiaocw@umich.edu"", ""yl016@ie.cuhk.edu.hk"", ""xcyan@umich.edu"", ""honglak@eecs.umich.edu"", ""lxbosky@gmail.com""]","[""Haonan Qiu"", ""Chaowei Xiao"", ""Lei Yang"", ""Xinchen Yan"", ""HongLak Lee"", ""Bo Li""]","[""adversarial examples"", ""semantic attack""]","Deep neural networks (DNNs) have achieved great success in various applications due to their strong expressive power. However, recent studies have shown that DNNs are vulnerable to adversarial examples which are manipulated instances targeting to mislead DNNs to make incorrect predictions. Currently, most such adversarial examples try to guarantee “subtle perturbation"" by limiting the Lp norm of the perturbation. In this paper, we aim to explore the impact of semantic manipulation on DNNs predictions by manipulating the semantic attributes of images and generate “unrestricted adversarial examples"". Such semantic based perturbation is more practical compared with the Lp bounded perturbation. In particular, we propose an algorithm SemanticAdv which leverages disentangled semantic factors to generate adversarial perturbation by altering controlled semantic attributes to fool the learner towards various “adversarial"" targets. We conduct extensive experiments to show that the semantic based adversarial examples can not only fool different learning tasks such as face verification and landmark detection, but also achieve high targeted attack success rate against real-world black-box services such as Azure face verification service based on transferability. To further demonstrate the applicability of SemanticAdv beyond face recognition domain, we also generate semantic perturbations on street-view images. Such adversarial examples with controlled semantic manipulation can shed light on further understanding about vulnerabilities of DNNs as well as potential defensive approaches.",/pdf/f3fd4f5a5545c1c4236d566d6cdb91e5ad06d729.pdf,ICLR,2020,A novel and effective semantic adversarial attack method. +TlPHO_duLv,5AfydAkltQ,1601310000000.0,1614990000000.0,726,Towards Noise-resistant Object Detection with Noisy Annotations,"[""~Junnan_Li2"", ""~Caiming_Xiong1"", ""~Steven_Hoi2""]","[""Junnan Li"", ""Caiming Xiong"", ""Steven Hoi""]","[""noisy annotation"", ""object detection"", ""label noise""]","Training deep object detectors requires large amounts of human-annotated images with accurate object labels and bounding box coordinates, which are extremely expensive to acquire. Noisy annotations are much more easily accessible, but they could be detrimental for learning. We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise. We propose a learning framework which jointly optimizes object labels, bounding box coordinates, and model parameters by performing alternating noise correction and model training. To disentangle label noise and bounding box noise, we propose a two-step noise correction method. The first step performs class-agnostic bounding box correction, and the second step performs label correction and class-specific bounding box refinement. We conduct experiments on PASCAL VOC and MS-COCO dataset with both synthetic noise and machine-generated noise. Our method achieves state-of-the-art performance by effectively cleaning both label noise and bounding box noise.",/pdf/84d3f46f0195c0bf3b5e15e48baf53285d12686f.pdf,ICLR,2021,We propose a noise-resistant training framework for learning object detectors from noisy annotations with entangled label noise and bounding box noise. +ByZmGjkA-,rJl7GoyCb,1509040000000.0,1518730000000.0,156,Understanding Grounded Language Learning Agents,"[""felixhill@google.com"", ""kmh@google.com"", ""pblunsom@google.com"", ""clarkstephen@google.com""]","[""Felix Hill"", ""Karl Moritz Hermann"", ""Phil Blunsom"", ""Stephen Clark""]","[""Language AI Learning Reinforcement Deep""]","Neural network-based systems can now learn to locate the referents of words and phrases in images, answer questions about visual scenes, and even execute symbolic instructions as first-person actors in partially-observable worlds. To achieve this so-called grounded language learning, models must overcome certain well-studied learning challenges that are also fundamental to infants learning their first words. While it is notable that models with no meaningful prior knowledge overcome these learning obstacles, AI researchers and practitioners currently lack a clear understanding of exactly how they do so. Here we address this question as a way of achieving a clearer general understanding of grounded language learning, both to inform future research and to improve confidence in model predictions. For maximum control and generality, we focus on a simple neural network-based language learning agent trained via policy-gradient methods to interpret synthetic linguistic instructions in a simulated 3D world. We apply experimental paradigms from developmental psychology to this agent, exploring the conditions under which established human biases and learning effects emerge. We further propose a novel way to visualise and analyse semantic representation in grounded language learning agents that yields a plausible computational account of the observed effects.",/pdf/619857289ba3fefcd584d23aaa3a03f4461829b4.pdf,ICLR,2018,Analysing and understanding how neural network agents learn to understand simple grounded language +10XWPuAro86,GKqEiKjZ-_,1601310000000.0,1614990000000.0,1987,Hamiltonian Q-Learning: Leveraging Importance-sampling for Data Efficient RL,"[""~Udari_Madhushani1"", ""~Biswadip_Dey2"", ""~Naomi_Leonard1"", ""~Amit_Chakraborty2""]","[""Udari Madhushani"", ""Biswadip Dey"", ""Naomi Leonard"", ""Amit Chakraborty""]","[""Data efficient RL"", ""$Q$-Learning"", ""Hamiltonian Monte Carlo""]","Model-free reinforcement learning (RL), in particular $Q$-learning is widely used to learn optimal policies for a variety of planning and control problems. However, when the underlying state-transition dynamics are stochastic and high-dimensional, $Q$-learning requires a large amount of data and incurs a prohibitively high computational cost. In this paper, we introduce Hamiltonian $Q$-Learning, a data efficient modification of the $Q$-learning approach, which adopts an importance-sampling based technique for computing the $Q$ function. To exploit stochastic structure of the state-transition dynamics, we employ Hamiltonian Monte Carlo to update $Q$ function estimates by approximating the expected future rewards using $Q$ values associated with a subset of next states. Further, to exploit the latent low-rank structure of the dynamic system, Hamiltonian $Q$-Learning uses a matrix completion algorithm to reconstruct the updated $Q$ function from $Q$ value updates over a much smaller subset of state-action pairs. By providing an efficient way to apply $Q$-learning in stochastic, high-dimensional problems, the proposed approach broadens the scope of RL algorithms for real-world applications, including classical control tasks and environmental monitoring.",/pdf/42c486c45a66f44cf10fd6bff982a9081733f50c.pdf,ICLR,2021,"We propose a data efficient modification of the $Q$-learning approach which uses Hamiltonian Monte Carlo to compute $Q$ function for problems with stochastic, high-dimensional dynamics." +B1eXbn05t7,SyeZ2x39tm,1538090000000.0,1545360000000.0,1157,Open-Ended Content-Style Recombination Via Leakage Filtering,"[""karl.ridgeway@colorado.edu"", ""mozer@colorado.edu""]","[""Karl Ridgeway"", ""Michael C. Mozer""]",[],"We consider visual domains in which a class label specifies the content of an image, and class-irrelevant properties that differentiate instances constitute the style. We present a domain-independent method that permits the open-ended recombination of style of one image with the content of another. Open ended simply means that the method generalizes to style and content not present in the training data. The method starts by constructing a content embedding using an existing deep metric-learning technique. This trained content encoder is incorporated into a variational autoencoder (VAE), paired with a to-be-trained style encoder. The VAE reconstruction loss alone is inadequate to ensure a decomposition of the latent representation into style and content. Our method thus includes an auxiliary loss, leakage filtering, which ensures that no style information remaining in the content representation is used for reconstruction and vice versa. We synthesize novel images by decoding the style representation obtained from one image with the content representation from another. Using this method for data-set augmentation, we obtain state-of-the-art performance on few-shot learning tasks.",/pdf/b5a050a22c2c38c1dc0c020d13ee2bd72cd79df0.pdf,ICLR,2019, +H1GEvHcee,,1478280000000.0,1483690000000.0,257,Annealing Gaussian into ReLU: a New Sampling Strategy for Leaky-ReLU RBM,"[""chunlial@cs.cmu.edu"", ""mravanba@cs.cmu.edu"", ""bapoczos@cs.cmu.edu""]","[""Chun-Liang Li"", ""Siamak Ravanbakhsh"", ""Barnabas Poczos""]","[""Deep learning"", ""Unsupervised Learning""]","Restricted Boltzmann Machine (RBM) is a bipartite graphical model that is used as the building block in energy-based deep generative models. Due to numerical stability and quantifiability of the likelihood, RBM is commonly used with Bernoulli units. Here, we consider an alternative member of exponential family RBM with leaky rectified linear units -- called leaky RBM. We first study the joint and marginal distributions of leaky RBM under different leakiness, which provides us important insights by connecting the leaky RBM model and truncated Gaussian distributions. The connection leads us to a simple yet efficient method for sampling from this model, where the basic idea is to anneal the leakiness rather than the energy; -- i.e., start from a fully Gaussian/Linear unit and gradually decrease the leakiness over iterations. This serves as an alternative to the annealing of the temperature parameter and enables numerical estimation of the likelihood that are more efficient and more accurate than the commonly used annealed importance sampling (AIS). We further demonstrate that the proposed sampling algorithm enjoys faster mixing property than contrastive divergence algorithm, which benefits the training without any additional computational cost.",/pdf/4b8398856bc85d6225b5b22e456cc393e761ff00.pdf,ICLR,2017,We study fundamental property of leaky RBM. We link the leaky RBM and truncated Gaussian distribution and propose a novel sampling algorithm without additional computation cost. +ryF7rTqgl,,1478310000000.0,1479150000000.0,546,Understanding intermediate layers using linear classifier probes,"[""guillaume.alain.umontreal@gmail.com"", ""yoshua.bengio@gmail.com""]","[""Guillaume Alain"", ""Yoshua Bengio""]",[],"Neural network models have a reputation for being black boxes. We propose a new method to better understand the roles and dynamics of the intermediate layers. This has direct consequences on the design of such models and it enables the expert to be able to justify certain heuristics (such as adding auxiliary losses in middle layers). Our method uses linear classifiers, referred to as ``probes'', where a probe can only use the hidden units of a given intermediate layer as discriminating features. Moreover, these probes cannot affect the training phase of a model, and they are generally added after training. They allow the user to visualize the state of the model at multiple steps of training. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems.",/pdf/586688417dcf976ead346448e8d1ac3dc81e4852.pdf,ICLR,2017,New useful concept of information to understand deep learning. +H1x-x309tm,BklbDgC5t7,1538090000000.0,1550860000000.0,1048,On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization,"[""chen5719@umn.edu"", ""sijia.liu@ibm.com"", ""ruoyus@illinois.edu"", ""mhong@umn.edu""]","[""Xiangyi Chen"", ""Sijia Liu"", ""Ruoyu Sun"", ""Mingyi Hong""]","[""nonconvex optimization"", ""Adam"", ""convergence analysis""]","This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the ''``Adam-type'', includes the popular algorithms such as Adam, AMSGrad, AdaGrad. Despite their popularity in training deep neural networks (DNNs), the convergence of these algorithms for solving non-convex problems remains an open question. In this paper, we develop an analysis framework and a set of mild sufficient conditions that guarantee the convergence of the Adam-type methods, with a convergence rate of order $O(\log{T}/\sqrt{T})$ for non-convex stochastic optimization. Our convergence analysis applies to a new algorithm called AdaFom (AdaGrad with First Order Momentum). We show that the conditions are essential, by identifying concrete examples in which violating the conditions makes an algorithm diverge. Besides providing one of the first comprehensive analysis for Adam-type methods in the non-convex setting, our results can also help the practitioners to easily monitor the progress of algorithms and determine their convergence behavior. ",/pdf/fc3a9833dfaa1459180681074788a35e5cb8521e.pdf,ICLR,2019,"We analyze convergence of Adam-type algorithms and provide mild sufficient conditions to guarantee their convergence, we also show violating the conditions can makes an algorithm diverge." +LT0KSFnQDWF,nE3Zs0npbFB,1601310000000.0,1614990000000.0,2153,Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting,"[""~Giorgos_Bouritsas1"", ""~Fabrizio_Frasca1"", ""~Stefanos_Zafeiriou1"", ""~Michael_M._Bronstein1""]","[""Giorgos Bouritsas"", ""Fabrizio Frasca"", ""Stefanos Zafeiriou"", ""Michael M. Bronstein""]","[""graph neural networks"", ""graph representation learning"", ""network analysis"", ""network motifs"", ""subgraph isomoprhism""]","While Graph Neural Networks (GNNs) have achieved remarkable results in a variety of applications, recent studies exposed important shortcomings in their ability to capture the structure of the underlying graph. It has been shown that the expressive power of standard GNNs is bounded by the Weisfeiler-Lehman (WL) graph isomorphism test, from which they inherit proven limitations such as the inability to detect and count graph substructures. On the other hand, there is significant empirical evidence, e.g. in network science and bioinformatics, that substructures are often informative for downstream tasks, suggesting that it is desirable to design GNNs capable of leveraging this important source of information. To this end, we propose a novel topologically-aware message passing scheme based on substructure encoding. We show that our architecture allows incorporating domain-specific inductive biases and that it is strictly more expressive than the WL test. Importantly, in contrast to recent works on the expressivity of GNNs, we do not attempt to adhere to the WL hierarchy; this allows us to retain multiple attractive properties of standard GNNs such as locality and linear network complexity, while being able to disambiguate even hard instances of graph isomorphism. We extensively evaluate our method on graph classification and regression tasks and show state-of-the-art results on multiple datasets including molecular graphs and social networks.",/pdf/fb78801fd3f4d7624899fc211252a2b877b3315e.pdf,ICLR,2021,We show that enhancing message passing neural networks with subgraph encodings improves their expressive power and allows incorporating domain specific prior knowledge +9p2CltauWEY,KKnC8sTM6uQ,1601310000000.0,1614990000000.0,520,On Size Generalization in Graph Neural Networks,"[""~Gilad_Yehudai2"", ""~Ethan_Fetaya1"", ""~Eli_Meirom2"", ""~Gal_Chechik1"", ""~Haggai_Maron1""]","[""Gilad Yehudai"", ""Ethan Fetaya"", ""Eli Meirom"", ""Gal Chechik"", ""Haggai Maron""]","[""graph neural networks"", ""gnn"", ""generalization"", ""Weisfeiler-Lehman""]","Graph neural networks (GNNs) can process graphs of different sizes but their capacity to generalize across sizes is still not well understood. Size generalization is key to numerous GNN applications, from solving combinatorial optimization problems to learning in molecular biology. In such problems, obtaining labels and training on large graphs can be prohibitively expensive, but training on smaller graphs is possible. + +This paper puts forward the size-generalization question and characterizes important aspects of that problem theoretically and empirically. +We prove that even for very simple tasks, such as counting the number of nodes or edges in a graph, GNNs do not naturally generalize to graphs of larger size. Instead, their generalization performance is closely related to the distribution of local patterns of connectivity and features and how that distribution changes from small to large graphs. Specifically, we prove that for many tasks, there are weight assignments for GNNs that can perfectly solve the task on small graphs but fail on large graphs, if there is a discrepancy between their local patterns. We further demonstrate on several tasks, that training GNNs on small graphs results in solutions which do not generalize to larger graphs. We then formalize size generalization as a domain-adaption problem and describe two learning setups where size generalization can be improved. First, as a self-supervised learning problem (SSL) over the target domain of large graphs. Second as a semi-supervised learning problem when few samples are available in the target domain. We demonstrate the efficacy of these solutions on a diverse set of benchmark graph datasets. ",/pdf/e81854f653c0c15fc70b0201236bf9e86fa3b2b3.pdf,ICLR,2021,"Graph neural networks can process graphs of any size, yet their capacity to generalize across sizes is unclear. We study the problem of size generalization both empirically and theoretically." +BJeS62EtwH,S1xm-iIXwB,1569440000000.0,1583910000000.0,225,Knowledge Consistency between Neural Networks and Beyond,"[""nexuslrf@sjtu.edu.cn"", ""litl@act.buaa.edu.cn"", ""1776752575@sjtu.edu.cn"", ""wangjing215@huawei.com"", ""zqs1022@sjtu.edu.cn""]","[""Ruofan Liang"", ""Tianlin Li"", ""Longfei Li"", ""Jing Wang"", ""Quanshi Zhang""]","[""Deep Learning"", ""Interpretability"", ""Convolutional Neural Networks""]","This paper aims to analyze knowledge consistency between pre-trained deep neural networks. We propose a generic definition for knowledge consistency between neural networks at different fuzziness levels. A task-agnostic method is designed to disentangle feature components, which represent the consistent knowledge, from raw intermediate-layer features of each neural network. As a generic tool, our method can be broadly used for different applications. In preliminary experiments, we have used knowledge consistency as a tool to diagnose representations of neural networks. Knowledge consistency provides new insights to explain the success of existing deep-learning techniques, such as knowledge distillation and network compression. More crucially, knowledge consistency can also be used to refine pre-trained networks and boost performance.",/pdf/b92fb2d94a906119c4ccf7a72d9dfd1bdb164f69.pdf,ICLR,2020, +9G5MIc-goqB,fezewxx4WKL,1601310000000.0,1615780000000.0,1123,Reweighting Augmented Samples by Minimizing the Maximal Expected Loss,"[""~Mingyang_Yi1"", ""~Lu_Hou2"", ""~Lifeng_Shang1"", ""~Xin_Jiang1"", ""~Qun_Liu1"", ""~Zhi-Ming_Ma1""]","[""Mingyang Yi"", ""Lu Hou"", ""Lifeng Shang"", ""Xin Jiang"", ""Qun Liu"", ""Zhi-Ming Ma""]","[""data augmentation"", ""sample reweighting""]","Data augmentation is an effective technique to improve the generalization of deep neural networks. However, previous data augmentation methods usually treat the augmented samples equally without considering their individual impacts on the model. To address this, for the augmented samples from the same training example, we propose to assign different weights to them. We construct the maximal expected loss which is the supremum over any reweighted loss on augmented samples. Inspired by adversarial training, we minimize this maximal expected loss (MMEL) and obtain a simple and interpretable closed-form solution: more attention should be paid to augmented samples with large loss values (i.e., harder examples). Minimizing this maximal expected loss enables the model to perform well under any reweighting strategy. The proposed method can generally be applied on top of any data augmentation methods. Experiments are conducted on both natural language understanding tasks with token-level data augmentation, and image classification tasks with commonly-used image augmentation techniques like random crop and horizontal flip. Empirical results show that the proposed method improves the generalization performance of the model.",/pdf/e4c9206df0bf95f0a614a0ac2cc2acd08f5125ce.pdf,ICLR,2021,a new reweighting strategy on augmented samples +rkFBJv9gg,,1478290000000.0,1488550000000.0,346,Learning Features of Music From Scratch,"[""thickstn@cs.washington.edu"", ""sham@cs.washington.edu"", ""zaid@uw.edu""]","[""John Thickstun"", ""Zaid Harchaoui"", ""Sham Kakade""]","[""Applications""]","This paper introduces a new large-scale music dataset, MusicNet, to serve as a source +of supervision and evaluation of machine learning methods for music research. +MusicNet consists of hundreds of freely-licensed classical music recordings +by 10 composers, written for 11 instruments, together with instrument/note +annotations resulting in over 1 million temporal labels on 34 hours of chamber music +performances under various studio and microphone conditions. + +The paper defines a multi-label classification task to predict notes in musical recordings, +along with an evaluation protocol, and benchmarks several machine learning architectures for this task: +i) learning from spectrogram features; +ii) end-to-end learning with a neural net; +iii) end-to-end learning with a convolutional neural net. +These experiments show that end-to-end models trained for note prediction learn frequency +selective filters as a low-level representation of audio. ",/pdf/b74c5c81741fefd98381e2bb1a649a7940f3b463.pdf,ICLR,2017,"We introduce a new large-scale music dataset, define a multi-label classification task, and benchmark machine learning architectures on this task." +4D4Rjrwaw3q,5uc9-UHK5e,1601310000000.0,1614990000000.0,3286,Black-Box Optimization Revisited: Improving Algorithm Selection Wizards through Massive Benchmarking,"[""~Laurent_Meunier1"", ""~Herilalaina_Rakotoarison1"", ""jrapin@fb.com"", ""paco.pkwong@gmail.com"", ""broz@fb.com"", ""~Olivier_Teytaud2"", ""antoine.moreau@uca.fr"", ""~Carola_Doerr1""]","[""Laurent Meunier"", ""Herilalaina Rakotoarison"", ""Jeremy Rapin"", ""Paco Wong"", ""Baptiste Roziere"", ""Olivier Teytaud"", ""Antoine Moreau"", ""Carola Doerr""]","[""black-box optimization"", ""mujoco"", ""wizard"", ""benchmarking"", ""BBOB"", ""LSGO""]","Existing studies in black-box optimization for machine learning suffer from low +generalizability, caused by a typically selective choice of problem instances used +for training and testing different optimization algorithms. Among other issues, +this practice promotes overfitting and poor-performing user guidelines. To address +this shortcoming, we propose in this work a benchmark suite, OptimSuite, +which covers a broad range of black-box optimization problems, ranging from +academic benchmarks to real-world applications, from discrete over numerical +to mixed-integer problems, from small to very large-scale problems, from noisy +over dynamic to static problems, etc. We demonstrate the advantages of such a +broad collection by deriving from it Automated Black Box Optimizer (ABBO), a +general-purpose algorithm selection wizard. Using three different types of algorithm +selection techniques, ABBO achieves competitive performance on all +benchmark suites. It significantly outperforms previous state of the art on some of +them, including YABBOB and LSGO. ABBO relies on many high-quality base +components. Its excellent performance is obtained without any task-specific +parametrization. The benchmark collection, the ABBO wizard, its base solvers, +as well as all experimental data are reproducible and open source in OptimSuite.",/pdf/ef1c96db59783dfd6f19daf621767dbea62f390b.pdf,ICLR,2021,We propose a huge benchmark aggregating many well known benchmarks and derive an algorithm selection tool on it. +sI4SVtktqJ2,eXgi85WCmZT,1601310000000.0,1614990000000.0,1746,Efficient randomized smoothing by denoising with learned score function,"[""~Kyungmin_Lee1"", ""syoh@add.re.kr""]","[""Kyungmin Lee"", ""Seyoon Oh""]","[""Adversarial Robustness"", ""Provable Adversarial Defense"", ""Randomized Smoothing"", ""Image Denoising"", ""Score Estimation""]","The randomized smoothing with various noise distributions is a promising approach to protect classifiers from $\ell_p$ adversarial attacks. However, it requires an ensemble of classifiers trained with different noise types and magnitudes, which is computationally expensive. In this work, we present an efficient method for randomized smoothing that does not require any re-training of classifiers. We built upon denoised smoothing, which prepends denoiser to the pre-trained classifier. We investigate two approaches to the image denoising problem for randomized smoothing and show that using the score function suits for both. Moreover, we present an efficient algorithm that can scale to randomized smoothing and can be applied regardless of noise types or levels. To validate, we demonstrate the effectiveness of our methods through extensive experiments on CIFAR-10 and ImageNet, under various $\ell_p$ adversaries.",/pdf/8bb978727b8850a1a7a783093ff49a75c0836a75.pdf,ICLR,2021,We provide efficient method to generate smoothed classifier for provable defense by exploiting score-based image denoiser +rJxq3kHKPH,BJxcHMkKPB,1569440000000.0,1577170000000.0,1962,A Simple Approach to the Noisy Label Problem Through the Gambler's Loss,"[""zliu@cat.phys.s.u-tokyo.ac.jp"", ""wangru1994305@gmail.com"", ""pliang@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu"", ""morency@cs.cmu.edu"", ""ueda@phys.s.u-tokyo.ac.jp""]","[""Liu Ziyin"", ""Ru Wang"", ""Paul Pu Liang"", ""Ruslan Salakhutdinov"", ""Louis-Philippe Morency"", ""Masahito Ueda""]","[""noisy labels"", ""robust learning"", ""early stopping"", ""generalization""]","Learning in the presence of label noise is a challenging yet important task. It is crucial to design models that are robust to noisy labels. In this paper, we discover that a new class of loss functions called the gambler's loss provides strong robustness to label noise across various levels of corruption. Training with this modified loss function reduces memorization of data points with noisy labels and is a simple yet effective method to improve robustness and generalization. Moreover, using this loss function allows us to derive an analytical early stopping criterion that accurately estimates when memorization of noisy labels begins to occur. Our overall approach achieves strong results and outperforming existing baselines.",/pdf/e06213bdd4666aefeb0530dff72547f8a6e10690.pdf,ICLR,2020,We propose to a simple loss function and an analytical early stopping criterion to deal with the label noise problem. +FN7_BUOG78e,HufYbvAOPYs,1601310000000.0,1614990000000.0,2012,Computing Preimages of Deep Neural Networks with Applications to Safety,"[""~Kyle_Matoba1"", ""~Fran\u00e7ois_Fleuret2""]","[""Kyle Matoba"", ""Fran\u00e7ois Fleuret""]","[""Deep neural networks"", ""verification"", ""interpretation"", ""AI safety"", ""ACAS""]","To apply an algorithm in a sensitive domain it is important to understand the set of input values that result in specific decisions. Deep neural networks suffer from an inherent instability that makes this difficult: different outputs can arise from very similar inputs. + +We present a method to check that the decisions of a deep neural network are as intended by constructing the exact, analytical preimage of its predictions. Preimages generalize verification in the sense that they can be used to verify a wide class of properties, and answer much richer questions besides. We examine the functioning and failures of neural networks used in robotics, including an aircraft collision avoidance system, related to sequential decision making and extrapolation. + +Our method iterates backwards through the layers of piecewise linear deep neural networks. Uniquely, we compute \emph{all} intermediate values that correspond to a prediction, propagating this calculation through layers using analytical formulae for layer preimages. + +",/pdf/a5ab19df37d23adf099384afa0f4ca44a68c9339.pdf,ICLR,2021,We show how to progressively invert the layers of deep neural networks to give a much simpler characterization that can be used to answer questions that cannot be addressed otherwise. +ByeDojRcYQ,r1xmilPBKQ,1538090000000.0,1545360000000.0,629,COLLABORATIVE MULTIAGENT REINFORCEMENT LEARNING IN HOMOGENEOUS SWARMS,"[""arbaazk@seas.upenn.edu"", ""vijay.kumar@seas.upenn.edu"", ""aribeiro@seas.upenn.edu""]","[""Arbaaz Khan"", ""Clark Zhang"", ""Vijay Kumar"", ""Alejandro Ribeiro""]","[""Reinforcement Learning"", ""Multi Agent"", ""policy gradient""]","A deep reinforcement learning solution is developed for a collaborative multiagent system. Individual agents choose actions in response to the state of the environment, their own state, and possibly partial information about the state of other agents. Actions are chosen to maximize a collaborative long term discounted reward that encompasses the individual rewards collected by each agent. The paper focuses on developing a scalable approach that applies to large swarms of homogeneous agents. This is accomplished by forcing the policies of all agents to be the same resulting in a constrained formulation in which the experiences of each agent inform the learning process of the whole team, thereby enhancing the sample efficiency of the learning process. A projected coordinate policy gradient descent algorithm is derived to solve the constrained reinforcement learning problem. Experimental evaluations in collaborative navigation, a multi-predator-multi-prey game, and a multiagent survival game show marked improvements relative to methods that do not exploit the policy equivalence that naturally arises in homogeneous swarms.",/pdf/94cb941dfda2901eb8ef525105baa41ea87accdb.pdf,ICLR,2019,Novel policy gradient for multiagent systems via distributed learning. +r1laEnA5Ym,rJlqs7_vFX,1538090000000.0,1555380000000.0,1491,A Variational Inequality Perspective on Generative Adversarial Networks,"[""gauthier.gidel@umontreal.ca"", ""hugo.berard@gmail.com"", ""gaetan.vignoud@gmail.com"", ""vincentp@iro.umontreal.ca"", ""slacoste@iro.umontreal.ca""]","[""Gauthier Gidel"", ""Hugo Berard"", ""Ga\u00ebtan Vignoud"", ""Pascal Vincent"", ""Simon Lacoste-Julien""]","[""optimization"", ""variational inequality"", ""games"", ""saddle point"", ""extrapolation"", ""averaging"", ""extragradient"", ""generative modeling"", ""generative adversarial network""]","Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend methods designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.",/pdf/b980196a1b6175f85e294c1a398173ae7e38bf17.pdf,ICLR,2019,We cast GANs in the variational inequality framework and import techniques from this literature to optimize GANs better; we give algorithmic extensions and empirically test their performance for training GANs. +HklY120cYm,S1gBV3LFYX,1538090000000.0,1550790000000.0,1006,ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech,"[""weiping.thu@gmail.com"", ""pengkainan@baidu.com"", ""jitongc@gmail.com""]","[""Wei Ping"", ""Kainan Peng"", ""Jitong Chen""]","[""text-to-speech"", ""deep generative models"", ""end-to-end training"", ""text to waveform""]","In this work, we propose a new solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (van Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2018). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.",/pdf/9d39ff3ccdd5fd7ee44de3b1481c1e8115d4c53b.pdf,ICLR,2019, +r1NDBsAqY7,SJeL5ldttQ,1538090000000.0,1545360000000.0,97,Unsupervised Word Discovery with Segmental Neural Language Models,"[""kawakamik@google.com"", ""cdyer@google.com"", ""pblunsom@google.com""]","[""Kazuya Kawakami"", ""Chris Dyer"", ""Phil Blunsom""]",[],"We propose a segmental neural language model that combines the representational power of neural networks and the structure learning mechanism of Bayesian nonparametrics, and show that it learns to discover semantically meaningful units (e.g., morphemes and words) from unsegmented character sequences. The model generates text as a sequence of segments, where each segment is generated either character-by-character from a sequence model or as a single draw from a lexical memory that stores multi-character units. Its parameters are fit to maximize the marginal likelihood of the training data, summing over all segmentations of the input, and its hyperparameters are likewise set to optimize held-out marginal likelihood. +To prevent the model from overusing the lexical memory, which leads to poor generalization and bad segmentation, we introduce a differentiable regularizer that penalizes based on the expected length of each segment. To our knowledge, this is the first demonstration of neural networks that have predictive distributions better than LSTM language models and also infer a segmentation into word-like units that are competitive with the best existing word discovery models.",/pdf/e28755b019f2b6f54c02f81332378de81217cadc.pdf,ICLR,2019,A LSTM language model that discovers words from unsegmented sequences of characters. +Skln2A4YDB,ByeGKoFdwB,1569440000000.0,1583910000000.0,1371,Model-Augmented Actor-Critic: Backpropagating through Paths,"[""iclavera@berkeley.edu"", ""violetfuyao@berkeley.edu"", ""pabbeel@cs.berkeley.edu""]","[""Ignasi Clavera"", ""Yao Fu"", ""Pieter Abbeel""]","[""reinforcement learning"", ""model-based"", ""actor-critic"", ""pathwise""]","Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator to augment the data for policy optimization or value function learning. In this paper, we show how to make more effective use of the model by exploiting its differentiability. We construct a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps. Instabilities of learning across many timesteps are prevented by using a terminal value function, learning the policy in an actor-critic fashion. Furthermore, we present a derivation on the monotonic improvement of our objective in terms of the gradient error in the model and value function. We show that our approach (i) is consistently more sample efficient than existing state-of-the-art model-based algorithms, (ii) matches the asymptotic performance of model-free algorithms, and (iii) scales to long horizons, a regime where typically past model-based approaches have struggled.",/pdf/76639777b1faa1d8aa927d52182a25c3e8949f7e.pdf,ICLR,2020,Policy gradient through backpropagation through time using learned models and Q-functions. SOTA results in reinforcement learning benchmark environments. +LFs3CnHwfM,rq_ojpR5Sw,1601310000000.0,1614990000000.0,1905,A Robust Fuel Optimization Strategy For Hybrid Electric Vehicles: A Deep Reinforcement Learning Based Continuous Time Design Approach,"[""~Nilanjan_Mukherjee1"", ""~Sudeshna_Sarkar1""]","[""Nilanjan Mukherjee"", ""Sudeshna Sarkar""]","[""Deep Reinforcement Learning"", ""Optimal Control"", ""Fuel Management System"", ""Hybrid Electric vehicles"", ""H\u221e Performance Index""]","This paper deals with the fuel optimization problem for hybrid electric vehicles in reinforcement learning framework. Firstly, considering the hybrid electric vehicle as a completely observable non-linear system with uncertain dynamics, we solve an open-loop deterministic optimization problem. This is followed by the design of a deep reinforcement learning based optimal controller for the non-linear system using concurrent learning based system identifier such that the actual states and the control policy are able to track the optimal trajectory and optimal policy, autonomously even in the presence of external disturbances, modeling errors, uncertainties and noise and signigicantly reducing the computational complexity at the same time, which is in sharp contrast to the conventional methods like PID and Model Predictive Control (MPC) as well as traditional RL approaches like ADP, DDP and DQN that mostly depend on a set of pre-defined rules and provide sub-optimal solutions under similar conditions. The low value of the H-infinity ($H_{\infty})$ performance index of the proposed optimization algorithm addresses the robustness issue. The optimization technique thus proposed is compared with the traditional fuel optimization strategies for hybrid electric vehicles to illustate the efficacy of the proposed method.",/pdf/c01950c0ac3835bd3e09fa4a559a23ddc82c7378.pdf,ICLR,2021,A state of the art continuous time robust deep reinforcement learning based fuel optimization strategy using concurrent learning for hybrid electric vehicles. +Hygxb2CqKm,H1xtrh-cYm,1538090000000.0,1548120000000.0,1136,Stable Recurrent Models,"[""miller_john@berkeley.edu"", ""hardt@berkeley.edu""]","[""John Miller"", ""Moritz Hardt""]","[""stability"", ""gradient descent"", ""non-convex optimization"", ""recurrent neural networks""]","Stability is a fundamental property of dynamical systems, yet to this date it has had little bearing on the practice of recurrent neural networks. In this work, we conduct a thorough investigation of stable recurrent models. Theoretically, we prove stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. Empirically, we demonstrate stable recurrent models often perform as well as their unstable counterparts on benchmark sequence tasks. Taken together, these findings shed light on the effective power of recurrent networks and suggest much of sequence learning happens, or can be made to happen, in the stable regime. Moreover, our results help to explain why in many cases practitioners succeed in replacing recurrent models by feed-forward models. +",/pdf/2e602c275cb2de7c656905c69b101bd66f2df83f.pdf,ICLR,2019,Stable recurrent models can be approximated by feed-forward networks and empirically perform as well as unstable models on benchmark tasks. +H1xKBCEYDr,r1xo1aUuDr,1569440000000.0,1577170000000.0,1118,Black-box Adversarial Attacks with Bayesian Optimization,"[""snshukla@cs.umass.edu"", ""anit.sahu@gmail.com"", ""devin.willmott@uky.edu"", ""zkolter@cs.cmu.edu""]","[""Satya Narayan Shukla"", ""Anit Kumar Sahu"", ""Devin Willmott"", ""J. Zico Kolter""]","[""black-box adversarial attacks"", ""bayesian optimization""]","We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples using information limited to loss function evaluations of input-output pairs. We use Bayesian optimization (BO) to specifically +cater to scenarios involving low query budgets to develop query efficient adversarial attacks. We alleviate the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques. Our proposed approach achieves performance comparable to the state of the art black-box adversarial attacks albeit with a much lower average query count. In particular, in low query budget regimes, our proposed method reduces the query count up to 80% with respect to the state of the art methods.",/pdf/727d33735c7157749a2e3992946978691930db1d.pdf,ICLR,2020,We show that a relatively simple black-box adversarial attack scheme using Bayesian optimization and dimension upsampling is preferable to existing methods when the number of available queries is very low. +HJOZBvcel,,1478290000000.0,1483320000000.0,380,Learning to Discover Sparse Graphical Models,"[""eugene.belilovsky@inria.fr"", ""kyle.kastner@umontreal.ca"", ""gael.varoquaux@inria.fr"", ""matthew.blaschko@esat.kuleuven.be""]","[""Eugene Belilovsky"", ""Kyle Kastner"", ""Gael Varoquaux"", ""Matthew B. Blaschko""]",[],"We consider structure discovery of undirected graphical models from observational data. Inferring likely structures from few examples is a complex task often requiring the formulation of priors and sophisticated inference procedures. In the setting of Gaussian Graphical Models (GGMs) a popular estimator is a maximum likelihood objective with a penalization on the precision matrix. Adapting this estimator to capture domain-specific knowledge as priors or a new data likelihood requires great effort. In addition, structure recovery is an indirect consequence of the data-fit term. By contrast, it may be easier to generate training samples of data that arise from graphs with the desired structure properties. We propose here to leverage this latter source of information as training data to learn a function mapping from empirical covariance matrices to estimated graph structures. Learning this function brings two benefits: it implicitly models the desired structure or sparsity properties to form suitable priors, and it can be tailored to the specific problem of edge structure discovery, rather than maximizing data likelihood. We apply this framework to several real-world problems in structure discovery and show that it can be competitive to standard approaches such as graphical lasso, at a fraction of the execution speed. We use convolutional neural networks to parametrize our estimators due to the compositional structure of the problem. Experimentally, our learnable graph-discovery method trained on synthetic data generalizes well to different data: identifying relevant edges in real data, completely unknown at training time. We find that on genetics, brain imaging, and simulation data we obtain competitive(and generally superior) performance, compared with analytical methods. ",/pdf/84bd49390e7f5de0a7a6911fc3b31fad4884434a.pdf,ICLR,2017,Sparse graphical model structure estimators make restrictive assumptions. We show that empirical risk minimization can yield SOTA estimators for edge prediction across a wide range of graph structure distributions. +HJlLKjR9FQ,HJgm47q5F7,1538090000000.0,1556010000000.0,447,Towards Understanding Regularization in Batch Normalization,"[""pluo@ie.cuhk.edu.hk"", ""wangxinjiang@sensetime.com"", ""shaowenqi@sensetime.com"", ""zhanglinpeng@sensetime.com""]","[""Ping Luo"", ""Xinjiang Wang"", ""Wenqi Shao"", ""Zhanglin Peng""]","[""batch normalization"", ""regularization"", ""deep learning""]","Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.",/pdf/8f671110d6fddecbe9002d668ff5d3fa121e8ba5.pdf,ICLR,2019, +Hygn2o0qKX,BygjLWK9KX,1538090000000.0,1550880000000.0,746,Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience,"[""vaishnavh@cs.cmu.edu"", ""zkolter@cs.cmu.edu""]","[""Vaishnavh Nagarajan"", ""Zico Kolter""]","[""generalization"", ""PAC-Bayes"", ""SGD"", ""learning theory"", ""implicit regularization""]","The ability of overparameterized deep networks to generalize well has been linked to the fact that stochastic gradient descent (SGD) finds solutions that lie in flat, wide minima in the training loss -- minima where the output of the network is resilient to small random noise added to its parameters. +So far this observation has been used to provide generalization guarantees only for neural networks whose parameters are either \textit{stochastic} or \textit{compressed}. In this work, we present a general PAC-Bayesian framework that leverages this observation to provide a bound on the original network learned -- a network that is deterministic and uncompressed. What enables us to do this is a key novelty in our approach: our framework allows us to show that if on training data, the interactions between the weight matrices satisfy certain conditions that imply a wide training loss minimum, these conditions themselves {\em generalize} to the interactions between the matrices on test data, thereby implying a wide test loss minimum. We then apply our general framework in a setup where we assume that the pre-activation values of the network are not too small (although we assume this only on the training data). In this setup, we provide a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.",/pdf/53bae60fa685ad7fc38094b448becdd4390fc22e.pdf,ICLR,2019,"We provide a PAC-Bayes based generalization guarantee for uncompressed, deterministic deep networks by generalizing noise-resilience of the network on the training data to the test data." +H1ervR4FwH,SkgcNwwOwS,1569440000000.0,1577170000000.0,1173,Improved Structural Discovery and Representation Learning of Multi-Agent Data,"[""jennifer.hobbs@statsperform.com"", ""matthewholbrook@statsperform.com"", ""nathan.frank@statsperform.com"", ""long.sha@statsperform.com"", ""patrick.lucey@statsperform.com""]","[""Jennifer Hobbs"", ""Matthew Holbrook"", ""Nathan Frank"", ""Long Sha"", ""Patrick Lucey""]","[""multi-agent"", ""gaussian mixture"", ""permutation learning"", ""representation learning"", ""group structure""]","Central to all machine learning algorithms is data representation. For multi-agent systems, selecting a representation which adequately captures the interactions among agents is challenging due to the latent group structure which tends to vary depending on various contexts. However, in multi-agent systems with strong group structure, we can simultaneously learn this structure and map a set of agents to a consistently ordered representation for further learning. In this paper, we present a dynamic alignment method which provides a robust ordering of structured multi-agent data which allows for representation learning to occur in a fraction of the time of previous methods. We demonstrate the value of this approach using a large amount of soccer tracking data from a professional league. ",/pdf/655d39c1b3514d3dee403296b636bf854f252aa3.pdf,ICLR,2020,We propose an improved approach to discovering the group structure and ordered representation of multi-agent data +whAxkamuuCU,kxRKWMPti4R,1601310000000.0,1614990000000.0,1162,Symbol-Shift Equivariant Neural Networks,"[""~David_Salinas1"", ""~Hady_Elsahar2""]","[""David Salinas"", ""Hady Elsahar""]","[""compositionality"", ""Symbolic"", ""Equivariance"", ""question answering"", ""Language Processing""]","Neural networks have been shown to have poor compositionality abilities: while they can produce sophisticated output given sufficient data, they perform patchy generalization and fail to generalize to new symbols (e.g. switching a name in a sentence by a less frequent one or one not seen yet). In this paper, we define a class of models whose outputs are equivariant to entity permutations (an analog being convolution networks whose outputs are invariant through translation) without requiring to specify or detect entities in a pre-processing step. We then show how two question-answering models can be made robust to entity permutation using a novel differentiable hybrid semantic-symbolic representation. The benefits of this approach are demonstrated on a set of synthetic NLP tasks where sample complexity and generalization are significantly improved even allowing models to generalize to words that are never seen in the training set. When using only 1K training examples for bAbi, we obtain a test error of 1.8% and fail only one task while the best results reported so far obtained an error of 9.9% and failed 7 tasks. +",/pdf/ac9fbdceb5409f73bbba0b858471b8d3d2e92e7e.pdf,ICLR,2021,We define a class of model whose outputs are equivariant to entity permutations without having to specify or detect such entities in a pre-processing step. +SJlVVAEKwS,BJxHA78dvr,1569440000000.0,1577170000000.0,1072,Adversarial Imitation Attack,"[""zhoumingyi@std.uestc.edu.cn"", ""wujing@std.uestc.edu.cn"", ""yipengliu@uestc.edu.cn"", ""xiaolinhuang@sjtu.edu.cn"", ""liushuaicheng@uestc.edu.cn"", ""engr_liaqat183@yahoo.com"", ""uestchero@uestc.edu.cn"", ""eczhu@uestc.edu.cn""]","[""Mingyi Zhou"", ""Jing Wu"", ""Yipeng Liu"", ""Xiaolin Huang"", ""Shuaicheng Liu"", ""Liaqat Ali"", ""Xiang Zhang"", ""Ce Zhu""]","[""Adversarial examples"", ""Security"", ""Machine learning"", ""Deep neural network"", ""Computer vision""]","Deep learning models are known to be vulnerable to adversarial examples. A practical adversarial attack should require as little as possible knowledge of attacked models T. Current substitute attacks need pre-trained models to generate adversarial examples and their attack success rates heavily rely on the transferability of adversarial examples. Current score-based and decision-based attacks require lots of queries for the T. In this study, we propose a novel adversarial imitation attack. First, it produces a replica of the T by a two-player game like the generative adversarial networks (GANs). The objective of the generative model G is to generate examples which lead D returning different outputs with T. The objective of the discriminative model D is to output the same labels with T under the same inputs. Then, the adversarial examples generated by D are utilized to fool the T. Compared with the current substitute attacks, imitation attack can use less training data to produce a replica of T and improve the transferability of adversarial examples. Experiments demonstrate that our imitation attack requires less training data than the black-box substitute attacks, but achieves an attack success rate close to the white-box attack on unseen data with no query. ",/pdf/d1dd433b9aff9c8b0a5061a0428f76784bbb4a85.pdf,ICLR,2020,A novel adversarial imitation attack to fool machine learning models. +SkxXCi0qFX,BkeI1C5ctQ,1538090000000.0,1545400000000.0,882,ProMP: Proximal Meta-Policy Search,"[""jonas.rothfuss@gmail.com"", ""dennisl88@berkeley.edu"", ""iclavera@berkeley.edu"", ""asfour@kit.edu"", ""pabbeel@cs.berkeley.edu""]","[""Jonas Rothfuss"", ""Dennis Lee"", ""Ignasi Clavera"", ""Tamim Asfour"", ""Pieter Abbeel""]","[""Meta-Reinforcement Learning"", ""Meta-Learning"", ""Reinforcement-Learning""]","Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. +This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.",/pdf/89bfa5ff6db6832ade6925c6d0e3ebe5776de3f9.pdf,ICLR,2019,A novel and theoretically grounded meta-reinforcement learning algorithm +Bye5OiR5F7,SJgL3pKFFQ,1538090000000.0,1545360000000.0,377,Wasserstein proximal of GANs,"[""atlin@math.ucla.edu"", ""wcli@math.ucla.edu"", ""sjo@math.ucla.edu"", ""montufar@math.ucla.edu""]","[""Alex Tong Lin"", ""Wuchen Li"", ""Stanley Osher"", ""Guido Montufar""]","[""Optimal transport"", ""Wasserstein gradient"", ""Generative adversarial network"", ""Unsupervised learning""]","We introduce a new method for training GANs by applying the Wasserstein-2 metric proximal on the generators. +The approach is based on the gradient operator induced by optimal transport, which connects the geometry of sample space and parameter space in implicit deep generative models. From this theory, we obtain an easy-to-implement regularizer for the parameter updates. Our experiments demonstrate that this method improves the speed and stability in training GANs in terms of wall-clock time and Fr\'echet Inception Distance (FID) learning curves. ",/pdf/9df0392be5f34787936b348563f3e8e10c62af7c.pdf,ICLR,2019,We propose the Wasserstein proximal method for training GANs. +_b8l7rVPe8z,WAq5cnqjQOJ,1601310000000.0,1614990000000.0,248,Relevance Attack on Detectors,"[""~Sizhe_Chen1"", ""hf-inspire@sjtu.edu.cn"", ""~Xiaolin_Huang1"", ""~Kun_Zhang1""]","[""Sizhe Chen"", ""Fan He"", ""Xiaolin Huang"", ""Kun Zhang""]","[""adversarial attack"", ""relevance map"", ""object detection"", ""transferability"", ""black-box attack""]","This paper focuses on high-transferable adversarial attacks on detectors, which are hard to attack in a black-box manner, because of their multiple-output characteristics and the diversity across architectures. To pursue a high attack transferability, one plausible way is to find a common property across detectors, which facilitates the discovery of common weaknesses. We are the first to suggest that the relevance map for detectors is such a property. Based on it, we design a Relevance Attack on Detectors (RAD), which achieves a state-of-the-art transferability, exceeding existing results by above 20%. On MS COCO, the detection mAPs for all 8 black-box architectures are more than halved and the segmentation mAPs are also significantly influenced. Given the great transferability of RAD, we generate the first adversarial dataset for object detection, i.e., Adversarial Objects in COntext (AOCO), which helps to quickly evaluate and improve the robustness of detectors.",/pdf/745d9d6e4d46aa97ba592384bcd8aaa803f38d90.pdf,ICLR,2021,"We design a Relevance Attack on Detectors, a high-transferable attack framework with the state-of-the-art performance." +c5QbJ1zob73,EVrvHYzSxp,1601310000000.0,1614990000000.0,2351,Understanding Self-supervised Learning with Dual Deep Networks,"[""~Yuandong_Tian1"", ""~Lantao_Yu2"", ""~Xinlei_Chen1"", ""~Surya_Ganguli1""]","[""Yuandong Tian"", ""Lantao Yu"", ""Xinlei Chen"", ""Surya Ganguli""]","[""self-supervised learning"", ""teacher-student setting"", ""theoretical analysis"", ""hierarchical models"", ""representation learning""]","We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR, the weights at each layer are updated by a \emph{covariance operator} that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. We show this leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the combination of BatchNorm and a predictor network creates an implicit contrastive term, acting as an approximate covariance operator. Additionally, for linear architectures we derive exact solutions for BYOL that provide conceptual insights into how BYOL can learn useful non-collapsed representations without any contrastive terms that separate negative pairs. Extensive ablation studies justify our theoretical findings. ",/pdf/8d1f5ab6ed92112edc1df40c8243863c18426bea.pdf,ICLR,2021,A theoretical framework for self-supervised learning with deep ReLU networks explaining recent success of SimCLR and BYOL. +8wa7HrUsElL,H8MwOlxGlIj,1601310000000.0,1614990000000.0,2116,D3C: Reducing the Price of Anarchy in Multi-Agent Learning,"[""~Ian_Gemp1"", ""kevinrmckee@google.com"", ""~Richard_Everett1"", ""~Edgar_Alfredo_Duenez-Guzman1"", ""~Yoram_Bachrach2"", ""~David_Balduzzi1"", ""~Andrea_Tacchetti1""]","[""Ian Gemp"", ""Kevin McKee"", ""Richard Everett"", ""Edgar Alfredo Duenez-Guzman"", ""Yoram Bachrach"", ""David Balduzzi"", ""Andrea Tacchetti""]","[""multiagent"", ""social dilemma"", ""reinforcement learning""]","Even in simple multi-agent systems, fixed incentives can lead to outcomes that are poor for the group and each individual agent. We propose a method, D3C, for online adjustment of agent incentives that reduces the loss incurred at a Nash equilibrium. Agents adjust their incentives by learning to mix their incentive with that of other agents, until a compromise is reached in a distributed fashion. We show that D3C improves outcomes for each agent and the group as a whole in several social dilemmas including a traffic network with Braess’s paradox, a prisoner’s dilemma, and several reinforcement learning domains.",/pdf/cb0ca3cefad4eee6e9c22f209b9927bf44bb5692.pdf,ICLR,2021,"We propose a decentralized, gradient-based meta-algorithm to adapt the losses of agents in a multi-agent system such that the price of anarchy is reduced." +B1s6xvqlx,,1478290000000.0,1491440000000.0,354,Recurrent Environment Simulators,"[""csilvia@google.com"", ""sracaniere@google.com"", ""wierstra@google.com"", ""shakir@google.com""]","[""Silvia Chiappa"", ""S\u00e9bastien Racaniere"", ""Daan Wierstra"", ""Shakir Mohamed""]","[""Deep learning"", ""Unsupervised Learning"", ""Applications""]","Models that can simulate how environments change in response to actions can be used by agents to plan and act efficiently. We improve on previous environment simulators from high-dimensional pixel observations by introducing recurrent neural networks that are able to make temporally and spatially coherent predictions for hundreds of time-steps into the future. We present an in-depth analysis of the factors affecting performance, providing the most extensive attempt to advance the understanding of the properties of these models. We address the issue of computationally inefficiency with a model that does not need to generate a high-dimensional image at each time-step. We show that our approach can be used to improve exploration and is adaptable to many diverse environments, namely 10 Atari games, a 3D car racing environment, and complex 3D mazes.",/pdf/dc16b626b1ac211f99b7c5940b6cac11eb9a717a.pdf,ICLR,2017, +B1kJ6H9ex,,1478280000000.0,1491580000000.0,271,Combining policy gradient and Q-learning,"[""bodonoghue@google.com"", ""munos@google.com"", ""korayk@google.com"", ""vmnih@google.com""]","[""Brendan O'Donoghue"", ""Remi Munos"", ""Koray Kavukcuoglu"", ""Volodymyr Mnih""]","[""Deep learning"", ""Reinforcement Learning""]","Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as ‘PGQL’, for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning. +",/pdf/9dceb23b2c811b0566d9025dcdd13b037e3f3f6f.pdf,ICLR,2017,We combine a policy gradient style update with a Q-learning style update into a single RL algorithm we call PGQL. +33rtZ4Sjwjn,lop_CKg7HmK,1601310000000.0,1614360000000.0,402,Effective and Efficient Vote Attack on Capsule Networks,"[""~Jindong_Gu1"", ""~Baoyuan_Wu1"", ""~Volker_Tresp1""]","[""Jindong Gu"", ""Baoyuan Wu"", ""Volker Tresp""]","[""Capsule Networks"", ""Adversarial Attacks"", ""Adversarial Example Detection""]","Standard Convolutional Neural Networks (CNNs) can be easily fooled by images with small quasi-imperceptible artificial perturbations. As alternatives to CNNs, the recently proposed Capsule Networks (CapsNets) are shown to be more robust to white-box attack than CNNs under popular attack protocols. Besides, the class-conditional reconstruction part of CapsNets is also used to detect adversarial examples. In this work, we investigate the adversarial robustness of CapsNets, especially how the inner workings of CapsNets change when the output capsules are attacked. The first observation is that adversarial examples misled CapsNets by manipulating the votes from primary capsules. Another observation is the high computational cost, when we directly apply multi-step attack methods designed for CNNs to attack CapsNets, due to the computationally expensive routing mechanism. Motivated by these two observations, we propose a novel vote attack where we attack votes of CapsNets directly. Our vote attack is not only effective, but also efficient by circumventing the routing process. Furthermore, we integrate our vote attack into the detection-aware attack paradigm, which can successfully bypass the class-conditional reconstruction based detection method. Extensive experiments demonstrate the superior attack performance of our vote attack on CapsNets.",/pdf/93dc8fe0e28a6c86ef5c7b7c74d8c5968238850c.pdf,ICLR,2021,We propose an effective and efficient vote attack to create adversarial examples and bypass adversarial example detection on Capsule Networks. +SJTQLdqlg,,1478290000000.0,1488510000000.0,459,Learning to Remember Rare Events,"[""lukaszkaiser@google.com"", ""ofirnachum@google.com"", ""aurko@gatech.edu"", ""bengio@google.com""]","[""Lukasz Kaiser"", ""Ofir Nachum"", ""Aurko Roy"", ""Samy Bengio""]","[""Deep learning""]","Despite recent advances, memory-augmented deep neural networks are still limited +when it comes to life-long and one-shot learning, especially in remembering rare events. +We present a large-scale life-long memory module for use in deep learning. +The module exploits fast nearest-neighbor algorithms for efficiency and +thus scales to large memory sizes. +Except for the nearest-neighbor query, the module is fully differentiable +and trained end-to-end with no extra supervision. It operates in +a life-long manner, i.e., without the need to reset it during training. + +Our memory module can be easily added to any part of a supervised neural network. +To show its versatility we add it to a number of networks, from simple +convolutional ones tested on image classification to deep sequence-to-sequence +and recurrent-convolutional models. +In all cases, the enhanced network gains the ability to remember +and do life-long one-shot learning. +Our module remembers training examples shown many thousands +of steps in the past and it can successfully generalize from them. +We set new state-of-the-art for one-shot learning on the Omniglot dataset +and demonstrate, for the first time, life-long one-shot learning in +recurrent neural networks on a large-scale machine translation task.",/pdf/2a6234b3f2a32c8055fcdd6dfb12cb5037cd57c7.pdf,ICLR,2017,We introduce a memory module for life-long learning that adds one-shot learning capability to any supervised neural network. +rylvAA4YDB,rJep1Y5_wS,1569440000000.0,1577170000000.0,1433,IsoNN: Isomorphic Neural Network for Graph Representation Learning and Classification,"[""lin@ifmlab.org"", ""jiawei@ifmlab.org""]","[""Lin Meng"", ""Jiawei Zhang""]","[""Deep Learning"", ""Graph Neural Network""]","Deep learning models have achieved huge success in numerous fields, such as computer vision and natural language processing. However, unlike such fields, it is hard to apply traditional deep learning models on the graph data due to the ‘node-orderless’ property. Normally, adjacency matrices will cast an artificial and random node-order on the graphs, which renders the performance of deep mod- els on graph classification tasks extremely erratic, and the representations learned by such models lack clear interpretability. To eliminate the unnecessary node- order constraint, we propose a novel model named Isomorphic Neural Network (ISONN), which learns the graph representation by extracting its isomorphic features via the graph matching between input graph and templates. ISONN has two main components: graph isomorphic feature extraction component and classification component. The graph isomorphic feature extraction component utilizes a set of subgraph templates as the kernel variables to learn the possible subgraph patterns existing in the input graph and then computes the isomorphic features. A set of permutation matrices is used in the component to break the node-order brought by the matrix representation. Three fully-connected layers are used as the classification component in ISONN. Extensive experiments are conducted on benchmark datasets, the experimental results can demonstrate the effectiveness of ISONN, especially compared with both classic and state-of-the-art graph classification methods.",/pdf/a2392bac115787a6a53475976d26bbc6185c2603.pdf,ICLR,2020, +8bZC3CyF-f7,1usmcF0PzIN,1601310000000.0,1614990000000.0,3130,Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution,"[""~Vihang_Prakash_Patil1"", ""~Markus_Hofmarcher1"", ""dinu@ml.jku.at"", ""~Matthias_Dorfer1"", ""~Patrick_M_Blies1"", ""~Johannes_Brandstetter1"", ""~Jose_Arjona-Medina1"", ""~Sepp_Hochreiter1""]","[""Vihang Prakash Patil"", ""Markus Hofmarcher"", ""Marius-Constantin Dinu"", ""Matthias Dorfer"", ""Patrick M Blies"", ""Johannes Brandstetter"", ""Jose Arjona-Medina"", ""Sepp Hochreiter""]",[],"Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. +Complex tasks are often hierarchically composed of sub-tasks. +A step in the Q-function indicates solving a sub-task, where the expectation of the return increases. +RUDDER identifies these steps and then redistributes reward to them, thus immediately giving reward if sub-tasks are solved. +Since the delay of rewards is reduced, learning is considerably sped up. +However, for complex tasks, current exploration strategies struggle with discovering episodes with high rewards. +Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. +Typically the number of demonstrations is small and RUDDER's LSTM model does not learn well. +Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. +First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, +replacing RUDDER’s safe exploration and lessons replay buffer. +Second, we substitute RUDDER’s LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. +Profile models can be constructed from as few as two demonstrations. +Align-RUDDER inherits the concept of reward redistribution, which speeds up learning by reducing the delay of rewards. +Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. +On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently.",/pdf/6b5d5a1838a806c9854419b4956e7a69a92a188b.pdf,ICLR,2021, +SkeNlJSKvS,Hyl2v7j_wH,1569440000000.0,1577170000000.0,1502,Shallow VAEs with RealNVP Prior Can Perform as Well as Deep Hierarchical VAEs,"[""xhw15@mails.tsinghua.edu.cn"", ""chen-wx17@mails.tsinghua.edu.cn"", ""laijl16@mails.tsinghua.edu.cn"", ""lizhihan17@mails.tsinghua.edu.cn"", ""zhaoyoujian@tsinghua.edu.cn"", ""peidan@tsinghua.edu.cn""]","[""Haowen Xu"", ""Wenxiao Chen"", ""Jinlin Lai"", ""Zhihan Li"", ""Youjian Zhao"", ""Dan Pei""]","[""Variational Auto-encoder"", ""RealNVP"", ""learnable prior""]","Using powerful posterior distributions is a popular technique in variational inference. However, recent works showed that the aggregated posterior may fail to match unit Gaussian prior, even with expressive posteriors, thus learning the prior becomes an alternative way to improve the variational lower-bound. We show that using learned RealNVP prior and just one latent variable in VAE, we can achieve test NLL comparable to very deep state-of-the-art hierarchical VAE, outperforming many previous works with complex hierarchical VAE architectures. We hypothesize that, when coupled with Gaussian posteriors, the learned prior can encourage appropriate posterior overlapping, which is likely to improve reconstruction loss and lower-bound, supported by our experimental results. We demonstrate that, with learned RealNVP prior, ß-VAE can have better rate-distortion curve than using fixed Gaussian prior.",/pdf/f1b32c2a0cca43a600da9d1111e1dfe97f3dd902.pdf,ICLR,2020,"We show that VAE with learned RealNVP prior and just one latent variable can have better test NLLs than some deep hierarchical VAEs with powerful posteriors, on several datasets." +r1luCsCqFm,SklJW9q9Fm,1538090000000.0,1545360000000.0,908,Learn From Neighbour: A Curriculum That Train Low Weighted Samples By Imitating,"[""sunbenyuan@pku.edu.cn"", ""yizhou.wang@pku.edu.cn""]","[""Benyuan Sun"", ""Yizhou Wang""]","[""Curriculum Learning"", ""Internal Covariate Shift""]","Deep neural networks, which gain great success in a wide spectrum of applications, are often time, compute and storage hungry. Curriculum learning proposed to boost training of network by a syllabus from easy to hard. However, the relationship between data complexity and network training is unclear: why hard example harm the performance at beginning but helps at end. In this paper, we aim to investigate on this problem. Similar to internal covariate shift in network forward pass, the distribution changes in weight of top layers also affects training of preceding layers during the backward pass. We call this phenomenon inverse ""internal covariate shift"". Training hard examples aggravates the distribution shifting and damages the training. To address this problem, we introduce a curriculum loss that consists of two parts: a) an adaptive weight that mitigates large early punishment; b) an additional representation loss for low weighted samples. The intuition of the loss is very simple. We train top layers on ""good"" samples to reduce large shifting, and encourage ""bad"" samples to learn from ""good"" sample. In detail, the adaptive weight assigns small values to hard examples, reducing the influence of noisy gradients. On the other hand, the less-weighted hard sample receives the proposed representation loss. Low-weighted data gets nearly no training signal and can stuck in embedding space for a long time. The proposed representation loss aims to encourage their training. This is done by letting them learn a better representation from its superior neighbours but not participate in learning of top layers. In this way, the fluctuation of top layers is reduced and hard samples also received signals for training. We found in this paper that curriculum learning needs random sampling between tasks for better training. Our curriculum loss is easy to combine with existing stochastic algorithms like SGD. Experimental result shows an consistent improvement over several benchmark datasets.",/pdf/4ae687244ce452107e640d4936afefc3321e8a50.pdf,ICLR,2019, +By-7dz-AZ,B1jMdzW0W,1509140000000.0,1519340000000.0,868,A Framework for the Quantitative Evaluation of Disentangled Representations,"[""s1668298@ed.ac.uk"", ""ckiw@inf.ed.ac.uk""]","[""Cian Eastwood"", ""Christopher K. I. Williams""]",[],"Recent AI research has emphasised the importance of learning disentangled representations of the explanatory factors behind data. Despite the growing interest in models which can learn such representations, visual inspection remains the standard evaluation metric. While various desiderata have been implied in recent definitions, it is currently unclear what exactly makes one disentangled representation better than another. In this work we propose a framework for the quantitative evaluation of disentangled representations when the ground-truth latent structure is available. Three criteria are explicitly defined and quantified to elucidate the quality of learnt representations and thus compare models on an equal basis. To illustrate the appropriateness of the framework, we employ it to compare quantitatively the representations learned by recent state-of-the-art models.",/pdf/c7f91caeb674f420a714557bb8e33ab79244168a.pdf,ICLR,2018, +SJA7xfb0b,B1pQgGWAW,1509130000000.0,1519350000000.0,765,Sobolev GAN,"[""mroueh@us.ibm.com"", ""chunlial@cs.cmu.edu"", ""tom.sercu1@ibm.com"", ""anant.raj@tuebingen.mpg.de"", ""chengyu@us.ibm.com""]","[""Youssef Mroueh"", ""Chun-Liang Li"", ""Tom Sercu"", ""Anant Raj"", ""Yu Cheng""]","[""GAN theory"", ""Integral Probability Metrics"", ""elliptic PDE and diffusion"", ""GAN for discrete sequences"", ""semi-supervised learning.""]","We propose a new Integral Probability Metric (IPM) between distributions: the Sobolev IPM. The Sobolev IPM compares the mean discrepancy of two distributions for functions (critic) restricted to a Sobolev ball defined with respect to a dominant measure mu. We show that the Sobolev IPM compares two distributions in high dimensions based on weighted conditional Cumulative Distribution Functions (CDF) of each coordinate on a leave one out basis. The Dominant measure mu plays a crucial role as it defines the support on which conditional CDFs are compared. Sobolev IPM can be seen as an extension of the one dimensional Von-Mises Cramer statistics to high dimensional distributions. We show how Sobolev IPM can be used to train Generative Adversarial Networks (GANs). We then exploit the intrinsic conditioning implied by Sobolev IPM in text generation. Finally we show that a variant of Sobolev GAN achieves competitive results in semi-supervised learning on CIFAR-10, thanks to the smoothness enforced on the critic by Sobolev GAN which relates to Laplacian regularization.",/pdf/75ca6ac961660bda51234ae77c4ed6bbb2886a59.pdf,ICLR,2018,We define a new Integral Probability Metric (Sobolev IPM) and show how it can be used for training GANs for text generation and semi-supervised learning. +HJxwvCEFvH,r1lNXODuvr,1569440000000.0,1577170000000.0,1178,SPECTRA: Sparse Entity-centric Transitions,"[""rim.assouel@hotmail.fr"", ""yoshua.bengio@mila.quebec""]","[""Rim Assouel"", ""Yoshua Bengio""]","[""representation learning"", ""slot-structured representations"", ""sparse slot-structured transitions"", ""entity-centric representation"", ""unsupervised learning"", ""object-centric""]","Learning an agent that interacts with objects is ubiquituous in many RL tasks. In most of them the agent's actions have sparse effects : only a small subset of objects in the visual scene will be affected by the action taken. We introduce SPECTRA, a model for learning slot-structured transitions from raw visual observations that embodies this sparsity assumption. Our model is composed of a perception module that decomposes the visual scene into a set of latent objects representations (i.e. slot-structured) and a transition module that predicts the next latent set slot-wise and in a sparse way. We show that learning a perception module jointly with a sparse slot-structured transition model not only biases the model towards more entity-centric perceptual groupings but also enables intrinsic exploration strategy that aims at maximizing the number of objects changed in the agent’s trajectory.",/pdf/b4270b052a3a351200bd7d14c48b2b830b3e349f.pdf,ICLR,2020,Sparse slot-structured transition model. Training is done such that such that latent slots correspond to relevant entities of the visual scene. +B1xIj3VYvr,SkgRFVxWPS,1569440000000.0,1583910000000.0,153,Weakly Supervised Clustering by Exploiting Unique Class Count,"[""umitoner@comp.nus.edu.sg"", ""leehk@bii.a-star.edu.sg"", ""ksung@comp.nus.edu.sg""]","[""Mustafa Umit Oner"", ""Hwee Kuan Lee"", ""Wing-Kin Sung""]","[""weakly supervised clustering"", ""weakly supervised learning"", ""multiple instance learning""]","A weakly supervised learning based clustering framework is proposed in this paper. As the core of this framework, we introduce a novel multiple instance learning task based on a bag level label called unique class count (ucc), which is the number of unique classes among all instances inside the bag. In this task, no annotations on individual instances inside the bag are needed during training of the models. We mathematically prove that with a perfect ucc classifier, perfect clustering of individual instances inside the bags is possible even when no annotations on individual instances are given during training. We have constructed a neural network based ucc classifier and experimentally shown that the clustering performance of our framework with our weakly supervised ucc classifier is comparable to that of fully supervised learning models where labels for all instances are known. Furthermore, we have tested the applicability of our framework to a real world task of semantic segmentation of breast cancer metastases in histological lymph node sections and shown that the performance of our weakly supervised framework is comparable to the performance of a fully supervised Unet model.",/pdf/adbf99b321dd11ad7ef4e787f2da0de3f4aba58f.pdf,ICLR,2020,A weakly supervised learning based clustering framework performs comparable to that of fully supervised learning models by exploiting unique class count. +H1lqZhRcFm,S1gDB8ocF7,1538090000000.0,1546340000000.0,1198,Unsupervised Learning of the Set of Local Maxima,"[""wolf@fb.com"", ""sagiebenaim@gmail.com"", ""tomer22g@gmail.com""]","[""Lior Wolf"", ""Sagie Benaim"", ""Tomer Galanti""]","[""Unsupervised Learning"", ""One-class Classification"", ""Multi-player Optimization""]","This paper describes a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function $v$ in an unknown subset of the vector space. Two functions are learned: (i) a set indicator $c$, which is a binary classifier, and (ii) a comparator function $h$ that given two nearby samples, predicts which sample has the higher value of the unknown function $v$. Loss terms are used to ensure that all training samples $\vx$ are a local maxima of $v$, according to $h$ and satisfy $c(\vx)=1$. Therefore, $c$ and $h$ provide training signals to each other: a point $\vx'$ in the vicinity of $\vx$ satisfies $c(\vx)=-1$ or is deemed by $h$ to be lower in value than $\vx$. We present an algorithm, show an example where it is more efficient to use local maxima as an indicator function than to employ conventional classification, and derive a suitable generalization bound. Our experiments show that the method is able to outperform one-class classification algorithms in the task of anomaly detection and also provide an additional signal that is extracted in a completely unsupervised way. +",/pdf/b66e92e3765b7672a8cec593d0c97f95fda1e1b4.pdf,ICLR,2019, +HyXNCZbCZ,H1GVCbbRb,1509130000000.0,1518730000000.0,741,Hierarchical Adversarially Learned Inference,"[""ishmael.belghazi@gmail.com"", ""rajsai24@gmail.com"", ""oli.mastro@gmail.com"", ""negar.rostamzadeh@gmail.com"", ""jovana.mitrovic@spc.ox.ac.uk"", ""aaron.courville@gmail.com""]","[""Mohamed Ishmael Belghazi"", ""Sai Rajeswar"", ""Olivier Mastropietro"", ""Negar Rostamzadeh"", ""Jovana Mitrovic"", ""Aaron Courville""]","[""generative"", ""hierarchical"", ""unsupervised"", ""semisupervised"", ""latent"", ""ALI"", ""GAN""]","We propose a novel hierarchical generative model with a simple Markovian structure and a corresponding inference model. Both the generative and inference model are trained using the adversarial learning paradigm. We demonstrate that the hierarchical structure supports the learning of progressively more abstract representations as well as providing semantically meaningful reconstructions with different levels of fidelity. Furthermore, we show that minimizing the Jensen-Shanon divergence between the generative and inference network is enough to minimize the reconstruction error. The resulting semantically meaningful hierarchical latent structure discovery is exemplified on the CelebA dataset. There, we show that the features learned by our model in an unsupervised way outperform the best handcrafted features. Furthermore, the extracted features remain competitive when compared to several recent deep supervised approaches on an attribute prediction task on CelebA. Finally, we leverage the model's inference network to achieve state-of-the-art performance on a semi-supervised variant of the MNIST digit classification task. ",/pdf/d2e776621b9421c53813d28d1e0b94271e032881.pdf,ICLR,2018,Adversarially trained hierarchical generative model with robust and semantically learned latent representation. +SJl3h2EYvS,rygftyjGwB,1569440000000.0,1577170000000.0,204,CLAREL: classification via retrieval loss for zero-shot learning,"[""boris@elementai.com"", ""negar@elementai.com"", ""pedro@elementai.com"", ""christopher.pal@elementai.com""]","[""Boris N. Oreshkin"", ""Negar Rostamzadeh"", ""Pedro O. Pinheiro"", ""Christopher Pal""]","[""zero-shot learning"", ""representation learning"", ""fine-grained classification""]","We address the problem of learning fine-grained cross-modal representations. We propose an instance-based deep metric learning approach in joint visual and textual space. The key novelty of this paper is that it shows that using per-image semantic supervision leads to substantial improvement in zero-shot performance over using class-only supervision. On top of that, we provide a probabilistic justification for a metric rescaling approach that solves a very common problem in the generalized zero-shot learning setting, i.e., classifying test images from unseen classes as one of the classes seen during training. We evaluate our approach on two fine-grained zero-shot learning datasets: CUB and FLOWERS. We find that on the generalized zero-shot classification task CLAREL consistently outperforms the existing approaches on both datasets.",/pdf/11f88b5c98063ab7e8f0383668cfc69dfcd2bf20.pdf,ICLR,2020,We propose an instance-based deep metric learning approach in joint visual and textual space. We show that per-image semantic supervision leads to substantial improvement over class-only supervision in zero shot classification. +BJg4Z3RqF7,H1ekHTp5t7,1538090000000.0,1549630000000.0,1163,Unsupervised Adversarial Image Reconstruction,"[""arthur.pajot@lip6.fr"", ""emmanuel.de-bezenac@lip6.fr"", ""patrick.gallinari@lip6.fr""]","[""Arthur Pajot"", ""Emmanuel de Bezenac"", ""Patrick Gallinari""]","[""Deep Learning"", ""Adversarial"", ""MAP"", ""GAN"", ""neural networks""]","We address the problem of recovering an underlying signal from lossy, inaccurate observations in an unsupervised setting. Typically, we consider situations where there is little to no background knowledge on the structure of the underlying signal, no access to signal-measurement pairs, nor even unpaired signal-measurement data. The only available information is provided by the observations and the measurement process statistics. We cast the problem as finding the \textit{maximum a posteriori} estimate of the signal given each measurement, and propose a general framework for the reconstruction problem. We use a formulation of generative adversarial networks, where the generator takes as input a corrupted observation in order to produce realistic reconstructions, and add a penalty term tying the reconstruction to the associated observation. We evaluate our reconstructions on several image datasets with different types of corruptions. The proposed approach yields better results than alternative baselines, and comparable performance with model variants trained with additional supervision.",/pdf/8952a5ef40ce737be3059e78feb4dd0b7f297219.pdf,ICLR,2019, +rJY3vK9eg,,1478300000000.0,1485330000000.0,506,Neural Combinatorial Optimization with Reinforcement Learning,"[""ibello@google.com"", ""hyhieu@google.com"", ""qvl@google.com"", ""mnorouzi@google.com"", ""bengio@google.com""]","[""Irwan Bello*"", ""Hieu Pham*"", ""Quoc V. Le"", ""Mohammad Norouzi"", ""Samy Bengio""]","[""Reinforcement Learning"", ""Deep learning""]","This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent neural network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack, another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items. These results, albeit still far from state-of-the-art, give insights into how neural networks can be used as a general tool for tackling combinatorial optimization problems.",/pdf/dc0c13426b0f4e480cb06f70c0e511ea499242ce.pdf,ICLR,2017,This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. +Byl8hhNYPS,Syg_jCXMDH,1569440000000.0,1588210000000.0,189,Neural Machine Translation with Universal Visual Representation,"[""zhangzs@sjtu.edu.cn"", ""khchen@nict.go.jp"", ""wangrui@nict.go.jp"", ""mutiyama@nict.go.jp"", ""eiichiro.sumita@nict.go.jp"", ""charlee@sjtu.edu.cn"", ""zhaohai@cs.sjtu.edu.cn""]","[""Zhuosheng Zhang"", ""Kehai Chen"", ""Rui Wang"", ""Masao Utiyama"", ""Eiichiro Sumita"", ""Zuchao Li"", ""Hai Zhao""]","[""Neural Machine Translation"", ""Visual Representation"", ""Multimodal Machine Translation"", ""Language Representation""]","Though visual information has been introduced for enhancing neural machine translation (NMT), its effectiveness strongly relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we present a universal visual representation learned over the monolingual corpora with image annotations, which overcomes the lack of large-scale bilingual sentence-image pairs, thereby extending image applicability in NMT. In detail, a group of images with similar topics to the source sentence will be retrieved from a light topic-image lookup table learned over the existing sentence-image pairs, and then is encoded as image representations by a pre-trained ResNet. An attention layer with a gated weighting is to fuse the visual information and text information as input to the decoder for predicting target translations. In particular, the proposed method enables the visual information to be integrated into large-scale text-only NMT in addition to the multimodel NMT. Experiments on four widely used translation datasets, including the WMT'16 English-to-Romanian, WMT'14 English-to-German, WMT'14 English-to-French, and Multi30K, show that the proposed approach achieves significant improvements over strong baselines.",/pdf/2ac2a018a5183db8cbcf9e8e33442c42d2eb55c6.pdf,ICLR,2020,"This work proposed a universal visual representation for neural machine translation (NMT) using retrieved images with similar topics to source sentence, extending image applicability in NMT." +2VXyy9mIyU3,4tE4acyxpc,1601310000000.0,1616050000000.0,1201,Learning with Instance-Dependent Label Noise: A Sample Sieve Approach,"[""~Hao_Cheng5"", ""~Zhaowei_Zhu1"", ""~Xingyu_Li2"", ""~Yifei_Gong1"", ""~Xing_Sun1"", ""~Yang_Liu3""]","[""Hao Cheng"", ""Zhaowei Zhu"", ""Xingyu Li"", ""Yifei Gong"", ""Xing Sun"", ""Yang Liu""]","[""Learning with noisy labels"", ""instance-based label noise"", ""deep neural networks.""]","Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, providing theoretically rigorous solutions for learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES$^{2}$ (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted examples. The implementation of CORES$^{2}$ does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES$^{2}$ in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES$^{2}$ on CIFAR10 and CIFAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance. Code is available at https://github.com/UCSC-REAL/cores.",/pdf/d1f9063b86dd15432acc0f429ea3d22cbb41ae74.pdf,ICLR,2021,This paper proposes a dynamic sample sieve method with strong theoretical guarantees to avoid overfitting to instance-based label noise. +BJxqohNFPB,H1xig0WZvH,1569440000000.0,1577170000000.0,161,S-Flow GAN,"[""yakov.miron@gmail.com"", ""yona.coscas@gmail.com""]","[""Miron Yakov"", ""Coscas Yona""]","[""GAN"", ""Image Generation"", ""AI"", ""Generative Models"", ""CV""]","Our work offers a new method for domain translation from semantic label maps +and Computer Graphic (CG) simulation edge map images to photo-realistic im- +ages. We train a Generative Adversarial Network (GAN) in a conditional way to +generate a photo-realistic version of a given CG scene. Existing architectures of +GANs still lack the photo-realism capabilities needed to train DNNs for computer +vision tasks, we address this issue by embedding edge maps, and training it in an +adversarial mode. We also offer an extension to our model that uses our GAN +architecture to create visually appealing and temporally coherent videos.",/pdf/b50c333770c1f03985caadc8d39ff8bc74167d2c.pdf,ICLR,2020,Simulation to real images translation and video generation +BJ78bJZCZ,S1G8Zy-CW,1509120000000.0,1518730000000.0,474,Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit,"[""brendan.maginnis@gmail.com"", ""pierre.richemond@gmail.com""]","[""Brendan Maginnis"", ""Pierre Richemond""]","[""RNNs""]","Recurrent Neural Networks architectures excel at processing sequences by +modelling dependencies over different timescales. The recently introduced +Recurrent Weighted Average (RWA) unit captures long term dependencies +far better than an LSTM on several challenging tasks. The RWA achieves +this by applying attention to each input and computing a weighted average +over the full history of its computations. Unfortunately, the RWA cannot +change the attention it has assigned to previous timesteps, and so struggles +with carrying out consecutive tasks or tasks with changing requirements. +We present the Recurrent Discounted Attention (RDA) unit that builds on +the RWA by additionally allowing the discounting of the past. +We empirically compare our model to RWA, LSTM and GRU units on +several challenging tasks. On tasks with a single output the RWA, RDA and +GRU units learn much quicker than the LSTM and with better performance. +On the multiple sequence copy task our RDA unit learns the task three +times as quickly as the LSTM or GRU units while the RWA fails to learn at +all. On the Wikipedia character prediction task the LSTM performs best +but it followed closely by our RDA unit. Overall our RDA unit performs +well and is sample efficient on a large variety of sequence tasks.",/pdf/eaa88eedd70fe9e6f654ed17af7007888ae02751.pdf,ICLR,2018,We introduce the Recurrent Discounted Unit which applies attention to any length sequence in linear time +MBOyiNnYthd,CVGcXynUyQ,1601310000000.0,1616060000000.0,3291,IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression,"[""~Rianne_van_den_Berg1"", ""~Alexey_A._Gritsenko1"", ""~Mostafa_Dehghani1"", ""~Casper_Kaae_S\u00f8nderby1"", ""~Tim_Salimans1""]","[""Rianne van den Berg"", ""Alexey A. Gritsenko"", ""Mostafa Dehghani"", ""Casper Kaae S\u00f8nderby"", ""Tim Salimans""]","[""normalizing flows"", ""lossless source compression"", ""generative modeling""]","In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states that invertible flows for discrete random variables are less flexible than their continuous counterparts. We demonstrate with a proof that this claim does not hold for integer discrete flows due to the embedding of data with finite support into the countably infinite integer lattice. Furthermore, we zoom in on the effect of gradient bias due to the straight-through estimator in integer discrete flows, and demonstrate that its influence is highly dependent on architecture choices and less prominent than previously thought. Finally, we show how different architecture modifications improve the performance of this model class for lossless compression, and that they also enable more efficient compression: a model with half the number of flow layers performs on par with or better than the original integer discrete flow model.",/pdf/049fd6f43de5700220bd49a24b2ae38e78c3782c.pdf,ICLR,2021,We analyze and improve integer discrete normalizing flows for lossless source compression. +kmqjgSNXby,Y6rR-S6paX,1601310000000.0,1616050000000.0,1897,Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization,"[""~Michael_R_Zhang1"", ""~Thomas_Paine1"", ""~Ofir_Nachum1"", ""~Cosmin_Paduraru1"", ""~George_Tucker1"", ""~ziyu_wang1"", ""~Mohammad_Norouzi1""]","[""Michael R Zhang"", ""Thomas Paine"", ""Ofir Nachum"", ""Cosmin Paduraru"", ""George Tucker"", ""ziyu wang"", ""Mohammad Norouzi""]","[""Off-policy policy evaluation"", ""autoregressive models"", ""offline reinforcement learning"", ""policy optimization""]","Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning. +",/pdf/258fc8fbf3df2a9d9783d528b562f8f503fe1167.pdf,ICLR,2021,We demonstrate autoregressive dynamics models outperform standard feedforward models and other baselines in offline policy evaluation and optimization. +HHiiQKWsOcV,GOv9wUY5kuS,1601310000000.0,1616550000000.0,2791,Explaining the Efficacy of Counterfactually Augmented Data,"[""~Divyansh_Kaushik1"", ""~Amrith_Setlur1"", ""~Eduard_H_Hovy1"", ""~Zachary_Chase_Lipton1""]","[""Divyansh Kaushik"", ""Amrith Setlur"", ""Eduard H Hovy"", ""Zachary Chase Lipton""]","[""humans in the loop"", ""annotation artifacts"", ""text classification"", ""sentiment analysis"", ""natural language inference""]","In attempts to produce machine learning models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label are prohibited. Models trained on the augmented (original and revised) data appear, empirically, to rely less on semantically irrelevant words and to generalize better out of domain. While this work draws loosely on causal thinking, the underlying causal model (even at an abstract level) and the principles underlying the observed out-of-domain improvements remain unclear. In this paper, we introduce a toy analog based on linear Gaussian models, observing interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Our analysis provides some insights that help to explain the efficacy of CAD. Moreover, we develop the hypothesis that while adding noise to causal features should degrade both in-domain and out-of-domain performance, adding noise to non-causal features should lead to relative improvements in out-of-domain performance. This idea inspires a speculative test for determining whether a feature attribution technique has identified the causal spans. If adding noise (e.g., by random word flips) to the highlighted spans degrades both in-domain and out-of-domain performance on a battery of challenge datasets, but adding noise to the complement gives improvements out-of-domain, this suggests we have identified causal spans. Thus, we present a large scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we find that the hypothesized phenomenon is pronounced for CAD.",/pdf/73361dc2c4d80cb501745448d7de1e3c99d2f2a8.pdf,ICLR,2021,We present a framework for thinking about counterfactually augmented data and make strides towards understanding its benefits in out-of-domain generalization. +rJe1y3CqtX,ryxcCppqF7,1538090000000.0,1545360000000.0,949,Deep Reinforcement Learning of Universal Policies with Diverse Environment Summaries,"[""befelix@inf.ethz.ch"", ""dedey@microsoft.com"", ""akapoor@microsoft.com""]","[""Felix Berkenkamp"", ""Debadeepta Dey"", ""Ashish Kapoor""]","[""Domain Randomization"", ""Diverse Summaries"", ""Reinforcement learning""]","Deep reinforcement learning has enabled robots to complete complex tasks in simulation. However, the resulting policies do not transfer to real robots due to model errors in the simulator. One solution is to randomize the simulation environment, so that the resulting, trained policy achieves high performance in expectation over a variety of configurations that could represent the real-world. However, the distribution over simulator configurations must be carefully selected to represent the relevant dynamic modes of the system, as otherwise it can be unlikely to sample challenging configurations frequently enough. Moreover, the ideal distribution to improve the policy changes as the policy (un)learns to solve tasks in certain configurations. In this paper, we propose to use an inexpensive, kernel-based summarization method method that identifies configurations that lead to diverse behaviors. Since failure modes for the given task are naturally diverse, the policy trains on a mixture of representative and challenging configurations, which leads to more robust policies. In experiments, we show that the proposed method achieves the same performance as domain randomization in simple cases, but performs better when domain randomization does not lead to diverse dynamic modes.",/pdf/e1e59e3713fd929b0f0daf47eb8a5673cf75736b.pdf,ICLR,2019,"As an alternative to domain randomization, we summarize simulator configurations to ensure that the policy is trained on a diverse set of induced state-trajectories." +txC1ObHJ0wB,rAHYHXVzhFy,1601310000000.0,1614990000000.0,434,How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS,"[""~Kaicheng_Yu1"", ""~Rene_Ranftl1"", ""~Mathieu_Salzmann1""]","[""Kaicheng Yu"", ""Rene Ranftl"", ""Mathieu Salzmann""]","[""autoML"", ""neural architecture search"", ""NAS"", ""one-shot NAS"", ""weight-sharing NAS"", ""super-net""]","Weight sharing promises to make neural architecture search (NAS) tractable even on commodity hardware. Existing methods in this space rely on a diverse set of heuristics to design and train the shared-weight backbone network, a.k.a. the super-net. Since heuristics substantially vary across different methods and have not been carefully studied, it is unclear to which extent they impact super-net training and hence the weight-sharing NAS algorithms. In this paper, we disentangle super-net training from the search algorithm, isolate 14 frequently-used training heuristics, and evaluate them over three benchmark search spaces. Our analysis uncovers that several commonly-used heuristics negatively impact the correlation between super-net and stand-alone performance, whereas simple, but often overlooked factors, such as proper hyper-parameter settings, are key to achieve strong performance. Equipped with this knowledge, we show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained. +",/pdf/ae4b06f5972c52d4809417d296da867cfc4493d8.pdf,ICLR,2021,We show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained. +r1gc3lBFPH,BJeiOe-twr,1569440000000.0,1577170000000.0,2548,Keyword Spotter Model for Crop Pest and Disease Monitoring from Community Radio Data,"[""akeraben@gmail.com"", ""jnakatumba@cis.mak.ac.ug"", ""ali.hussein@ronininstitute.org"", ""ssendiwaladaniel@gmail.com"", ""jonmuk7@gmail.com""]","[""Benjamin Akera"", ""Joyce Nakatumba-Nabende"", ""Ali Hussein"", ""Daniel Ssendiwala"", ""Jonathan Mukiibi""]","[""keyword spotter"", ""radio data"", ""crop pest and disease"", ""agriculture""]","In societies with well developed internet infrastructure, social media is the leading medium of communication for various social issues especially for breaking news situations. In rural Uganda however, public community radio is still a dominant means for news dissemination. Community radio gives audience to the general public especially to individuals living in rural areas, and thus plays an important role in giving a voice to those living in the broadcast area. It is an avenue for participatory communication and a tool relevant in both economic and social development.This is supported by the rise to ubiquity of mobile phones providing access to phone-in or text-in talk shows. In this paper, we describe an approach to analysing the readily available community radio data with machine learning-based speech keyword spotting techniques. We identify the keywords of interest related to agriculture and build models to automatically identify these keywords from audio streams. Our contribution through these techniques is a cost-efficient and effective way to monitor food security concerns particularly in rural areas. Through keyword spotting and radio talk show analysis, issues such as crop diseases, pests, drought and famine can be captured and fed into an early warning system for stakeholders and policy makers.",/pdf/019586dcdcaad8040c17e338589f435e8dc2468a.pdf,ICLR,2020,This paper describes an approach to analyse community radio data with machine learning-based speech keyword spotting techniques for crop pest and disease monitoring. +SJVmjjR9FX,Hke3nJsY_7,1538090000000.0,1550270000000.0,611,Variational Bayesian Phylogenetic Inference,"[""zc.rabbit@gmail.com"", ""matsen@fredhutch.org""]","[""Cheng Zhang"", ""Frederick A. Matsen IV""]","[""Bayesian phylogenetic inference"", ""Variational inference"", ""Subsplit Bayesian networks""]","Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo with simple mechanisms for proposing new states, which hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We approximate the true posterior using an expressive graphical model for tree distributions, called a subsplit Bayesian network, together with appropriate branch length distributions. We train the variational approximation via stochastic gradient ascent and adopt multi-sample based gradient estimators for different latent variables separately to handle the composite latent space of phylogenetic models. We show that our structured variational approximations are flexible enough to provide comparable posterior estimation to MCMC, while requiring less computation due to a more efficient tree exploration mechanism enabled by variational inference. Moreover, the variational approximations can be readily used for further statistical analysis such as marginal likelihood estimation for model comparison via importance sampling. Experiments on both synthetic data and real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.",/pdf/6bab1c498a92fde3b14cec1bbf11299073ef7719.pdf,ICLR,2019,"The first variational Bayes formulation of phylogenetic inference, a challenging inference problem over structures with intertwined discrete and continuous components" +uRKqXoN-Ic9,Flriyr5nZz2,1601310000000.0,1614990000000.0,1172,Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable?,"[""~Anna-Kathrin_Kopetzki1"", ""~Bertrand_Charpentier2"", ""~Daniel_Z\u00fcgner1"", ""giri@in.tum.de"", ""~Stephan_G\u00fcnnemann1""]","[""Anna-Kathrin Kopetzki"", ""Bertrand Charpentier"", ""Daniel Z\u00fcgner"", ""Sandhya Giri"", ""Stephan G\u00fcnnemann""]",[],"Robustness to adversarial perturbations and accurate uncertainty estimation are crucial for reliable application of deep learning in real world settings. Dirichlet-based uncertainty (DBU) models are a family of models that predict the parameters of a Dirichlet distribution (instead of a categorical one) and promise to signal when not to trust their predictions. Untrustworthy predictions are obtained on unknown or ambiguous samples and marked with a high uncertainty by the models. + +In this work, we show that DBU models with standard training are not robust w.r.t. three important tasks in the field of uncertainty estimation. First, we evaluate how useful the uncertainty estimates are to (1) indicate correctly classified samples. Our results show that while they are a good indicator on unperturbed data, performance on perturbed data decreases dramatically. (2) We evaluate if uncertainty estimates are able to detect adversarial examples that try to fool classification. It turns out that uncertainty estimates are able to detect FGSM attacks but not able to detect PGD attacks. We further evaluate the reliability of DBU models on the task of (3) distinguishing between in-distribution (ID) and out-of-distribution (OOD) data. To this end, we present the first study of certifiable robustness for DBU models. Furthermore, we propose novel uncertainty attacks that fool models into assigning high confidence to OOD data and low confidence to ID data, respectively. +Both approaches show that detecting OOD samples and distinguishing between ID-data and OOD-data is not robust. + +Based on our results, we explore the first approaches to make DBU models more robust. We use adversarial training procedures based on label attacks, uncertainty attacks, or random noise and demonstrate how they affect robustness of DBU models on ID data and OOD data.",/pdf/a3b8824e9d341078b411202db5e4c2ffa5625e72.pdf,ICLR,2021, +68747kJ0qKt,7PXG0hwGUpY,1601310000000.0,1614990000000.0,1713,"On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks","[""~Ben_Lengerich1"", ""~Eric_Xing1"", ""~Rich_Caruana1""]","[""Ben Lengerich"", ""Eric Xing"", ""Rich Caruana""]","[""Dropout"", ""Interaction Effects"", ""Neural Networks"", ""Functional ANOVA""]","We examine Dropout through the perspective of interactions. Given $N$ variables, there are $\mathcal{O}(N^2)$ possible pairwise interactions, $\mathcal{O}(N^3)$ possible 3-way interactions, i.e. $\mathcal{O}(N^k)$ possible interactions of $k$ variables. Conversely, the probability of an interaction of $k$ variables surviving Dropout at rate $p$ is $\mathcal{O}((1-p)^k)$. In this paper, we show that these rates cancel, and as a result, Dropout selectively regularizes against learning higher-order interactions. We prove this new perspective analytically for Input Dropout and empirically for Activation Dropout. This perspective on Dropout has several practical implications: (1) higher Dropout rates should be used when we need stronger regularization against spurious high-order interactions, (2) caution must be used when interpreting Dropout-based feature saliency measures, and (3) networks trained with Input Dropout are biased estimators, even with infinite data. We also compare Dropout to regularization via weight decay and early stopping and find that it is difficult to obtain the same regularization against high-order interactions with these methods.",/pdf/7876515b497e6aabbdb4741796d18d21b3fbe864.pdf,ICLR,2021,We show that Dropout regularizes against interaction effects. +ByW2Avqgg,,1478290000000.0,1486360000000.0,444,Neural Causal Regularization under the Independence of Mechanisms Assumption,"[""bahadori@gatech.edu"", ""kjchalup@caltech.edu"", ""mp2893@gatech.edu"", ""rchen87@gatech.edu"", ""StewarWF@sutterhealth.org"", ""jsun@cc.gatech.edu""]","[""Mohammad Taha Bahadori"", ""Krzysztof Chalupka"", ""Edward Choi"", ""Robert Chen"", ""Walter F. Stewart"", ""Jimeng Sun""]","[""Deep learning"", ""Applications""]","Neural networks provide a powerful framework for learning the association between input and response variables and making accurate predictions. However, in many applications such as healthcare, it is important to identify causal relationships between the inputs and the response variables to be able to change the response variables by intervention on the inputs. In pursuit of models whose predictive power comes maximally from causal variables, we propose a novel causal regularizer based on the independence of mechanisms assumption. We utilize the causal regularizer to steer deep neural network architectures towards causally-interpretable solutions. We perform a large-scale analysis of electronic health records. Employing expert's judgment as the causal ground-truth, we show that our causally-regularized algorithm outperforms its L1-regularized equivalence both in predictive performance as well as causal relevance. Finally, we show that the proposed causal regularizer can be used together with representation learning algorithms to yield up to 20% improvement in the causality score of the generated hypotheses.",/pdf/f874a8ac3db85e31cc2fdf628d8563196d36e690.pdf,ICLR,2017,We designed a neural causal regularizer to encourage predictive models to be more causal. +J40FkbdldTX,F4YKOp1N89f,1601310000000.0,1614990000000.0,814,Exploring single-path Architecture Search ranking correlations,"[""~Kevin_Alexander_Laube1"", ""~Andreas_Zell1""]","[""Kevin Alexander Laube"", ""Andreas Zell""]","[""Neural Architecture Search"", ""AutoML"", ""Neural Networks""]","Recently presented benchmarks for Neural Architecture Search (NAS) provide the results of training thousands of different architectures in a specific search space, thus enabling the fair and rapid comparison of different methods. +Based on these results, we quantify the ranking correlations of single-path architecture search methods +in different search space subsets and under several training variations; +studying their impact on the expected search results. +The experiments support the few-shot approach and Linear Transformers, +provide evidence against disabling cell topology sharing during the training phase or using strong regularization in the NAS-Bench-201 search space, +and show the necessity of further research regarding super-network size and path sampling strategies.",/pdf/5d9b39e4650306f0a69f439331527e51b4dc8401.pdf,ICLR,2021,An empirical study of how several method variations affect the quality of the architecture ranking prediction. +HPGtPvFNROh,nofSZnC4fv,1601310000000.0,1614990000000.0,1656,DROPS: Deep Retrieval of Physiological Signals via Attribute-specific Clinical Prototypes,"[""~Dani_Kiyasseh1"", ""tingting.zhu@eng.ox.ac.uk"", ""~David_A._Clifton1""]","[""Dani Kiyasseh"", ""Tingting Zhu"", ""David A. Clifton""]","[""Contrastive learning"", ""information retrieval"", ""clustering"", ""physiological signals"", ""healthcare""]","The ongoing digitization of health records within the healthcare industry results in large-scale datasets. Manually extracting clinically-useful insight from such datasets is non-trivial. However, doing so at scale while simultaneously leveraging patient-specific attributes such as sex and age can assist with clinical-trial enrollment, medical school educational endeavours, and the evaluation of the fairness of neural networks. To facilitate the reliable extraction of clinical information, we propose to learn embeddings, known as clinical prototypes (CPs), via supervised contrastive learning. We show that CPs can be efficiently used for large-scale retrieval and clustering of physiological signals based on multiple patient attributes. We also show that CPs capture attribute-specific semantic relationships.",/pdf/572a888fa66390044820fba124b967abf7b7e5e6.pdf,ICLR,2021, +ByxhOyHYwH,rJgTgX0_PH,1569440000000.0,1577170000000.0,1819,Fast Task Adaptation for Few-Shot Learning,"[""zhangyingying7@hikvision.com"", ""zhongqiaoyong@hikvision.com"", ""xiedi@hikvision.com"", ""pushiliang@hikvision.com""]","[""Yingying Zhang"", ""Qiaoyong Zhong"", ""Di Xie"", ""Shiliang Pu""]","[""Few-Shot Learning"", ""Metric-Softmax Loss"", ""Fast Task Adaptation""]","Few-shot classification is a challenging task due to the scarcity of training examples for each class. The key lies in generalization of prior knowledge learned from large-scale base classes and fast adaptation of the classifier to novel classes. In this paper, we introduce a two-stage framework. In the first stage, we attempt to learn task-agnostic feature on base data with a novel Metric-Softmax loss. The Metric-Softmax loss is trained against the whole label set and learns more discriminative feature than episodic training. Besides, the Metric-Softmax classifier can be applied to base and novel classes in a consistent manner, which is critical for the generalizability of the learned feature. In the second stage, we design a task-adaptive transformation which adapts the classifier to each few-shot setting very fast within a few tuning epochs. Compared with existing fine-tuning scheme, the scarce examples of novel classes are exploited more effectively. Experiments show that our approach outperforms current state-of-the-arts by a large margin on the commonly used mini-ImageNet and CUB-200-2011 benchmarks.",/pdf/821eb7a70136f226a25bfc5644695887bb86439b.pdf,ICLR,2020,We propose a novel Metric-Softmax loss to learn task-agnostic feature and adapt the classifier to each few-shot task using a task-adaptive transformation. +1eKz1kjHO1,9Tv9bymvGU4,1601310000000.0,1614990000000.0,1473,Contextual Image Parsing via Panoptic Segment Sorting,"[""~Jyh-Jing_Hwang1"", ""~Tsung-Wei_Ke2"", ""~Stella_Yu2""]","[""Jyh-Jing Hwang"", ""Tsung-Wei Ke"", ""Stella Yu""]","[""metric learning"", ""context encoding"", ""context discovery"", ""image parsing"", ""panoptic segmentation""]","Visual context is versatile and hard to describe or label precisely. We aim to leverage the densely labeled task, image parsing, a.k.a panoptic segmentation, to learn a model that encodes and discovers object-centric context. Most existing approaches based on deep learning tackle image parsing via fusion of pixel-wise classification and instance masks from two sub-networks. Such approaches isolate things from stuff and fuse the semantic and instance masks in the later stage. To encode object-centric context inherently, we propose a metric learning framework, Panoptic Segment Sorting, that is directly trained with stuff and things jointly. Our key insight is to make the panoptic embeddings separate every instance so that the model automatically learns to leverage visual context as many instances across different images appear similar. We show that the context of our model's retrieved instances is more consistent relatively by 13.7%, further demonstrating its ability to discover novel context unsupervisedly. Our overall framework also achieves competitive performance across standard panoptic segmentation metrics amongst the state-of-the-art methods on two large datasets, Cityscapes and PASCAL VOC. These promising results suggest that pixel-wise embeddings can not only inject new understanding into panoptic segmentation but potentially serve for other tasks such as modeling instance relationships.",/pdf/ab67da8d1d9853933fc06bb205f7dc66f19d9054.pdf,ICLR,2021,"We present a metric learning framework, panoptic segment sorting, to leverage the dense labels from image parsing for object visual context encoding and discovery." +SkxUrTVKDH,rJgdrBUvPH,1569440000000.0,1577170000000.0,525,Split LBI for Deep Learning: Structural Sparsity via Differential Inclusion Paths,"[""yanweifu@fudan.edu.cn"", ""corwinliu9669@gmail.com"", ""donghao.li@connect.ust.hk"", ""xinsun@microsoft.com"", ""jsh.zeng@gmail.com"", ""yuany@ust.hk""]","[""Yanwei Fu"", ""Chen Liu"", ""Donghao Li"", ""Xinwei Sun"", ""Jinshan ZENG"", ""Yuan Yao""]",[],"Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on \emph{differential inclusions of inverse scale spaces}, that generates a family of models from simple to complex ones by coupling gradient descent and mirror descent to explore model structural sparsity. It has a simple discretization, called the Split Linearized Bregman Iteration (SplitLBI), whose global convergence analysis in deep learning is established that from any initializations, algorithmic iterations converge to a critical point of empirical risks. Experimental evidence shows that\ SplitLBI may achieve state-of-the-art performance in large scale training on ImageNet-2012 dataset etc., while with \emph{early stopping} it unveils effective subnet architecture with comparable test accuracies to dense models after retraining instead of pruning well-trained ones.",/pdf/f91c359bbd9fb350b3cc8c5858c30f3164e9741d.pdf,ICLR,2020,"SplitLBI is applied to deep learning to explore model structural sparsity, achieving state-of-the-art performance in ImageNet-2012 and unveiling effective subnet architecture." +q-cnWaaoUTH,sp6wrF0rCkA,1601310000000.0,1614580000000.0,636,Conformation-Guided Molecular Representation with Hamiltonian Neural Networks,"[""~Ziyao_Li1"", ""swyang@pku.edu.cn"", ""~Guojie_Song1"", ""cailingsheng@pku.edu.cn""]","[""Ziyao Li"", ""Shuwen Yang"", ""Guojie Song"", ""Lingsheng Cai""]","[""Molecular Representation"", ""Neural Physics Engines"", ""Molecular Dynamics"", ""Graph Neural Networks""]","Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following the discretized Hamiltonian equations. These implicit coordinations are supervised with real conformations with translation- & rotation-invariant losses, and further used as inputs to the Fingerprint Generator, a message-passing neural network. Experiments show that the Hamiltonian Engine can well preserve molecular conformations, and that the fingerprints generated by HamNet achieve state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark.",/pdf/7a5a25fdbe36c7b0286d17dafd233f47bf7dd30c.pdf,ICLR,2021,"We propose a molecular representation algorithm, which preserves molecular conformations with a neural physics engine and generates fingerprints with an MPNN." +r1xH5xHYwH,B1eo20eKvS,1569440000000.0,1577170000000.0,2470,Effects of Linguistic Labels on Learned Visual Representations in Convolutional Neural Networks: Labels matter!,"[""seoyoung.ahn@stonybrook.edu"", ""gregory.zelinsky@stonybrook.edu"", ""lupyan@wisc.edu""]","[""Seoyoung Ahn"", ""Gregory Zelinsky"", ""Gary Lupyan""]","[""category learning"", ""visual representation"", ""linguistic labels"", ""human behavior prediction""]","We investigated the changes in visual representations learnt by CNNs when using different linguistic labels (e.g., trained with basic-level labels only, superordinate-level only, or both at the same time) and how they compare to human behavior when asked to select which of three images is most different. We compared CNNs with identical architecture and input, differing only in what labels were used to supervise the training. The results showed that in the absence of labels, the models learn very little categorical structure that is often assumed to be in the input. Models trained with superordinate labels (vehicle, tool, etc.) are most helpful in allowing the models to match human categorization, implying that human representations used in odd-one-out tasks are highly modulated by semantic information not obviously present in the visual input.",/pdf/d3f2a64020a2a95f1b232edf41bb32809acb6c0e.pdf,ICLR,2020,We investigated the changes in visual representations learnt by CNNs when using different linguistic labels +pXi-zY262sE,XWHVbhS67oY,1601310000000.0,1614990000000.0,1159,Ruminating Word Representations with Random Noise Masking,"[""~Hwiyeol_Jo1"", ""~Byoung-Tak_Zhang1""]","[""Hwiyeol Jo"", ""Byoung-Tak Zhang""]","[""representation learning for natural language processing"", ""pretrained word embeddings"", ""iterative training method"", ""model regularization""]","We introduce a training method for better word representation and performance, which we call \textbf{GraVeR} (\textbf{Gra}dual \textbf{Ve}ctor \textbf{R}umination). The method is to gradually and iteratively add random noises and bias to word embeddings after training a model, and re-train the model from scratch but initialize with the noised word embeddings. Through the re-training process, some noises can be compensated and other noises can be utilized to learn better representations. As a result, we can get word representations further fine-tuned and specialized in the task. On six text classification tasks, our method improves model performances with a large gap. When GraVeR is combined with other regularization techniques, it shows further improvements. Lastly, we investigate the usefulness of GraVeR.",/pdf/d30d95a44b737f4e51115b8f523e41ca4ed35248.pdf,ICLR,2021,An iterative method to be applied on pretrained word embeddings to find better word representations. +rkeJRhNYDH,r1emEZQNPr,1569440000000.0,1592160000000.0,248,TabFact: A Large-scale Dataset for Table-based Fact Verification,"[""wenhuchen@ucsb.edu"", ""hongmin@ucsb.edu"", ""chenjianshu@gmail.com"", ""yunkai_zhang@ucsb.edu"", ""hongwang600@ucsb.edu"", ""shiyangli@ucsb.edu"", ""xiyou@ucsb.edu"", ""william@cs.ucsb.edu""]","[""Wenhu Chen"", ""Hongmin Wang"", ""Jianshu Chen"", ""Yunkai Zhang"", ""Hong Wang"", ""Shiyang Li"", ""Xiyou Zhou"", ""William Yang Wang""]","[""Fact Verification"", ""Tabular Data"", ""Symbolic Reasoning""]","The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains unexplored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into LISP-like programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities.",/pdf/3119bcb6199dfa1eeb714b26144530ce1d987a2a.pdf,ICLR,2020,We propose a new dataset to investigate the entailment problem under semi-structured table as premise +r14Aas09Y7,rJx2zoLuFm,1538090000000.0,1545360000000.0,854,COCO-GAN: Conditional Coordinate Generative Adversarial Network,"[""hubert052702@gmail.com"", ""chang810249@gmail.com"", ""nothinglo@cmlab.csie.ntu.edu.tw"", ""dacheng@google.com"", ""wewei@google.com"", ""htchen@cs.nthu.edu.tw""]","[""Chieh Hubert Lin"", ""Chia-Che Chang"", ""Yu-Sheng Chen"", ""Da-Cheng Juan"", ""Wei Wei"", ""Hwann-Tzong Chen""]",[],"Recent advancements on Generative Adversarial Network (GAN) have inspired a wide range of works that generate synthetic images. However, the current processes have to generate an entire image at once, and therefore resolutions are limited by memory or computational constraints. In this work, we propose COnditional COordinate GAN (COCO-GAN), which generates a specific patch of an image conditioned on a spatial position rather than the entire image at a time. The generated patches are later combined together to form a globally coherent full-image. With this process, we show that the generated image can achieve competitive quality to state-of-the-arts and the generated patches are locally smooth between consecutive neighbors. One direct implication of the COCO-GAN is that it can be applied onto any coordinate systems including the cylindrical systems which makes it feasible for generating panorama images. The fact that the patch generation process is independent to each other inspires a wide range of new applications: firstly, ""Patch-Inspired Image Generation"" enables us to generate the entire image based on a single patch. Secondly, ""Partial-Scene Generation"" allows us to generate images within a customized target region. Finally, thanks to COCO-GAN's patch generation and massive parallelism, which enables combining patches for generating a full-image with higher resolution than state-of-the-arts.",/pdf/8025b1196ca4fc26183bc7fec52eeef3bda69d6d.pdf,ICLR,2019, +42kiJ7n_8xO,gBKiYBgdMba,1601310000000.0,1616000000000.0,2285,The geometry of integration in text classification RNNs,"[""~Kyle_Aitken1"", ""~Vinay_Venkatesh_Ramasesh1"", ""~Ankush_Garg1"", ""~Yuan_Cao2"", ""~David_Sussillo1"", ""~Niru_Maheswaranathan1""]","[""Kyle Aitken"", ""Vinay Venkatesh Ramasesh"", ""Ankush Garg"", ""Yuan Cao"", ""David Sussillo"", ""Niru Maheswaranathan""]","[""Recurrent neural networks"", ""dynamical systems"", ""interpretability"", ""document classification"", ""reverse engineering""]","Despite the widespread application of recurrent neural networks (RNNs), a unified understanding of how RNNs solve particular tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those pat-terns depend on the training dataset or task. This work addresses these questions in the context of text classification, building on earlier work studying the dynamics of binary sentiment-classification networks (Maheswaranathan et al., 2019). We study text-classification tasks beyond the binary case, exploring the dynamics ofRNNs trained on both natural and synthetic datasets. These dynamics, which we find to be both interpretable and low-dimensional, share a common mechanism across architectures and datasets: specifically, these text-classification networks use low-dimensional attractor manifolds to accumulate evidence for each class as they process the text. The dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset, with the dimensionality reflecting the number of scalar quantities the network remembers in order to classify.In categorical classification, for example, we show that this dimensionality is one less than the number of classes. Correlations in the dataset, such as those induced by ordering, can further reduce the dimensionality of the attractor manifold; we show how to predict this reduction using simple word-count statistics computed on the training dataset. To the degree that integration of evidence towards a decision is a common computational primitive, this work continues to lay the foundation for using dynamical systems techniques to study the inner workings of RNNs.",/pdf/bc724aa9a5ce537c4e5005d963641086e1e41bb3.pdf,ICLR,2021,"We study text classification RNNs using tools from dynamical systems analysis, finding and explaining the geometry of low-dimensional attractor manifolds." +Hyewf3AqYX,ryx6Ob0cFX,1538090000000.0,1545360000000.0,1275,A Frank-Wolfe Framework for Efficient and Effective Adversarial Attacks,"[""jc4zg@virginia.edu"", ""yijinfeng@jd.com"", ""qgu@cs.ucla.edu""]","[""Jinghui Chen"", ""Jinfeng Yi"", ""Quanquan Gu""]",[],"Depending on how much information an adversary can access to, adversarial attacks can be classified as white-box attack and black-box attack. In both cases, optimization-based attack algorithms can achieve relatively low distortions and high attack success rates. However, they usually suffer from poor time and query complexities, thereby limiting their practical usefulness. In this work, we focus on the problem of developing efficient and effective optimization-based adversarial attack algorithms. In particular, we propose a novel adversarial attack framework for both white-box and black-box settings based on the non-convex Frank-Wolfe algorithm. We show in theory that the proposed attack algorithms are efficient with an $O(1/\sqrt{T})$ convergence rate. The empirical results of attacking Inception V3 model and ResNet V2 model on the ImageNet dataset also verify the efficiency and effectiveness of the proposed algorithms. More specific, our proposed algorithms attain the highest attack success rate in both white-box and black-box attacks among all baselines, and are more time and query efficient than the state-of-the-art.",/pdf/34c4089621552ebd73810e1abd1d8d0b56462657.pdf,ICLR,2019, +uIc4W6MtbDA,X4YdhQJdEwS,1601310000000.0,1614990000000.0,941,ERMAS: Learning Policies Robust to Reality Gaps in Multi-Agent Simulations,"[""~Eric_Zhao1"", ""~Alexander_R_Trott1"", ""~Caiming_Xiong1"", ""~Stephan_Zheng1""]","[""Eric Zhao"", ""Alexander R Trott"", ""Caiming Xiong"", ""Stephan Zheng""]","[""Robustness"", ""Multi-Agent Learning"", ""Sim2Real"", ""Reinforcement Learning""]","Policies for real-world multi-agent problems, such as optimal taxation, can be learned in multi-agent simulations with AI agents that emulate humans. However, simulations can suffer from reality gaps as humans often act suboptimally or optimize for different objectives (i.e., bounded rationality). We introduce $\epsilon$-Robust Multi-Agent Simulation (ERMAS), a robust optimization framework to learn AI policies that are robust to such multi-agent reality gaps. The objective of ERMAS theoretically guarantees robustness to the $\epsilon$-Nash equilibria of other agents – that is, robustness to behavioral deviations with a regret of at most $\epsilon$. ERMAS efficiently solves a first-order approximation of the robustness objective using meta-learning methods. We show that ERMAS yields robust policies for repeated bimatrix games and optimal adaptive taxation in economic simulations, even when baseline notions of robustness are uninformative or intractable. In particular, we show ERMAS can learn tax policies that are robust to changes in agent risk aversion, improving policy objectives (social welfare) by up to 15% in complex spatiotemporal simulations using the AI Economist (Zheng et al., 2020).",/pdf/ccacd7d61154fc1c0e6e21c5e918248a1d58284e.pdf,ICLR,2021,"ERMAS efficiently trains RL agents that are robust to reality gaps in multi-agent simulations, such as complex economic simulations." +fSTD6NFIW_b,0rLTD8sl1s,1601310000000.0,1616040000000.0,2272,Understanding the failure modes of out-of-distribution generalization,"[""~Vaishnavh_Nagarajan3"", ""ajandreassen@google.com"", ""~Behnam_Neyshabur1""]","[""Vaishnavh Nagarajan"", ""Anders Andreassen"", ""Behnam Neyshabur""]","[""out-of-distribution generalization"", ""spurious correlations"", ""empirical risk minimization"", ""theoretical study""]","Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way even in easy-to-learn tasks where one would expect these models to succeed. In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature and another, statistical. Finally, we construct natural modifications of image classification datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets.",/pdf/2790b3f2ccfda08399e0549ba75e2da20bd2d1b1.pdf,ICLR,2021,"In this theoretical study, we explain why machine learning models rely on spuriously correlated features in the dataset and fail at out-of-distribution generalization." +SkAK2jg0b,rJpthog0Z,1509110000000.0,1518730000000.0,376,An Out-of-the-box Full-network Embedding for Convolutional Neural Networks,"[""dario.garcia@bsc.es"", ""armand.vilalta@bsc.es"", ""ferran.pares@bsc.es"", ""jonatan.moreno@bsc.es"", ""eduard.ayguade@bsc.es"", ""jesus.labarta@bsc.es"", ""ia@cs.upc.edu"", ""suzumurat@gmail.com""]","[""Dario Garcia-Gasulla"", ""Armand Vilalta"", ""Ferran Par\u00e9s"", ""Jonatan Moreno"", ""Eduard Ayguad\u00e9"", ""Jes\u00fas Labarta"", ""Ulises Cort\u00e9s"", ""Toyotaro Suzumura""]","[""Embedding spaces"", ""feature extraction"", ""transfer learning.""]","Transfer learning for feature extraction can be used to exploit deep representations in contexts where there is very few training data, where there are limited computational resources, or when tuning the hyper-parameters needed for training is not an option. While previous contributions to feature extraction propose embeddings based on a single layer of the network, in this paper we propose a full-network embedding which successfully integrates convolutional and fully connected features, coming from all layers of a deep convolutional neural network. To do so, the embedding normalizes features in the context of the problem, and discretizes their values to reduce noise and regularize the embedding space. Significantly, this also reduces the computational cost of processing the resultant representations. The proposed method is shown to outperform single layer embeddings on several image classification tasks, while also being more robust to the choice of the pre-trained model used for obtaining the initial features. The performance gap in classification accuracy between thoroughly tuned solutions and the full-network embedding is also reduced, which makes of the proposed approach a competitive solution for a large set of applications.",/pdf/784d0b148acef99c89eb735824294354f772a27d.pdf,ICLR,2018,We present a full-network embedding of CNN which outperforms single layer embeddings for transfer learning tasks. +d_Ue2glvcY8,JwqSm-bqrF,1601310000000.0,1614990000000.0,1519,Structure Controllable Text Generation,"[""~Liming_DENG1"", ""wanglong137@pingan.com.cn"", ""wangbinzhu86@gmail.com"", ""~Jiang_Qian1"", ""~Bojin_Zhuang2"", ""~Shaojun_Wang1"", ""~Jing_Xiao3""]","[""Liming DENG"", ""Long WANG"", ""Binzhu WANG"", ""Jiang Qian"", ""Bojin Zhuang"", ""Shaojun Wang"", ""Jing Xiao""]","[""Natural language generation"", ""structure representation"", ""structure controlling"", ""conditional language model"", ""structure aware transformer""]","Controlling the presented forms (or structures) of generated text are as important as controlling the generated contents during neural text generation. It helps to reduce the uncertainty and improve the interpretability of generated text. However, the structures and contents are entangled together and realized simultaneously during text generation, which is challenging for the structure controlling. In this paper, we propose an efficient, straightforward generation framework to control the structure of generated text. A structure-aware transformer (SAT) is proposed to explicitly incorporate multiple types of multi-granularity structure information to guide the text generation with corresponding structure. The structure information is extracted from given sequence template by auxiliary model, and the type of structure for the given template can be learned, represented and imitated. Extensive experiments have been conducted on both Chinese lyrics corpus and English Penn Treebank dataset. Both automatic evaluation metrics and human judgement demonstrate the superior capability of our model in controlling the structure of generated text, and the quality ( like Fluency and Meaningfulness) of the generated text is even better than the state-of-the-arts model. +",/pdf/4cd5ec24342ae550985ea70ee63de828c7614431.pdf,ICLR,2021,"A straightforward, interpretable structure controlling text generation framework is proposed, which is capable of learning and controlling multigranularity sequence structure from character-level to sentence-level structure." +0cmMMy8J5q,3ySw4zI3FKj,1601310000000.0,1616070000000.0,1886,Zero-Cost Proxies for Lightweight NAS,"[""~Mohamed_S_Abdelfattah1"", ""a.mehrotra1@samsung.com"", ""~\u0141ukasz_Dudziak1"", ""~Nicholas_Donald_Lane1""]","[""Mohamed S Abdelfattah"", ""Abhinav Mehrotra"", ""\u0141ukasz Dudziak"", ""Nicholas Donald Lane""]","[""NAS"", ""AutoML"", ""proxy"", ""pruning"", ""efficient""]","Neural Architecture Search (NAS) is quickly becoming the standard methodology to design neural network models. However, NAS is typically compute-intensive because multiple models need to be evaluated before choosing the best one. To reduce the computational power and time needed, a proxy task is often used for evaluating each model instead of full training. In this paper, we evaluate conventional reduced-training proxies and quantify how well they preserve ranking between neural network models during search when compared with the rankings produced by final trained accuracy. We propose a series of zero-cost proxies, based on recent pruning literature, that use just a single minibatch of training data to compute a model's score. Our zero-cost proxies use 3 orders of magnitude less computation but can match and even outperform conventional proxies. For example, Spearman's rank correlation coefficient between final validation accuracy and our best zero-cost proxy on NAS-Bench-201 is 0.82, compared to 0.61 for EcoNAS (a recently proposed reduced-training proxy). Finally, we use these zero-cost proxies to enhance existing NAS search algorithms such as random search, reinforcement learning, evolutionary search and predictor-based search. For all search methodologies and across three different NAS datasets, we are able to significantly improve sample efficiency, and thereby decrease computation, by using our zero-cost proxies. For example on NAS-Bench-101, we achieved the same accuracy 4$\times$ quicker than the best previous result. Our code is made public at: https://github.com/mohsaied/zero-cost-nas.",/pdf/94fe9fb70902774ff60846c845667ac64c3c8fa1.pdf,ICLR,2021,A single minibatch of data is used to score neural networks for NAS instead of performing full training. +BJgcwh4FwS,r1eOn7YFBB,1569440000000.0,1577170000000.0,15,Neural Maximum Common Subgraph Detection with Guided Subgraph Extraction,"[""yba@ucla.edu"", ""derekqxu@ucla.edu"", ""ken.qgu@gmail.com"", ""shirley0@mail.ustc.edu.cn"", ""amarinovic@ucla.edu"", ""christopher.j.ro@gmail.com"", ""yzsun@cs.ucla.edu"", ""weiwang@cs.ucla.edu""]","[""Yunsheng Bai"", ""Derek Xu"", ""Ken Gu"", ""Xueqing Wu"", ""Agustin Marinovic"", ""Christopher Ro"", ""Yizhou Sun"", ""Wei Wang""]","[""graph matching"", ""maximum common subgraph"", ""graph neural networks"", ""subgraph extraction"", ""graph alignment""]","Maximum Common Subgraph (MCS) is defined as the largest subgraph that is commonly present in both graphs of a graph pair. Exact MCS detection is NP-hard, and its state-of-the-art exact solver based on heuristic search is slow in practice without any time complexity guarantee. Given the huge importance of this task yet the lack of fast solver, we propose an efficient MCS detection algorithm, NeuralMCS, consisting of a novel neural network model that learns the node-node correspondence from the ground-truth MCS result, and a subgraph extraction procedure that uses the neural network output as guidance for final MCS prediction. The whole model guarantees polynomial time complexity with respect to the number of the nodes of the larger of the two input graphs. Experiments on four real graph datasets show that the proposed model is 48.1x faster than the exact solver and more accurate than all the existing competitive approximate approaches to MCS detection.",/pdf/2c52dd2cf12199741b2f90b8be9cdb370d258054.pdf,ICLR,2020, +iTeUSEw5rl2,_qvUTopGfmK,1601310000000.0,1614990000000.0,2732,Online Continual Learning Under Domain Shift,"[""~Quang_Pham1"", ""~Chenghao_Liu1"", ""~Steven_HOI1""]","[""Quang Pham"", ""Chenghao Liu"", ""Steven HOI""]","[""Continual Learning"", ""Domain Generalization""]","Existing continual learning benchmarks often assume each task's training and test data are from the same distribution, which may not hold in practice. Towards making continual learning practical, in this paper, we introduce a novel setting of online continual learning under conditional domain shift, in which domain shift exists between training and test data of all tasks: $P^{tr}(X, Y) \neq P^{te}(X,Y)$, and the model is required to generalize to unseen domains at test time. To address this problem, we propose \emph{Conditional Invariant Experience Replay (CIER)} that can simultaneously retain old knowledge, acquire new information, and generalize to unseen domains. CIER employs an adversarial training to correct the shift in $P(X,Y)$ by matching $P(X|Y)$, which results in an invariant representation that can generalize to unseen domains during inference. Our extensive experiments show that CIER can bridge the domain gap in continual learning and significantly outperforms state-of-the-art methods. We will release our benchmarks and implementation upon acceptance.",/pdf/5e1f308c1d07f7bc831da527683bbd644f399954.pdf,ICLR,2021, +HyeqPJHYvH,rklLxTp_DS,1569440000000.0,1577170000000.0,1777,Stochastic Latent Residual Video Prediction,"[""jean-yves.franceschi@lip6.fr"", ""edouard.delasalles@lip6.fr"", ""mickael.chen@lip6.fr"", ""sylvain.lamprier@lip6.fr"", ""patrick.gallinari@lip6.fr""]","[""Jean-Yves Franceschi"", ""Edouard Delasalles"", ""Mickael Chen"", ""Sylvain Lamprier"", ""Patrick Gallinari""]","[""stochastic video prediction"", ""variational autoencoder"", ""residual dynamics""]","Video prediction is a challenging task: models have to account for the inherent uncertainty of the future. Most works in the literature are based on stochastic image-autoregressive recurrent networks, raising several performance and applicability issues. An alternative is to use fully latent temporal models which untie frame synthesis and dynamics. However, no such model for video prediction has been proposed in the literature yet, due to design and training difficulties. In this paper, we overcome these difficulties by introducing a novel stochastic temporal model. It is based on residual updates of a latent state, motivated by discretization schemes of differential equations. This first-order principle naturally models video dynamics as it allows our simpler, lightweight, interpretable, latent model to outperform prior state-of-the-art methods on challenging datasets.",/pdf/01c19eb1a00a822df4d28c39840c317978b2e296.pdf,ICLR,2020, +r1lL4a4tDB,ryx-ECGDPr,1569440000000.0,1583910000000.0,486,Variational Recurrent Models for Solving Partially Observable Control Tasks,"[""dongqi.han@oist.jp"", ""doya@oist.jp"", ""jun.tani@oist.jp""]","[""Dongqi Han"", ""Kenji Doya"", ""Jun Tani""]","[""Reinforcement Learning"", ""Deep Learning"", ""Variational Inference"", ""Recurrent Neural Network"", ""Partially Observable"", ""Robotic Control"", ""Continuous Control""]","In partially observable (PO) environments, deep reinforcement learning (RL) agents often suffer from unsatisfactory performance, since two problems need to be tackled together: how to extract information from the raw observations to solve the task, and how to improve the policy. In this study, we propose an RL algorithm for solving PO tasks. Our method comprises two parts: a variational recurrent model (VRM) for modeling the environment, and an RL controller that has access to both the environment and the VRM. The proposed algorithm was tested in two types of PO robotic control tasks, those in which either coordinates or velocities were not observable and those that require long-term memorization. Our experiments show that the proposed algorithm achieved better data efficiency and/or learned more optimal policy than other alternative approaches in tasks in which unobserved states cannot be inferred from raw observations in a simple manner.",/pdf/008bf8ee5b9f826cd28cccde848cfe092052030e.pdf,ICLR,2020,A deep RL algorithm for solving POMDPs by auto-encoding the underlying states using a variational recurrent model +rkehoAVtvS,BJeDmmt_vr,1569440000000.0,1577170000000.0,1333,Adversarial Paritial Multi-label Learning,"[""yanyan.nwpu@gmail.com"", ""yuhongguo.cs@gmail.com""]","[""Yan Yan"", ""Yuhong Guo""]",[],"Partial multi-label learning (PML), which tackles the problem of learning multi-label prediction models from instances with overcomplete noisy annotations, has recently started gaining attention from the research community. In this paper, we propose a novel adversarial learning model, PML-GAN, under a generalized encoder-decoder framework for partial multi-label learning. The PML-GAN model uses a disambiguation network to identify noisy labels and uses a multi-label prediction network to map the training instances to the disambiguated label vectors, while deploying a generative adversarial network as an inverse mapping from label vectors to data samples in the input feature space. The learning of the overall model corresponds to a minimax adversarial game, which enhances the correspondence of input features with the output labels. Extensive experiments are conducted on multiple datasets, while the proposed model demonstrates the state-of-the-art performance for partial multi-label learning.",/pdf/c8d39b52e88f5a22a5b92bc8d78415fe5bcbb98f.pdf,ICLR,2020, +rklraTNFwB,ryl1zhgODB,1569440000000.0,1577170000000.0,818,Robust Instruction-Following in a Situated Agent via Transfer-Learning from Text,"[""felixhill@google.com"", ""sonka@google.com"", ""nathanielwong@google.com"", ""tharley@google.com""]","[""Felix Hill"", ""Sona Mokra"", ""Nathaniel Wong"", ""Tim Harley""]","[""agent"", ""language"", ""3D"", ""simulation"", ""policy"", ""instruction"", ""transfer""]","Recent work has described neural-network-based agents that are trained to execute language-like commands in simulated worlds, as a step towards an intelligent agent or robot that can be instructed by human users. However, the instructions that such agents are trained to follow are typically generated from templates (by an environment simulator), and do not reflect the varied or ambiguous expressions used by real people. We address this issue by integrating language encoders that are pretrained on large text corpora into a situated, instruction-following agent. In a procedurally-randomized first-person 3D world, we first train agents to follow synthetic instructions requiring the identification, manipulation and relative positioning of visually-realistic objects models. We then show how these abilities can transfer to a context where humans provide instructions in natural language, but only when agents are endowed with language encoding components that were pretrained on text-data. We explore techniques for integrating text-trained and environment-trained components into an agent, observing clear advantages for the fully-contextual phrase representations computed by the well-known BERT model, and additional gains by integrating a self-attention operation optimized to adapt BERT's representations for the agent's tasks and environment. These results bridge the gap between two successful strands of recent AI research: agent-centric behavior optimization and text-based representation learning. ",/pdf/495007f6bb6a4eec969c94d1d10ee9880c8fe6b2.pdf,ICLR,2020,Transfer learning from powerful text-based language models makes an agent more robust to human instructions in a 3D simulated world. +pmj131uIL9H,eW4PclG569,1601310000000.0,1615920000000.0,1030,NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation,"[""~Angtian_Wang2"", ""~Adam_Kortylewski1"", ""~Alan_Yuille1""]","[""Angtian Wang"", ""Adam Kortylewski"", ""Alan Yuille""]","[""Pose Estimation"", ""Robust Deep Learning"", ""Contrastive Learning"", ""Render-and-Compare""]","3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust to partial occlusion. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion compared to standard deep networks, while retaining competitive performance on non-occluded data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation.",/pdf/8b6175a9362d0d2b28d4e33e82e7dbcef30c90a0.pdf,ICLR,2021,"We introduce NeMo, a rendering-based approach to 3D pose estimation that models objects in terms of neural feature activations, instead of image intensities." +HJcLcw9xg,,1478290000000.0,1481660000000.0,413,The Preimage of Rectifier Network Activities,"[""stefanc@kth.se"", ""azizpour@kth.se"", ""razavian@kth.se""]","[""Stefan Carlsson"", ""Hossein Azizpour"", ""Ali Razavian""]",[],"The preimage of the activity at a certain level of a deep network is the set of inputs that result in the same node activity. For fully connected multi layer rectifier networks we demonstrate how to compute the preimages of activities at arbitrary levels from knowledge of the parameters in a deep rectifying network. If the preimage set of a certain activity in the network contains elements from more than one class it means that these classes are irreversibly mixed. This implies that preimage sets which are piecewise linear manifolds are building blocks for describing the input manifolds specific classes, i.e. all preimages should ideally be from the same class. We believe that the knowledge of how to compute preimages will be valuable in understanding the efficiency displayed by deep learning networks and could potentially be used in designing more efficient training algorithms.",/pdf/d292514377df32b560d6fcf877c13f34dbaac64f.pdf,ICLR,2017, +S1FQEfZA-,SJrXNfZAW,1509140000000.0,1518730000000.0,818,A Classification-Based Perspective on GAN Distributions,"[""shibani@mit.edu"", ""ludwigs@mit.edu"", ""madry@mit.edu""]","[""Shibani Santurkar"", ""Ludwig Schmidt"", ""Aleksander Madry""]","[""Generative adversarial networks"", ""classification"", ""benchmark"", ""mode collapse"", ""diversity""]","A fundamental, and still largely unanswered, question in the context of Generative Adversarial Networks (GANs) is whether GANs are actually able to capture the key characteristics of the datasets they are trained on. The current approaches to examining this issue require significant human supervision, such as visual inspection of sampled images, and often offer only fairly limited scalability. In this paper, we propose new techniques that employ classification-based perspective to evaluate synthetic GAN distributions and their capability to accurately reflect the essential properties of the training data. These techniques require only minimal human supervision and can easily be scaled and adapted to evaluate a variety of state-of-the-art GANs on large, popular datasets. They also indicate that GANs have significant problems in reproducing the more distributional properties of the training dataset. In particular, the diversity of such synthetic data is orders of magnitude smaller than that of the original data.",/pdf/ef76594752345fb90d5417188c12475422ccc639.pdf,ICLR,2018,We propose new methods for evaluating and quantifying the quality of synthetic GAN distributions from the perspective of classification tasks +ghjxvfgv9ht,Eb68uod08u1,1601310000000.0,1614990000000.0,156,Self-Pretraining for Small Datasets by Exploiting Patch Information,"[""~Zhang_Chunyang1""]","[""Zhang Chunyang""]","[""Learning with Small Datasets"", ""Self-Pretraining""]"," Deep learning tasks with small datasets are often tackled by pretraining models with large datasets on relevent tasks. Although pretraining methods mitigate the problem of overfitting, it can be difficult to find appropriate pretrained models sometimes. In this paper, we proposed a self-pretraininng method by exploiting patch information in the dataset itself without pretraining on other datasets. Our experiments show that the self-pretraining method leads to better performance than training from scratch both in the condition of not using other data.",/pdf/fdfb8c320bbdac93302bf63d636ad0a04736b3d4.pdf,ICLR,2021,Pretraining the model using patch information in the small dataset itself +Bygh9j09KX,Byelrp_9tX,1538090000000.0,1550510000000.0,566,ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness,"[""robert@geirhos.de"", ""patricia@rubisch.net"", ""claudio.michaelis@bethgelab.org"", ""matthias.bethge@uni-tuebingen.de"", ""felix.wichmann@uni-tuebingen.de"", ""wieland.brendel@bethgelab.org""]","[""Robert Geirhos"", ""Patricia Rubisch"", ""Claudio Michaelis"", ""Matthias Bethge"", ""Felix A. Wichmann"", ""Wieland Brendel""]","[""deep learning"", ""psychophysics"", ""representation learning"", ""object recognition"", ""robustness"", ""neural networks"", ""data augmentation""]","Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on 'Stylized-ImageNet', a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.",/pdf/5e60bd7409c7ed94f7440a8beb2d3fd762dad566.pdf,ICLR,2019,ImageNet-trained CNNs are biased towards object texture (instead of shape like humans). Overcoming this major difference between human and machine vision yields improved detection performance and previously unseen robustness to image distortions. +BJxg_hVtwH,SJlFCj0RBS,1569440000000.0,1583910000000.0,29,StructPool: Structured Graph Pooling via Conditional Random Fields,"[""hao.yuan@tamu.edu"", ""sji@tamu.edu""]","[""Hao Yuan"", ""Shuiwang Ji""]","[""Graph Pooling"", ""Representation Learning"", ""Graph Analysis""]","Learning high-level representations for graphs is of great importance for graph analysis tasks. In addition to graph convolution, graph pooling is an important but less explored research area. In particular, most of existing graph pooling techniques do not consider the graph structural information explicitly. We argue that such information is important and develop a novel graph pooling technique, know as the StructPool, in this work. We consider the graph pooling as a node clustering problem, which requires the learning of a cluster assignment matrix. We propose to formulate it as a structured prediction problem and employ conditional random fields to capture the relationships among assignments of different nodes. We also generalize our method to incorporate graph topological information in designing the Gibbs energy function. Experimental results on multiple datasets demonstrate the effectiveness of our proposed StructPool.",/pdf/34574996509b66e286705b252e045c4f7c300d6b.pdf,ICLR,2020,A novel graph pooling method considering relationships between different nodes via conditional random fields. +BJlnmgrFvS,rJltZSlYDB,1569440000000.0,1577170000000.0,2225,BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning,"[""xc1305@nyu.edu"", ""zz1435@nyu.edu"", ""zw1454@nyu.edu"", ""cw1681@nyu.edu"", ""yanqiu.wu@nyu.edu"", ""qd319@nyu.edu"", ""keithwross@nyu.edu""]","[""Xinyue Chen"", ""Zijian Zhou"", ""Zheng Wang"", ""Che Wang"", ""Yanqiu Wu"", ""Qing Deng"", ""Keith Ross""]","[""Deep Reinforcement Learning"", ""Batch Reinforcement Learning"", ""Sample Efficiency""]","The field of Deep Reinforcement Learning (DRL) has recently seen a surge in research in batch reinforcement learning, which aims for sample-efficient learning from a given data set without additional interactions with the environment. In the batch DRL setting, commonly employed off-policy DRL algorithms can perform poorly and sometimes even fail to learn altogether. In this paper we propose anew algorithm, Best-Action Imitation Learning (BAIL), which unlike many off-policy DRL algorithms does not involve maximizing Q functions over the action space. Striving for simplicity as well as performance, BAIL first selects from the batch the actions it believes to be high-performing actions for their corresponding states; it then uses those state-action pairs to train a policy network using imitation learning. Although BAIL is simple, we demonstrate that BAIL achieves state of the art performance on the Mujoco benchmark, typically outperforming BatchConstrained deep Q-Learning (BCQ) by a wide margin.",/pdf/53498646eb805551986a2ccef0756bbae6759040.pdf,ICLR,2020,We propose a new Batch Reinforcement Learning algorithm achieving state-of-the-art performance. +Sy-tszZRZ,SyRUoGZRb,1509140000000.0,1518730000000.0,940,Bounding and Counting Linear Regions of Deep Neural Networks,"[""tserra@gmail.com"", ""ctjandra@andrew.cmu.edu"", ""srikumar.ramalingam@gmail.com""]","[""Thiago Serra"", ""Christian Tjandraatmadja"", ""Srikumar Ramalingam""]","[""rectifier networks"", ""maxout networks"", ""piecewise linear functions"", ""linear regions"", ""mixed-integer programming""]","In this paper, we study the representational power of deep neural networks (DNN) that belong to the family of piecewise-linear (PWL) functions, based on PWL activation units such as rectifier or maxout. We investigate the complexity of such networks by studying the number of linear regions of the PWL function. Typically, a PWL function from a DNN can be seen as a large family of linear functions acting on millions of such regions. We directly build upon the work of Mont´ufar et al. (2014), Mont´ufar (2017), and Raghu et al. (2017) by refining the upper and lower bounds on the number of linear regions for rectified and maxout networks. In addition to achieving tighter bounds, we also develop a novel method to perform exact numeration or counting of the number of linear regions with a mixed-integer linear formulation that maps the input space to output. We use this new capability to visualize how the number of linear regions change while training DNNs. ",/pdf/e0cfdda7417474e3b9d0665bc0123d1c6ae8426c.pdf,ICLR,2018,We empirically count the number of linear regions of rectifier networks and refine upper and lower bounds. +Ua6zuk0WRH,8UWz5SFDN5B,1601310000000.0,1615310000000.0,656,Rethinking Attention with Performers,"[""~Krzysztof_Marcin_Choromanski1"", ""vl304@cam.ac.uk"", ""~David_Dohan1"", ""~Xingyou_Song1"", ""~Andreea_Gane1"", ""~Tamas_Sarlos1"", ""phawkins@google.com"", ""~Jared_Quincy_Davis1"", ""~Afroz_Mohiuddin1"", ""~Lukasz_Kaiser1"", ""~David_Benjamin_Belanger1"", ""~Lucy_J_Colwell1"", ""~Adrian_Weller1""]","[""Krzysztof Marcin Choromanski"", ""Valerii Likhosherstov"", ""David Dohan"", ""Xingyou Song"", ""Andreea Gane"", ""Tamas Sarlos"", ""Peter Hawkins"", ""Jared Quincy Davis"", ""Afroz Mohiuddin"", ""Lukasz Kaiser"", ""David Benjamin Belanger"", ""Lucy J Colwell"", ""Adrian Weller""]","[""performer"", ""transformer"", ""attention"", ""softmax"", ""approximation"", ""linear"", ""bert"", ""bidirectional"", ""unidirectional"", ""orthogonal"", ""random"", ""features"", ""FAVOR"", ""kernel"", ""generalized"", ""sparsity"", ""reformer"", ""linformer"", ""protein"", ""trembl"", ""uniprot""]","We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers. ",/pdf/f9985b6b0f77c997ffb932a86a3f3ff482aaa30d.pdf,ICLR,2021,"We introduce Performers, linear full-rank-attention Transformers via provable random feature approximation methods, without relying on sparsity or low-rankness." +H1z-PsR5KX,ryl84XP5YX,1538090000000.0,1550890000000.0,240,Identifying and Controlling Important Neurons in Neural Machine Translation,"[""abau@mit.edu"", ""belinkov@mit.edu"", ""hsajjad@hbku.edu.qa"", ""ndurrani@qf.org.qa"", ""faimaduddin@qf.org.qa"", ""glass@mit.edu""]","[""Anthony Bau"", ""Yonatan Belinkov"", ""Hassan Sajjad"", ""Nadir Durrani"", ""Fahim Dalvi"", ""James Glass""]","[""neural machine translation"", ""individual neurons"", ""unsupervised"", ""analysis"", ""correlation"", ""translation control"", ""distributivity"", ""localization""]","Neural machine translation (NMT) models learn representations containing substantial linguistic information. However, it is not clear if such information is fully distributed or if some of it can be attributed to individual neurons. We develop unsupervised methods for discovering important neurons in NMT models. Our methods rely on the intuition that different models learn similar properties, and do not require any costly external supervision. We show experimentally that translation quality depends on the discovered neurons, and find that many of them capture common linguistic phenomena. Finally, we show how to control NMT translations in predictable ways, by modifying activations of individual neurons.",/pdf/6c8dec838caf57bfcdc76d14aacb81cfd1dee5b6.pdf,ICLR,2019,"Unsupervised methods for finding, analyzing, and controlling important neurons in NMT" +BkxpMTEtPB,ryxCqM68PS,1569440000000.0,1583910000000.0,430,GLAD: Learning Sparse Graph Recovery,"[""hshrivastava3@gatech.edu"", ""xinshi.chen@gatech.edu"", ""binghong@gatech.edu"", ""george.lan@isye.gatech.edu"", ""aluru@cc.gatech.edu"", ""hanliu@northwestern.edu"", ""lsong@cc.gatech.edu""]","[""Harsh Shrivastava"", ""Xinshi Chen"", ""Binghong Chen"", ""Guanghui Lan"", ""Srinivas Aluru"", ""Han Liu"", ""Le Song""]","[""Meta learning"", ""automated algorithm design"", ""learning structure recovery"", ""Gaussian graphical models""]","Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an $\ell_1$ regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data.",/pdf/de4a1ee30acfa6bdb89070aabf16e4faf3317da4.pdf,ICLR,2020,A data-driven learning algorithm based on unrolling the Alternating Minimization optimization for sparse graph recovery. +HyTrSegCb,r1hBrleAZ,1509060000000.0,1518730000000.0,211,Achieving morphological agreement with Concorde,"[""daniil.polykovskiy@gmail.com"", ""d.soloviev@corp.mail.ru""]","[""Daniil Polykovskiy"", ""Dmitry Soloviev""]","[""NLP"", ""morphology"", ""seq2seq""]","Neural conversational models are widely used in applications like personal assistants and chat bots. These models seem to give better performance when operating on word level. However, for fusion languages like French, Russian and Polish vocabulary size sometimes become infeasible since most of the words have lots of word forms. We propose a neural network architecture for transforming normalized text into a grammatically correct one. Our model efficiently employs correspondence between normalized and target words and significantly outperforms character-level models while being 2x faster in training and 20\% faster at evaluation. We also propose a new pipeline for building conversational models: first generate a normalized answer and then transform it into a grammatically correct one using our network. The proposed pipeline gives better performance than character-level conversational models according to assessor testing.",/pdf/a526d821794832ca248475cb445c3438572d7867.pdf,ICLR,2018,Proposed architecture to solve morphological agreement task +SJg6nj09F7,SJgz8UoqKQ,1538090000000.0,1545360000000.0,756,NEURAL MALWARE CONTROL WITH DEEP REINFORCEMENT LEARNING,"[""yu.wang@yale.edu"", ""jstokes@microsoft.com"", ""mady@microsoft.com""]","[""Yu Wang"", ""Jack W. Stokes"", ""Mady Marinescu""]","[""malware"", ""execution"", ""control"", ""deep reinforcement learning""]","Antimalware products are a key component in detecting malware attacks, and their engines typically execute unknown programs in a sandbox prior to running them on the native operating system. Files cannot be scanned indefinitely so the engine employs heuristics to determine when to halt execution. Previous research has investigated analyzing the sequence of system calls generated during this emulation process to predict if an unknown file is malicious, but these models require the emulation to be stopped after executing a fixed number of events from the beginning of the file. Also, these classifiers are not accurate enough to halt emulation in the middle of the file on their own. In this paper, we propose a novel algorithm which overcomes this limitation and learns the best time to halt the file's execution based on deep reinforcement learning (DRL). Because the new DRL-based system continues to emulate the unknown file until it can make a confident decision to stop, it prevents attackers from avoiding detection by initiating malicious activity after a fixed number of system calls. Results show that the proposed malware execution control model automatically halts emulation for 91.3\% of the files earlier than heuristics employed by the engine. Furthermore, classifying the files at that time improves the true positive rate by 61.5%, at a false positive rate of 1%, compared to a baseline classifier.",/pdf/e6f889d078a11ea1762d4440553d01a70a82346a.pdf,ICLR,2019,A deep reinforcement learning-based system is proposed to control when to halt the emulation of an unknown file and to improve the detection rate of a deep malware classifier. +SJlh2jR9FX,rkeA2gTctX,1538090000000.0,1545360000000.0,745,Learning with Reflective Likelihoods,"[""abd2141@columbia.edu"", ""kyunghyun.cho@nyu.edu"", ""david.blei@columbia.edu"", ""yann@fb.com""]","[""Adji B. Dieng"", ""Kyunghyun Cho"", ""David M. Blei"", ""Yann LeCun""]","[""new learning criterion"", ""penalized maximum likelihood"", ""posterior inference in deep generative models"", ""input forgetting issue"", ""latent variable collapse issue""]","Models parameterized by deep neural networks have achieved state-of-the-art results in many domains. These models are usually trained using the maximum likelihood principle with a finite set of observations. However, training deep probabilistic models with maximum likelihood can lead to the issue we refer to as input forgetting. In deep generative latent-variable models, input forgetting corresponds to posterior collapse---a phenomenon in which the latent variables are driven independent from the observations. However input forgetting can happen even in the absence of latent variables. We attribute input forgetting in deep probabilistic models to the finite sample dilemma of maximum likelihood. We formalize this problem and propose a learning criterion---termed reflective likelihood---that explicitly prevents input forgetting. We empirically observe that the proposed criterion significantly outperforms the maximum likelihood objective when used in classification under a skewed class distribution. Furthermore, the reflective likelihood objective prevents posterior collapse when used to train stochastic auto-encoders with amortized inference. For example in a neural topic modeling experiment, the reflective likelihood objective leads to better quantitative and qualitative results than the variational auto-encoder and the importance-weighted auto-encoder.",/pdf/5e772cc8357f21e9349374f5428f9cfe3d7bac06.pdf,ICLR,2019,"Training deep probabilistic models with maximum likelihood often leads to ""input forgetting"". We identify a potential cause and propose a new learning criterion to alleviate the issue." +mRNkPVHyIVX,Ku4khURArJQ7,1601310000000.0,1614990000000.0,1148,Exploiting Safe Spots in Neural Networks for Preemptive Robustness and Out-of-Distribution Detection,"[""~Seungyong_Moon1"", ""~Gaon_An1"", ""~Hyun_Oh_Song1""]","[""Seungyong Moon"", ""Gaon An"", ""Hyun Oh Song""]","[""adversarial defense"", ""out-of-distribution detection""]","Recent advances on adversarial defense mainly focus on improving the classifier’s robustness against adversarially perturbed inputs. In this paper, we turn our attention from classifiers to inputs and explore if there exist safe spots in the vicinity of natural images that are robust to adversarial attacks. In this regard, we introduce a novel bi-level optimization algorithm that can find safe spots on over 90% of the correctly classified images for adversarially trained classifiers on CIFAR-10 and ImageNet datasets. Our experiments also show that they can be used to improve both the empirical and certified robustness on smoothed classifiers. Furthermore, by exploiting a novel safe spot inducing model training scheme and our safe spot generation method, we propose a new out-of-distribution detection algorithm which achieves the state of the art results on near-distribution outliers.",/pdf/f683439c1c6493752eb40d589f4ee1718daf647c.pdf,ICLR,2021,"We define a new problem on adversarial robustness of neural networks, named preemptive robustness, and develop a novel algorithm to improve the robustness. " +Skx6WaEYPH,Skxuk-FLDH,1569440000000.0,1577170000000.0,393,Bandlimiting Neural Networks Against Adversarial Attacks,"[""yuping@eecs.yorku.ca"", ""kasraah@eecs.yorku.ca"", ""hj@cse.yorku.ca""]","[""Yuping Lin"", ""Kasra Ahmadi K. A."", ""Hui Jiang""]","[""adversarial examples"", ""adversarial attack defense"", ""neural network"", ""Fourier analysis""]","In this paper, we study the adversarial attack and defence problem in deep learning from the perspective of Fourier analysis. We first explicitly compute the Fourier transform of deep ReLU neural networks and show that there exist decaying but non-zero high frequency components in the Fourier spectrum of neural networks. We then demonstrate that the vulnerability of neural networks towards adversarial samples can be attributed to these insignificant but non-zero high frequency components. Based on this analysis, we propose to use a simple post-averaging technique to smooth out these high frequency components to improve the robustness of neural networks against adversarial attacks. Experimental results on the ImageNet and the CIFAR-10 datasets have shown that our proposed method is universally effective to defend many existing adversarial attacking methods proposed in the literature, including FGSM, PGD, DeepFool and C&W attacks. Our post-averaging method is simple since it does not require any re-training, and meanwhile it can successfully defend over 80-96% of the adversarial samples generated by these methods without introducing significant performance degradation (less than 2%) on the original clean images.",/pdf/0b105ef70558ff48d04952195ddaca263fec6c59.pdf,ICLR,2020,"An insight into the reason of adversarial vulnerability, an effective defense method against adversarial attacks." +rJl4BsR5KX,H1lerc30um,1538090000000.0,1545360000000.0,79,k-Nearest Neighbors by Means of Sequence to Sequence Deep Neural Networks and Memory Networks,"[""yimingxu2020@u.northwestern.edu"", ""d-klabjan@northwestern.edu""]","[""Yiming Xu"", ""Diego Klabjan""]",[],"k-Nearest Neighbors is one of the most fundamental but effective classification models. In this paper, we propose two families of models built on a sequence to sequence model and a memory network model to mimic the k-Nearest Neighbors model, which generate a sequence of labels, a sequence of out-of-sample feature vectors and a final label for classification, and thus they could also function as oversamplers. We also propose `out-of-core' versions of our models which assume that only a small portion of data can be loaded into memory. Computational experiments show that our models outperform k-Nearest Neighbors, a feed-forward neural network and a memory network, due to the fact that our models must produce additional output and not just the label. As an oversampler on imbalanced datasets, the sequence to sequence kNN model often outperforms Synthetic Minority Over-sampling Technique and Adaptive Synthetic Sampling. +",/pdf/51acf3f6c9e86f207f689eceabd764aa5d591a52.pdf,ICLR,2019, +mzfqkPOhVI4,b7rOfs8U6RR,1601310000000.0,1614990000000.0,2458,Adaptive Spatial-Temporal Inception Graph Convolutional Networks for Multi-step Spatial-Temporal Network Data Forecasting,"[""~Xing_Wang4"", ""~Lin_Zhu5"", ""~Juan_Zhao2"", ""~Zhou_Xu1"", ""~Zhao_Li4"", ""~Junlan_Feng2"", ""~Chao_Deng3""]","[""Xing Wang"", ""Lin Zhu"", ""Juan Zhao"", ""Zhou Xu"", ""Zhao Li"", ""Junlan Feng"", ""Chao Deng""]",[],"Spatial-temporal data forecasting is of great importance for industries such as telecom network operation and transportation management. However, spatial-temporal data is inherent with complex spatial-temporal correlations and behaves heterogeneities among the spatial and temporal aspects, which makes the forecasting remain as a very challenging task though recently great work has been done. In this paper, we propose a novel model, Adaptive Spatial-Temporal Inception Graph Convolution Networks (ASTI-GCN), to solve the multi-step spatial-temporal data forecasting problem. The model proposes multi-scale spatial-temporal joint graph convolution block to directly model the spatial-temporal joint correlations without introducing elaborately constructed mechanisms. Moreover inception mechanism combined with the graph node-level attention is introduced to make the model capture the heterogeneous nature of the graph adaptively. Our experiments on three real-world datasets from two different fields consistently show ASTI-GCN outperforms the state-of-the-art performance. In addition, ASTI-GCN is proved to generalize well. ",/pdf/6ded9f18bebef622c69c4c3629aea7e77246bdb1.pdf,ICLR,2021, +PeT5p3ocagr,ajPesU0aMB,1601310000000.0,1614990000000.0,1654,PGPS : Coupling Policy Gradient with Population-based Search,"[""~Namyong_Kim1"", ""hisuk31@kaist.ac.kr"", ""~Hayong_Shin1""]","[""Namyong Kim"", ""Hyunsuk Baek"", ""Hayong Shin""]","[""Reinforcement Learning"", ""Population-based Search"", ""Policy Gradient"", ""Combining PG with PS""]","Gradient-based policy search algorithms (such as PPO, SAC or TD3) in deep reinforcement learning (DRL) have shown successful results on a range of challenging control tasks. However, they often suffer from flat or deceptive gradient problems. As an alternative to policy gradient methods, population-based evolutionary approaches have been applied to DRL. While population-based search algorithms show more robust learning in a broader range of tasks, they are usually inefficient in the use of samples. Recently, reported are a few attempts (such as CEMRL) to combine gradient with a population in searching optimal policy. This kind of hybrid algorithm takes advantage of both camps. In this paper, we propose yet another hybrid algorithm, which more tightly couples policy gradient with the population-based search. More specifically, we use the Cross-Entropy Method (CEM) for population-based search and Twin Delayed Deep Deterministic Policy Gradient (TD3) for policy gradient. In the proposed algorithm called Coupling Policy Gradient with Population-based Search (PGPS), a single TD3 agent, which learns by a gradient from all experiences generated by population, leads a population by providing its critic function Q as a surrogate to select better performing next-generation population from candidates. On the other hand, if the TD3 agent falls behind the CEM population, then the TD3 agent is updated toward the elite member of the CEM population using loss function augmented with the distance between the TD3 and the CEM elite. Experiments in a MuJoCo environment show that PGPS is robust to deceptive gradient and also outperforms the state-of-the-art algorithms. +",/pdf/cbdc6daadb860db6f1c6e452b2e3e69d38362da8.pdf,ICLR,2021, +ByeMB3Act7,HygObw69Km,1538090000000.0,1550730000000.0,1524,Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks,"[""patrickchen@g.ucla.edu"", ""sisidaisy@google.com"", ""sanjivk@google.com"", ""liyang@google.com"", ""chohsieh@cs.ucla.edu""]","[""Patrick Chen"", ""Si Si"", ""Sanjiv Kumar"", ""Yang Li"", ""Cho-Jui Hsieh""]","[""fast inference"", ""softmax computation"", ""natural language processing""]","Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax layer. In this paper, we introduce a novel softmax layer approximation algorithm by exploiting the clustering structure of context vectors. Our algorithm uses a light-weight screening model to predict a much smaller set of candidate words based on the given context, and then conducts an exact softmax only within that subset. Training such a procedure end-to-end is challenging as traditional clustering methods are discrete and non-differentiable, and thus unable to be used with back-propagation in the training process. Using the Gumbel softmax, we are able to train the screening model end-to-end on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the original softmax layer for predicting top-k words in various tasks such as beam search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary, we can achieve 20.4 times speed up with 98.9% precision@1 and 99.3% precision@5 with the original softmax layer prediction, while state-of-the-art (Zhang et al., 2018) only achieves 6.7x speedup with 98.7% precision@1 and 98.1% precision@5 for the same task.",/pdf/8f4b29f6a03da8a162d9e2e4727ca5f80c1f6bf8.pdf,ICLR,2019, +m4baHw5LZ7M,mhH0mRcRFa9,1601310000000.0,1614990000000.0,1476,Deep Learning Solution of the Eigenvalue Problem for Differential Operators,"[""~Ido_Ben-Shaul1"", ""~Leah_Bar2"", ""~Nir_Sochen1""]","[""Ido Ben-Shaul"", ""Leah Bar"", ""Nir Sochen""]","[""Eigenvalue problem"", ""Unsupervised learning"", ""Laplacian operator""]","Solving the eigenvalue problem for differential operators is a common problem in many scientific fields. Classical numerical methods rely on intricate domain discretization, and yield non-analytic or non-smooth approximations. We introduce a novel Neural Network (NN)-based solver for the eigenvalue problem of differential self-adjoint operators where the eigenpairs are learned in an unsupervised end-to-end fashion. We propose three different training procedures, for solving increasingly challenging tasks towards the general eigenvalue problem. + +The proposed solver is able to find the M smallest eigenpairs for a general differential operator. We demonstrate the method on the Laplacian operator which is of particular interest in image processing, computer vision, shape analysis among many other applications. +Unlike other numerical methods such as finite differences, the partial derivatives of the network approximation of the eigenfunction can be analytically calculated to any order. Therefore, the proposed framework enables the solution of higher order operators and on free shape domain or even on a manifold. Non-linear operators can be investigated by this approach as well.",/pdf/515ecdfc93cb95ebd398546660d487bd10fcc3dd.pdf,ICLR,2021,We propose an unsupervised neural network-based solver for the eigenvalue problem of differential operators demonstrated on one and two dimensional Laplacian. +HIGSa_3kOx3,Y2wC0wzpQ0E,1601310000000.0,1616060000000.0,2690,Reset-Free Lifelong Learning with Skill-Space Planning,"[""~Kevin_Lu2"", ""~Aditya_Grover1"", ""~Pieter_Abbeel2"", ""~Igor_Mordatch4""]","[""Kevin Lu"", ""Aditya Grover"", ""Pieter Abbeel"", ""Igor Mordatch""]","[""reset-free"", ""lifelong"", ""reinforcement learning""]","The objective of \textit{lifelong} reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose \textit{Lifelong Skill Planning} (LiSP), an algorithmic framework for lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.",/pdf/c2294e8113d0b33d3849f2a97396d946826c3de3.pdf,ICLR,2021, +HkGmDsR9YQ,r1lx4VMqtX,1538090000000.0,1545360000000.0,250,Generalization and Regularization in DQN,"[""jfarebro@ualberta.ca"", ""machado@ualberta.ca"", ""mbowling@ualberta.ca""]","[""Jesse Farebrother"", ""Marlos C. Machado"", ""Michael Bowling""]","[""generalization"", ""reinforcement learning"", ""dqn"", ""regularization"", ""transfer learning"", ""multitask""]","Deep reinforcement learning (RL) algorithms have shown an impressive ability to learn complex control policies in high-dimensional environments. However, despite the ever-increasing performance on popular benchmarks like the Arcade Learning Environment (ALE), policies learned by deep RL algorithms can struggle to generalize when evaluated in remarkably similar environments. These results are unexpected given the fact that, in supervised learning, deep neural networks often learn robust features that generalize across tasks. In this paper, we study the generalization capabilities of DQN in order to aid in understanding this mismatch between generalization in deep RL and supervised learning methods. We provide evidence suggesting that DQN overspecializes to the domain it is trained on. We then comprehensively evaluate the impact of traditional methods of regularization from supervised learning, $\ell_2$ and dropout, and of reusing learned representations to improve the generalization capabilities of DQN. We perform this study using different game modes of Atari 2600 games, a recently introduced modification for the ALE which supports slight variations of the Atari 2600 games used for benchmarking in the field. Despite regularization being largely underutilized in deep RL, we show that it can, in fact, help DQN learn more general features. These features can then be reused and fine-tuned on similar tasks, considerably improving the sample efficiency of DQN.",/pdf/9c03b9e2e557da1e049ad5b1e79bee6a7338bf30.pdf,ICLR,2019,"We study the generalization capabilities of DQN using the new modes and difficulties of Atari games. We show how regularization can improve DQN's ability to generalize across tasks, something it often fails to do." +HJxwDiActX,rylqmc3Ytm,1538090000000.0,1550460000000.0,270,StrokeNet: A Neural Painting Environment,"[""zhengningyuan@qq.com"", ""winhehe@163.com"", ""djhuang@dase.ecnu.edu.cn""]","[""Ningyuan Zheng"", ""Yifan Jiang"", ""Dingjiang Huang""]","[""image generation"", ""differentiable model"", ""reinforcement learning"", ""deep learning"", ""model based""]","We've seen tremendous success of image generating models these years. Generating images through a neural network is usually pixel-based, which is fundamentally different from how humans create artwork using brushes. To imitate human drawing, interactions between the environment and the agent is required to allow trials. However, the environment is usually non-differentiable, leading to slow convergence and massive computation. In this paper we try to address the discrete nature of software environment with an intermediate, differentiable simulation. We present StrokeNet, a novel model where the agent is trained upon a well-crafted neural approximation of the painting environment. With this approach, our agent was able to learn to write characters such as MNIST digits faster than reinforcement learning approaches in an unsupervised manner. Our primary contribution is the neural simulation of a real-world environment. Furthermore, the agent trained with the emulated environment is able to directly transfer its skills to real-world software.",/pdf/8ab6e2c743c8758ff0040d778891e122557dc572.pdf,ICLR,2019,"StrokeNet is a novel architecture where the agent is trained to draw by strokes on a differentiable simulation of the environment, which could effectively exploit the power of back-propagation." +nzpLWnVAyah,ru3gk47JtY9,1601310000000.0,1615980000000.0,3092,"On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines","[""~Marius_Mosbach1"", ""~Maksym_Andriushchenko1"", ""~Dietrich_Klakow1""]","[""Marius Mosbach"", ""Maksym Andriushchenko"", ""Dietrich Klakow""]","[""fine-tuning stability"", ""transfer learning"", ""pretrained language model"", ""BERT""]","Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.",/pdf/ecb1af8e8fc55b9e071db6ef6b56163a21f00a44.pdf,ICLR,2021,We provide an analysis of the fine-tuning instability of BERT-based models and present a simple method to fix it. +9GBZBPn0Jx,Xfnwi-_qTa,1601310000000.0,1615920000000.0,2202,HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents,"[""~Deyao_Zhu1"", ""~Mohamed_Zahran1"", ""~Li_Erran_Li1"", ""~Mohamed_Elhoseiny1""]","[""Deyao Zhu"", ""Mohamed Zahran"", ""Li Erran Li"", ""Mohamed Elhoseiny""]",[],"Motion forecasting is essential for making intelligent decisions in robotic navigation. As a result, the multi-agent behavioral prediction has become a core component of modern human-robot interaction applications such as autonomous driving. Due to various intentions and interactions among agents, agent trajectories can have multiple possible futures. Hence, the motion forecasting model's ability to cover possible modes becomes essential to enable accurate prediction. Towards this goal, we introduce HalentNet to better model the future motion distribution in addition to a traditional trajectory regression learning objective by incorporating generative augmentation losses. We model intents with unsupervised discrete random variables whose training is guided by a collaboration between two key signals: A discriminative loss that encourages intents' diversity and a hallucinative loss that explores intent transitions (i.e., mixed intents) and encourages their smoothness. This regulates the neural network behavior to be more accurately predictive on uncertain scenarios due to the active yet careful exploration of possible future agent behavior. Our model's learned representation leads to better and more semantically meaningful coverage of the trajectory distribution. Our experiments show that our method can improve over the state-of-the-art trajectory forecasting benchmarks, including vehicles and pedestrians, for about 20% on average FDE and 50% on road boundary violation rate when predicting 6 seconds future. We also conducted human experiments to show that our predicted trajectories received 39.6% more votes than the runner-up approach and 32.2% more votes than our variant without hallucinative mixed intent loss. The code will be released soon. ",/pdf/9f7a72a2ef90cdbf07b3067bf2417e59741a2483.pdf,ICLR,2021, +ryPx38qge,,1478290000000.0,1483440000000.0,325,A hybrid network: Scattering and Convnet,"[""edouard.oyallon@ens.fr""]","[""Edouard Oyallon""]","[""Computer vision"", ""Unsupervised Learning"", ""Deep learning""]","This paper shows how, by combining prior and supervised representations, one can create architectures that lead to nearly state-of-the-art results on standard benchmarks, which mean they perform as well as a deep network learned from scratch. We use scattering as a generic and fixed initialization of the first layers of a deep network, and learn the remaining layers in a supervised manner. We numerically demonstrate that deep hybrid scattering networks generalize better on small datasets than supervised deep networks. Scattering networks could help current systems to save computation time, while guaranteeing the stability to geometric transformations and noise of the first internal layers. We also show that the learned operators explicitly build invariances to geometrical variabilities, such as local rotation and translation, by analyzing the third layer of our architecture. We demonstrate that it is possible to replace the scattering transform by a standard deep network at the cost of having to learn more parameters and potentially adding instabilities. Finally, we release a new software, ScatWave, using GPUs for fast computations of a scattering network that is integrated in Torch. We evaluate our model on the CIFAR10, CIFAR100 and STL10 datasets.",/pdf/4787c58ce796ae759bfe5fb0a6d5a9603d86aa8e.pdf,ICLR,2017,"This paper shows how, by combining prior and supervised representations, one can create architectures that lead to nearly state-of-the-art results on standard benchmarks." +tEw4vEEhHjI,qiTiRIyGe4X,1601310000000.0,1614990000000.0,593,Fixing Asymptotic Uncertainty of Bayesian Neural Networks with Infinite ReLU Features,"[""~Agustinus_Kristiadi1"", ""~Matthias_Hein2"", ""~Philipp_Hennig1""]","[""Agustinus Kristiadi"", ""Matthias Hein"", ""Philipp Hennig""]","[""Bayesian deep learning"", ""Gaussian processes"", ""uncertainty quantification""]","Approximate Bayesian methods can mitigate overconfidence in ReLU networks. However, far away from the training data, even Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be overconfident. We suggest to fix this by considering an infinite number of ReLU features over the input domain that are never part of the training process and thus remain at prior values. Perhaps surprisingly, we show that this model leads to a tractable Gaussian process (GP) term that can be added to a pre-trained BNN's posterior at test time with negligible cost overhead. The BNN then yields structured uncertainty in the proximity of training data, while the GP prior calibrates uncertainty far away from them. As a key contribution, we prove that the added uncertainty yields cubic predictive variance growth, and thus the ideal uniform (maximum entropy) confidence in multi-class classification far from the training data.",/pdf/d8e4ced5cd5028d7ee0f29947a58b7f48eb0ccb2.pdf,ICLR,2021,"A way to add infinitely many ReLU features away from the support of the data to achieve uniform asymptotic uncertainty, at negligible additional cost." +BklMYjC9FQ,Hkx7Jk9cYm,1538090000000.0,1545360000000.0,423,microGAN: Promoting Variety through Microbatch Discrimination,"[""goncalo.mordido@hpi.de"", ""haojin.yang@hpi.de"", ""christoph.meinel@hpi.de""]","[""Goncalo Mordido"", ""Haojin Yang"", ""Christoph Meinel""]","[""adversarial training"", ""gans""]","We propose to tackle the mode collapse problem in generative adversarial networks (GANs) by using multiple discriminators and assigning a different portion of each minibatch, called microbatch, to each discriminator. We gradually change each discriminator's task from distinguishing between real and fake samples to discriminating samples coming from inside or outside its assigned microbatch by using a diversity parameter $\alpha$. The generator is then forced to promote variety in each minibatch to make the microbatch discrimination harder to achieve by each discriminator. Thus, all models in our framework benefit from having variety in the generated set to reduce their respective losses. We show evidence that our solution promotes sample diversity since early training stages on multiple datasets.",/pdf/bf1d67e8b7026fbbc4020588ffd69cd17fe84abf.pdf,ICLR,2019,We use microbatch discrimination on multi-adversarial GANs to mitigate mode collapse. +S1eRbANtDB,B1lsXdNuDH,1569440000000.0,1583910000000.0,985,Learning to Link,"[""ninamf@cs.cmu.edu"", ""tdick@ttic.edu"", ""manuel.lang@student.kit.edu""]","[""Maria-Florina Balcan"", ""Travis Dick"", ""Manuel Lang""]","[""Data-driven Algorithm Configuration"", ""Metric Learning"", ""Linkage Clustering"", ""Learning Algorithms""]","Clustering is an important part of many modern data analysis pipelines, including network analysis and data retrieval. There are many different clustering algorithms developed by various communities, and it is often not clear which algorithm will give the best performance on a specific clustering task. Similarly, we often have multiple ways to measure distances between data points, and the best clustering performance might require a non-trivial combination of those metrics. In this work, we study data-driven algorithm selection and metric learning for clustering problems, where the goal is to simultaneously learn the best algorithm and metric for a specific application. The family of clustering algorithms we consider is parameterized linkage based procedures that includes single and complete linkage. The family of distance functions we learn over are convex combinations of base distance functions. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal distance and clustering algorithm from these classes. We also carry out a comprehensive empirical evaluation of our techniques showing that they can lead to significantly improved clustering performance on real-world datasets.",/pdf/9938c3f64cb34c88f61529dd494643322a66a067.pdf,ICLR,2020,We show how to use data to automatically learn low-loss linkage procedures and metrics for specific clustering applications. +vhKe9UFbrJo,gig2usnr9F,1601310000000.0,1615910000000.0,1601,Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models,"[""~Yuge_Shi1"", ""~Brooks_Paige1"", ""~Philip_Torr1"", ""~Siddharth_N1""]","[""Yuge Shi"", ""Brooks Paige"", ""Philip Torr"", ""Siddharth N""]","[""Deep generative model"", ""multi-modal learning"", ""representation learning""]","Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of related multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between ""related"" and ""unrelated"" multimodal data. We show in experiments that our method enables data-efficient multimodal learning on challenging datasets for various multimodal VAE models. We also show that under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.",/pdf/9f989526ad09a1935c06a52631aac027c4f90633.pdf,ICLR,2021, +ryeQmCVYPS,HkgoUvSOwB,1569440000000.0,1577170000000.0,1031,Defective Convolutional Layers Learn Robust CNNs,"[""luotg@pku.edu.cn"", ""caitianle1998@pku.edu.cn"", ""zhan147@usc.edu"", ""siyuchen@pku.edu.cn"", ""dihe@microsoft.com"", ""wanglw@cis.pku.edu.cn""]","[""Tiange Luo"", ""Tianle Cai"", ""Xiaomeng Zhang"", ""Siyu Chen"", ""Di He"", ""Liwei Wang""]","[""adversarial examples"", ""robust machine learning"", ""cnn structure"", ""deep feature representations""]","Robustness of convolutional neural networks has recently been highlighted by the adversarial examples, i.e., inputs added with well-designed perturbations which are imperceptible to humans but can cause the network to give incorrect outputs. Recent research suggests that the noises in adversarial examples break the textural structure, which eventually leads to wrong predictions by convolutional neural networks. To help a convolutional neural network make predictions relying less on textural information, we propose defective convolutional layers which contain defective neurons whose activations are set to be a constant function. As the defective neurons contain no information and are far different from the standard neurons in its spatial neighborhood, the textural features cannot be accurately extracted and the model has to seek for other features for classification, such as the shape. We first show that predictions made by the defective CNN are less dependent on textural information, but more on shape information, and further find that adversarial examples generated by the defective CNN appear to have semantic shapes. Experimental results demonstrate the defective CNN has higher defense ability than the standard CNN against various types of attack. In particular, it achieves state-of-the-art performance against transfer-based attacks without applying any adversarial training.",/pdf/696c255b36858e0908e0fdfed5e0ea303b62c35b.pdf,ICLR,2020,We propose a technique that modifies CNN structures to make predictions more relying on shape information and improve the defense ability against several types of attack. +Sye2doC9tX,B1gfHxP5Ym,1538090000000.0,1545360000000.0,388,Exploration by Uncertainty in Reward Space,"[""nju_qwy@163.com"", ""yuy@nju.edu.cn"", ""hzlvtangjie@corp.netease.com"", ""chenyingfeng1@corp.netease.com"", ""fanchangjie@corp.netease.com""]","[""Wei-Yang Qu"", ""Yang Yu"", ""Tang-Jie Lv"", ""Ying-Feng Chen"", ""Chang-Jie Fan""]","[""Policy Exploration"", ""Uncertainty in Reward Space""]","Efficient exploration plays a key role in reinforcement learning tasks. Commonly used dithering strategies, such as-greedy, try to explore the action-state space randomly; this can lead to large demand for samples. In this paper, We propose an exploration method based on the uncertainty in reward space. There are two policies in this approach, the exploration policy is used for exploratory sampling in the environment, then the benchmark policy try to update by the data proven by the exploration policy. Benchmark policy is used to provide the uncertainty in reward space, e.g. td-error, which guides the exploration policy updating. We apply our method on two grid-world environments and four Atari games. Experiment results show that our method improves learning speed and have a better performance than baseline policies",/pdf/5d0758db0c8a5ab2f365bfee891eba7010a67d2f.pdf,ICLR,2019,Exploration by Uncertainty in Reward Space +rkemqsC9Fm,B1lMqa0YK7,1538090000000.0,1550970000000.0,522,Information Theoretic lower bounds on negative log likelihood,"[""lastrasl@us.ibm.com""]","[""Luis A. Lastras-Monta\u00f1o""]","[""latent variable modeling"", ""rate-distortion theory"", ""log likelihood bounds""]","In this article we use rate-distortion theory, a branch of information theory devoted to the problem of lossy compression, to shed light on an important problem in latent variable modeling of data: is there room to improve the model? One way to address this question is to find an upper bound on the probability (equivalently a lower bound on the negative log likelihood) that the model can assign to some data as one varies the prior and/or the likelihood function in a latent variable model. The core of our contribution is to formally show that the problem of optimizing priors in latent variable models is exactly an instance of the variational optimization problem that information theorists solve when computing rate-distortion functions, and then to use this to derive a lower bound on negative log likelihood. Moreover, we will show that if changing the prior can improve the log likelihood, then there is a way to change the likelihood function instead and attain the same log likelihood, and thus rate-distortion theory is of relevance to both optimizing priors as well as optimizing likelihood functions. We will experimentally argue for the usefulness of quantities derived from rate-distortion theory in latent variable modeling by applying them to a problem in image modeling.",/pdf/b20609fdf716184e5afd19975ed667adada0bfdb.pdf,ICLR,2019,Use rate-distortion theory to bound how much a latent variable model can be improved +BkNUFjR5KQ,SJexq5b5Ym,1538090000000.0,1545360000000.0,450,Learning Internal Dense But External Sparse Structures of Deep Neural Network,"[""duanyiquncc@gmail.com""]","[""Yiqun Duan""]","[""Convolutional Neural Network"", ""Hierarchical Neural Architecture"", ""Structural Sparsity"", ""Evolving Algorithm""]","Recent years have witnessed two seemingly opposite developments of deep convolutional neural networks (CNNs). On one hand, increasing the density of CNNs by adding cross-layer connections achieve higher accuracy. On the other hand, creating sparsity structures through regularization and pruning methods enjoys lower computational costs. In this paper, we bridge these two by proposing a new network structure with locally dense yet externally sparse connections. This new structure uses dense modules, as basic building blocks and then sparsely connects these modules via a novel algorithm during the training process. Experimental results demonstrate that the locally dense yet externally sparse structure could acquire competitive performance on benchmark tasks (CIFAR10, CIFAR100, and ImageNet) while keeping the network structure slim.",/pdf/34d98aef59177a53161263186d14c02468f6dcc9.pdf,ICLR,2019,"In this paper, we explore an internal dense yet external sparse network structure of deep neural networks and analyze its key properties." +SJleNCNtDH,BJgzCb8uPS,1569440000000.0,1583910000000.0,1063,Intrinsic Motivation for Encouraging Synergistic Behavior,"[""ronuchit@mit.edu"", ""shubhtuls@fb.com"", ""saurabhg@illinois.edu"", ""abhinavg@cs.cmu.edu""]","[""Rohan Chitnis"", ""Shubham Tulsiani"", ""Saurabh Gupta"", ""Abhinav Gupta""]","[""reinforcement learning"", ""intrinsic motivation"", ""synergistic"", ""robot manipulation""]","We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation and multi-agent locomotion tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.",/pdf/1c76c8b597ee076527d77a8b9befd563bdeb5733.pdf,ICLR,2020,"We propose a formulation of intrinsic motivation that is suitable as an exploration bias in synergistic multi-agent tasks, by encouraging agents to affect the world in ways that would not be achieved if they were acting individually." +rJl-b3RcF7,rklNr_YqKX,1538090000000.0,1551570000000.0,1146,"The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks","[""jfrankle@mit.edu"", ""mcarbin@csail.mit.edu""]","[""Jonathan Frankle"", ""Michael Carbin""]","[""Neural networks"", ""sparsity"", ""pruning"", ""compression"", ""performance"", ""architecture search""]","Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. + +We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the ""lottery ticket hypothesis:"" dense, randomly-initialized, feed-forward networks contain subnetworks (""winning tickets"") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. + +We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.",/pdf/2c35994ea2912e6517a87c50fc55faa58f0df150.pdf,ICLR,2019,Feedforward neural networks that can have weights pruned after training could have had the same weights pruned before training +BJxDNxSFDH,B1xopIeYDB,1569440000000.0,1577170000000.0,2251,Few-Shot Regression via Learning Sparsifying Basis Functions,"[""loo_yi@sutd.edu.sg"", ""guoyl1990@outlook.com"", ""ngaiman_cheung@sutd.edu.sg""]","[""Yi Loo"", ""Yiluan Guo"", ""Ngai-Man Cheung""]","[""meta-learning"", ""few-shot learning"", ""regression"", ""learning basis functions"", ""self-attention""]","Recent few-shot learning algorithms have enabled models to quickly adapt to new tasks based on only a few training samples. Previous few-shot learning works have mainly focused on classification and reinforcement learning. In this paper, we propose a few-shot meta-learning system that focuses exclusively on regression tasks. Our model is based on the idea that the degree of freedom of the unknown function can be significantly reduced if it is represented as a linear combination of a set of sparsifying basis functions. This enables a few labeled samples to approximate the function. We design a Basis Function Learner network to encode basis functions for a task distribution, and a Weights Generator network to generate the weight vector for a novel task. We show that our model outperforms the current state of the art meta-learning methods in various regression tasks.",/pdf/48b259e15f3c0663cbf33ab632b12f529dc4aa24.pdf,ICLR,2020,We propose a method of doing few-shot regression by learning a set of basis functions to represent the function distribution. +SJl98sR5tX,HygFldCtF7,1538090000000.0,1545360000000.0,197,Interactive Agent Modeling by Learning to Probe,"[""tianmin.shu@ucla.edu"", ""cxiong@salesforce.com"", ""ywu@stat.ucla.edu"", ""sczhu@stat.ucla.edu""]","[""Tianmin Shu"", ""Caiming Xiong"", ""Ying Nian Wu"", ""Song-Chun Zhu""]","[""Agent Modeling"", ""Theory of Mind"", ""Deep Reinforcement Learning"", ""Multi-agent Reinforcement Learning""]","The ability of modeling the other agents, such as understanding their intentions and skills, is essential to an agent's interactions with other agents. Conventional agent modeling relies on passive observation from demonstrations. In this work, we propose an interactive agent modeling scheme enabled by encouraging an agent to learn to probe. In particular, the probing agent (i.e. a learner) learns to interact with the environment and with a target agent (i.e., a demonstrator) to maximize the change in the observed behaviors of that agent. Through probing, rich behaviors can be observed and are used for enhancing the agent modeling to learn a more accurate mind model of the target agent. Our framework consists of two learning processes: i) imitation learning for an approximated agent model and ii) pure curiosity-driven reinforcement learning for an efficient probing policy to discover new behaviors that otherwise can not be observed. We have validated our approach in four different tasks. The experimental results suggest that the agent model learned by our approach i) generalizes better in novel scenarios than the ones learned by passive observation, random probing, and other curiosity-driven approaches do, and ii) can be used for enhancing performance in multiple applications including distilling optimal planning to a policy net, collaboration, and competition. A video demo is available at https://www.dropbox.com/s/8mz6rd3349tso67/Probing_Demo.mov?dl=0",/pdf/c3dc67da7dd5b736f7c2e65e7d2a865b85f4a9cf.pdf,ICLR,2019,We propose an interactive agent modeling framework by learning a probing policy to diversify task settings and to incite new behaviors of a target agent for a better modeling of the target agent. +B1Lc-Gb0Z,rJB9bz-0Z,1509130000000.0,1519410000000.0,779,Deep Learning as a Mixed Convex-Combinatorial Optimization Problem,"[""afriesen@cs.washington.edu"", ""pedrod@cs.washington.edu""]","[""Abram L. Friesen"", ""Pedro Domingos""]","[""hard-threshold units"", ""combinatorial optimization"", ""target propagation"", ""straight-through estimation"", ""quantization""]","As neural networks grow deeper and wider, learning networks with hard-threshold activations is becoming increasingly important, both for network quantization, which can drastically reduce time and energy requirements, and for creating large integrated systems of deep networks, which may have non-differentiable components and must avoid vanishing and exploding gradients for effective learning. However, since gradient descent is not applicable to hard-threshold functions, it is not clear how to learn them in a principled way. We address this problem by observing that setting targets for hard-threshold hidden units in order to minimize loss is a discrete optimization problem, and can be solved as such. The discrete optimization goal is to find a set of targets such that each unit, including the output, has a linearly separable problem to solve. Given these targets, the network decomposes into individual perceptrons, which can then be learned with standard convex approaches. Based on this, we develop a recursive mini-batch algorithm for learning deep hard-threshold networks that includes the popular but poorly justified straight-through estimator as a special case. Empirically, we show that our algorithm improves classification accuracy in a number of settings, including for AlexNet and ResNet-18 on ImageNet, when compared to the straight-through estimator.",/pdf/e6e5be79559e4013ea4fd62c1442fad576efa561.pdf,ICLR,2018,"We learn deep networks of hard-threshold units by setting hidden-unit targets using combinatorial optimization and weights by convex optimization, resulting in improved performance on ImageNet." +ry18Ww5ee,,1478290000000.0,1488440000000.0,359,Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization,"[""lishal@cs.ucla.edu"", ""kjamieson@berkeley.edu"", ""desalvo@cims.nyu.edu"", ""rostami@google.com"", ""ameet@cs.ucla.edu""]","[""Lisha Li"", ""Kevin Jamieson"", ""Giulia DeSalvo"", ""Afshin Rostamizadeh"", ""Ameet Talwalkar""]",[],"Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian Optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation. We present Hyperband, a novel algorithm for hyperparameter optimization that is simple, flexible, and theoretically sound. Hyperband is a principled early-stoppping method that adaptively allocates a predefined resource, e.g., iterations, data samples or number of features, to randomly sampled configurations. We compare Hyperband with state-of-the-art Bayesian Optimization methods on several hyperparameter optimization problems. We observe that Hyperband can provide over an order of magnitude speedups over competitors on a variety of neural network and kernel-based learning problems. ",/pdf/a9263ffc193997e5041e73ce2230de60001cd855.pdf,ICLR,2017, +SJk01vogl,,1478350000000.0,1484390000000.0,556,Adversarial examples for generative models,"[""jernej@kos.mx"", ""iansf@google.com"", ""dawnsong.travel@gmail.com""]","[""Jernej Kos"", ""Ian Fischer"", ""Dawn Song""]","[""Computer vision"", ""Unsupervised Learning""]","We explore methods of producing adversarial examples on deep generative models such as the variational autoencoder (VAE) and the VAE-GAN. Deep learning architectures are known to be vulnerable to adversarial examples, but previous work has focused on the application of adversarial examples to classification tasks. Deep generative models have recently become popular due to their ability to model input data distributions and generate realistic examples from those distributions. We present three classes of attacks on the VAE and VAE-GAN architectures and demonstrate them against networks trained on MNIST, SVHN and CelebA. Our first attack leverages classification-based adversaries by attaching a classifier to the trained encoder of the target generative model, which can then be used to indirectly manipulate the latent representation. Our second attack directly uses the VAE loss function to generate a target reconstruction image from the adversarial example. Our third attack moves beyond relying on classification or the standard loss for the gradient and directly optimizes against differences in source and target latent representations. We also motivate why an attacker might be interested in deploying such techniques against a target generative network.",/pdf/503e0fb6d4e69e70b71d90a2a252a3085e7e5eb4.pdf,ICLR,2017,Exploration of ways to attack generative models with adversarial examples and why someone might want to do that. +BklHpjCqKm,HkxQ3ua9Y7,1538090000000.0,1554390000000.0,798,Deep Lagrangian Networks: Using Physics as Model Prior for Deep Learning,"[""michael@robot-learning.de"", ""ritter@stud.tu-darmstadt.de"", ""peters@ias.tu-darmstadt.de""]","[""Michael Lutter"", ""Christian Ritter"", ""Jan Peters""]","[""Deep Model Learning"", ""Robot Control""]","Deep learning has achieved astonishing results on many tasks with large amounts of data and generalization within the proximity of training data. For many important real-world applications, these requirements are unfeasible and additional prior knowledge on the task domain is required to overcome the resulting problems. In particular, learning physics models for model-based control requires robust extrapolation from fewer samples – often collected online in real-time – and model errors may lead to drastic damages of the system. +Directly incorporating physical insight has enabled us to obtain a novel deep model learning approach that extrapolates well while requiring fewer samples. As a first example, we propose Deep Lagrangian Networks (DeLaN) as a deep network structure upon which Lagrangian Mechanics have been imposed. DeLaN can learn the equations of motion of a mechanical system (i.e., system dynamics) with a deep network efficiently while ensuring physical plausibility. +The resulting DeLaN network performs very well at robot tracking control. The proposed method did not only outperform previous model learning approaches at learning speed but exhibits substantially improved and more robust extrapolation to novel trajectories and learns online in real-time.",/pdf/6a4c8189f356a112f26ceb69d7c1966da761c73d.pdf,ICLR,2019,This paper introduces a physics prior for Deep Learning and applies the resulting network topology for model-based control. +#NAME?,KODhAATnYkK,1601310000000.0,1614990000000.0,785,Adaptive norms for deep learning with regularized Newton methods,"[""~Jonas_K_Kohler1"", ""~Leonard_Adolphs1"", ""~Aurelien_Lucchi1""]","[""Jonas K Kohler"", ""Leonard Adolphs"", ""Aurelien Lucchi""]","[""Stochastic Optimization"", ""Non-convex Optimization"", ""Deep Learning"", ""Adaptive methods"", ""Newton methods"", ""Second-order optimization""]","We investigate the use of regularized Newton methods with adaptive norms for optimizing neural networks. This approach can be seen as a second-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we prove that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for provable convergence of second-order trust region methods with standard worst-case complexities on general non-convex objectives. Furthermore, we run experiments across different neural architectures and datasets to find that the ellipsoidal constraints constantly outperform their spherical counterpart both in terms of number of backpropagations and asymptotic loss value. Finally, we find comparable performance to state-of-the-art first-order methods in terms of backpropagations, but further advances in hardware are needed to render Newton methods competitive in terms of computational time.",/pdf/ef68077229f18cbc8d7b8afee78e7fd5b5ed7e19.pdf,ICLR,2021,This paper proposes second-order variants of adaptive gradient methods such as RMSProp. +fgd7we_uZa6,cqAsL_Ed6A,1601310000000.0,1616010000000.0,2783,How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?,"[""~Zixiang_Chen1"", ""~Yuan_Cao1"", ""~Difan_Zou1"", ""~Quanquan_Gu1""]","[""Zixiang Chen"", ""Yuan Cao"", ""Difan Zou"", ""Quanquan Gu""]","[""deep ReLU networks"", ""neural tangent kernel"", ""(stochastic) gradient descent"", ""generalization error"", ""classification""]","A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2020). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.",/pdf/7d4b4fabf3654c85ec7bc9a41516a3fe17bbccd8.pdf,ICLR,2021,We establish learning guarantees for deep ReLU networks with width polylogarithmic in sample size and the inverse of the target error. +r1nmx5l0W,H1s7g9g0W,1509100000000.0,1518730000000.0,344,SIC-GAN: A Self-Improving Collaborative GAN for Decoding Sketch RNNs,"[""ccchuang@datalab.cs.nthu.edu.tw"", ""zxweng@datalab.cs.nthu.edu.tw"", ""shwu@cs.nthu.edu.tw""]","[""Chi-Chun Chuang"", ""Zheng-Xin Weng"", ""Shan-Hung Wu""]","[""RNNs"", ""GANs"", ""Variational RNNs"", ""Sketch RNNs""]","Variational RNNs are proposed to output “creative” sequences. Ideally, a collection of sequences produced by a variational RNN should be of both high quality and high variety. However, existing decoders for variational RNNs suffer from a trade-off between quality and variety. In this paper, we seek to learn a variational RNN that decodes high-quality and high-variety sequences. We propose the Self-Improving Collaborative GAN (SIC-GAN), where there are two generators (variational RNNs) collaborating with each other to output a sequence and aiming to trick the discriminator into believing the sequence is of good quality. By deliberately weakening one generator, we can make another stronger in balancing quality and variety. We conduct experiments using the QuickDraw dataset and the results demonstrate the effectiveness of SIC-GAN empirically. ",/pdf/77a5bad2a2db85895ff2d771baeb99b6a4f9709e.pdf,ICLR,2018, +LwEQnp6CYev,PBJf0Ls7CDG,1601310000000.0,1616020000000.0,423,Quantifying Differences in Reward Functions,"[""~Adam_Gleave1"", ""~Michael_D_Dennis1"", ""~Shane_Legg1"", ""~Stuart_Russell1"", ""~Jan_Leike1""]","[""Adam Gleave"", ""Michael D Dennis"", ""Shane Legg"", ""Stuart Russell"", ""Jan Leike""]","[""rl"", ""irl"", ""reward learning"", ""distance"", ""benchmarks""]","For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https://github.com/HumanCompatibleAI/evaluating-rewards.",/pdf/c9babbffccc1b8e389a2e8de1c7aac4cee00f966.pdf,ICLR,2021,A theoretically principled distance measure on reward functions that is quick to compute and predicts policy training performance. +H1gcw1HYPr,Hklrgp6_PH,1569440000000.0,1577170000000.0,1776,AlignNet: Self-supervised Alignment Module,"[""tonicreswell@google.com"", ""piloto@google.com"", ""peterbattaglia@google.com"", ""knikiforou@google.com"", ""barrettdavid@google.com"", ""garnelo@google.com"", ""mshanahan@google.com"", ""draposo@google.com""]","[""Antonia Creswell"", ""Luis Piloto"", ""David Barrett"", ""Kyriacos Nikiforou"", ""David Raposo"", ""Marta Garnelo"", ""Peter Battaglia"", ""Murray Shanahan""]","[""Graph networks"", ""alignment"", ""objects"", ""relation networks""]","The natural world consists of objects that we perceive as persistent in space and time, even though these objects appear, disappear and reappear in our field of view as we move. This can be attributed to our notion of object persistence -- our knowledge that objects typically continue to exist, even if we can no longer see them -- and our ability to track objects. Drawing inspiration from the psychology literature on `sticky indices', we propose the AlignNet, a model that learns to assign unique indices to new objects when they first appear and reassign the index to subsequent instances of that object. By introducing a persistent object-based memory, the AlignNet may be used to keep track of objects across time, even if they disappear and reappear later. We implement the AlignNet as a graph network applied to a bipartite graph, in which the input nodes are objects from two sets that we wish to align. The network is trained to predict the edges which connect two instances of the same object across sets. The model is also capable of identifying when there are no matches and dealing with these cases. We perform experiments to show the model's ability to deal with the appearance, disappearance and reappearance of objects. Additionally, we demonstrate how a persistent object-based memory can help solve question-answering problems in a partially observable environment.",/pdf/927e3d34f24b57da550a8c95feaa607290bdcfbd.pdf,ICLR,2020,"A differentiable model for aligning pre-extracted entity representations with a slot based memory, to which new objects can be added." +rkx0g3R5tX,Bklz782qK7,1538090000000.0,1545360000000.0,1127,Partially Mutual Exclusive Softmax for Positive and Unlabeled data,"[""u.tanielian@criteo.com"", ""f.vasile@criteo.com"", ""m.gartrell@criteo.com""]","[""Ugo Tanielian"", ""Flavian vasile"", ""Mike Gartrell""]","[""Negative Sampling"", ""Sampled Softmax"", ""Word embeddings"", ""Adversarial Networks""]","In recent years, softmax together with its fast approximations has become the de-facto loss function for deep neural networks with multiclass predictions. However, softmax is used in many problems that do not fully fit the multiclass framework and where the softmax assumption of mutually exclusive outcomes can lead to biased results. This is often the case for applications such as language modeling, next event prediction and matrix factorization, where many of the potential outcomes are not mutually exclusive, but are more likely to be independent conditionally on the state. To this end, for the set of problems with positive and unlabeled data, we propose a relaxation of the original softmax formulation, where, given the observed state, each of the outcomes are conditionally independent but share a common set of negatives. Since we operate in a regime where explicit negatives are missing, we create an adversarially-trained model of negatives and derive a new negative sampling and weighting scheme which we denote as Cooperative Importance Sampling (CIS). We show empirically the advantages of our newly introduced negative sampling scheme by pluging it in the Word2Vec algorithm and benching it extensively against other negative sampling schemes on both language modeling and matrix factorization tasks and show large lifts in performance.",/pdf/a1b1c2167d90d666935afdc865ea935ddecff3b2.pdf,ICLR,2019,Defining a partially mutual exclusive softmax loss for postive data and implementing a cooperative based sampling scheme +SJlOq34Kwr,BJl8f6_1vS,1569440000000.0,1577170000000.0,121,Unsupervised Intuitive Physics from Past Experiences,"[""hyenal@robots.ox.ac.uk"", ""aron@nianticlabs.com"", ""n.mitra@cs.ucl.ac.uk"", ""vedaldi@robots.ox.ac.uk""]","[""Sebastien Ehrhardt"", ""Aron Monszpart"", ""Niloy Mitra"", ""Andrea Vedaldi""]","[""Intuitive physics"", ""Deep learning""]","We consider the problem of learning models of intuitive physics from raw, unlabelled visual input. Differently from prior work, in addition to learning general physical principles, we are also interested in learning ``on the fly'' physical properties specific to new environments, based on a small number of environment-specific experiences. We do all this in an unsupervised manner, using a meta-learning formulation where the goal is to predict videos containing demonstrations of physical phenomena, such as objects moving and colliding with a complex background. We introduce the idea of summarizing past experiences in a very compact manner, in our case using dynamic images, and show that this can be used to solve the problem well and efficiently. Empirically, we show, via extensive experiments and ablation studies, that our model learns to perform physical predictions that generalize well in time and space, as well as to a variable number of interacting physical objects.",/pdf/7c5943525da27e715edee8ae0ea4f23f5a2969d1.pdf,ICLR,2020, +HkpLeH9el,,1478280000000.0,1478280000000.0,232,Neural Functional Programming,"[""feser@csail.mit.edu"", ""mabrocks@microsoft.com"", ""t-algaun@microsoft.com"", ""dtarlow@microsoft.com""]","[""John K. Feser"", ""Marc Brockschmidt"", ""Alexander L. Gaunt"", ""Daniel Tarlow""]","[""Supervised Learning""]","We discuss a range of modeling choices that arise when constructing an end-to-end differentiable programming language suitable for learning programs from input-output examples. Taking cues from programming languages research, we study the effect of memory allocation schemes, immutable data, type systems, and built-in control-flow structures on the success rate of learning algorithms. We build a range of models leading up to a simple differentiable functional programming language. Our empirical evaluation shows that this language allows to learn far more programs than existing baselines.",/pdf/f8f084e85876380943390d08b632c235ea739dd9.pdf,ICLR,2017,A differentiable functional programming language for learning programs from input-output examples. +r1RF3ExCb,HJCt2ExC-,1509080000000.0,1518730000000.0,244,Transformation Autoregressive Networks,"[""joliva@cs.cmu.edu"", ""akdubey@cs.cmu.edu"", ""bapoczos@cs.cmu.edu"", ""epxing@cs.cmu.edu"", ""schneide@cs.cmu.edu""]","[""Junier Oliva"", ""Avinava Dubey"", ""Barnab\u00e1s P\u00f3czos"", ""Eric P. Xing"", ""Jeff Schneider""]","[""density estimation"", ""autoregressive models"", ""RNNs""]","The fundamental task of general density estimation has been of keen interest to machine learning. Recent advances in density estimation have either: a) proposed using a flexible model to estimate the conditional factors of the chain rule; or b) used flexible, non-linear transformations of variables of a simple base distribution. Instead, this work jointly leverages transformations of variables and autoregressive conditional models, and proposes novel methods for both. We provide a deeper understanding of our models, showing a considerable improvement with our methods through a comprehensive study over both real world and synthetic data. Moreover, we illustrate the use of our models in outlier detection and image modeling task.",/pdf/88fad0ffef7f88f3189772ef23270833358795ea.pdf,ICLR,2018, +HyxhusA9Fm,rklMN4O9FX,1538090000000.0,1545360000000.0,389,Talk The Walk: Navigating Grids in New York City through Grounded Dialogue,"[""mail@harmdevries.com"", ""kshuster@fb.com"", ""dbatra@gatech.edu"", ""parikh@gatech.edu"", ""jase@fb.com"", ""dkiela@fb.com""]","[""Harm de Vries"", ""Kurt Shuster"", ""Dhruv Batra"", ""Devi Parikh"", ""Jason Weston"", ""Douwe Kiela""]","[""Dialogue"", ""Navigation"", ""Grounded Language Learning""]","We introduce `""Talk The Walk"", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a 'guide' and a 'tourist') that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task. ",/pdf/b9552e22acd86a0472886837eb0c092345e93f6d.pdf,ICLR,2019,First large-scale dialogue dataset grounded in action and perception +B1Igu2ogg,,1478380000000.0,1495150000000.0,578,Efficient Vector Representation for Documents through Corruption,"[""m.chen@criteo.com""]","[""Minmin Chen""]","[""Natural language processing"", ""Deep learning"", ""Semi-Supervised Learning""]","We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time. +",/pdf/5e58d0d43f34525f995cc00fe57341537cb1b88e.pdf,ICLR,2017,a simple document representation learning framework that is very efficient to train and test +SkiCjzNTZ,H19CjGEaZ,1508290000000.0,1518730000000.0,11,Spontaneous Symmetry Breaking in Deep Neural Networks,"[""ricky.fok3@gmail.com""]","[""Ricky Fok"", ""Aijun An"", ""Xiaogang Wang""]","[""deep learning"", ""physics"", ""field theory""]","We propose a framework to understand the unprecedented performance and robustness of deep neural networks using field theory. Correlations between the weights within the same layer can be described by symmetries in that layer, and networks generalize better if such symmetries are broken to reduce the redundancies of the weights. Using a two parameter field theory, we find that the network can break such symmetries itself towards the end of training in a process commonly known in physics as spontaneous symmetry breaking. This corresponds to a network generalizing itself without any user input layers to break the symmetry, but by communication with adjacent layers. In the layer decoupling limit applicable to residual networks (He et al., 2015), we show that the remnant symmetries that survive the non-linear layers are spontaneously broken based on empirical results. The Lagrangian for the non-linear and weight layers together has striking similarities with the one in quantum field theory of a scalar. Using results from quantum field theory we show that our framework is able to explain many experimentally observed phenomena, such as training on random labels with zero error (Zhang et al., 2017), the information bottleneck and the phase transition out of it (Shwartz-Ziv & Tishby, 2017), shattered gradients (Balduzzi et al., 2017), and many more.",/pdf/c874a0ee1778a9363e5e91483fdbcb1b5a22d672.pdf,ICLR,2018,Closed form results for deep learning in the layer decoupling limit applicable to Residual Networks +SylVNerFvr,Hy4vUIgFvS,1569440000000.0,1583910000000.0,2245,Permutation Equivariant Models for Compositional Generalization in Language,"[""jg801@cam.ac.uk"", ""dlp@fb.com"", ""mbaroni@fb.com"", ""dianeb@fb.com""]","[""Jonathan Gordon"", ""David Lopez-Paz"", ""Marco Baroni"", ""Diane Bouchacourt""]","[""Compositionality"", ""Permutation Equivariance"", ""Language Processing""]","Humans understand novel sentences by composing meanings and roles of core language components. In contrast, neural network models for natural language modeling fail when such compositional generalization is required. The main contribution of this paper is to hypothesize that language compositionality is a form of group-equivariance. Based on this hypothesis, we propose a set of tools for constructing equivariant sequence-to-sequence models. Throughout a variety of experiments on the SCAN tasks, we analyze the behavior of existing models under the lens of equivariance, and demonstrate that our equivariant architecture is able to achieve the type compositional generalization required in human language understanding.",/pdf/17645aa763238db6faff1b591060681d3d3f0e6d.pdf,ICLR,2020,"We propose a link between permutation equivariance and compositional generalization, and provide equivariant language models" +0gfSzsRDZFw,0VFeixTE4Jo,1601310000000.0,1614990000000.0,3375,Ablation Path Saliency,"[""~Olivier_Verdier1"", ""~Justus_Sagem\u00fcller1""]","[""Olivier Verdier"", ""Justus Sagem\u00fcller""]","[""image classification"", ""interpretability"", ""feature attribution"", ""saliency"", ""ablation""]","We consider the saliency problem for black-box classification. In image classification, this means highlighting the part of the image that is most relevant for the current decision. +We cast the saliency problem as finding an optimal ablation path between two images. An ablation path consists of a sequence of ever smaller masks, joining the current image to a reference image in another decision region. The optimal path will stay as long as possible in the current decision region. This approach extends the ablation tests in [Sturmfels et al. (2020)]. The gradient of the corresponding objective function is closely related to the integrated gradient method [Sundararajan et al. (2017)]. In the saturated case (when the classifier outputs a binary value) our method would reduce to the meaningful perturbation approach [Fong & Vedaldi (2017)], since crossing the decision boundary as late as +possible would then be equivalent to finding the smallest possible mask lying on +the decision boundary. +Our interpretation provides geometric understanding of existing saliency methods, and suggests a novel approach based on ablation path optimisation.",/pdf/48eff1051ca70033e76a23076ee3efe7744c8931.pdf,ICLR,2021,Understanding decisions by ablating as much of the input as possible without changing classification +BJeWOi09FQ,B1ezOXF9KX,1538090000000.0,1545360000000.0,328,SHAMANN: Shared Memory Augmented Neural Networks,"[""cosmin.bercea@fau.de"", ""olivier.pauly@gmail.com"", ""andreas.maier@fau.de"", ""florin.ghesu@siemens-healthineers.com""]","[""Cosmin I. Bercea"", ""Olivier Pauly"", ""Andreas K. Maier"", ""Florin C. Ghesu""]","[""memory networks"", ""deep learning"", ""medical image segmentation""]","Current state-of-the-art methods for semantic segmentation use deep neural networks to learn the segmentation mask from the input image signal as an image-to-image mapping. While these methods effectively exploit global image context, the learning and computational complexities are high. We propose shared memory augmented neural network actors as a dynamically scalable alternative. Based on a decomposition of the image into a sequence of local patches, we train such actors to sequentially segment each patch. To further increase the robustness and better capture shape priors, an external memory module is shared between different actors, providing an implicit mechanism for image information exchange. Finally, the patch-wise predictions are aggregated to a complete segmentation mask. We demonstrate the benefits of the new paradigm on a challenging lung segmentation problem based on chest X-Ray images, as well as on two synthetic tasks based on the MNIST dataset. On the X-Ray data, our method achieves state-of-the-art accuracy with a significantly reduced model size compared to reference methods. In addition, we reduce the number of failure cases by at least half.",/pdf/45eafa16ba64ffffd3a5f0aa884a777fb69fcde6.pdf,ICLR,2019,Multiple virtual actors cooperating through shared memory solve medical image segmentation. +HygQ7TNtPr,ryxXqpAIvB,1569440000000.0,1577170000000.0,444,Rethinking Neural Network Quantization,"[""jinqingking@gmail.com"", ""yljatthu@gmail.com"", ""liaozhenyu2004@gmail.com""]","[""Qing Jin"", ""Linjie Yang"", ""Zhenyu Liao""]","[""Deep Learning"", ""Convolutional Network"", ""Network Quantization"", ""Efficient Learning""]","Quantization reduces computation costs of neural networks but suffers from performance degeneration. Is this accuracy drop due to the reduced capacity, or inefficient training during the quantization procedure? After looking into the gradient propagation process of neural networks by viewing the weights and intermediate activations as random variables, we discover two critical rules for efficient training. Recent quantization approaches violates the two rules and results in degenerated convergence. To deal with this problem, we propose a simple yet effective technique, named scale-adjusted training (SAT), to comply with the discovered rules and facilitates efficient training. We also analyze the quantization error introduced in calculating the gradient in the popular parameterized clipping activation (PACT) technique. Through SAT together with gradient-calibrated PACT, quantized models obtain comparable or even better performance than their full-precision counterparts, achieving state-of-the-art accuracy with consistent improvement over previous quantization methods on a wide spectrum of models including MobileNet-V1/V2 and PreResNet-50.",/pdf/23208e1807939f9d05c373591c9328d027ca72ba.pdf,ICLR,2020, +fdZvTFn8Yq,pMBFTCL81oe,1601310000000.0,1614990000000.0,3163,Probabilistic Meta-Learning for Bayesian Optimization,"[""~Felix_Berkenkamp1"", ""anna.eivazi@de.bosch.com"", ""~Lukas_Grossberger1"", ""kathrin.skubch@de.bosch.com"", ""spitz.jonathan@gmail.com"", ""~Christian_Daniel1"", ""~Stefan_Falkner1""]","[""Felix Berkenkamp"", ""Anna Eivazi"", ""Lukas Grossberger"", ""Kathrin Skubch"", ""Jonathan Spitz"", ""Christian Daniel"", ""Stefan Falkner""]","[""meta-learning"", ""bayesian optimization"", ""probabilistic modelling""]","Transfer and meta-learning algorithms leverage evaluations on related tasks in order to significantly speed up learning or optimization on a new problem. For applications that depend on uncertainty estimates, e.g., in Bayesian optimization, recent probabilistic approaches have shown good performance at test time, but either scale poorly with the number of data points or under-perform with little data on the test task. In this paper, we propose a novel approach to probabilistic transfer learning that uses a generative model for the underlying data distribution and simultaneously learns a latent feature distribution to represent unknown task properties. To enable fast and accurate inference at test-time, we introduce a novel meta-loss that structures the latent space to match the prior used for inference. Together, these contributions ensure that our probabilistic model exhibits high sample-efficiency and provides well-calibrated uncertainty estimates. We evaluate the proposed approach and compare its performance to probabilistic models from the literature on a set of Bayesian optimization transfer-learning tasks. ",/pdf/1cdae65a486dff2ed932ea70b186812cc52f4f71.pdf,ICLR,2021,We develop a probabilistic meta-learning model to speed up Bayesian optimization +By5ugjyCb,SJK_gjyCb,1509040000000.0,1518730000000.0,154,PACT: Parameterized Clipping Activation for Quantized Neural Networks,"[""choij@us.ibm.com""]","[""Jungwook Choi"", ""Zhuo Wang"", ""Swagath Venkataramani"", ""Pierce I-Jen Chuang"", ""Vijayalakshmi Srinivasan"", ""Kailash Gopalakrishnan""]","[""deep learning"", ""quantized deep neural network"", ""activation quantization""]","Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. To address this cost, a number of quantization schemeshave been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations. This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation. This technique, PArameterized Clipping acTi-vation (PACT), uses an activation clipping parameter α that is optimized duringtraining to find the right quantization scale. PACT allows quantizing activations toarbitrary bit precisions, while achieving much better accuracy relative to publishedstate-of-the-art quantization schemes. We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets. We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance dueto a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories.",/pdf/72b81ce72cb94ac746c4d2fc028a8176c37f0b10.pdf,ICLR,2018,A new way of quantizing activation of Deep Neural Network via parameterized clipping which optimizes the quantization scale via stochastic gradient descent. +ep81NLpHeos,YmuKSoEcYwZ,1601310000000.0,1614990000000.0,2053,Momentum Contrastive Autoencoder,"[""~Devansh_Arpit2"", ""abhatnagar@salesforce.com"", ""~Huan_Wang1"", ""~Caiming_Xiong1""]","[""Devansh Arpit"", ""Aadyot Bhatnagar"", ""Huan Wang"", ""Caiming Xiong""]","[""generative model"", ""contrastive learning"", ""autoencoder"", ""Wasserstein autoencoder""]","Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution. This latent space distribution matching is a core component in WAE, and is in itself a challenging task. In this paper, we propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem. We do so by exploiting the fact that contrastive learning objectives optimize the latent space distribution to be uniform over the unit hyper-sphere, which can be easily sampled from. This results in a simple and scalable algorithm that avoids many of the optimization challenges of existing generative models, while retaining the advantage of efficient sampling. Quantitatively, we show that our algorithm achieves a new state-of-the-art FID of 54.36 on CIFAR-10, and performs competitively with existing models on CelebA in terms of FID score. We also show qualitative results on CelebA-HQ in addition to these datasets, confirming that our algorithm can generate realistic images at multiple resolutions.",/pdf/a6a36c9ed6c222128eb4d443c489c7a69853369f.pdf,ICLR,2021,We propose a simple autoencoder based generative model that overcomes many of the optimization challenges of existing generative models by combining contrastive learning with Wasserstein autoencoder. +rkRwGg-0Z,rk2Dzg-A-,1509130000000.0,1519370000000.0,562,Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs,"[""jmurdoch@berkeley.edu"", ""peterjliu@google.com"", ""binyu@berkeley.edu""]","[""W. James Murdoch"", ""Peter J. Liu"", ""Bin Yu""]","[""interpretability"", ""LSTM"", ""natural language processing"", ""sentiment analysis"", ""interactions""]","The driving force behind the recent success of LSTMs has been their ability to learn complex and non-linear relationships. Consequently, our inability to describe these relationships has led to LSTMs being characterized as black boxes. To this end, we introduce contextual decomposition (CD), an interpretation algorithm for analysing individual predictions made by standard LSTMs, without any changes to the underlying model. By decomposing the output of a LSTM, CD captures the contributions of combinations of words or variables to the final prediction of an LSTM. On the task of sentiment analysis with the Yelp and SST data sets, we show that CD is able to reliably identify words and phrases of contrasting sentiment, and how they are combined to yield the LSTM's final prediction. Using the phrase-level labels in SST, we also demonstrate that CD is able to successfully extract positive and negative negations from an LSTM, something which has not previously been done.",/pdf/e199c8058920f0a86c631b4e6f4602a3427f758f.pdf,ICLR,2018,"We introduce contextual decompositions, an interpretation algorithm for LSTMs capable of extracting word, phrase and interaction-level importance score" +SJeHwJSYvH,r1lkqjauPB,1569440000000.0,1577170000000.0,1765,Learning De-biased Representations with Biased Representations,"[""hjj552@korea.ac.kr"", ""sanghyuk.c@navercorp.com"", ""sangdoo.yun@navercorp.com"", ""jchoo@korea.ac.kr"", ""coallaoh@linecorp.com""]","[""Hyojin Bahng"", ""Sanghyuk Chun"", ""Sangdoo Yun"", ""Jaegul Choo"", ""Seong Joon Oh""]","[""Generalization"", ""Bias"", ""Dataset bias""]","Many machine learning algorithms are trained and evaluated by splitting data from a single source into training and test sets. While such focus on in-distribution learning scenarios has led interesting advances, it has not been able to tell if models are relying on dataset biases as shortcuts for successful prediction (e.g., using snow cues for recognising snowmobiles). Such biased models fail to generalise when the bias shifts to a different class. The cross-bias generalisation problem has been addressed by de-biasing training data through augmentation or re-sampling, which are often prohibitive due to the data collection cost (e.g., collecting images of snowmobile on a desert) and the difficulty of quantifying or expressing biases in the first place. In this work, we propose a novel framework to train a de-biased representation by encouraging it to be different from a set of representations that are biased by design. This tactic is feasible in many scenarios where it is much easier to define a set of biased representations than to define and quantify bias. Our experiments and analyses show that our method discourages models from taking bias shortcuts, resulting in improved performances on de-biased test data.",/pdf/90255d574e940487cdf4f31fca8b09177d6f863d.pdf,ICLR,2020, +HJx81ySKwr,Sye9gT5dwH,1569440000000.0,1583910000000.0,1469,Iterative energy-based projection on a normal data manifold for anomaly localization,"[""david@anotherbrain.ai"", ""oriel@anotherbrain.ai"", ""sebastien@anotherbrain.ai"", ""pierre@anotherbrain.ai""]","[""David Dehaene"", ""Oriel Frigo"", ""S\u00e9bastien Combrexelle"", ""Pierre Eline""]","[""deep learning"", ""visual inspection"", ""unsupervised anomaly detection"", ""anomaly localization"", ""autoencoder"", ""variational autoencoder"", ""gradient descent"", ""inpainting""]","Autoencoder reconstructions are widely used for the task of unsupervised anomaly localization. Indeed, an autoencoder trained on normal data is expected to only be able to reconstruct normal features of the data, allowing the segmentation of anomalous pixels in an image via a simple comparison between the image and its autoencoder reconstruction. In practice however, local defects added to a normal image can deteriorate the whole reconstruction, making this segmentation challenging. To tackle the issue, we propose in this paper a new approach for projecting anomalous data on a autoencoder-learned normal data manifold, by using gradient descent on an energy derived from the autoencoder's loss function. This energy can be augmented with regularization terms that model priors on what constitutes the user-defined optimal projection. By iteratively updating the input of the autoencoder, we bypass the loss of high-frequency information caused by the autoencoder bottleneck. This allows to produce images of higher quality than classic reconstructions. Our method achieves state-of-the-art results on various anomaly localization datasets. It also shows promising results at an inpainting task on the CelebA dataset.",/pdf/53c8526763e4dc77e98e527c4d06bfbbf22b1768.pdf,ICLR,2020,We use gradient descent on a regularized autoencoder loss to correct anomalous images. +HklliySFDS,BkgmwnC_PB,1569440000000.0,1577170000000.0,1902,Continual Learning with Gated Incremental Memories for Sequential Data Processing,"[""cossu48@gmail.com"", ""antonio.carta@di.unipi.it"", ""bacciu@di.unipi.it""]","[""Andrea Cossu"", ""Antonio Carta"", ""Davide Bacciu""]","[""continual learning"", ""recurrent neural networks"", ""progressive networks"", ""gating autoencoders"", ""sequential data processing""]","The ability to learn over changing task distributions without forgetting previous knowledge, also known as continual learning, is a key enabler for scalable and trustworthy deployments of adaptive solutions. While the importance of continual learning is largely acknowledged in machine vision and reinforcement learning problems, this is mostly under-documented for sequence processing tasks. This work focuses on characterizing and quantitatively assessing the impact of catastrophic forgetting and task interference when dealing with sequential data in recurrent neural networks. We also introduce a general architecture, named Gated Incremental Memory, for augmenting recurrent models with continual learning skills, whose effectiveness is demonstrated through the benchmarks introduced in this paper.",/pdf/bb45990a672ffe2ae516eaead0db891a18b29467.pdf,ICLR,2020,"We tackled the problem of CL in sequential data processing scenarios, providing a set of domain-agnostic benchmarks against which we compared performances of a novel RNN for CL and other standard RNNs." +ByglLlHFDS,S1lEdFgKPB,1569440000000.0,1583910000000.0,2310,Expected Information Maximization: Using the I-Projection for Mixture Density Estimation,"[""philippbecker93@googlemail.com"", ""oleg@robot-learning.de"", ""geri@robot-learning.de""]","[""Philipp Becker"", ""Oleg Arenz"", ""Gerhard Neumann""]","[""density estimation"", ""information projection"", ""mixture models"", ""generative learning"", ""multimodal modeling""]","Modelling highly multi-modal data is a challenging problem in machine learning. Most algorithms are based on maximizing the likelihood, which corresponds to the M(oment)-projection of the data distribution to the model distribution. +The M-projection forces the model to average over modes it cannot represent. In contrast, the I(nformation)-projection ignores such modes in the data and concentrates on the modes the model can represent. Such behavior is appealing whenever we deal with highly multi-modal data where modelling single modes correctly is more important than covering all the modes. Despite this advantage, the I-projection is rarely used in practice due to the lack of algorithms that can efficiently optimize it based on data. In this work, we present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection solely based on samples for general latent variable models, where we focus on Gaussian mixtures models and Gaussian mixtures of experts. Our approach applies a variational upper bound to the I-projection objective which decomposes the original objective into single objectives for each mixture component as well as for the coefficients, allowing an efficient optimization. Similar to GANs, our approach employs discriminators but uses a more stable optimization procedure, using a tight upper bound. We show that our algorithm is much more effective in computing the I-projection than recent GAN approaches and we illustrate the effectiveness of our approach for modelling multi-modal behavior on two pedestrian and traffic prediction datasets. ",/pdf/11a3505e1ab90cc3bf2a00810d8210a8eebc12c9.pdf,ICLR,2020,"A novel, non-adversarial, approach to learn latent variable models in general and mixture models in particular by computing the I-Projection solely based on samples." +rygjmpVFvB,BygvKdevwH,1569440000000.0,1583910000000.0,461,Difference-Seeking Generative Adversarial Network--Unseen Sample Generation,"[""r06942076@ntu.edu.tw"", ""parvaty316@hotmail.com"", ""peisc@ntu.edu.tw"", ""lcs@iis.sinica.edu.tw""]","[""Yi Lin Sung"", ""Sung-Hsien Hsieh"", ""Soo-Chang Pei"", ""Chun-Shien Lu""]","[""generative adversarial network"", ""semi-supervised learning"", ""novelty detection""]"," +Unseen data, which are not samples from the distribution of training data and are difficult to collect, have exhibited importance in numerous applications, ({\em e.g.,} novelty detection, semi-supervised learning, and adversarial training). In this paper, we introduce a general framework called \textbf{d}ifference-\textbf{s}eeking \textbf{g}enerative \textbf{a}dversarial \textbf{n}etwork (DSGAN), to generate various types of unseen data. Its novelty is the consideration of the probability density of the unseen data distribution as the difference between two distributions $p_{\bar{d}}$ and $p_{d}$ whose samples are relatively easy to collect. + +The DSGAN can learn the target distribution, $p_{t}$, (or the unseen data distribution) from only the samples from the two distributions, $p_{d}$ and $p_{\bar{d}}$. In our scenario, $p_d$ is the distribution of the seen data, and $p_{\bar{d}}$ can be obtained from $p_{d}$ via simple operations, so that we only need the samples of $p_{d}$ during the training. +Two key applications, semi-supervised learning and novelty detection, are taken as case studies to illustrate that the DSGAN enables the production of various unseen data. We also provide theoretical analyses about the convergence of the DSGAN. + +",/pdf/61b8949ecc88a2f28518687319b880b264bbc8ec.pdf,ICLR,2020,We proposed a novel GAN framework to generate unseen data. +B1xFhiC9Y7,rkeHxA5aOm,1538090000000.0,1545360000000.0,728,Domain Adaptation for Structured Output via Disentangled Patch Representations,"[""wasidennis@gmail.com"", ""kihyuk.sohn@gmail.com"", ""samuel@nec-labs.com"", ""manu@nec-labs.com""]","[""Yi-Hsuan Tsai"", ""Kihyuk Sohn"", ""Samuel Schulter"", ""Manmohan Chandraker""]","[""Domain Adaptation"", ""Feature Representation Learning"", ""Semantic Segmentation""]","Predicting structured outputs such as semantic segmentation relies on expensive per-pixel annotations to learn strong supervised models like convolutional neural networks. However, these models trained on one data domain may not generalize well to other domains unequipped with annotations for model finetuning. To avoid the labor-intensive process of annotation, we develop a domain adaptation method to adapt the source data to the unlabeled target domain. To this end, we propose to learn discriminative feature representations of patches based on label histograms in the source domain, through the construction of a disentangled space. With such representations as guidance, we then use an adversarial learning scheme to push the feature representations in target patches to the closer distributions in source ones. In addition, we show that our framework can integrate a global alignment process with the proposed patch-level alignment and achieve state-of-the-art performance on semantic segmentation. Extensive ablation studies and experiments are conducted on numerous benchmark datasets with various settings, such as synthetic-to-real and cross-city scenarios.",/pdf/36a181969a1f4cc1dc50fbfcf4fe69011d9eda6e.pdf,ICLR,2019,A domain adaptation method for structured output via learning patch-level discriminative feature representations +JBAa9we1AL,CQ4iPDvKrs4,1601310000000.0,1616050000000.0,2257,Individually Fair Gradient Boosting,"[""ahsvargo@umich.edu"", ""zhangfan4@shanghaitech.edu.cn"", ""~Mikhail_Yurochkin1"", ""~Yuekai_Sun1""]","[""Alexander Vargo"", ""Fan Zhang"", ""Mikhail Yurochkin"", ""Yuekai Sun""]","[""Algorithmic fairness"", ""boosting"", ""non-smooth models""]","We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth models such as decision trees. We show that our algorithm converges globally and generalizes. We also demonstrate the efficacy of our algorithm on three ML problems susceptible to algorithmic bias.",/pdf/edda3e5c0ee042a5a03ebfdd14665355f34cd023.pdf,ICLR,2021,We propose an algorithm for training individually fair gradient boosted decision trees classifiers. +ryb-q1Olg,,1478130000000.0,1478130000000.0,53,Rectified Factor Networks for Biclustering,"[""okko@bioinf.jku.at"", ""unterthiner@bioinf.jku.at"", ""hochreit@bioinf.jku.at""]","[""Djork-Arn\u00e9 Clevert"", ""Thomas Unterthiner"", ""Sepp Hochreiter""]","[""Deep learning"", ""Unsupervised Learning"", ""Applications""]","Biclustering is evolving into one of the major tools for analyzing large datasets given as matrix of samples times features. Biclustering has several noteworthy applications and has been successfully applied in life sciences and e-commerce for drug design and recommender systems, respectively. + +FABIA is one of the most successful biclustering methods and is used by companies like Bayer, Janssen, or Zalando. FABIA is a generative model that represents each bicluster by two sparse membership vectors: one for the samples and one for the features. However, FABIA is restricted to about 20 code units because of the high computational complexity of computing the posterior. Furthermore, code units are sometimes insufficiently decorrelated. Sample membership is difficult to determine because vectors do not have exact zero entries and can have both large positive and large negative values. + +We propose to use the recently introduced unsupervised Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks of existing biclustering methods. RFNs efficiently construct very sparse, non-linear, high-dimensional representations of the input via their posterior means. RFN learning is a generalized alternating minimization algorithm based on the posterior regularization method which enforces non-negative and normalized posterior means. Each code unit represents a bicluster, where samples for which the code unit is active belong to the bicluster and features that have activating weights to the code unit belong to the bicluster. + +On 400 benchmark datasets with artificially implanted biclusters, RFN significantly outperformed 13 other biclustering competitors including FABIA. In biclustering experiments on three gene expression datasets with known clusters that were determined by separate measurements, RFN biclustering was two times significantly better than the other 13 methods and once on second place. On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate, that interbreeding with other hominins starting already before ancestors of modern humans left Africa.",/pdf/dde80649dd3335264f25e9dfbb0f5cfa392cea7b.pdf,ICLR,2017, +HJOQ7MgAW,HydXXMe0W,1509070000000.0,1518730000000.0,222,Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum,"[""omerlevy@gmail.com"", ""kentonl@cs.washington.edu"", ""nfitz@cs.washington.edu"", ""lsz@cs.washington.edu""]","[""Omer Levy"", ""Kenton Lee"", ""Nicholas FitzGerald"", ""Luke Zettlemoyer""]",[],"Long short-term memory networks (LSTMs) were introduced to combat vanishing gradients in simple recurrent neural networks (S-RNNs) by augmenting them with additive recurrent connections controlled by gates. We present an alternate view to explain the success of LSTMs: the gates themselves are powerful recurrent models that provide more representational power than previously appreciated. We do this by showing that the LSTM's gates can be decoupled from the embedded S-RNN, producing a restricted class of RNNs where the main recurrence computes an element-wise weighted sum of context-independent functions of the inputs. Experiments on a range of challenging NLP problems demonstrate that the simplified gate-based models work substantially better than S-RNNs, and often just as well as the original LSTMs, strongly suggesting that the gates are doing much more in practice than just alleviating vanishing gradients.",/pdf/e75a7d96e70725601efc593b9a80da4a6a412cf3.pdf,ICLR,2018,"Gates do all the heavy lifting in LSTMs by computing element-wise weighted sums, and removing the internal simple RNN does not degrade model performance." +zx_uX-BO7CH,IQ6r4sTolrb,1601310000000.0,1616000000000.0,2729,Contextual Transformation Networks for Online Continual Learning,"[""~Quang_Pham1"", ""~Chenghao_Liu1"", ""~Doyen_Sahoo1"", ""~Steven_HOI1""]","[""Quang Pham"", ""Chenghao Liu"", ""Doyen Sahoo"", ""Steven HOI""]","[""Continual Learning""]","Continual learning methods with fixed architectures rely on a single network to learn models that can perform well on all tasks. +As a result, they often only accommodate common features of those tasks but neglect each task's specific features. On the other hand, dynamic architecture methods can have a separate network for each task, but they are too expensive to train and not scalable in practice, especially in online settings. +To address this problem, we propose a novel online continual learning method named ``Contextual Transformation Networks” (CTN) to efficiently model the \emph{task-specific features} while enjoying neglectable complexity overhead compared to other fixed architecture methods. +Moreover, inspired by the Complementary Learning Systems (CLS) theory, we propose a novel dual memory design and an objective to train CTN that can address both catastrophic forgetting and knowledge transfer simultaneously. +Our extensive experiments show that CTN is competitive with a large scale dynamic architecture network and consistently outperforms other fixed architecture methods under the same standard backbone. Our implementation can be found at \url{https://github.com/phquang/Contextual-Transformation-Network}.",/pdf/677e7eacc15f5a8cfa20b7a38a726599b2f960ca.pdf,ICLR,2021,This paper develops a novel method that can model task-specific features with minimal complexity overhead. +BkUDW_lCb,S1BPbOgCZ,1509090000000.0,1518730000000.0,302,Pointing Out SQL Queries From Text,"[""clwang@cs.washington.edu"", ""mabrocks@microsoft.com"", ""risin@microsoft.com""]","[""Chenglong Wang"", ""Marc Brockschmidt"", ""Rishabh Singh""]","[""Program Synthesis"", ""Semantic Parsing"", ""WikiTable"", ""SQL"", ""Pointer Network""]","The digitization of data has resulted in making datasets available to millions of users in the form of relational databases and spreadsheet tables. However, a majority of these users come from diverse backgrounds and lack the programming expertise to query and analyze such tables. We present a system that allows for querying data tables using natural language questions, where the system translates the question into an executable SQL query. We use a deep sequence to sequence model in wich the decoder uses a simple type system of SQL expressions to structure the output prediction. Based on the type, the decoder either copies an output token from the input question using an attention-based copying mechanism or generates it from a fixed vocabulary. We also introduce a value-based loss function that transforms a distribution over locations to copy from into a distribution over the set of input tokens to improve training of our model. We evaluate our model on the recently released WikiSQL dataset and show that our model trained using only supervised learning significantly outperforms the current state-of-the-art Seq2SQL model that uses reinforcement learning.",/pdf/44974b4b6331ec41ad3325321a0d27e683f00edf.pdf,ICLR,2018,We present a type-based pointer network model together with a value-based loss method to effectively train a neural model to translate natural language to SQL. +rJlRKjActQ,rkx9zS9ttm,1538090000000.0,1545360000000.0,491,Manifold Mixup: Learning Better Representations by Interpolating Hidden States,"[""vikasverma.iitm@gmail.com"", ""lambalex@iro.umontreal.ca"", ""christopher.j.beckham@gmail.com"", ""najafy@ce.sharif.edu"", ""aaron.courville@gmail.com"", ""imitliagkas@gmail.com"", ""yoshua.umontreal@gmail.com""]","[""Vikas Verma"", ""Alex Lamb"", ""Christopher Beckham"", ""Amir Najafi"", ""Aaron Courville"", ""Ioannis Mitliagkas"", ""Yoshua Bengio""]","[""Regularizer"", ""Supervised Learning"", ""Semi-supervised Learning"", ""Better representation learning"", ""Deep Neural Networks.""]","Deep networks often perform well on the data distribution on which they are trained, yet give incorrect (and often very confident) answers when evaluated on points from off of the training distribution. This is exemplified by the adversarial examples phenomenon but can also be seen in terms of model generalization and domain shift. Ideally, a model would assign lower confidence to points unlike those from the training distribution. We propose a regularizer which addresses this issue by training with interpolated hidden states and encouraging the classifier to be less confident at these points. Because the hidden states are learned, this has an important effect of encouraging the hidden states for a class to be concentrated in such a way so that interpolations within the same class or between two different classes do not intersect with the real data points from other classes. This has a major advantage in that it avoids the underfitting which can result from interpolating in the input space. We prove that the exact condition for this problem of underfitting to be avoided by Manifold Mixup is that the dimensionality of the hidden states exceeds the number of classes, which is often the case in practice. Additionally, this concentration can be seen as making the features in earlier layers more discriminative. We show that despite requiring no significant additional computation, Manifold Mixup achieves large improvements over strong baselines in supervised learning, robustness to single-step adversarial attacks, semi-supervised learning, and Negative Log-Likelihood on held out samples.",/pdf/2f2d795bbe8d303946267d61c7e2bdd0447bcb90.pdf,ICLR,2019,"A method for learning better representations, that acts as a regularizer and despite its no significant additional computation cost , achieves improvements over strong baselines on Supervised and Semi-supervised Learning tasks." +PQlC91XxqK5,qFPjLHaoHb4,1601310000000.0,1614990000000.0,21,Segmenting Natural Language Sentences via Lexical Unit Analysis,"[""~Yangming_Li1"", ""~lemao_liu1"", ""~Shuming_Shi1""]","[""Yangming Li"", ""lemao liu"", ""Shuming Shi""]","[""Neural Sequence Labeling"", ""Neural Sequence Segmentation"", ""Dynamic Programming""]","In this work, we present Lexical Unit Analysis (LUA), a framework for general sequence segmentation tasks. Given a natural language sentence, LUA scores all the valid segmentation candidates and utilizes dynamic programming (DP) to extract the maximum scoring one. LUA enjoys a number of appealing properties such as inherently guaranteeing the predicted segmentation to be valid and facilitating globally optimal training and inference. Besides, the practical time complexity of LUA can be reduced to linear time, which is very efficient. We have conducted extensive experiments on 5 tasks, including syntactic chunking, named entity recognition (NER), slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets. Our models have achieved the state-of-the-art performances on 13 of them. The results also show that the F1 score of identifying long-length segments is notably improved.",/pdf/266fcb02911f133e588a9072a28f59467f6d09b7.pdf,ICLR,2021,"We propose LUA, a novel framework for neural sequence segmentation, which facilitates globally optimal training and inference." +09-528y2Fgf,3S-hD4XZcSF,1601310000000.0,1615790000000.0,772,Rethinking Positional Encoding in Language Pre-training,"[""~Guolin_Ke3"", ""~Di_He1"", ""~Tie-Yan_Liu1""]","[""Guolin Ke"", ""Di He"", ""Tie-Yan Liu""]","[""Natural Language Processing"", ""Pre-training""]","In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol \texttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called \textbf{T}ransformer with \textbf{U}ntied \textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at \url{https://github.com/guolinke/TUPE}.",/pdf/33fed0683748564aa65aa880cab67c6104dfd26a.pdf,ICLR,2021,A novel and better positional encoding method for Transformer-based language pre-training models. +OOsR8BzCnl5,38rTshOgITr,1601310000000.0,1618900000000.0,591,Trusted Multi-View Classification,"[""~Zongbo_Han1"", ""~Changqing_Zhang1"", ""~Huazhu_Fu4"", ""~Joey_Tianyi_Zhou1""]","[""Zongbo Han"", ""Changqing Zhang"", ""Huazhu Fu"", ""Joey Tianyi Zhou""]","[""Multi-Modal Learning"", ""Multi-View Learning"", ""Uncertainty Machine Learning""]","Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predictions can be trusted. To this end, we propose a novel multi-view classification method, termed trusted multi-view classification, which provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The algorithm jointly utilizes multiple views to promote both classification reliability (uncertainty estimation during testing) and robustness (out-of-distribution-awareness during training) by integrating evidence from each view. To achieve this, the Dirichlet distribution is used to model the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness for out-of-distribution samples. Extensive experimental results validate the effectiveness of the proposed model in accuracy, reliability and robustness.",/pdf/4ae336db914c13c1db09afbb3dea3d948ad4aa37.pdf,ICLR,2021, +H1ebc0VYvH,SyejytudPH,1569440000000.0,1577170000000.0,1273,Unaligned Image-to-Sequence Transformation with Loop Consistency,"[""siw030@ucsd.edu"", ""jlazarow@ucsd.edu"", ""kwl042@ucsd.edu"", ""ztu@ucsd.edu""]","[""Siyang Wang"", ""Justin Lazarow"", ""Kwonjoon Lee"", ""Zhuowen Tu""]",[],"We tackle the problem of modeling sequential visual phenomena. Given examples of a phenomena that can be divided into discrete time steps, we aim to take an input from any such time and realize this input at all other time steps in the sequence. Furthermore, we aim to do this \textit{without} ground-truth aligned sequences --- avoiding the difficulties needed for gathering aligned data. This generalizes the unpaired image-to-image problem from generating pairs to generating sequences. We extend cycle consistency to \textit{loop consistency} and alleviate difficulties associated with learning in the resulting long chains of computation. We show competitive results compared to existing image-to-image techniques when modeling several different data sets including the Earth's seasons and aging of human faces.",/pdf/fb85101445593cfec1d43633ba3d178869ce4e38.pdf,ICLR,2020,LoopGAN extends cycle length in CycleGAN to enable unaligned sequential transformation for more than two time steps. +ryggIs0cYQ,HJei3rRFY7,1538090000000.0,1556010000000.0,143,Differentiable Learning-to-Normalize via Switchable Normalization,"[""pluo@ie.cuhk.edu.hk"", ""renjiamin@sensetime.com"", ""pengzhanglin@sensetime.com"", ""zhangruimao@sensetime.com"", ""lijingyu@sensetime.com""]","[""Ping Luo"", ""Jiamin Ren"", ""Zhanglin Peng"", ""Ruimao Zhang"", ""Jingyu Li""]","[""normalization"", ""deep learning"", ""CNN"", ""computer vision""]","We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks (see Fig.1). Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, and Kinetics. Analyses of SN are also presented. We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN will be released.",/pdf/5170b99841d34d19e04153419f4002c4234b0c57.pdf,ICLR,2019, +rklz9iAcKQ,SkgHFy_FFX,1538090000000.0,1545410000000.0,513,Deep Graph Infomax,"[""petar.velickovic@cst.cam.ac.uk"", ""liam.fedus@gmail.com"", ""wleif@stanford.edu"", ""pietro.lio@cst.cam.ac.uk"", ""yoshua.umontreal@gmail.com"", ""devon.hjelm@microsoft.com""]","[""Petar Veli\u010dkovi\u0107"", ""William Fedus"", ""William L. Hamilton"", ""Pietro Li\u00f2"", ""Yoshua Bengio"", ""R Devon Hjelm""]","[""Unsupervised Learning"", ""Graph Neural Networks"", ""Graph Convolutions"", ""Mutual Information"", ""Infomax"", ""Deep Learning""]","We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.",/pdf/67df6b5ffbf0ef252ee5f21442c63f5a1bab1023.pdf,ICLR,2019,"A new method for unsupervised representation learning on graphs, relying on maximizing mutual information between local and global representations in a graph. State-of-the-art results, competitive with supervised learning." +q_Q9MMGwSQu,8OdZQvr98FY,1601310000000.0,1614990000000.0,2981,A Simple and Effective Baseline for Out-of-Distribution Detection using Abstention,"[""~Sunil_Thulasidasan1"", ""~Sushil_Thapa2"", ""sayeradbl@lanl.gov"", ""~Gopinath_Chennupati1"", ""~Tanmoy_Bhattacharya1"", ""~Jeff_Bilmes1""]","[""Sunil Thulasidasan"", ""Sushil Thapa"", ""Sayera Dhaubhadel"", ""Gopinath Chennupati"", ""Tanmoy Bhattacharya"", ""Jeff Bilmes""]","[""deep learning"", ""out-of-distribution detection""]","Refraining from confidently predicting when faced with categories of inputs different from those seen during training is an important requirement for the safe deployment of deep learning systems. While simple to state, this has been a particularly challenging problem in deep learning, where models often end up making overconfident predictions in such situations. In this work we present a simple, but highly effective approach to deal with out-of-distribution detection that uses the principle of abstention: when encountering a sample from an unseen class, the desired behavior is to abstain from predicting. Our approach uses a network with an extra abstention class and is trained on a dataset that is augmented with an uncurated set that consists of a large number of out-of-distribution (OoD) samples that are assigned the label of the abstention class; the model is then trained to learn an effective discriminator between in and out-of-distribution samples. + + We compare this relatively simple approach against a wide variety of more complex methods that have been proposed both for out-of-distribution detection as well as uncertainty modeling in deep learning, and empirically demonstrate its effectiveness on a wide variety of of benchmarks and deep architectures for image recognition and text classification, often outperforming existing approaches by significant margins. Given the simplicity and effectiveness of this method, we propose that this approach be used as a new additional baseline for future work in this domain.",/pdf/246a9f067a15bac0dfcca296a4d84cf4f0979d03.pdf,ICLR,2021,"Deep neural networks augmented with an extra abstention class and trained on in and out-of-distribution data show strong out-of-detection performance, often exceeding existing state-of-the-art." +BJlITC4KDB,Hyl8yG9uwr,1569440000000.0,1577170000000.0,1395,Multi-Sample Dropout for Accelerated Training and Better Generalization,"[""inouehrs@jp.ibm.com""]","[""Hiroshi Inoue""]","[""dropout"", ""regularization"", ""convolutional neural networks""]","Dropout is a simple but efficient regularization technique for achieving better generalization of deep neural networks (DNNs); hence it is widely used in tasks based on DNNs. During training, dropout randomly discards a portion of the neurons to avoid overfitting. This paper presents an enhanced dropout technique, which we call multi-sample dropout, for both accelerating training and improving generalization over the original dropout. The original dropout creates a randomly selected subset (called a dropout sample) from the input in each training iteration while the multi-sample dropout creates multiple dropout samples. The loss is calculated for each sample, and then the sample losses are averaged to obtain the final loss. This technique can be easily implemented without implementing a new operator by duplicating a part of the network after the dropout layer while sharing the weights among the duplicated fully connected layers. Experimental results showed that multi-sample dropout significantly accelerates training by reducing the number of iterations until convergence for image classification tasks using the ImageNet, CIFAR-10, CIFAR-100, and SVHN datasets. Multi-sample dropout does not significantly increase computation cost per iteration for deep convolutional networks because most of the computation time is consumed in the convolution layers before the dropout layer, which are not duplicated. Experiments also showed that networks trained using multi-sample dropout achieved lower error rates and losses for both the training set and validation set. +",/pdf/0abd45e461ce7bc62ab4c78edff5232551a2d017.pdf,ICLR,2020, +Hkxi2gHYvH,rkeV9eZYPH,1569440000000.0,1577170000000.0,2550,Predictive Coding for Boosting Deep Reinforcement Learning with Sparse Rewards,"[""xingyulu0701@berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""stas@berkeley.edu""]","[""Xingyu Lu"", ""Pieter Abbeel"", ""Stas Tiomkin""]","[""reinforcement learning"", ""representation learning"", ""reward shaping"", ""predictive coding""]","While recent progress in deep reinforcement learning has enabled robots to learn complex behaviors, tasks with long horizons and sparse rewards remain an ongoing challenge. In this work, we propose an effective reward shaping method through predictive coding to tackle sparse reward problems. By learning predictive representations offline and using these representations for reward shaping, we gain access to reward signals that understand the structure and dynamics of the environment. In particular, our method achieves better learning by providing reward signals that 1) understand environment dynamics 2) emphasize on features most useful for learning 3) resist noise in learned representations through reward accumulation. We demonstrate the usefulness of this approach in different domains ranging from robotic manipulation to navigation, and we show that reward signals produced through predictive coding are as effective for learning as hand-crafted rewards.",/pdf/3cc45a51eac1fd4b392f78aa0a9d78a26c19c6ff.pdf,ICLR,2020,We apply predictive coding to provide reward signals in sparse reward problems. +ByqiJIqxg,,1478280000000.0,1487800000000.0,280,Online Bayesian Transfer Learning for Sequential Data Modeling,"[""pjaini@uwaterloo.ca"", ""chenzhitang2@huawei.com"", ""pablo@veedata.io"", ""edith.law@uwaterloo.ca"", ""lmiddlet@uwaterloo.ca"", ""kregan@uwaterloo.ca"", ""mschaekermann@uwaterloo.ca"", ""g.trimponias@huawei.com"", ""james.tung@uwaterloo.ca"", ""ppoupart@uwaterloo.ca""]","[""Priyank Jaini"", ""Zhitang Chen"", ""Pablo Carbajal"", ""Edith Law"", ""Laura Middleton"", ""Kayla Regan"", ""Mike Schaekermann"", ""George Trimponias"", ""James Tung"", ""Pascal Poupart""]","[""Unsupervised Learning"", ""Transfer Learning"", ""Applications""]","We consider the problem of inferring a sequence of hidden states associated with a sequence of observations produced by an individual within a population. Instead of learning a single sequence model for the population (which does not account for variations within the population), we learn a set of basis sequence models based on different individuals. The sequence of hidden states for a new individual is inferred in an online fashion by estimating a distribution over the basis models that best explain the sequence of observations of this new individual. We explain how to do this in the context of hidden Markov models with Gaussian mixture models that are learned based on streaming data by online Bayesian moment matching. The resulting transfer learning technique is demonstrated with three real-word applications: activity recognition based on smartphone sensors, sleep classification based on electroencephalography data and the prediction of the direction of future packet flows between a pair of servers in telecommunication networks. ",/pdf/55bcf7454fc94122042a63a947299353b6099fe2.pdf,ICLR,2017, +N9oPAFcuYWX,26mx2c105vk,1601310000000.0,1614990000000.0,474,Understanding and Mitigating Accuracy Disparity in Regression,"[""~Jianfeng_Chi1"", ""~Han_Zhao1"", ""~Geoff_Gordon2"", ""~Yuan_Tian2""]","[""Jianfeng Chi"", ""Han Zhao"", ""Geoff Gordon"", ""Yuan Tian""]","[""Algorithmic Fairness"", ""Representation Learning""]","With the widespread deployment of large-scale prediction systems in high-stakes domains, e.g., face recognition, criminal justice, etc., disparity on prediction accuracy between different demographic subgroups has called for fundamental understanding on the source of such disparity and algorithmic intervention to mitigate it. In this paper, we study the accuracy disparity problem in regression. To begin with, we first propose an error decomposition theorem, which decomposes the accuracy disparity into the distance between label populations and the distance between conditional representations, to help explain why such accuracy disparity appears in practice. Motivated by this error decomposition and the general idea of distribution alignment with statistical distances, we then propose an algorithm to reduce this disparity, and analyze its game-theoretic optima of the proposed objective function. We conduct experiments on four real-world datasets. The experimental results suggest that our proposed algorithms can effectively mitigate accuracy disparity while maintaining the predictive power of the regression models.",/pdf/0bd24517866d7e9f14ce7cc883e46356c7af51cd.pdf,ICLR,2021, +Bki4EfWCb,ryP4VMW0b,1509140000000.0,1518730000000.0,820,Inference Suboptimality in Variational Autoencoders,"[""ccremer@cs.toronto.edu"", ""lxuechen@cs.toronto.edu"", ""duvenaud@cs.toronto.edu""]","[""Chris Cremer"", ""Xuechen Li"", ""David Duvenaud""]","[""Approximate Inference"", ""Amortization"", ""Posterior Approximations"", ""Variational Autoencoder""]","Amortized inference has led to efficient approximate inference for large datasets. The quality of posterior inference is largely determined by two factors: a) the ability of the variational distribution to model the true posterior and b) the capacity of the recognition network to generalize inference over all datapoints. We analyze approximate inference in variational autoencoders in terms of these factors. We find that suboptimal inference is often due to amortizing inference rather than the limited complexity of the approximating distribution. We show that this is due partly to the generator learning to accommodate the choice of approximation. Furthermore, we show that the parameters used to increase the expressiveness of the approximation play a role in generalizing inference rather than simply improving the complexity of the approximation.",/pdf/64d3ccecd88e19eee1cc56616f73294601451592.pdf,ICLR,2018,We decompose the gap between the marginal log-likelihood and the evidence lower bound and study the effect of the approximate posterior on the true posterior distribution in VAEs. +rkg8xTEtvB,r1ge_IxIPH,1569440000000.0,1577170000000.0,339,Hierarchical Disentangle Network for Object Representation Learning,"[""qiaoshishi14@mails.ucas.ac.cn"", ""wangruiping@ict.ac.cn"", ""sgshan@ict.ac.cn"", ""xlchen@ict.ac.cn""]","[""Shishi Qiao"", ""Ruiping Wang"", ""Shiguang Shan"", ""Xilin Chen""]",[],"An object can be described as the combination of primary visual attributes. Disentangling such underlying primitives is the long objective of representation learning. It is observed that categories have the natural multi-granularity or hierarchical characteristics, i.e. any two objects can share some common primitives in a particular category granularity while they may possess their unique ones in another granularity. However, previous works usually operate in a flat manner (i.e. in a particular granularity) to disentangle the representations of objects. Though they may obtain the primitives to constitute objects as the categories in that granularity, their results are obviously not efficient and complete. In this paper, we propose the hierarchical disentangle network (HDN) to exploit the rich hierarchical characteristics among categories to divide the disentangling process in a coarse-to-fine manner, such that each level only focuses on learning the specific representations in its granularity and finally the common and unique representations in all granularities jointly constitute the raw object. Specifically, HDN is designed based on an encoder-decoder architecture. To simultaneously ensure the disentanglement and interpretability of the encoded representations, a novel hierarchical generative adversarial network (GAN) is elaborately designed. Quantitative and qualitative evaluations on four object datasets validate the effectiveness of our method.",/pdf/c2e51052dbff2225f639ed34102a38d146ce68a6.pdf,ICLR,2020,Disentangle the primitives of objects in different hierarchy levels +3jjmdp7Hha,w6aqRfrg7E,1601310000000.0,1611610000000.0,2063,Meta Back-Translation,"[""~Hieu_Pham1"", ""~Xinyi_Wang1"", ""~Yiming_Yang1"", ""~Graham_Neubig1""]","[""Hieu Pham"", ""Xinyi Wang"", ""Yiming Yang"", ""Graham Neubig""]","[""meta learning"", ""machine translation"", ""back translation""]","Back-translation is an effective strategy to improve the performance of Neural Machine Translation~(NMT) by generating pseudo-parallel data. However, several recent works have found that better translation quality in the pseudo-parallel data does not necessarily lead to a better final translation model, while lower-quality but diverse data often yields stronger results instead. +In this paper we propose a new way to generate pseudo-parallel data for back-translation that directly optimizes the final model performance. Specifically, we propose a meta-learning framework where the back-translation model learns to match the forward-translation model's gradients on the development data with those on the pseudo-parallel data. In our evaluations in both the standard datasets WMT En-De'14 and WMT En-Fr'14, as well as a multilingual translation setting, our method leads to significant improvements over strong baselines. ",/pdf/30daa849a22aebea5ef566f6b6a0f9cf99027a34.pdf,ICLR,2021,Use meta learning to teach the back-translation model to generate better back-translated sentences. +Hyx0slrFvH,ByxpRkbKvH,1569440000000.0,1583910000000.0,2519,Mixed Precision DNNs: All you need is a good parametrization,"[""stefan.uhlich@sony.com"", ""lukas.mauch@sony.com"", ""fabien.cardinaux@sony.com"", ""kazuki.yoshiyama@sony.com"", ""javier.alonso@sony.com"", ""stephen.tiedemann@sony.com"", ""thomas.kemp@sony.com"", ""akira.b.nakamura@sony.com""]","[""Stefan Uhlich"", ""Lukas Mauch"", ""Fabien Cardinaux"", ""Kazuki Yoshiyama"", ""Javier Alonso Garcia"", ""Stephen Tiedemann"", ""Thomas Kemp"", ""Akira Nakamura""]","[""Deep Neural Network Compression"", ""Quantization"", ""Straight through gradients""]","Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learned quantization parameters, achieving state-of-the-art performance.",/pdf/3ad26f430240ea72fd17805954df5a7e3b96b883.pdf,ICLR,2020, +HJe9cR4KvB,HJxFUpu_wB,1569440000000.0,1577170000000.0,1292,Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling,"[""olan@usc.edu"", ""huan183@usc.edu"", ""yuchen.lin@usc.edu"", ""jian567@usc.edu"", ""xiangren@usc.edu""]","[""Ouyu Lan*"", ""Xiao Huang*"", ""Bill Yuchen Lin"", ""He Jiang"", ""Xiang Ren""]","[""crowdsourcing"", ""domain adaptation"", ""sequence labeling"", ""named entity recognition"", ""weak supervision""]","Sequence labeling is a fundamental framework for various natural language processing problems including part-of-speech tagging and named entity recognition. Its performance is largely influenced by the annotation quality and quantity in supervised learning scenarios. In many cases, ground truth labels are costly and time-consuming to collect or even non-existent, while imperfect ones could be easily accessed or transferred from different domains. A typical example is crowd-sourced datasets which have multiple annotations for each sentence which may be noisy or incomplete. Additionally, predictions from multiple source models in transfer learning can be seen as a case of multi-source supervision. In this paper, we propose a novel framework named Consensus Network (CONNET) to conduct training with imperfect annotations from multiple sources. It learns the representation for every weak supervision source and dynamically aggregates them by a context-aware attention mechanism. Finally, it leads to a model reflecting the consensus among multiple sources. We evaluate the proposed framework in two practical settings of multi-source learning: learning with crowd annotations and unsupervised cross-domain model adaptation. Extensive experimental results show that our model achieves significant improvements over existing methods in both settings.",/pdf/33d72e3977a46779f756e0ed5fa803c8a3805bd9.pdf,ICLR,2020,A model to contextually aggregate multi-source supervision for sequence learning. +S1xKd24twB,rkgBJvZwIS,1569440000000.0,1583910000000.0,50,SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards,"[""sgr@berkeley.edu"", ""anca@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Siddharth Reddy"", ""Anca D. Dragan"", ""Sergey Levine""]","[""Imitation Learning"", ""Reinforcement Learning""]","Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo. This paper is a proof of concept that illustrates how a simple imitation method based on RL with constant rewards can be as effective as more complex methods that use learned rewards.",/pdf/d416daeb3577d1f6c75f4064d2b45f3c5edae7c9.pdf,ICLR,2020,"A simple and effective alternative to adversarial imitation learning: initialize experience replay buffer with demonstrations, set their reward to +1, set reward for all other data to 0, run Q-learning or soft actor-critic to train." +ryghZJBKPS,SygmFTjdwr,1569440000000.0,1583910000000.0,1557,"Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds","[""jordanta@cs.princeton.edu"", ""chichengz@cs.arizona.edu"", ""akshay.krishnamurthy@microsoft.com"", ""jcl@microsoft.com"", ""alekha@microsoft.com""]","[""Jordan T. Ash"", ""Chicheng Zhang"", ""Akshay Krishnamurthy"", ""John Langford"", ""Alekh Agarwal""]","[""deep learning"", ""active learning"", ""batch active learning""]","We design a new algorithm for batch active learning with deep neural network models. Our algorithm, Batch Active learning by Diverse Gradient Embeddings (BADGE), samples groups of points that are disparate and high-magnitude when represented in a hallucinated gradient space, a strategy designed to incorporate both predictive uncertainty and sample diversity into every selected batch. Crucially, BADGE trades off between diversity and uncertainty without requiring any hand-tuned hyperparameters. While other approaches sometimes succeed for particular batch sizes or architectures, BADGE consistently performs as well or better, making it a useful option for real world active learning problems.",/pdf/333298af887073a9582b3fb8f532a53ce4bcf078.pdf,ICLR,2020,"We introduce a new batch active learning algorithm that's robust to model architecture, batch size, and dataset." +Bkln2a4tPB,BklL9EldDS,1569440000000.0,1577170000000.0,796,Customizing Sequence Generation with Multi-Task Dynamical Systems,"[""abird@turing.ac.uk"", ""ckiw@inf.ed.ac.uk""]","[""Alex Bird"", ""Christopher K. I. Williams""]","[""Time-series modelling"", ""Dynamical systems"", ""RNNs"", ""Multi-task learning""]","Dynamical system models (including RNNs) often lack the ability to adapt the sequence generation or prediction to a given context, limiting their real-world application. In this paper we show that hierarchical multi-task dynamical systems (MTDSs) provide direct user control over sequence generation, via use of a latent code z that specifies the customization to the +individual data sequence. This enables style transfer, interpolation and morphing within generated sequences. We show the MTDS can improve predictions via latent code interpolation, and avoid the long-term performance degradation of standard RNN approaches.",/pdf/ec770730c67f30b7b7573c53a68663efe12a036a.pdf,ICLR,2020,Tailoring predictions from sequence models (such as LDSs and RNNs) via an explicit latent code. +S1xJFREKvB,BJxpbz__PS,1569440000000.0,1577170000000.0,1233,Amortized Nesterov's Momentum: Robust and Lightweight Momentum for Deep Learning,"[""kwzhou@cse.cuhk.edu.hk"", ""jinyh@preferred.jp"", ""qhding@cse.cuhk.edu.hk"", ""jcheng@cse.cuhk.edu.hk""]","[""Kaiwen Zhou"", ""Yanghua Jin"", ""Qinghua Ding"", ""James Cheng""]","[""momentum"", ""nesterov"", ""optimization"", ""deep learning"", ""neural networks""]","Stochastic Gradient Descent (SGD) with Nesterov's momentum is a widely used optimizer in deep learning, which is observed to have excellent generalization performance. However, due to the large stochasticity, SGD with Nesterov's momentum is not robust, i.e., its performance may deviate significantly from the expectation. In this work, we propose Amortized Nesterov's Momentum, a special variant of Nesterov's momentum which has more robust iterates, faster convergence in the early stage and higher efficiency. Our experimental results show that this new momentum achieves similar (sometimes better) generalization performance with little-to-no tuning. In the convex case, we provide optimal convergence rates for our new methods and discuss how the theorems explain the empirical results. ",/pdf/b3a9c490fedc2454fde4cb11353b711c479bacc1.pdf,ICLR,2020,"Amortizing Nesterov's momentum for more robust, lightweight and fast deep learning training." +HJli2hNKDH,rJe7JRcfvr,1569440000000.0,1583910000000.0,203,Observational Overfitting in Reinforcement Learning,"[""xsong@berkeley.edu"", ""ydjiang@google.com"", ""stephentu@google.com"", ""yilundu@mit.edu"", ""neyshabur@google.com""]","[""Xingyou Song"", ""Yiding Jiang"", ""Stephen Tu"", ""Yilun Du"", ""Behnam Neyshabur""]","[""observational"", ""overfitting"", ""reinforcement"", ""learning"", ""generalization"", ""implicit"", ""regularization"", ""overparametrization""]","A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of an MDP. When an agent overfits to different observation spaces even if the underlying MDP dynamics is fixed, we term this observational overfitting. Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL). ",/pdf/414967553544dd06517120ef72ad27b1bffaf61b.pdf,ICLR,2020,We isolate one factor of RL generalization by analyzing the case when the agent only overfits to the observations. We show that architectural implicit regularizations occur in this regime. +jwgZh4Y4U7,YPFFR3BMj2v,1601310000000.0,1614990000000.0,2117,Temporal and Object Quantification Nets,"[""~Jiayuan_Mao1"", ""~Zhezheng_Luo1"", ""~Chuang_Gan1"", ""~Joshua_B._Tenenbaum1"", ""~Jiajun_Wu1"", ""~Leslie_Pack_Kaelbling1"", ""~Tomer_Ullman1""]","[""Jiayuan Mao"", ""Zhezheng Luo"", ""Chuang Gan"", ""Joshua B. Tenenbaum"", ""Jiajun Wu"", ""Leslie Pack Kaelbling"", ""Tomer Ullman""]","[""Temporal Modeling"", ""Object-Centric Representations""]","We aim to learn generalizable representations for complex activities by quantifying over both entities and time, as in “the kicker is behind all the other players,” or “the player controls the ball until it moves toward the goal.” Such a structural inductive bias of object relations, object quantification, and temporal orders will enable the learned representation to generalize to situations with varying numbers of agents, objects, and time courses. In this paper, we present Temporal and Object Quantification Nets (TOQ-Nets), which provide such structural inductive bias for learning composable action concepts from time sequences that describe the properties and relations of multiple entities. We evaluate TOQ-Nets on two benchmarks: trajectory-based soccer event detection, and 6D pose-based manipulation concept learning. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios where there are more agents and objects than were present during training. The learned concepts are also robust with respect to temporally warped sequences and easily transfer to other prediction tasks in a similar domain.",/pdf/bad5b357e2f031c30768a32d2993467eb096a779.pdf,ICLR,2021,We present a framework for learning generalizable representations for complex activities by quantifying over both entities and time. +IrM64DGB21,gALrZeRuajhr,1601310000000.0,1615980000000.0,1895,On the role of planning in model-based deep reinforcement learning,"[""~Jessica_B_Hamrick1"", ""~Abram_L._Friesen1"", ""~Feryal_Behbahani1"", ""~Arthur_Guez1"", ""~Fabio_Viola2"", ""switherspoon@google.com"", ""~Thomas_Anthony1"", ""~Lars_Holger_Buesing1"", ""~Petar_Veli\u010dkovi\u01071"", ""~Theophane_Weber1""]","[""Jessica B Hamrick"", ""Abram L. Friesen"", ""Feryal Behbahani"", ""Arthur Guez"", ""Fabio Viola"", ""Sims Witherspoon"", ""Thomas Anthony"", ""Lars Holger Buesing"", ""Petar Veli\u010dkovi\u0107"", ""Theophane Weber""]","[""model-based RL"", ""planning"", ""MuZero""]","Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.",/pdf/e451ad8cff01119850da4fdb852266ebaa8184ce.pdf,ICLR,2021,An empirical investigation into how planning drives performance in model-based RL algorithms. +SJAr0QFxe,,1478210000000.0,1485090000000.0,88,Demystifying ResNet,"[""lisihan13@mails.tsinghua.edu.cn"", ""jiantao@stanford.edu"", ""yjhan@stanford.edu"", ""tsachy@stanford.edu""]","[""Sihan Li"", ""Jiantao Jiao"", ""Yanjun Han"", ""Tsachy Weissman""]","[""Deep learning"", ""Optimization"", ""Theory""]","We provide a theoretical explanation for the superb performance of ResNet via the study of deep linear networks and some nonlinear variants. We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization algorithm is hard to escape. The 1-shortcut, however, is essentially equivalent to no shortcuts. Extensive experiments are provided accompanying our theoretical results. We show that initializing the network to small weights with 2-shortcuts achieves significantly better results than random Gaussian (Xavier) initialization, orthogonal initialization, and shortcuts of deeper depth, from various perspectives ranging from final loss, learning dynamics and stability, to the behavior of the Hessian along the learning process.",/pdf/f1c641e470a3645d5942b5097834a42c0ba01dd8.pdf,ICLR,2017, +xTV-wQ-pMrU,vIfBiPZORC,1601310000000.0,1614990000000.0,3358,Shuffle to Learn: Self-supervised learning from permutations via differentiable ranking,"[""~Andrew_N_Carr1"", ""~Quentin_Berthet2"", ""~Mathieu_Blondel1"", ""~Olivier_Teboul2"", ""~Neil_Zeghidour1""]","[""Andrew N Carr"", ""Quentin Berthet"", ""Mathieu Blondel"", ""Olivier Teboul"", ""Neil Zeghidour""]",[],"Self-supervised pre-training using so-called ""pretext"" tasks has recently shown impressive performance across a wide range of tasks. In this work we advance self-supervised learning from permutations, that consists in shuffling parts of input and training a model to reorder them, improving downstream performance in classification. To do so, we overcome the main challenges of integrating permutation inversions (a discontinuous operation) into an end-to-end training scheme, heretofore sidestepped by casting the reordering task as classification, fundamentally reducing the space of permutations that can be exploited. These advances rely on two main, independent contributions. First, we use recent advances in differentiable ranking to integrate the permutation inversion flawlessly into a neural network, enabling us to use the full set of permutations, at no additional computing cost. Our experiments validate that learning from all possible permutations (up to $10^{18}$) improves the quality of the pre-trained representations over using a limited, fixed set. Second, we successfully demonstrate that inverting permutations is a meaningful pretext task in a diverse range of modalities, beyond images, which does not require modality-specific design. In particular, we also improve music understanding by reordering spectrogram patches in the frequency space, as well as video classification by reordering frames along the time axis. We furthermore analyze the influence of the patches that we use (vertical, horizontal, 2-dimensional), as well as the benefit of our approach in different data regimes. ",/pdf/2c3a505ad201e10cefa00cb6938a8c5c746a0293.pdf,ICLR,2021,We use recent advances in differentiable ranking to allow for self-supervised pre-training using the full set of permutations. +rygZJ2RcF7,HJgdo-69tQ,1538090000000.0,1545360000000.0,958,Out-of-Sample Extrapolation with Neuron Editing,"[""matthew.amodio@yale.edu"", ""davidvandijk@gmail.com"", ""ruth.montgomery@yale.edu"", ""guy.wolf@yale.edu"", ""smita.krishnaswamy@yale.edu""]","[""Matthew Amodio"", ""David van Dijk"", ""Ruth Montgomery"", ""Guy Wolf"", ""Smita Krishnaswamy""]","[""generative adversarial networks"", ""computational biology"", ""generating"", ""generation"", ""extrapolation"", ""out-of-sample"", ""neural network inference""]","While neural networks can be trained to map from one specific dataset to another, they usually do not learn a generalized transformation that can extrapolate accurately outside the space of training. For instance, a generative adversarial network (GAN) exclusively trained to transform images of cars from light to dark might not have the same effect on images of horses. This is because neural networks are good at generation within the manifold of the data that they are trained on. However, generating new samples outside of the manifold or extrapolating ""out-of-sample"" is a much harder problem that has been less well studied. To address this, we introduce a technique called neuron editing that learns how neurons encode an edit for a particular transformation in a latent space. We use an autoencoder to decompose the variation within the dataset into activations of different neurons and generate transformed data by defining an editing transformation on those neurons. By performing the transformation in a latent trained space, we encode fairly complex and non-linear transformations to the data with much simpler distribution shifts to the neuron's activations. We showcase our technique on image domain/style transfer and two biological applications: removal of batch artifacts representing unwanted noise and modeling the effect of drug treatments to predict synergy between drugs.",/pdf/ee973975dafaef863742edcf43043414dbeb15f7.pdf,ICLR,2019,"We reframe the generation problem as one of editing existing points, and as a result extrapolate better than traditional GANs." +kmG8vRXTFv,hegFlMVGfw0,1601310000000.0,1615990000000.0,761,Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting,"[""~Yuan_Yin1"", ""~Vincent_LE_GUEN1"", ""~J\u00e9r\u00e9mie_DONA2"", ""~Emmanuel_de_Bezenac1"", ""~Ibrahim_Ayed1"", ""~Nicolas_THOME2"", ""~patrick_gallinari1""]","[""Yuan Yin"", ""Vincent LE GUEN"", ""J\u00e9r\u00e9mie DONA"", ""Emmanuel de Bezenac"", ""Ibrahim Ayed"", ""Nicolas THOME"", ""patrick gallinari""]","[""spatio-temporal forecasting"", ""deep learning"", ""physics"", ""differential equations"", ""hybrid systems""]","Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters.",/pdf/26f8f5665ae13d683ef87f2ff8c2e5a321fac386.pdf,ICLR,2021,"We propose a new principled framework for combining physical models with deep data-driven networks, for which we provide theoretical decomposition guarantees." +SJaP_-xAb,ry6w_-g0W,1509070000000.0,1519420000000.0,217,Deep Learning with Logged Bandit Feedback,"[""tj@cs.cornell.edu"", ""adswamin@microsoft.com"", ""derijke@uva.nl""]","[""Thorsten Joachims"", ""Adith Swaminathan"", ""Maarten de Rijke""]","[""Batch Learning from Bandit Feedback"", ""Counterfactual Learning""]","We propose a new output layer for deep neural networks that permits the use of logged contextual bandit feedback for training. Such contextual bandit feedback can be available in huge quantities (e.g., logs of search engines, recommender systems) at little cost, opening up a path for training deep networks on orders of magnitude more data. To this effect, we propose a Counterfactual Risk Minimization (CRM) approach for training deep networks using an equivariant empirical risk estimator with variance regularization, BanditNet, and show how the resulting objective can be decomposed in a way that allows Stochastic Gradient Descent (SGD) training. We empirically demonstrate the effectiveness of the method by showing how deep networks -- ResNets in particular -- can be trained for object recognition without conventionally labeled images. ",/pdf/bdcb5ca1186d8d3eacb5eafbd0f8e1f6a0d33eb2.pdf,ICLR,2018,The paper proposes a new output layer for deep networks that permits the use of logged contextual bandit feedback for training. +8xeBUgD8u9,7kHdZ5uo710,1601310000000.0,1615360000000.0,3663,Continual learning in recurrent neural networks,"[""behret@ethz.ch"", ""~Christian_Henning1"", ""~Maria_Cervera1"", ""~Alexander_Meulemans1"", ""~Johannes_Von_Oswald1"", ""~Benjamin_F_Grewe1""]","[""Benjamin Ehret"", ""Christian Henning"", ""Maria Cervera"", ""Alexander Meulemans"", ""Johannes Von Oswald"", ""Benjamin F Grewe""]","[""Recurrent Neural Networks"", ""Continual Learning""]","While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to feedforward networks, RNNs iteratively reuse a shared set of weights and require working memory to process input samples. We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements, which lead to an increased need for stability at the cost of decreased plasticity for learning subsequent tasks. We additionally provide theoretical arguments supporting this interpretation by studying linear RNNs. Our study shows that established CL methods can be successfully ported to the recurrent case, and that a recent regularization approach based on hypernetworks outperforms weight-importance methods, thus emerging as a promising candidate for CL in RNNs. Overall, we provide insights on the differences between CL in feedforward networks and RNNs, while guiding towards effective solutions to tackle CL on sequential data.",/pdf/c8abb8aebd4518ff958e3cde3bb8c972fecd7cba.pdf,ICLR,2021,This paper studies the behavior of established approaches to the problem of continual learning in the context of recurrent neural networks. +45uOPa46Kh,vOJT79ASPjG,1601310000000.0,1615970000000.0,1868,Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule,"[""~Shuhei_Kurita1"", ""~Kyunghyun_Cho1""]","[""Shuhei Kurita"", ""Kyunghyun Cho""]","[""vision-and-language-navigation""]","Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative and generative. In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history. In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN.",/pdf/9971055694256fe2362f6a3739e1101c82e2941a.pdf,ICLR,2021,We propose the novel generative language-grounded policy for vision-and-language navigation(VLN). +B1IDRdeCW,SJSwCdgCZ,1509100000000.0,1519100000000.0,322,The High-Dimensional Geometry of Binary Neural Networks,"[""aga@berkeley.edu"", ""cberg500@berkeley.edu""]","[""Alexander G. Anderson"", ""Cory P. Berg""]","[""Binary Neural Networks"", ""Neural Network Visualization""]","Recent research has shown that one can train a neural network with binary weights and activations at train time by augmenting the weights with a high-precision continuous latent variable that accumulates small changes from stochastic gradient descent. However, there is a dearth of work to explain why one can effectively capture the features in data with binary weights and activations. Our main result is that the neural networks with binary weights and activations trained using the method of Courbariaux, Hubara et al. (2016) work because of the high-dimensional geometry of binary vectors. In particular, the ideal continuous vectors that extract out features in the intermediate representations of these BNNs are well-approximated by binary vectors in the sense that dot products are approximately preserved. Compared to previous research that demonstrated good classification performance with BNNs, our work explains why these BNNs work in terms of HD geometry. Furthermore, the results and analysis used on BNNs are shown to generalize to neural networks with ternary weights and activations. Our theory serves as a foundation for understanding not only BNNs but a variety of methods that seek to compress traditional neural networks. Furthermore, a better understanding of multilayer binary neural networks serves as a starting point for generalizing BNNs to other neural network architectures such as recurrent neural networks.",/pdf/f93f2b1aed30dcadc468595627c40bf498a0345b.pdf,ICLR,2018,Recent successes of Binary Neural Networks can be understood based on the geometry of high-dimensional binary vectors +rklEj2EFvB,SygIyRRxvr,1569440000000.0,1603310000000.0,149,Estimating Gradients for Discrete Random Variables by Sampling without Replacement,"[""w.w.m.kool@uva.nl"", ""h.c.vanhoof@uva.nl"", ""m.welling@uva.nl""]","[""Wouter Kool"", ""Herke van Hoof"", ""Max Welling""]","[""gradient"", ""estimator"", ""discrete"", ""categorical"", ""sampling"", ""without replacement"", ""reinforce"", ""baseline"", ""variance"", ""gumbel"", ""vae"", ""structured prediction""]","We derive an unbiased estimator for expectations over discrete random variables based on sampling without replacement, which reduces variance as it avoids duplicate samples. We show that our estimator can be derived as the Rao-Blackwellization of three different estimators. Combining our estimator with REINFORCE, we obtain a policy gradient estimator and we reduce its variance using a built-in control variate which is obtained without additional model evaluations. The resulting estimator is closely related to other gradient estimators. Experiments with a toy problem, a categorical Variational Auto-Encoder and a structured prediction problem show that our estimator is the only estimator that is consistently among the best estimators in both high and low entropy settings.",/pdf/8dc784339488d200937befc54b87ddf715be7eb1.pdf,ICLR,2020,"We derive a low-variance, unbiased gradient estimator for expectations over discrete random variables based on sampling without replacement" +t0TaKv0Gx6Z,Qt3C15fsJla,1601310000000.0,1615970000000.0,3174,Sliced Kernelized Stein Discrepancy,"[""~Wenbo_Gong1"", ""~Yingzhen_Li1"", ""~Jos\u00e9_Miguel_Hern\u00e1ndez-Lobato1""]","[""Wenbo Gong"", ""Yingzhen Li"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""kernel methods"", ""variational inference"", ""particle inference""]","Kernelized Stein discrepancy (KSD), though being extensively used in goodness-of-fit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employs kernel-based test functions defined on the optimal one-dimensional projections. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we show its advantages by training an independent component analysis when compared with existing Stein discrepancy baselines. We further propose a novel particle inference method called sliced Stein variational gradient descent (S-SVGD) which alleviates the mode-collapse issue of SVGD in training variational autoencoders.",/pdf/39d9fa2661eb33fc05f7d9de6fddb979108767c4.pdf,ICLR,2021,"We proposed a method to tackle the curse-of-dimensionality issue of kernelized Stein discrepancy with RBF kernel, along with a novel particle inference algorithm resolving the vanishing repulsive issue of Stein variational gradient descent." +r1eVXa4KvH,rkxx_ZywDH,1569440000000.0,1577170000000.0,446,Concise Multi-head Attention Models,"[""bsrinadh@google.com"", ""chulheey@mit.edu"", ""ankitsrawat@google.com"", ""sashank@google.com"", ""sanjivk@google.com""]","[""Srinadh Bhojanapalli"", ""Chulhee Yun"", ""Ankit Singh Rawat"", ""Sashank Reddi"", ""Sanjiv Kumar""]","[""Transformers"", ""Attention"", ""Multihead"", ""expressive power"", ""embedding size""]","Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. This leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the existing architectures gives rise to this limitation, which we further validate with our experiments. As a solution, we propose a new way to set the projection size in attention heads that allows us to train models with a relatively smaller embedding dimension, without sacrificing the performance.",/pdf/21f779988c315a1e1614050079022c5a4b405a3b.pdf,ICLR,2020,Fixing the head size of the Transformer models allows one to train them with a smaller embedding size. +DE0MSwKv32y,XEy3MGklL5y,1601310000000.0,1614990000000.0,3396,"Trust, but verify: model-based exploration in sparse reward environments","[""~Konrad_Czechowski1"", ""tomaszo@impan.pl"", ""m.izworski@student.uw.edu.pl"", ""marek.zbysinski@gmail.com"", ""~\u0141ukasz_Kuci\u0144ski1"", ""~Piotr_Mi\u0142o\u015b1""]","[""Konrad Czechowski"", ""Tomasz Odrzyg\u00f3\u017ad\u017a"", ""Micha\u0142 Izworski"", ""Marek Zbysi\u0144ski"", ""\u0141ukasz Kuci\u0144ski"", ""Piotr Mi\u0142o\u015b""]","[""reinforcement learning"", ""model-based"", ""exploration"", ""on-line planning"", ""imperfect environment model""]","We propose $\textit{trust-but-verify}$ (TBV) mechanism, a new method which uses model uncertainty estimates to guide exploration. The mechanism augments graph search planning algorithms by the capacity to deal with learned model's imperfections. We identify certain type of frequent model errors, which we dub $\textit{false loops}$, and which are particularly dangerous for graph search algorithms in discrete environments. These errors impose falsely pessimistic expectations and thus hinder exploration. We confirm this experimentally and show that TBV can effectively alleviate them. TBV combined with MCTS or Best First Search forms an effective model-based reinforcement learning solution, which is able to robustly solve sparse reward problems. ",/pdf/eae2c04c4e4d20aec275fc01da17962bbee22714.pdf,ICLR,2021,We address exploration problems arising from on-line planning with learned environment models. +HklZUpEtvr,rJgnPz_DPH,1569440000000.0,1577170000000.0,550,"OPTIMAL TRANSPORT, CYCLEGAN, AND PENALIZED LS FOR UNSUPERVISED LEARNING IN INVERSE PROBLEMS","[""byeongsu.s@kaist.ac.kr"", ""okt0711@kaist.ac.kr"", ""sungjunlim@gmail.com"", ""jong.ye@kaist.ac.kr""]","[""Byeongsu Sim"", ""Gyutaek Oh"", ""Sungjun Lim"", ""and Jong Chul Ye""]","[""Optimal transport"", ""CycleGAN"", ""penalized LS"", ""unsupervised learning"", ""and inverse problems""]","The penalized least squares (PLS) is a classic approach to inverse problems, where a regularization term is added to stabilize the solution. Optimal transport (OT) is another mathematical framework for computer vision tasks by providing means to transport one measure to another at minimal cost. Cycle-consistent generative adversarial network (cycleGAN) is a recent extension of GAN to learn target distributions with less mode collapsing behavior. Although similar in that no supervised training is required, the algorithms look different, so the mathematical relationship between these approaches is not clear. In this article, we provide an important advance to unveil the missing link. Specifically, we reveal that a cycleGAN architecture can be derived as a dual formulation of the optimal transport problem, if the PLS with a deep learning penalty is used as a transport cost between the two probability measures from measurements and unknown images. This suggests that cycleGAN can be considered as stochastic generalization of classical PLS approaches. +Our derivation is so general that various types of cycleGAN architecture can be easily derived by merely changing the transport cost. As proofs of concept, this paper provides novel cycleGAN architecture for unsupervised learning in accelerated MRI and deconvolution microscopy problems, which confirm the efficacy and the flexibility of the theory.",/pdf/0dd7689b495c6dc23bdc4d56e74ecebe9a317135.pdf,ICLR,2020, +SylOlp4FvH,BkelAgmLvB,1569440000000.0,1583910000000.0,345,V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control,"[""songf@google.com"", ""aabdolmaleki@google.com"", ""springenberg@google.com"", ""aidanclark@google.com"", ""soyer@google.com"", ""jwrae@google.com"", ""snoury@google.com"", ""arahuja@google.com"", ""liusiqi@google.com"", ""dhruvat@google.com"", ""heess@google.com"", ""danbelov@google.com"", ""riedmiller@google.com"", ""botvinick@google.com""]","[""H. Francis Song"", ""Abbas Abdolmaleki"", ""Jost Tobias Springenberg"", ""Aidan Clark"", ""Hubert Soyer"", ""Jack W. Rae"", ""Seb Noury"", ""Arun Ahuja"", ""Siqi Liu"", ""Dhruva Tirumala"", ""Nicolas Heess"", ""Dan Belov"", ""Martin Riedmiller"", ""Matthew M. Botvinick""]","[""reinforcement learning"", ""policy iteration"", ""multi-task learning"", ""continuous control""]","Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported.",/pdf/350ecb6259be416a19088c7bc18cff0a63802328.pdf,ICLR,2020,A state-value function-based version of MPO that achieves good results in a wide range of tasks in discrete and continuous control. +SJlEs1HKDr,Hyg_Ap0uvS,1569440000000.0,1577170000000.0,1912,Attentive Sequential Neural Processes,"[""jaesik817@gmail.com"", ""singh.gautam.iitg@gmail.com"", ""sjn.ahn@gmail.com""]","[""Jaesik Yoon"", ""Gautam Singh"", ""Sungjin Ahn""]","[""meta-learning"", ""neural processes"", ""attention"", ""sequential modeling""]","Sequential Neural Processes (SNP) is a new class of models that can meta-learn a temporal stochastic process of stochastic processes by modeling temporal transition between Neural Processes. As Neural Processes (NP) suffers from underfitting, SNP is also prone to the same problem, even more severely due to its temporal context compression. Applying attention which resolves the problem of NP, however, is a challenge in SNP, because it cannot store the past contexts over which it is supposed to apply attention. In this paper, we propose the Attentive Sequential Neural Processes (ASNP) that resolve the underfitting in SNP by introducing a novel imaginary context as a latent variable and by applying attention over the imaginary context. We evaluate our model on 1D Gaussian Process regression and 2D moving MNIST/CelebA regression. We apply ASNP to implement Attentive Temporal GQN and evaluate on the moving-CelebA task.",/pdf/222f621981d73170e62664875711b2bfc5b62c6b.pdf,ICLR,2020,"We introduce a new model, Attentive Sequential Neural Processes, that resolves the problem of augmenting attention mechanism on SNP." +BklXkCNYDB,H1ealUfdDS,1569440000000.0,1577170000000.0,886,Fast Training of Sparse Graph Neural Networks on Dense Hardware,"[""matej.balog@gmail.com"", ""bartvm@google.com"", ""smoitra@google.com"", ""yujiali@google.com"", ""dtarlow@google.com""]","[""Matej Balog"", ""Bart van Merri\u00ebnboer"", ""Subhodeep Moitra"", ""Yujia Li"", ""Daniel Tarlow""]",[],"Graph neural networks have become increasingly popular in recent years due to their ability to naturally encode relational input data and their ability to operate on large graphs by using a sparse representation of graph adjacency matrices. As we look to scale up these models using custom hardware, a natural assumption would be that we need hardware tailored to sparse operations and/or dynamic control flow. In this work, we question this assumption by scaling up sparse graph neural networks using a platform targeted at dense computation on fixed-size data. Drawing inspiration from optimization of numerical algorithms on sparse matrices, we develop techniques that enable training the sparse graph neural network model from Allamanis et al. (2018) in 13 minutes using a 512-core TPUv2 Pod, whereas the original training takes almost a day.",/pdf/32a75cd031b98506707ea3a594c98c5eea0f6f0c.pdf,ICLR,2020,Is sparse hardware necessary for training sparse GNNs? No. Does large-batch training work for sparse GNNs? Yes. So what? We can train a model in 13 minutes that previously took almost a day. +SkxJ-309FQ,H1xgIIpqtm,1538090000000.0,1545360000000.0,1131,Hallucinations in Neural Machine Translation,"[""katherinelee@google.com"", ""orhanf@google.com"", ""agarwal@google.com"", ""clarafy@berkeley.edu"", ""sussillo@google.com""]","[""Katherine Lee"", ""Orhan Firat"", ""Ashish Agarwal"", ""Clara Fannjiang"", ""David Sussillo""]","[""nmt"", ""translate"", ""dynamics"", ""rnn""]","Neural machine translation (NMT) systems have reached state of the art performance in translating text and are in wide deployment. Yet little is understood about how these systems function or break. Here we show that NMT systems are susceptible to producing highly pathological translations that are completely untethered from the source material, which we term hallucinations. Such pathological translations are problematic because they are are deeply disturbing of user trust and easy to find with a simple search. We describe a method to generate hallucinations and show that many common variations of the NMT architecture are susceptible to them. We study a variety of approaches to reduce the frequency of hallucinations, including data augmentation, dynamical systems and regularization techniques, showing that data augmentation significantly reduces hallucination frequency. Finally, we analyze networks that produce hallucinations and show that there are signatures in the attention matrix as well as in the hidden states of the decoder.",/pdf/69cadf38432b0fbc59e9feb9172dcb2f37e17466.pdf,ICLR,2019,"We introduce and analyze the phenomenon of ""hallucinations"" in NMT, or spurious translations unrelated to source text, and propose methods to reduce its frequency." +Syxt2jC5FX,rJeidVj9tX,1538090000000.0,1550370000000.0,732,From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference,"[""randallbalestriero@gmail.com"", ""richb@rice.edu""]","[""Randall Balestriero"", ""Richard Baraniuk""]","[""Spline"", ""Vector Quantization"", ""Inference"", ""Nonlinearities"", ""Deep Network""]","Nonlinearity is crucial to the performance of a deep (neural) network (DN). +To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the r\^{o}le played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. +In particular, DN layers constructed from these operations can be interpreted as {\em max-affine spline operators} (MASOs) that have an elegant link to vector quantization (VQ) and $K$-means. +While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. +{\em This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs).} +We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural ``hard'' VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding ``soft'' VQ inference problems. +We further extend the framework by hybridizing the hard and soft VQ optimizations to create a $\beta$-VQ inference that interpolates between hard, soft, and linear VQ inference. +A prime example of a $\beta$-VQ DN nonlinearity is the {\em swish} nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. +Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters. +",/pdf/586d2cb1a01d67a4a95dc22e4ebf511b0aa52404.pdf,ICLR,2019,Reformulate deep networks nonlinearities from a vector quantization scope and bridge most known nonlinearities together. +H1xscnEKDr,BylgTkllwS,1569440000000.0,1583910000000.0,128,Defending Against Physically Realizable Attacks on Image Classification,"[""tongwu@wustl.edu"", ""liangtong@wustl.edu"", ""yvorobeychik@wustl.edu""]","[""Tong Wu"", ""Liang Tong"", ""Yevgeniy Vorobeychik""]","[""defense against physical attacks"", ""adversarial machine learning""]","We study the problem of defending deep neural network approaches for image classification from physically realizable attacks. First, we demonstrate that the two most scalable and effective methods for learning robust models, adversarial training with PGD attacks and randomized smoothing, exhibit very limited effectiveness against three of the highest profile physical attacks. Next, we propose a new abstract adversarial model, rectangular occlusion attacks, in which an adversary places a small adversarially crafted rectangle in an image, and develop two approaches for efficiently computing the resulting adversarial examples. Finally, we demonstrate that adversarial training using our new attack yields image classification models that exhibit high robustness against the physically realizable attacks we study, offering the first effective generic defense against such attacks.",/pdf/429b971c79c90fdb29acb165334ce76ef2ea9206.pdf,ICLR,2020,Defending Against Physically Realizable Attacks on Image Classification +jz7tDvX6XYR,LGTEAF72IVK,1601310000000.0,1614990000000.0,2680,Speeding up Deep Learning Training by Sharing Weights and Then Unsharing,"[""~Shuo_Yang6"", ""~Le_Hou1"", ""~Xiaodan_Song1"", ""~qiang_liu4"", ""~Denny_Zhou1""]","[""Shuo Yang"", ""Le Hou"", ""Xiaodan Song"", ""qiang liu"", ""Denny Zhou""]","[""fast training"", ""BERT"", ""transformer"", ""weight sharing"", ""deep learning""]","It has been widely observed that increasing deep learning model sizes often leads to significant performance improvements on a variety of natural language processing and computer vision tasks. In the meantime, however, computational costs and training time would dramatically increase when models get larger. In this paper, we propose a simple approach to speed up training for a particular kind of deep networks which contain repeated structures, such as the transformer module. In our method, we first train such a deep network with the weights shared across all the repeated layers till some point. We then stop weight sharing and continue training until convergence. The untying point is automatically determined by monitoring gradient statistics. Our adaptive untying criterion is obtained from a theoretic analysis over deep linear networks. Empirical results show that our method is able to reduce the training time of BERT by 50%. ",/pdf/de4e739274cfaebbffa6e2a7fe7eb5ee356b0ba8.pdf,ICLR,2021,Speeding up deep learning training by sharing weights and then unsharing +oxnp2q-PGL4,RcVSk8SXuow,1601310000000.0,1615650000000.0,3256,Lossless Compression of Structured Convolutional Models via Lifting,"[""~Gustav_Sourek1"", ""zelezny@fel.cvut.cz"", ""~Ondrej_Kuzelka1""]","[""Gustav Sourek"", ""Filip Zelezny"", ""Ondrej Kuzelka""]","[""weight sharing"", ""graph neural networks"", ""lifted inference"", ""relational learning"", ""dynamic computation graphs"", ""convolutional models""]","Lifting is an efficient technique to scale up graphical models generalized to relational domains by exploiting the underlying symmetries. Concurrently, neural models are continuously expanding from grid-like tensor data into structured representations, such as various attributed graphs and relational databases. To address the irregular structure of the data, the models typically extrapolate on the idea of convolution, effectively introducing parameter sharing in their, dynamically unfolded, computation graphs. The computation graphs themselves then reflect the symmetries of the underlying data, similarly to the lifted graphical models. Inspired by lifting, we introduce a simple and efficient technique to detect the symmetries and compress the neural models without loss of any information. We demonstrate through experiments that such compression can lead to significant speedups of structured convolutional models, such as various Graph Neural Networks, across various tasks, such as molecule classification and knowledge-base completion.",/pdf/6ca46d0a2419236e20aac30bbf133f4c81154953.pdf,ICLR,2021,"Speeding up weight-sharing dynamic neural computation graphs, such as GNNs, with lifted inference." +BJlBSkHtDS,SyxF6ba_wr,1569440000000.0,1583910000000.0,1690,Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks,"[""molina@cs.tu-darmstadt.de"", ""schramowski@cs.tu-darmstadt.de"", ""kersting@cs.tu-darmstadt.de""]","[""Alejandro Molina"", ""Patrick Schramowski"", ""Kristian Kersting""]",[],"The performance of deep network learning strongly depends on the choice of the non-linear activation function associated with each neuron. However, deciding on the best activation is non-trivial and the choice depends on the architecture, hyper-parameters, and even on the dataset. Typically these activations are fixed by hand before training. Here, we demonstrate how to eliminate the reliance on first picking fixed activation functions by using flexible parametric rational functions instead. The resulting Padé Activation Units (PAUs) can both approximate common activation functions and also learn new ones while providing compact representations. Our empirical evidence shows that end-to-end learning deep networks with PAUs can increase the predictive performance. Moreover, PAUs pave the way to approximations with provable robustness.",/pdf/7c3c245f99116c181b7fcca470559c4c2456ce8d.pdf,ICLR,2020,"We introduce PAU, a new learnable activation function for neural networks. They free the network designers from the activation selection process and increase the test prediction accuracy." +i7aMbliTkHs,yHSIQGVIGi,1601310000000.0,1614990000000.0,406,TAM: Temporal Adaptive Module for Video Recognition,"[""~Zhaoyang_Liu1"", ""~Limin_Wang1"", ""~Wayne_Wu1"", ""~Chen_Qian1"", ""~Tong_Lu1""]","[""Zhaoyang Liu"", ""Limin Wang"", ""Wayne Wu"", ""Chen Qian"", ""Tong Lu""]","[""Action Recognition"", ""Temporal Adaptive Module"", ""Temporal Adaptive Network""]","Temporal modeling is crucial for capturing spatiotemporal structure in videos for action recognition. Video data is with extremely complex dynamics along its temporal dimension due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module ({\bf TAM}) to generate video-specific kernels based on its own feature maps. TAM proposes a unique two-level adaptive modeling scheme by decoupling dynamic kernels into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a principled module and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets, demonstrate that the TAM outperforms other temporal modeling methods consistently owing to its temporal adaptive modeling strategy.",/pdf/41f7405b8750a9cc2034f2da4cb42cf2c1dcc0f3.pdf,ICLR,2021,"As the video data has extremely complex dynamics along its temporal dimension, we thus propose a temporal adaptive module decoupled by a location sensitive map and a location invariant weight to capture the temporal clues in a dynamic scheme." +JCRblSgs34Z,DOKtlhUKmn1,1601310000000.0,1616060000000.0,2340,Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers,"[""~Sahil_Singla1"", ""~Soheil_Feizi2""]","[""Sahil Singla"", ""Soheil Feizi""]","[""spectral regularization"", ""spectral normalization""]","In deep neural networks, the spectral norm of the Jacobian of a layer bounds the factor by which the norm of a signal changes during forward/backward propagation. Spectral norm regularizations have been shown to improve generalization, robustness and optimization of deep learning methods. Existing methods to compute the spectral norm of convolution layers either rely on heuristics that are efficient in computation but lack guarantees or are theoretically-sound but computationally expensive. In this work, we obtain the best of both worlds by deriving {\it four} provable upper bounds on the spectral norm of a standard 2D multi-channel convolution layer. These bounds are differentiable and can be computed efficiently during training with negligible overhead. One of these bounds is in fact the popular heuristic method of Miyato et al. (multiplied by a constant factor depending on filter sizes). Each of these four bounds can achieve the tightest gap depending on convolution filters. Thus, we propose to use the minimum of these four bounds as a tight, differentiable and efficient upper bound on the spectral norm of convolution layers. Moreover, our spectral bound is an effective regularizer and can be used to bound either the lipschitz constant or curvature values (eigenvalues of the Hessian) of neural networks. Through experiments on MNIST and CIFAR-10, we demonstrate the effectiveness of our spectral bound in improving generalization and robustness of deep networks.",/pdf/6c7018c5dcc64de7e42204d28cf786cb4a596c69.pdf,ICLR,2021,"We derive four provable upper bounds on the largest singular value of convolution layers that are differentiable, independent of size of input image and can be computed efficiently during training with negligible overhead." +r1gixp4FPH,Skg9sL48vH,1569440000000.0,1601780000000.0,350,Accelerating SGD with momentum for over-parameterized learning,"[""liu.2656@buckeyemail.osu.edu"", ""mbelkin@cse.ohio-state.edu""]","[""Chaoyue Liu"", ""Mikhail Belkin""]","[""SGD"", ""acceleration"", ""momentum"", ""stochastic"", ""over-parameterized"", ""Nesterov""]"," +Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in this paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic setting, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent. + +To address the non-acceleration issue, we introduce a compensation term to Nesterov SGD. The resulting algorithm, which we call MaSS, converges for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting. For full batch, the convergence rate of MaSS matches the well-known accelerated rate of the Nesterov's method. + +We also analyze the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini-batch size, demonstrating three distinct regimes: linear scaling, diminishing returns and saturation. + +Experimental evaluation of MaSS for several standard architectures of deep networks, including ResNet and convolutional networks, shows improved performance over SGD, Nesterov SGD and Adam. ",/pdf/67006b8932120072903be0eb7bfca6fc324e26c6.pdf,ICLR,2020,"This work proves the non-acceleration of Nesterov SGD with any hyper-parameters, and proposes new algorithm which provably accelerates SGD in the over-parameterized setting." +SysEexbRb,B15EglbRW,1509130000000.0,1519090000000.0,549,Critical Points of Linear Neural Networks: Analytical Forms and Landscape Properties,"[""zhou.1172@osu.edu"", ""liang.889@osu.edu""]","[""Yi Zhou"", ""Yingbin Liang""]","[""neural networks"", ""critical points"", ""analytical form"", ""landscape""]","Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect. Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms. In this paper, we provide a necessary and sufficient characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for linear neural networks. We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum. Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks. One particular conclusion is that: While the loss function of linear networks has no spurious local minimum, the loss function of one-hidden-layer nonlinear networks with ReLU activation function does have local minimum that is not global minimum.",/pdf/839424a5fb3ab577c5cd4849b07aef28c62993ee.pdf,ICLR,2018,"We provide necessary and sufficient analytical forms for the critical points of the square loss functions for various neural networks, and exploit the analytical forms to characterize the landscape properties for the loss functions of these neural networks." +1Jv6b0Zq3qi,Fa_YBPWF0S,1601310000000.0,1616530000000.0,1680,Uncertainty in Gradient Boosting via Ensembles,"[""~Andrey_Malinin1"", ""~Liudmila_Prokhorenkova1"", ""austimenko@yandex-team.ru""]","[""Andrey Malinin"", ""Liudmila Prokhorenkova"", ""Aleksei Ustimenko""]","[""uncertainty"", ""ensembles"", ""gradient boosting"", ""decision trees"", ""knowledge uncertainty""]","For many practical, high-risk applications, it is essential to quantify uncertainty in a model's predictions to avoid costly mistakes. While predictive uncertainty is widely studied for neural networks, the topic seems to be under-explored for models based on gradient boosting. However, gradient boosting often achieves state-of-the-art results on tabular data. This work examines a probabilistic ensemble-based framework for deriving uncertainty estimates in the predictions of gradient boosting classification and regression models. We conducted experiments on a range of synthetic and real datasets and investigated the applicability of ensemble approaches to gradient boosting models that are themselves ensembles of decision trees. Our analysis shows that ensembles of gradient boosting models successfully detect anomalous inputs while having limited ability to improve the predicted total uncertainty. Importantly, we also propose a concept of a virtual ensemble to get the benefits of an ensemble via only one gradient boosting model, which significantly reduces complexity. ",/pdf/d20bb72ed1f0c69fa43f4363ec90050b0e79af3d.pdf,ICLR,2021,Propose and analyze an ensemble-based framework for deriving uncertainty estimates in GBDT models. +Pbj8H_jEHYv,AdNVcoggT7k,1601310000000.0,1618440000000.0,2647,Orthogonalizing Convolutional Layers with the Cayley Transform,"[""~Asher_Trockman1"", ""~J_Zico_Kolter1""]","[""Asher Trockman"", ""J Zico Kolter""]","[""orthogonal layers"", ""Lipschitz constrained networks"", ""adversarial robustness""]","Recent work has highlighted several advantages of enforcing orthogonality in the weight layers of deep networks, such as maintaining the stability of activations, preserving gradient norms, and enhancing adversarial robustness by enforcing low Lipschitz constants. Although numerous methods exist for enforcing the orthogonality of fully-connected layers, those for convolutional layers are more heuristic in nature, often focusing on penalty methods or limited classes of convolutions. In this work, we propose and evaluate an alternative approach to directly parameterize convolutional layers that are constrained to be orthogonal. Specifically, we propose to apply the Cayley transform to a skew-symmetric convolution in the Fourier domain, so that the inverse convolution needed by the Cayley transform can be computed efficiently. We compare our method to previous Lipschitz-constrained and orthogonal convolutional layers and show that it indeed preserves orthogonality to a high degree even for large convolutions. Applied to the problem of certified adversarial robustness, we show that networks incorporating the layer outperform existing deterministic methods for certified defense against $\ell_2$-norm-bounded adversaries, while scaling to larger architectures than previously investigated. Code is available at https://github.com/locuslab/orthogonal-convolutions.",/pdf/6ae88022a0cffd7671fd9602be34926437a0f6cb.pdf,ICLR,2021, +_zHHAZOLTVh,CqroyhqAVsC,1601310000000.0,1614990000000.0,1943,A Maximum Mutual Information Framework for Multi-Agent Reinforcement Learning,"[""~Woojun_Kim1"", ""~Whiyoung_Jung1"", ""ms.cho@kaist.ac.kr"", ""~Youngchul_Sung1""]","[""Woojun Kim"", ""Whiyoung Jung"", ""Myungsik Cho"", ""Youngchul Sung""]","[""Multi-agent reinforcement learning"", ""coordination"", ""mutual information""]","In this paper, we propose a maximum mutual information (MMI) framework for multi-agent reinforcement learning (MARL) to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the mutual information between actions. By introducing a latent variable to induce nonzero mutual information between actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic (VM3-AC), which follows centralized learning with decentralized execution (CTDE). We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms MADDPG and other MARL algorithms in multi-agent tasks requiring coordination.",/pdf/f712f0bffe83005c7488610daf17831e2f42d641.pdf,ICLR,2021,This paper propose a new framework for multi-agent reinforcement learning named maximum mutual information to enable the multiple agents to learn coordinated behaviors. +KOtxfjpQsq,pAH-UC3-n3,1601310000000.0,1614990000000.0,162,Meta-Model-Based Meta-Policy Optimization,"[""takuya-h1@nec.com"", ""~Takahisa_Imagawa1"", ""~Voot_Tangkaratt1"", ""~Takayuki_Osa1"", ""takashi.onishi@nec.com"", ""~Yoshimasa_Tsuruoka1""]","[""Takuya Hiraoka"", ""Takahisa Imagawa"", ""Voot Tangkaratt"", ""Takayuki Osa"", ""Takashi Onishi"", ""Yoshimasa Tsuruoka""]",[],"Model-based reinforcement learning (MBRL) has been applied to meta-learning settings and has demonstrated its high sample efficiency. +However, in previous MBRL for meta-learning settings, policies are optimized via rollouts that fully rely on a predictive model of an environment. +Thus, its performance in a real environment tends to degrade when the predictive model is inaccurate. +In this paper, we prove that performance degradation can be suppressed by using branched meta-rollouts. +On the basis of this theoretical analysis, we propose Meta-Model-based Meta-Policy Optimization (M3PO), in which the branched meta-rollouts are used for policy optimization. +We demonstrate that M3PO outperforms existing meta reinforcement learning methods in continuous-control benchmarks. ",/pdf/285ba1b769e7bf3431a80d5dde0267d27a2c8804.pdf,ICLR,2021, +SygfNCEYDH,BJlAIfUuvS,1569440000000.0,1577170000000.0,1066,Weakly-supervised Knowledge Graph Alignment with Adversarial Learning,"[""meng.qu@umontreal.ca"", ""jian.tang@hec.ca"", ""yoshua.bengio@mila.quebec""]","[""Meng Qu"", ""Jian Tang"", ""Yoshua Bengio""]",[],"This paper studies aligning knowledge graphs from different sources or languages. Most existing methods train supervised methods for the alignment, which usually require a large number of aligned knowledge triplets. However, such a large number of aligned knowledge triplets may not be available or are expensive to obtain in many domains. Therefore, in this paper we propose to study aligning knowledge graphs in fully-unsupervised or weakly-supervised fashion, i.e., without or with only a few aligned triplets. We propose an unsupervised framework to align the entity and relation embddings of different knowledge graphs with an adversarial learning framework. Moreover, a regularization term which maximizes the mutual information between the embeddings of different knowledge graphs is used to mitigate the problem of mode collapse when learning the alignment functions. Such a framework can be further seamlessly integrated with existing supervised methods by utilizing a limited number of aligned triples as guidance. Experimental results on multiple datasets prove the effectiveness of our proposed approach in both the unsupervised and the weakly-supervised settings.",/pdf/cdacfcbed90f4ba17ce9aa8c16d3b1995dcf32a2.pdf,ICLR,2020, +0F_OC_oROWb,134GXlIaKgA,1601310000000.0,1614990000000.0,1082,RSO: A Gradient Free Sampling Based Approach For Training Deep Neural Networks,"[""~Rohun_Tripathi1"", ""~Bharat_Singh2""]","[""Rohun Tripathi"", ""Bharat Singh""]",[],"We propose RSO (random search optimization), a gradient free, sampling based approach for training deep neural networks. To this end, RSO adds a perturbation to a weight in a deep neural network and tests if it reduces the loss on a mini-batch. If this reduces the loss, the weight is updated, otherwise the existing weight is retained. Surprisingly, we find that repeating this process a few times for each weight is sufficient to train a deep neural network. The number of weight updates for RSO is an order of magnitude lesser when compared to backpropagation with SGD. RSO can make aggressive weight updates in each step as there is no concept of learning rate. The weight update step for individual layers is also not coupled with the magnitude of the loss. RSO is evaluated on classification tasks on MNIST and CIFAR-10 datasets with deep neural networks of 6 to 10 layers where it achieves an accuracy of 99.1% and 81.8% respectively. We also find that after updating the weights just 5 times, the algorithm obtains a classification accuracy of 98% on MNIST.",/pdf/342d230034444482105d6a4df3ca7ec2f61c20a3.pdf,ICLR,2021, +HkMCybx0-,SJMAy-eAZ,1509060000000.0,1518730000000.0,214,Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs),"[""bradcarlile@yahoo.com"", ""info@aiperf.com""]","[""Brad Carlile"", ""Guy Delamarter"", ""Paul Kinney"", ""Akiko Marti"", ""Brian Whitney""]","[""Deep learning"", ""Theory""]","We introduce the “inverse square root linear unit” (ISRLU) to speed up learning in deep neural networks. ISRLU has better performance than ELU but has many of the same benefits. ISRLU and ELU have similar curves and characteristics. Both have negative values, allowing them to push mean unit activation closer to zero, and bring the normal gradient closer to the unit natural gradient, ensuring a noise- robust deactivation state, lessening the over fitting risk. The significant performance advantage of ISRLU on traditional CPUs also carry over to more efficient HW implementations on HW/SW codesign for CNNs/RNNs. In experiments with TensorFlow, ISRLU leads to faster learning and better generalization than ReLU on CNNs. This work also suggests a computationally efficient variant called the “inverse square root unit” (ISRU) which can be used for RNNs. Many RNNs use either long short-term memory (LSTM) and gated recurrent units (GRU) which are implemented with tanh and sigmoid activation functions. ISRU has less computational complexity but still has a similar curve to tanh and sigmoid.",/pdf/395b9a0c69410f349dadbfd5e8f8fe25526e3481.pdf,ICLR,2018,We introduce the ISRLU activation function which is continuously differentiable and faster than ELU. The related ISRU replaces tanh & sigmoid. +rkl44TEtwH,SJxJf2Mvvr,1569440000000.0,1577170000000.0,483,Composable Semi-parametric Modelling for Long-range Motion Generation,"[""xjwxjw@sjtu.edu.cn"", ""huazhe_xu@eecs.berkeley.edu"", ""nibingbing@sjtu.edu.cn"", ""xkyang@sjtu.edu.cn"", ""trevor@eecs.berkeley.edu""]","[""Jingwei Xu"", ""Huazhe Xu"", ""Bingbing Ni"", ""Xiaokang Yang"", ""Trevor Darrell""]","[""Semi-parametric"", ""Long-range"", ""Motion Generation""]","Learning diverse and natural behaviors is one of the longstanding goal for creating intelligent characters in the animated world. In this paper, we propose ``COmposable Semi-parametric MOdelling'' (COSMO), a method for generating long range diverse and distinctive behaviors to achieve a specific goal location. Our proposed method learns to model the motion of human by combining the complementary strengths of both non-parametric techniques and parametric ones. Given the starting and ending state, a memory bank is used to retrieve motion references that are provided as source material to a deep network. The synthesis is performed by a deep network that controls the style of the provided motion material and modifies it to become natural. On skeleton datasets with diverse motion, we show that the proposed method outperforms existing parametric and non-parametric baselines. We also demonstrate the generated sequences are useful as subgoals for actual physical execution in the animated world. ",/pdf/cfafeb5863514ebcd798584cd7a4d84bc79e9675.pdf,ICLR,2020,"We propose a semi-parametric model to generate long-range, diverse and visually natural motion sequence." +SJf_XhCqKm,SJgZ-z0qt7,1538090000000.0,1545360000000.0,1373,Open Loop Hyperparameter Optimization and Determinantal Point Processes,"[""jessed@cs.cmu.edu"", ""jamieson@cs.washington.edu"", ""nasmith@cs.washington.edu""]","[""Jesse Dodge"", ""Kevin Jamieson"", ""Noah Smith""]","[""hyperparameter optimization"", ""black box optimization""]","Driven by the need for parallelizable hyperparameter optimization methods, this paper studies open loop search methods: sequences that are predetermined and can be generated before a single configuration is evaluated. Examples include grid search, uniform random search, low discrepancy sequences, and other sampling distributions. +In particular, we propose the use of k-determinantal point processes in hyperparameter optimization via random search. Compared to conventional uniform random search where hyperparameter settings are sampled independently, a k-DPP promotes diversity. We describe an approach that transforms hyperparameter search spaces for efficient use with a k-DPP. In addition, we introduce a novel Metropolis-Hastings algorithm which can sample from k-DPPs defined over any space from which uniform samples can be drawn, including spaces with a mixture of discrete and continuous dimensions or tree structure. Our experiments show significant benefits in realistic scenarios with a limited budget for training supervised learners, whether in serial or parallel.",/pdf/3531cd38c529efc4f988285026fa60bc6ffe9575.pdf,ICLR,2019,We address fully parallel hyperparameter optimization with Determinantal Point Processes. +Hyfg5o0qtm,SylvyZEwYQ,1538090000000.0,1545360000000.0,504,Temporal Gaussian Mixture Layer for Videos,"[""ajpiergi@indiana.edu"", ""mryoo@indiana.edu""]","[""AJ Piergiovanni"", ""Michael S. Ryoo""]",[],"We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos. The TGM layer is a temporal convolutional layer governed by a much smaller set of parameters (e.g., location/variance of Gaussians) that are fully differentiable. We present our fully convolutional video models with multiple TGM layers for activity detection. The experiments on multiple datasets including Charades and MultiTHUMOS confirm the effectiveness of TGM layers, outperforming the state-of-the-arts.",/pdf/dc92e315d1776df077123cd50dfd23c6cc843e98.pdf,ICLR,2019, +H1ep5TNKwr,Bkeb2VCvwB,1569440000000.0,1577170000000.0,725,Hebbian Graph Embeddings,"[""shalin.shah@target.com"", ""venkataramana.kini@target.com""]","[""Shalin Shah"", ""Venkataramana Kini""]","[""graph embeddings"", ""hebbian learning"", ""simulated annealing""]","Representation learning has recently been successfully used to create vector representations of entities in language learning, recommender systems and in similarity learning. Graph embeddings exploit the locality structure of a graph and generate embeddings for nodes which could be words in a language, products on a retail website; and the nodes are connected based on a context window. In this paper, we consider graph embeddings with an error-free associative learning update rule, which models the embedding vector of node as a non-convex Gaussian mixture of the embeddings of the nodes in its immediate vicinity with some constant variance that is reduced as iterations progress. It is very easy to parallelize our algorithm without any form of shared memory, which makes it possible to use it on very large graphs with a much higher dimensionality of the embeddings. We study the efficacy of proposed method on several benchmark data sets in Goyal & Ferrara(2018b) and favorably compare with state of the art methods. Further, proposed method is applied to generate relevant recommendations for a large retailer.",/pdf/09ac4ce2cdccca6e5954e3869793211362ecc33e.pdf,ICLR,2020,"Graph embeddings for link prediction, reconstruction and for a recommender system" +ryAe2WBee,,1477940000000.0,1481740000000.0,26,Multi-label learning with semantic embeddings,"[""lpjing@bjtu.edu.cn"", ""15112085@bjtu.edu.cn"", ""11112191@bjtu.edu.cn"", ""gittens@icsi.berkeley.edu"", ""mmahoney@stat.berkeley.edu""]","[""Liping Jing"", ""MiaoMiao Cheng"", ""Liu Yang"", ""Alex Gittens"", ""Michael W. Mahoney""]","[""Supervised Learning""]","Multi-label learning aims to automatically assign to an instance (e.g., an image or a document) the most relevant subset of labels from a large set of possible labels. The main challenge is to maintain accurate predictions while scaling efficiently on data sets with extremely large label sets and many training data points. We propose a simple but effective neural net approach, the Semantic Embedding Model (SEM), that models the labels for an instance as draws from a multinomial distribution parametrized by nonlinear functions of the instance features. A Gauss-Siedel mini-batch adaptive gradient descent algorithm is used to fit the model. To handle extremely large label sets, we propose and experimentally validate the efficacy of fitting randomly chosen marginal label distributions. Experimental results on eight real-world data sets show that SEM garners significant performance gains over existing methods. In particular, we compare SEM to four recent state-of-the-art algorithms (NNML, BMLPL, REmbed, and SLEEC) and find that SEM uniformly outperforms these algorithms in several widely used evaluation metrics while requiring significantly less training time. +",/pdf/d3baddd0392bb8aa948871c918b11cd345b68566.pdf,ICLR,2017,"The SEM approach to multi-label learning models labels using multinomial distributions parametrized by nonlinear functions of the instance features, is scalable and outperforms current state-of-the-art algorithms" +Bkel6ertwS,ryg5RxWFDB,1569440000000.0,1577170000000.0,2562,Learning DNA folding patterns with Recurrent Neural Networks ,"[""michal.rozenwald@gmail.com"", ""agalitzina@gmail.com"", ""ekhrameeva@gmail.com"", ""grigory.sapunov@gmail.codelfm"", ""mikhail.gelfand@gmail.com""]","[""Michal Rozenwald"", ""Aleksandra Galitsyna"", ""Ekaterina Khrameeva"", ""Grigory Sapunov"", ""Mikhail S. Gelfand""]","[""Machine Learning"", ""Recurrent Neural Networks"", ""3D chromatin structure"", ""topologically associating domains"", ""computational biology.""]"," +The recent expansion of machine learning applications to molecular biology proved to have a significant contribution to our understanding of biological systems, and genome functioning in particular. Technological advances enabled the collection of large epigenetic datasets, including information about various DNA binding factors (ChIP-Seq) and DNA spatial structure (Hi-C). Several studies have confirmed the correlation between DNA binding factors and Topologically Associating Domains (TADs) in DNA structure. However, the information about physical proximity represented by genomic coordinate was not yet used for the improvement of the prediction models. + +In this research, we focus on Machine Learning methods for prediction of folding patterns of DNA in a classical model organism Drosophila melanogaster. The paper considers linear models with four types of regularization, Gradient Boosting and Recurrent Neural Networks for the prediction of chromatin folding patterns from epigenetic marks. The bidirectional LSTM RNN model outperformed all the models and gained the best prediction scores. This demonstrates the utilization of complex models and the importance of memory of sequential DNA states for the chromatin folding. We identify informative epigenetic features that lead to the further conclusion of their biological significance.",/pdf/203cd6f32544775beb171bdda9ed7221c2e0829e.pdf,ICLR,2020,We apply RNN to solve the biological problem of chromatin folding patterns prediction from epigenetic marks and demonstrate for the first time that utilization of memory of sequential states on DNA molecule is significant for the best performance. +i7aDkDEXJQU,b8yQ3op6KFj,1601310000000.0,1614990000000.0,2699,Demystifying Learning of Unsupervised Neural Machine Translation,"[""~Guanlin_Li1"", ""~lemao_liu1"", ""~Taro_Watanabe1"", ""~Conghui_Zhu2"", ""~Tiejun_Zhao1""]","[""Guanlin Li"", ""lemao liu"", ""Taro Watanabe"", ""Conghui Zhu"", ""Tiejun Zhao""]","[""Unsupervised Neural Machine Translation"", ""Marginal Likelihood Maximization"", ""Mutual Information""]","Unsupervised Neural Machine Translation or UNMT has received great attention +in recent years. Though tremendous empirical improvements have been achieved, +there still lacks theory-oriented investigation and thus some fundamental +questions like \textit{why} certain training protocol can work or not under +\textit{what} circumstances have not yet been well understood. This paper +attempts to provide theoretical insights for the above questions. Specifically, +following the methodology of comparative study, we leverage two perspectives, +i) \textit{marginal likelihood maximization} and ii) \textit{mutual information} +from information theory, to understand the different learning effects from the +standard training protocol and its variants. Our detailed analyses reveal +several critical conditions for the successful training of UNMT.",/pdf/8a659619cdddad8a041bd3c3d26715920b20ee05.pdf,ICLR,2021,Try to demystify why dae+bt training can lead to successfully trained UNMT model with decent performance. +S1lyyANYwr,rkxaVXzuvr,1569440000000.0,1577170000000.0,878,Constrained Markov Decision Processes via Backward Value Functions,"[""harsh.satija@mail.mcgill.ca"", ""philip.amortila@mail.mcgill.ca"", ""jpineau@cs.mcgill.ca""]","[""Harsh Satija"", ""Philip Amortila"", ""Joelle Pineau""]","[""Reinforcement Learning"", ""Constrained Markov Decision Processes"", ""Deep Reinforcement Learning""]","Although Reinforcement Learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g. on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process, and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement method which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.",/pdf/796b869a914ad48c5e95954aa6c8cffc372299d9.pdf,ICLR,2020,"We present an on-policy method for solving constrained MDPs that respects trajectory-level constraints by converting them into local state-dependent constraints, and works for both discrete and continuous high-dimensional spaces." +HygrAR4tPS,B1gJxucOPH,1569440000000.0,1577170000000.0,1430,On Empirical Comparisons of Optimizers for Deep Learning,"[""choidami@cs.toronto.edu"", ""shallue@google.com"", ""znado@google.com"", ""jaehlee@google.com"", ""cmaddis@google.com"", ""gdahl@google.com""]","[""Dami Choi"", ""Christopher J. Shallue"", ""Zachary Nado"", ""Jaehoon Lee"", ""Chris J. Maddison"", ""George E. Dahl""]","[""Deep learning"", ""optimization"", ""adaptive gradient methods"", ""Adam"", ""hyperparameter tuning""]","Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper we demonstrate the sensitivity of optimizer comparisons to the metaparameter tuning protocol. Our findings suggest that the metaparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when metaparameter search spaces are changed. As tuning effort grows without bound, more general update rules should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but the recent attempts to compare optimizers either assume these inclusion relationships are not relevant in practice or restrict the metaparameters they tune to break the inclusions. In our experiments, we find that the inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adative gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning rarely-tuned metaparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training.",/pdf/46b36dabfb24c43ab927f1afc21652594f807d33.pdf,ICLR,2020,Optimizer comparisons depend more than you would think on metaparameter tuning details and our prior should be that more general update rules (e.g. adaptive gradient methods) are better. +BkPrDFgR-,ryISwKg0-,1509100000000.0,1518730000000.0,337,Piecewise Linear Neural Networks verification: A comparative study,"[""rudy@robots.ox.ac.uk"", ""ilker.turkaslan@lmh.ox.ac.uk"", ""philip.torr@eng.ox.ac.uk"", ""pushmeet@google.com"", ""pawan@robots.ox.ac.uk""]","[""Rudy Bunel"", ""Ilker Turkaslan"", ""Philip H.S. Torr"", ""Pushmeet Kohli"", ""M. Pawan Kumar""]","[""Verification"", ""SMT solver"", ""Mixed Integer Programming"", ""Neural Networks""]","The success of Deep Learning and its potential use in many important safety- +critical applications has motivated research on formal verification of Neural Net- +work (NN) models. Despite the reputation of learned NN models to behave as +black boxes and theoretical hardness results of the problem of proving their prop- +erties, researchers have been successful in verifying some classes of models by +exploiting their piecewise linear structure. Unfortunately, most of these works +test their algorithms on their own models and do not offer any comparison with +other approaches. As a result, the advantages and downsides of the different al- +gorithms are not well understood. Motivated by the need of accelerating progress +in this very important area, we investigate the trade-offs of a number of different +approaches based on Mixed Integer Programming, Satisfiability Modulo Theory, +as well as a novel method based on the Branch-and-Bound framework. We also +propose a new data set of benchmarks, in addition to a collection of previously +released testcases that can be used to compare existing methods. Our analysis not +only allowed a comparison to be made between different strategies, the compar- +ision of results from different solvers also revealed implementation bugs in pub- +lished methods. We expect that the availability of our benchmark and the analysis +of the different approaches will allow researchers to invent and evaluate promising +approaches for making progress on this important topic.",/pdf/ec5705be22cc228fb5fad6ddc13ed333bbf7f436.pdf,ICLR,2018, +VMtftZqMruq,CA4faetHzS,1601310000000.0,1614990000000.0,1210,Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning,"[""~Jianhao_Wang1"", ""~Zhizhou_Ren1"", ""~Beining_Han1"", ""~Jianing_Ye1"", ""~Chongjie_Zhang1""]","[""Jianhao Wang"", ""Zhizhou Ren"", ""Beining Han"", ""Jianing Ye"", ""Chongjie Zhang""]","[""Multi-agent reinforcement learning"", ""Fitted Q-iteration"", ""Value factorization"", ""Credit assignment""]","Value decomposition is a popular and promising approach to scaling up multi-agent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we introduce a variant of the fitted Q-iteration framework for analyzing multi-agent Q-learning with value decomposition. Based on this framework, we derive a closed-form solution to the empirical Bellman error minimization with linear value decomposition. With this novel solution, we further reveal two interesting insights: 1) linear value decomposition implicitly implements a classical multi-agent credit assignment called counterfactual difference rewards; and 2) On-policy data distribution or richer Q function classes can improve the training stability of multi-agent Q-learning. In the empirical study, our experiments demonstrate the realizability of our theoretical closed-form formulation and implications in the didactic examples and a broad set of StarCraft II unit micromanagement tasks, respectively. ",/pdf/893d97d02d2f8662c05b96b4807161bbd2fe6652.pdf,ICLR,2021,A theoretical and empirical understanding of cooperative multi-agent Q-learning with linear value decomposition. +BJlahxHYDS,Skge6eWFwS,1569440000000.0,1587120000000.0,2553,Conservative Uncertainty Estimation By Fitting Prior Networks,"[""kamil.ciosek@microsoft.com"", ""fortuin@inf.ethz.ch"", ""ryoto@microsoft.com"", ""katja.hofmann@microsoft.com"", ""ret26@cam.ac.uk""]","[""Kamil Ciosek"", ""Vincent Fortuin"", ""Ryota Tomioka"", ""Katja Hofmann"", ""Richard Turner""]","[""uncertainty quantification"", ""deep learning"", ""Gaussian process"", ""epistemic uncertainty"", ""random network"", ""prior"", ""Bayesian inference""]","Obtaining high-quality uncertainty estimates is essential for many applications of deep neural networks. In this paper, we theoretically justify a scheme for estimating uncertainties, based on sampling from a prior distribution. Crucially, the uncertainty estimates are shown to be conservative in the sense that they never underestimate a posterior uncertainty obtained by a hypothetical Bayesian algorithm. We also show concentration, implying that the uncertainty estimates converge to zero as we get more data. Uncertainty estimates obtained from random priors can be adapted to any deep network architecture and trained using standard supervised learning pipelines. We provide experimental evaluation of random priors on calibration and out-of-distribution detection on typical computer vision tasks, demonstrating that they outperform deep ensembles in practice.",/pdf/523d0ff6817176db2a8897bf673b70ca56eb81c8.pdf,ICLR,2020,We provide theoretical support to uncertainty estimates for deep learning obtained fitting random priors. +BJe4JJBYwS,HJlDN35ODr,1569440000000.0,1577170000000.0,1465,CROSS-DOMAIN CASCADED DEEP TRANSLATION,"[""orenkatzir@mail.tau.ac.il"", ""cohenor@gmail.com"", ""danix3d@gmail.com""]","[""Oren Katzir"", ""Dani Lischinski"", ""Daniel Cohen-Or""]","[""computer vision"", ""image translation"", ""generative adversarial networks""]","In recent years we have witnessed tremendous progress in unpaired image-to-image translation methods, propelled by the emergence of DNNs and adversarial training strategies. However, most existing methods focus on transfer of style and appearance, rather than on shape translation. The latter task is challenging, due to its intricate non-local nature, which calls for additional supervision. We mitigate this by descending the deep layers of a pre-trained network, where the deep features contain more semantics, and applying the translation between these deep features. Specifically, we leverage VGG, which is a classification network, pre-trained with large-scale semantic supervision. Our translation is performed in a cascaded, deep-to-shallow, fashion, along the deep feature hierarchy: we first translate between the deepest layers that encode the higher-level semantic content of the image, proceeding to translate the shallower layers, conditioned on the deeper ones. We show that our method is able to translate between different domains, which exhibit significantly different shapes. We evaluate our method both qualitatively and quantitatively and compare it to state-of-the-art image-to-image translation methods. Our code and trained models will be made available.",/pdf/c960a1569181f2268170779c1c8fb1545ef8944e.pdf,ICLR,2020,"Image-to-image translation in a cascaded, deep-to-shallow, fashion, along the deep feature of a pre-trained classification network" +ryx3_iAcY7,S1xXrCoOF7,1538090000000.0,1545360000000.0,390,Contextualized Role Interaction for Neural Machine Translation,"[""dirk.weissenborn@gmail.com"", ""dkiela@fb.com"", ""jase@fb.com"", ""kyunghyun.cho@nyu.edu""]","[""Dirk Weissenborn"", ""Douwe Kiela"", ""Jason Weston"", ""Kyunghyun Cho""]","[""Neural Machine Translation"", ""Natural Language Processing""]","Word inputs tend to be represented as single continuous vectors in deep neural networks. It is left to the subsequent layers of the network to extract relevant aspects of a word's meaning based on the context in which it appears. In this paper, we investigate whether word representations can be improved by explicitly incorporating the idea of latent roles. That is, we propose a role interaction layer (RIL) that consists of context-dependent (latent) role assignments and role-specific transformations. We evaluate the RIL on machine translation using two language pairs (En-De and En-Fi) and three datasets of varying size. We find that the proposed mechanism improves translation quality over strong baselines with limited amounts of data, but that the improvement diminishes as the size of data grows, indicating that powerful neural MT systems are capable of implicitly modeling role-word interaction by themselves. Our qualitative analysis reveals that the RIL extracts meaningful context-dependent roles and that it allows us to inspect more deeply the internal mechanisms of state-of-the-art neural machine translation systems.",/pdf/6b6b523e7619e51cc0ba56eab35e7ab9853dcfc7.pdf,ICLR,2019,We propose a role interaction layer that explicitly models the modulation of token representations by contextualized roles. +BJe4V1HFPr,Hkxqm3h_DB,1569440000000.0,1577170000000.0,1650,Disentangling Style and Content in Anime Illustrations,"[""sitaoxia@usc.edu"", ""hao@hao-li.com""]","[""Sitao Xiang"", ""Hao Li""]","[""Adversarial Training"", ""Generative Models"", ""Style Transfer"", ""Anime""]","Existing methods for AI-generated artworks still struggle with generating high-quality stylized content, where high-level semantics are preserved, or separating fine-grained styles from various artists. We propose a novel Generative Adversarial Disentanglement Network which can disentangle two complementary factors of variations when only one of them is labelled in general, and fully decompose complex anime illustrations into style and content in particular. Training such model is challenging, since given a style, various content data may exist but not the other way round. Our approach is divided into two stages, one that encodes an input image into a style independent content, and one based on a dual-conditional generator. We demonstrate the ability to generate high-fidelity anime portraits with a fixed content and a large variety of styles from over a thousand artists, and vice versa, using a single end-to-end network and with applications in style transfer. We show this unique capability as well as superior output to the current state-of-the-art.",/pdf/1c87ed366f97c9e3b5d9e9aba741584609bee0de.pdf,ICLR,2020,"An adversarial training-based method for disentangling two complementary sets of variations in a dataset where only one of them is labelled, tested on style vs. content in anime illustrations." +BJxwPJHFwS,BJl2_36Ovr,1569440000000.0,1599390000000.0,1770,Robustness Verification for Transformers,"[""zhouxingshichn@gmail.com"", ""huan@huan-zhang.com"", ""kw@kwchang.net"", ""aihuang@tsinghua.edu.cn"", ""chohsieh@cs.ucla.edu""]","[""Zhouxing Shi"", ""Huan Zhang"", ""Kai-Wei Chang"", ""Minlie Huang"", ""Cho-Jui Hsieh""]","[""Robustness"", ""Verification"", ""Transformers""]","Robustness verification that aims to formally certify the prediction behavior of neural networks has become an important tool for understanding model behavior and obtaining safety guarantees. However, previous methods can usually only handle neural networks with relatively simple architectures. In this paper, we consider the robustness verification problem for Transformers. Transformers have complex self-attention layers that pose many challenges for verification, including cross-nonlinearity and cross-position dependency, which have not been discussed in previous works. We resolve these challenges and develop the first robustness verification algorithm for Transformers. The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound Propagation. These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.",/pdf/3e590e59bac677a1a5d07e28177f4d0b4604cdc2.pdf,ICLR,2020,We propose the first algorithm for verifying the robustness of Transformers. +HkGsHj05tQ,BkeUZP9wtX,1538090000000.0,1545360000000.0,117,Effective and Efficient Batch Normalization Using Few Uncorrelated Data for Statistics' Estimation,"[""chenzd15@mails.tsinghua.edu.cn"", ""leideng@ucsb.edu"", ""liguoqi@mail.tsinghua.edu.cn"", ""sunjw15@mails.tsinghua.edu.cn"", ""xinghu@ucsb.edu"", ""lingliang@ucsb.edu"", ""yufeiding@cs.ucsb.edu"", ""yuanxie@ucsb.edu""]","[""Zhaodong Chen"", ""Lei Deng"", ""Guoqi Li"", ""Jiawei Sun"", ""Xing Hu"", ""Ling Liang"", ""YufeiDing"", ""Yuan Xie""]","[""batch normalization"", ""acceleration"", ""correlation"", ""sampling""]","Deep Neural Networks (DNNs) thrive in recent years in which Batch Normalization (BN) plays an indispensable role. However, it has been observed that BN is costly due to the reduction operations. In this paper, we propose alleviating the BN’s cost by using only a small fraction of data for mean & variance estimation at each iteration. The key challenge to reach this goal is how to achieve a satisfactory balance between normalization effectiveness and execution efficiency. We identify that the effectiveness expects less data correlation while the efficiency expects regular execution pattern. To this end, we propose two categories of approach: sampling or creating few uncorrelated data for statistics’ estimation with certain strategy constraints. The former includes “Batch Sampling (BS)” that randomly selects few samples from each batch and “Feature Sampling (FS)” that randomly selects a small patch from each feature map of all samples, and the latter is “Virtual Dataset Normalization (VDN)” that generates few synthetic random samples. Accordingly, multi-way strategies are designed to reduce the data correlation for accurate estimation and optimize the execution pattern for running acceleration in the meantime. All the proposed methods are comprehensively evaluated on various DNN models, where an overall training speedup by up to 21.7% on modern GPUs can be practically achieved without the support of any specialized libraries, and the loss of model accuracy and convergence rate are negligible. Furthermore, our methods demonstrate powerful performance when solving the well-known “micro-batch normalization” problem in the case of tiny batch size.",/pdf/b1996b897e431bf246533b158a4734a55be2b69a.pdf,ICLR,2019,"We propose accelerating Batch Normalization (BN) through sampling less correlated data for reduction operations with regular execution pattern, which achieves up to 2x and 20% speedup for BN itself and the overall training, respectively." +B1M8JF9xx,,1478300000000.0,1488570000000.0,468,On the Quantitative Analysis of Decoder-Based Generative Models,"[""ywu@cs.toronto.edu"", ""yburda@openai.com"", ""rsalakhu@cs.cmu.edu"", ""rgrosse@cs.toronto.edu""]","[""Yuhuai Wu"", ""Yuri Burda"", ""Ruslan Salakhutdinov"", ""Roger Grosse""]","[""Deep learning"", ""Unsupervised Learning""]","The past several years have seen remarkable progress in generative models which produce convincing samples of images and other modalities. A shared component of some popular models such as generative adversarial networks and generative moment matching networks, is a decoder network, a parametric deep neural net that defines a generative distribution. Unfortunately, it can be difficult to quantify the performance of these models because of the intractability of log-likelihood estimation, and inspecting samples can be misleading. We propose to use Annealed Importance Sampling for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo. Using this technique, we analyze the performance of decoder-based models, the effectiveness of existing log-likelihood estimators, the degree of overfitting, and the degree to which these models miss important modes of the data distribution.",/pdf/e22d0adc0c056dd2641fdd179725b94402def68a.pdf,ICLR,2017,"We propose to use Annealed Importance Sampling to evaluate decoder-based generative network, and investigate various properties of these models." +HygHbTVYPB,SyluxVw8vr,1569440000000.0,1577170000000.0,373,LDMGAN: Reducing Mode Collapse in GANs with Latent Distribution Matching,"[""zzwcs@zju.edu.cn"", ""cszhl@zju.edh.cn"", ""qinglanwuji@zju.edu.cn"", ""moqihang@zju.edu.cn"", ""feng123@zju.edu.cn"", ""endywon@zju.edu.cn"", ""11921050@zju.edu.cn"", ""zjusheldon@zju.edu.cn"", ""wxing@zju.edu.cn"", ""ldm@zju.edu.cn""]","[""Zhiwen Zuo"", ""Lei Zhao"", ""Huiming Zhang"", ""Qihang Mo"", ""Haibo Chen"", ""Zhizhong Wang"", ""AiLin Li"", ""Lihong Qiu"", ""Wei Xing"", ""Dongming Lu""]","[""Deep Learning"", ""Unsupervised Learning"", ""Generative Adversarial Networks"", ""Mode Collapse"", ""AutoEncoder""]","Generative Adversarial Networks (GANs) have shown impressive results in modeling distributions over complicated manifolds such as those of natural images. However, GANs often suffer from mode collapse, which means they are prone to characterize only a single or a few modes of the data distribution. In order to address this problem, we propose a novel framework called LDMGAN. We first introduce Latent Distribution Matching (LDM) constraint which regularizes the generator by aligning distribution of generated samples with that of real samples in latent space. To make use of such latent space, we propose a regularized AutoEncoder (AE) that maps the data distribution to prior distribution in encoded space. Extensive experiments on synthetic data and real world datasets show that our proposed framework significantly improves GAN’s stability and diversity.",/pdf/c2d2cb4aa7945f63266ec09181830f67be5ba3b3.pdf,ICLR,2020,We propose an AE-based GAN that alleviates mode collapse in GANs. +BJe-Sn0ctm,H1eS8STcFm,1538090000000.0,1545360000000.0,1516,Ain't Nobody Got Time for Coding: Structure-Aware Program Synthesis from Natural Language,"[""jakub.bednarek@put.poznan.pl"", ""kar.piaskowski@gmail.com"", ""krawiec@cs.put.poznan.pl""]","[""Jakub Bednarek"", ""Karol Piaskowski"", ""Krzysztof Krawiec""]","[""Program synthesis"", ""tree2tree autoencoders"", ""soft attention"", ""doubly-recurrent neural networks"", ""LSTM"", ""nlp2tree""]","Program synthesis from natural language (NL) is practical for humans and, once technically feasible, would significantly facilitate software development and revolutionize end-user programming. We present SAPS, an end-to-end neural network capable of mapping relatively complex, multi-sentence NL specifications to snippets of executable code. The proposed architecture relies exclusively on neural components, and is built upon a tree2tree autoencoder trained on abstract syntax trees, combined with a pretrained word embedding and a bi-directional multi-layer LSTM for NL processing. The decoder features a doubly-recurrent LSTM with a novel signal propagation scheme and soft attention mechanism. When applied to a large dataset of problems proposed in a previous study, SAPS performs on par with or better than the method proposed there, producing correct programs in over 90% of cases. In contrast to other methods, it does not involve any non-neural components to post-process the resulting programs, and uses a fixed-dimensional latent representation as the only link between the NL analyzer and source code generator. ",/pdf/e103fbc714d0b47312dde3e5cf1bedf23742b9f6.pdf,ICLR,2019,"We generate source code based on short descriptions in natural language, using deep neural networks." +E3UZoJKHxuk,QBSD-5hLwL7,1601310000000.0,1614990000000.0,729,Latent Causal Invariant Model,"[""~Xinwei_Sun1"", ""~Botong_Wu1"", ""~Chang_Liu10"", ""~Xiangyu_Zheng1"", ""~Wei_Chen1"", ""~Tao_Qin1"", ""~Tie-Yan_Liu1""]","[""Xinwei Sun"", ""Botong Wu"", ""Chang Liu"", ""Xiangyu Zheng"", ""Wei Chen"", ""Tao Qin"", ""Tie-Yan Liu""]","[""invariance"", ""causality"", ""spurious correlation"", ""out-of-distribution generalization"", ""interpretability"", ""variational auto-encoder""]","Current supervised learning can learn spurious correlation during the data-fitting process, imposing issues regarding interpretability, out-of-distribution (OOD) generalization, and robustness. To avoid spurious correlation, we propose a \textbf{La}tent \textbf{C}ausal \textbf{I}nvariance \textbf{M}odel (LaCIM) which pursues \emph{causal prediction}. Specifically, we introduce latent variables that are separated into (a) output-causative factors and (b) others that are spuriously correlated to the output via confounders, to model the underlying causal factors. We further assume the generating mechanisms from latent space to observed data to be \emph{causally invariant}. We give the identifiable claim of such invariance, particularly the disentanglement of output-causative factors from others, as a theoretical guarantee for precise inference and avoiding spurious correlation. We propose a Variational-Bayesian-based method for estimation and to optimize over the latent space for prediction. The utility of our approach is verified by improved interpretability, prediction power on various OOD scenarios (including healthcare) and robustness on security. ",/pdf/2e26ec30a62a18e04b9a3c89e5647a0e18c9ef96.pdf,ICLR,2021,"We leverage causal invariance to avoid spurious correlation for better out-of-distribution generalization, interpretability and robustness." +UoAFJMzCNM,2jxaHTIydo,1601310000000.0,1614990000000.0,2400,Multi-agent Deep FBSDE Representation For Large Scale Stochastic Differential Games,"[""~Tianrong_Chen1"", ""~Ziyi_Wang1"", ""~Ioannis_Exarchos1"", ""~Evangelos_Theodorou1""]","[""Tianrong Chen"", ""Ziyi Wang"", ""Ioannis Exarchos"", ""Evangelos Theodorou""]","[""Multi-agent Deep FBSDE Representation For Large Scale Stochastic Differential Games""]"," In this paper we present a deep learning framework for solving large-scale multi-agent non-cooperative stochastic games using fictitious play. The Hamilton-Jacobi-Bellman (HJB) PDE associated with each agent is reformulated into a set of Forward-Backward Stochastic Differential Equations (FBSDEs) and solved via forward sampling on a suitably defined neural network architecture. Decision-making in multi-agent systems suffers from the curse of dimensionality and strategy degeneration as the number of agents and time horizon increase. We propose a novel Deep FBSDE controller framework which is shown to outperform the current state-of-the-art deep fictitious play algorithm on a high dimensional inter-bank lending/borrowing problem. More importantly, our approach mitigates the curse of many agents and reduces computational and memory complexity, allowing us to scale up to 1,000 agents in simulation, a scale which, to the best of our knowledge, represents a new state of the art. Finally, we showcase the framework's applicability in robotics on a belief-space autonomous racing problem.",/pdf/4284de988d5d6b8ee2a6a0a8640657806df581dc.pdf,ICLR,2021,"In this paper, we propose a novel and scalable deep learning framework for solving multi-agent stochastic differential game using fictitious play." +SkeHuCVFDr,BJexo2v_PS,1569440000000.0,1583910000000.0,1210,BERTScore: Evaluating Text Generation with BERT,"[""zty27x@gmail.com"", ""vk352@cornell.edu"", ""fw245@cornell.edu"", ""kqw4@cornell.edu"", ""yoav@cs.cornell.edu""]","[""Tianyi Zhang*"", ""Varsha Kishore*"", ""Felix Wu*"", ""Kilian Q. Weinberger"", ""Yoav Artzi""]","[""Metric"", ""Evaluation"", ""Contextual Embedding"", ""Text Generation""]","We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task and show that BERTScore is more robust to challenging examples compared to existing metrics. ",/pdf/aa3e86a43472a9ca29b3f649f0da62f511c55548.pdf,ICLR,2020,"We propose BERTScore, an automatic evaluation metric for text generation, which correlates better with human judgments and provides stronger model selection performance than existing metrics." +HyesW2C9YQ,ryg45Aa5F7,1538090000000.0,1545360000000.0,1202,I Know the Feeling: Learning to Converse with Empathy,"[""hrashkin@cs.washington.edu"", ""ems@fb.com"", ""hadasah@gmail.com"", ""ylan@fb.com""]","[""Hannah Rashkin"", ""Eric Michael Smith"", ""Margaret Li"", ""Y-Lan Boureau""]","[""dialogue generation"", ""nlp applications"", ""grounded text generation"", ""contextual representation learning""]","Beyond understanding what is being discussed, human communication requires an awareness of what someone is feeling. One challenge for dialogue agents is recognizing feelings in the conversation partner and replying accordingly, a key communicative skill that is trivial for humans. Research in this area is made difficult by the paucity of suitable publicly available datasets both for emotion and dialogues. This work proposes a new task for empathetic dialogue generation and EmpatheticDialogues, a dataset of 25k conversations grounded in emotional situations to facilitate training and evaluating dialogue systems. Our experiments indicate that dialogue models that use our dataset are perceived to be more empathetic by human evaluators, while improving on other metrics as well (e.g. perceived relevance of responses, BLEU scores), compared to models merely trained on large-scale Internet conversation data. We also present empirical comparisons of several ways to improve the performance of a given model by leveraging existing models or datasets without requiring lengthy re-training of the full model.",/pdf/e454a624cfa5c86063aa8969e753c012e31d3f33.pdf,ICLR,2019,"We improve existing dialogue systems for responding to people sharing personal stories, incorporating emotion prediction representations and also release a new benchmark and dataset of empathetic dialogues." +O-6Pm_d_Q-,z5c8baQxpyG,1601310000000.0,1618000000000.0,863,Deep Networks and the Multiple Manifold Problem,"[""~Sam_Buchanan1"", ""~Dar_Gilboa1"", ""~John_Wright1""]","[""Sam Buchanan"", ""Dar Gilboa"", ""John Wright""]","[""deep learning"", ""overparameterized neural networks"", ""low-dimensional structure""]","We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth $L$ is large relative to certain geometric and statistical properties of the data, the network width $n$ grows as a sufficiently large polynomial in $L$, and the number of i.i.d. samples from the manifolds is polynomial in $L$, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomly-initialized network and its gradients. The argument centers around the ""neural tangent kernel"" of Jacot et al. and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fully-connected ReLU networks, requiring width $n \geq L\,\mathrm{poly}(d_0)$ to achieve uniform concentration of the initial kernel over a $d_0$-dimensional submanifold of the unit sphere $\mathbb{S}^{n_0-1}$, and a nonasymptotic framework for establishing generalization of networks trained in the ""NTK regime"" with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures.",/pdf/56f1d814ac75024a949f12d01b0c52e5a0e108f3.pdf,ICLR,2021,"We prove a finite-time generalization result for deep fully-connected neural networks trained by gradient descent to classify structured data, where the required width, depth, and sample complexity depend only on intrinsic properties of the data." +HJxEhREKDH,Skek_DK_vr,1569440000000.0,1583910000000.0,1353,On the Global Convergence of Training Deep Linear ResNets,"[""knowzou@ucla.edu"", ""plong@google.com"", ""qgu@cs.ucla.edu""]","[""Difan Zou"", ""Philip M. Long"", ""Quanquan Gu""]",[],"We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with the global convergence result of GD for training standard deep linear networks \citep{du2019width}, our condition on the neural network width is sharper by a factor of $O(\kappa L)$, where $\kappa$ denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d,k$ are the input and output dimensions respectively.",/pdf/40f991e7d57fcf3070e9cc9829c477216a40cc72.pdf,ICLR,2020,"Under certain condition on the input and output linear transformations, both GD and SGD can achieve global convergence for training deep linear ResNets." +BygMreSYPB,BJgWm_eFvH,1569440000000.0,1577170000000.0,2277,Learning Latent Dynamics for Partially-Observed Chaotic Systems,"[""said.ouala@imt-atlantique.fr"", ""van.nguyen1@imt-atlantique.fr"", ""lucas.drumetz@imt-atlantique.fr"", ""bertrand.chapron@ifremer.fr"", ""ananda.pascual@imedea.uib-csic.es"", ""dr.fab@oceandatalab.com"", ""lucile.gaultier@oceandatalab.com"", ""ronan.fablet@imt-atlantique.fr""]","[""Said ouala"", ""Duong Nguyen"", ""Lucas Drumetz"", ""Bertrand Chapron"", ""Ananda Pascual"", ""Fabrice Collard"", ""Lucile Gaultier"", ""Ronan Fablet""]","[""Dynamical systems"", ""Neural networks"", ""Embedding"", ""Partially observed systems"", ""Forecasting"", ""chaos""]","This paper addresses the data-driven identification of latent representations of partially-observed dynamical systems, i.e. dynamical systems whose some components are never observed, with an emphasis on forecasting applications and long-term asymptotic patterns. Whereas state-of-the-art data-driven approaches rely on delay embeddings and linear decompositions of the underlying operators, we introduce a framework based on the data-driven identification of an augmented state-space model using a neural-network-based representation. For a given training dataset, it amounts to jointly reconstructing the latent states and learning an ODE (Ordinary Differential Equation) representation in this space. Through numerical experiments, we demonstrate the relevance of the proposed framework w.r.t. state-of-the-art approaches in terms of short-term forecasting errors and long-term behaviour. We further discuss how the proposed framework relates to Koopman operator theory and Takens' embedding theorem.",/pdf/501ceeda5fc8eb083fac34b2951e76c3a12e9890.pdf,ICLR,2020,Data driven identification of ODE representations for partially observed chaotic systems +rJzoujRct7,BJlhaQqctX,1538090000000.0,1545360000000.0,386,A Solution to China Competitive Poker Using Deep Learning,"[""liuzx@smzy.cc"", ""humaoyu@smzy.cc"", ""zzf@smzy.cc""]","[""Zhenxing Liu"", ""Maoyu Hu"", ""Zhangfei Zhang""]","[""artificial intelligence"", ""China competitive poker"", ""Dou dizhu"", ""CNN"", ""imperfect information game""]","Recently, deep neural networks have achieved superhuman performance in various games such as Go, chess and Shogi. Compared to Go, China Competitive Poker, also known as Dou dizhu, is a type of imperfect information game, including hidden information, randomness, multi-agent cooperation and competition. It has become widespread and is now a national game in China. We introduce an approach to play China Competitive Poker using Convolutional Neural Network (CNN) to predict actions. This network is trained by supervised learning from human game records. Without any search, the network already beats the best AI program by a large margin, and also beats the best human amateur players in duplicate mode.",/pdf/6bd5dc3c8fd0f148ceafc85f824209321289bb49.pdf,ICLR,2019,"This paper introduces a method to play China competitive poker using deep neural network, gets the state of the art performance." +Skx5txzb0W,BJ5FxfZ0-,1509130000000.0,1532340000000.0,771,A Boo(n) for Evaluating Architecture Performance,"[""ondrej@bajgar.org"", ""rudolf_kadlec@cz.ibm.com"", ""jankle@cz.ibm.com""]","[""Ondrej Bajgar"", ""Rudolf Kadlec"", ""and Jan Kleindienst""]","[""evaluation"", ""methodology""]","We point out important problems with the common practice of using the best single model performance for comparing deep learning architectures, and we propose a method that corrects these flaws. Each time a model is trained, one gets a different result due to random factors in the training process, which include random parameter initialization and random data shuffling. Reporting the best single model performance does not appropriately address this stochasticity. We propose a normalized expected best-out-of-n performance (Boo_n) as a way to correct these problems.",/pdf/e912f3e25a76c697152ccac7600058aa1b6b5ab5.pdf,ICLR,2018,"We point out important problems with the common practice of using the best single model performance for comparing deep learning architectures, and we propose a method that corrects these flaws." +bM4Iqfg8M2k,t3RV0UQ6Kg,1601310000000.0,1615890000000.0,779,Graph Information Bottleneck for Subgraph Recognition,"[""~Junchi_Yu1"", ""~Tingyang_Xu1"", ""~Yu_Rong1"", ""~Yatao_Bian1"", ""~Junzhou_Huang2"", ""~Ran_He1""]","[""Junchi Yu"", ""Tingyang Xu"", ""Yu Rong"", ""Yatao Bian"", ""Junzhou Huang"", ""Ran He""]",[],"Given the input graph and its label/property, several key problems of graph learning, such as finding interpretable subgraphs, graph denoising and graph compression, can be attributed to the fundamental problem of recognizing a subgraph of the original one. This subgraph shall be as informative as possible, yet contains less redundant and noisy structure. This problem setting is closely related to the well-known information bottleneck (IB) principle, which, however, has less been studied for the irregular graph data and graph neural networks (GNNs). In this paper, we propose a framework of Graph Information Bottleneck (GIB) for the subgraph recognition problem in deep graph learning. Under this framework, one can recognize the maximally informative yet compressive subgraph, named IB-subgraph. However, the GIB objective is notoriously hard to optimize, mostly due to the intractability of the mutual information of irregular graph data and the unstable optimization process. In order to tackle these challenges, we propose: i) a GIB objective based-on a mutual information estimator for the irregular graph data; ii) a bi-level optimization scheme to maximize the GIB objective; iii) a connectivity loss to stabilize the optimization process. We evaluate the properties of the IB-subgraph in three application scenarios: improvement of graph classification, graph interpretation and graph denoising. Extensive experiments demonstrate that the information-theoretic IB-subgraph enjoys superior graph properties. ",/pdf/45a07fde0c34644e0b294e4bb7bb3c045bc3429a.pdf,ICLR,2021, +ryiAv2xAZ,SJcRw2x0-,1509110000000.0,1519410000000.0,393,Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples,"[""kiminlee@kaist.ac.kr"", ""honglak@eecs.umich.edu"", ""kibok@umich.edu"", ""jinwoos@kaist.ac.kr""]","[""Kimin Lee"", ""Honglak Lee"", ""Kibok Lee"", ""Jinwoo Shin""]",[],"The problem of detecting whether a test sample is from in-distribution (i.e., training distribution by a classifier) or out-of-distribution sufficiently different from it arises in many real-world machine learning applications. However, the state-of-art deep neural networks are known to be highly overconfident in their predictions, i.e., do not distinguish in- and out-of-distributions. Recently, to handle this issue, several threshold-based detectors have been proposed given pre-trained neural classifiers. However, the performance of prior works highly depends on how to train the classifiers since they only focus on improving inference procedures. In this paper, we develop a novel training method for classifiers so that such inference algorithms can work better. In particular, we suggest two additional terms added to the original loss (e.g., cross entropy). The first one forces samples from out-of-distribution less confident by the classifier and the second one is for (implicitly) generating most effective training samples for the first one. In essence, our method jointly trains both classification and generative neural networks for out-of-distribution. We demonstrate its effectiveness using deep convolutional neural networks on various popular image datasets.",/pdf/0d5105956a4f6279a24413241916c5b3df9cd1e3.pdf,ICLR,2018, +HkxWrsC5FQ,rJt3xy4tX,1538090000000.0,1545360000000.0,60,Provable Guarantees on Learning Hierarchical Generative Models with Deep CNNs,"[""eran.malach@mail.huji.ac.il"", ""shais@cs.huji.ac.il""]","[""Eran Malach"", ""Shai Shalev-Shwartz""]","[""deep learning"", ""theory""]","Learning deep networks is computationally hard in the general case. To show any positive theoretical results, one must make assumptions on the data distribution. Current theoretical works often make assumptions that are very far from describing real data, like sampling from Gaussian distribution or linear separability of the data. We describe an algorithm that learns convolutional neural network, +assuming the data is sampled from a deep generative model that generates images level by level, +where lower resolution images correspond to latent semantic classes. We analyze the convergence rate of our algorithm assuming the data is indeed generated according to this model (as well as +additional assumptions). While we do not pretend to claim that the assumptions are realistic for natural images, we do believe that they capture some true properties of real data. Furthermore, we show that on CIFAR-10, the algorithm we analyze achieves results in the same ballpark with vanilla convolutional neural networks that are trained with SGD.",/pdf/6bcb58e88ad4f1461f872e14c0291fbade0db69d.pdf,ICLR,2019,A generative model for deep CNNs with provable theoretical guarantees that actually works +sTeoJiB4uR,GrvaDIA5Aj2,1601310000000.0,1615840000000.0,1523,Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks,"[""~Thomas_Bird1"", ""fhkingma@gmail.com"", ""~David_Barber1""]","[""Thomas Bird"", ""Friso Kingma"", ""David Barber""]","[""binary"", ""generative"", ""optimization"", ""compression""]","Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an open problem. In this work we show, for the first time, that we can successfully train generative models which utilize binary neural networks. This reduces the computational cost of the models massively. We develop a new class of binary weight normalization, and provide insights for architecture designs of these binarized generative models. We demonstrate that two state-of-the-art deep generative models, the ResNet VAE and Flow++ models, can be binarized effectively using these techniques. We train binary models that achieve loss values close to those of the regular models but are 90%-94% smaller in size, and also allow significant speed-ups in execution time.",/pdf/4bc920fb592820a6d9fba3d58e93f6561f456f66.pdf,ICLR,2021,We demonstrate that deep generative models can be effectively trained using binary weights and/or activations +B1gTShAct7,HJlhT3T5K7,1538090000000.0,1556850000000.0,1583,Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference,"[""mdriemer@us.ibm.com"", ""cases@stanford.edu"", ""ajemian@mit.edu"", ""miao.liu1@ibm.com"", ""rish@us.ibm.com"", ""yuhai@us.ibm.com"", ""gtesauro@us.ibm.com""]","[""Matthew Riemer"", ""Ignacio Cases"", ""Robert Ajemian"", ""Miao Liu"", ""Irina Rish"", ""Yuhai Tu"", ""and Gerald Tesauro""]",[],"Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller. ",/pdf/93ad46fe7cff088bd67ef50a6ebc39b64b15344b.pdf,ICLR,2019, +BJe55gBtvH,HJgMfk-twr,1569440000000.0,1583910000000.0,2482,Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem,"[""vaggos@cs.stanford.edu"", ""sai_nagarajan@mymail.sutd.edu.sg"", ""ioannis@sutd.edu.sg"", ""xiao_wang@sutd.edu.sg""]","[""Vaggos Chatziafratis"", ""Sai Ganesh Nagarajan"", ""Ioannis Panageas"", ""Xiao Wang""]","[""Depth-Width trade-offs"", ""ReLU networks"", ""chaos theory"", ""Sharkovsky Theorem"", ""dynamical systems""]","Understanding the representational power of Deep Neural Networks (DNNs) and how their structural properties (e.g., depth, width, type of activation unit) affect the functions they can compute, has been an important yet challenging question in deep learning and approximation theory. In a seminal paper, Telgarsky high- lighted the benefits of depth by presenting a family of functions (based on sim- ple triangular waves) for which DNNs achieve zero classification error, whereas shallow networks with fewer than exponentially many nodes incur constant error. Even though Telgarsky’s work reveals the limitations of shallow neural networks, it doesn’t inform us on why these functions are difficult to represent and in fact he states it as a tantalizing open question to characterize those functions that cannot be well-approximated by smaller depths. +In this work, we point to a new connection between DNNs expressivity and Sharkovsky’s Theorem from dynamical systems, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of a generalized notion of fixed points, called periodic points (a fixed point is a point of period 1). Motivated by our observation that the triangle waves used in Telgarsky’s work contain points of period 3 – a period that is special in that it implies chaotic behaviour based on the celebrated result by Li-Yorke – we proceed to give general lower bounds for the width needed to represent periodic functions as a function of the depth. Technically, the crux of our approach is based on an eigenvalue analysis of the dynamical systems associated with such functions.",/pdf/45974c4cd154860a2542454f3e4be33d90d08215.pdf,ICLR,2020,"In this work, we point to a new connection between DNNs expressivity and Sharkovsky’s Theorem from dynamical systems, that enables us to characterize the depth-width trade-offs of ReLU networks " +Sklsm20ctX,S1l_Fc65KQ,1538090000000.0,1550360000000.0,1391,Competitive experience replay,"[""lhao499@gmail.com"", ""atrott@salesforce.com"", ""rsocher@salesforce.com"", ""cxiong@salesforce.com""]","[""Hao Liu"", ""Alexander Trott"", ""Richard Socher"", ""Caiming Xiong""]","[""reinforcement learning"", ""sparse reward"", ""goal-based learning""]","Deep learning has achieved remarkable successes in solving challenging reinforcement learning (RL) problems when dense reward function is provided. However, in sparse reward environment it still often suffers from the need to carefully shape reward function to guide policy optimization. This limits the applicability of RL in the real world since both reinforcement learning and domain-specific knowledge are required. It is therefore of great practical importance to develop algorithms which can learn from a binary signal indicating successful task completion or other unshaped, sparse reward signals. We propose a novel method called competitive experience replay, which efficiently supplements a sparse reward by placing learning in the context of an exploration competition between a pair of agents. Our method complements the recently proposed hindsight experience replay (HER) by inducing an automatic exploratory curriculum. We evaluate our approach on the tasks of reaching various goal locations in an ant maze and manipulating objects with a robotic arm. Each task provides only binary rewards indicating whether or not the goal is achieved. Our method asymmetrically augments these sparse rewards for a pair of agents each learning the same task, creating a competitive game designed to drive exploration. Extensive experiments demonstrate that this method leads to faster converge and improved task performance.",/pdf/b1ae1567540f13f76fb15a7e83801cffd3eabf2a.pdf,ICLR,2019,a novel method to learn with sparse reward using adversarial reward re-labeling +FGqiDsBUKL0,biNkVBx2IvD,1601310000000.0,1615520000000.0,140,Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs,"[""~Xingang_Pan1"", ""~Bo_Dai2"", ""~Ziwei_Liu1"", ""~Chen_Change_Loy2"", ""~Ping_Luo2""]","[""Xingang Pan"", ""Bo Dai"", ""Ziwei Liu"", ""Chen Change Loy"", ""Ping Luo""]","[""Generative Adversarial Network"", ""3D Reconstruction""]","Natural images are projections of 3D objects on a 2D image plane. While state-of-the-art 2D generative models like GANs show unprecedented quality in modeling the natural image manifold, it is unclear whether they implicitly capture the underlying 3D object structures. And if so, how could we exploit such knowledge to recover the 3D shapes of objects in the images? To answer these questions, in this work, we present the first attempt to directly mine 3D geometric cues from an off-the-shelf 2D GAN that is trained on RGB images only. Through our investigation, we found that such a pre-trained GAN indeed contains rich 3D knowledge and thus can be used to recover 3D shape from a single 2D image in an unsupervised manner. The core of our framework is an iterative strategy that explores and exploits diverse viewpoint and lighting variations in the GAN image manifold. The framework does not require 2D keypoint or 3D annotations, or strong assumptions on object shapes (e.g. shapes are symmetric), yet it successfully recovers 3D shapes with high precision for human faces, cats, cars, and buildings. The recovered 3D shapes immediately allow high-quality image editing like relighting and object rotation. We quantitatively demonstrate the effectiveness of our approach compared to previous methods in both 3D shape reconstruction and face rotation. Our code is available at https://github.com/XingangPan/GAN2Shape.",/pdf/ced9a36d7c1942426cc47913c3ec31367b60dad3.pdf,ICLR,2021,Unsupervised 3D Shape Reconstruction from 2D Image GANs +rJJzTyWCZ,SJCba1-CZ,1509130000000.0,1518730000000.0,525,Large-scale Cloze Test Dataset Designed by Teachers,"[""qizhex@gmail.com"", ""guokun@cs.cmu.edu"", ""zander.dai@gmail.com"", ""hovy@cs.cmu.edu""]","[""Qizhe Xie"", ""Guokun Lai"", ""Zihang Dai"", ""Eduard Hovy""]","[""dataset"", ""human-designed"", ""language understanding""]","Cloze test is widely adopted in language exams to evaluate students' language proficiency. In this paper, we propose the first large-scale human-designed cloze test dataset CLOTH in which the questions were used in middle-school and high-school language exams. With the missing blanks carefully created by teachers and candidate choices purposely designed to be confusing, CLOTH requires a deeper language understanding and a wider attention span than previous automatically generated cloze datasets. We show humans outperform dedicated designed baseline models by a significant margin, even when the model is trained on sufficiently large external data. We investigate the source of the performance gap, trace model deficiencies to some distinct properties of CLOTH, and identify the limited ability of comprehending a long-term context to be the key bottleneck. In addition, we find that human-designed data leads to a larger gap between the model's performance and human performance when compared to automatically generated data. ",/pdf/c467d55ffc4d0ceb8d88467df0d56f660a005dfe.pdf,ICLR,2018,A cloze test dataset designed by teachers to assess language proficiency +S1z9ehAqYX,HJex_Ja5KQ,1538090000000.0,1545360000000.0,1102,Shrinkage-based Bias-Variance Trade-off for Deep Reinforcement Learning,"[""yihao@cs.utexas.edu"", ""uestcliuhao@gmail.com"", ""jianpeng@illinois.edu"", ""lqiang@cs.utexas.edu""]","[""Yihao Feng"", ""Hao Liu"", ""Jian Peng"", ""Qiang Liu""]","[""bias-variance trade-off"", ""James-stein estimator"", ""reinforcement learning""]","Deep reinforcement learning has achieved remarkable successes in solving various challenging artificial intelligence tasks. A variety of different algorithms have been introduced and improved towards human-level performance. Although technical advances have been developed for each individual algorithms, there has been strong evidence showing that further substantial improvements can be achieved by properly combining multiple approaches with difference biases and variances. In this work, we propose to use the James-Stein (JS) shrinkage estimator to combine on-policy policy gradient estimators which have low bias but high variance, with low-variance high-bias gradient estimates such as those constructed based on model-based methods or temporally smoothed averaging of historical gradients. Empirical results show that our simple shrinkage approach is very effective in practice and substantially improve the sample efficiency of the state-of-the-art on-policy methods on various continuous control tasks. +",/pdf/0669c9494e7e1ad3a35b87edb358a4af9aeb09d1.pdf,ICLR,2019, +SJeT_oRcY7,BJlEGrccK7,1538090000000.0,1545360000000.0,397,Localized random projections challenge benchmarks for bio-plausible deep learning,"[""bernd.illing@epfl.ch"", ""wulfram.gerstner@epfl.ch"", ""johanni.brea@epfl.ch""]","[""Bernd Illing"", ""Wulfram Gerstner"", ""Johanni Brea""]","[""deep learning"", ""bio-plausibility"", ""random projections"", ""spiking networks"", ""unsupervised learning"", ""MNIST"", ""spike timing dependent plasticity""]","Similar to models of brain-like computation, artificial deep neural networks rely +on distributed coding, parallel processing and plastic synaptic weights. Training +deep neural networks with the error-backpropagation algorithm, however, is +considered bio-implausible. An appealing alternative to training deep neural networks +is to use one or a few hidden layers with fixed random weights or trained +with an unsupervised, local learning rule and train a single readout layer with a +supervised, local learning rule. We find that a network of leaky-integrate-andfire +neurons with fixed random, localized receptive fields in the hidden layer and +spike timing dependent plasticity to train the readout layer achieves 98.1% test +accuracy on MNIST, which is close to the optimal result achievable with error-backpropagation +in non-convolutional networks of rate neurons with one hidden +layer. To support the design choices of the spiking network, we systematically +compare the classification performance of rate networks with a single hidden +layer, where the weights of this layer are either random and fixed, trained with +unsupervised Principal Component Analysis or Sparse Coding, or trained with +the backpropagation algorithm. This comparison revealed, first, that unsupervised +learning does not lead to better performance than fixed random projections for +large hidden layers on digit classification (MNIST) and object recognition (CIFAR10); +second, networks with random projections and localized receptive fields +perform significantly better than networks with all-to-all connectivity and almost +reach the performance of networks trained with the backpropagation algorithm. +The performance of these simple random projection networks is comparable to +most current models of bio-plausible deep learning and thus provides an interesting +benchmark for future approaches.",/pdf/8328494c5d217bda8468467da8e145609b15c351.pdf,ICLR,2019,Spiking networks using localized random projections and STDP challenge current MNIST benchmark models for bio-plausible deep learning +ba82GniSJdc,o5amnKw3xLr,1601310000000.0,1614990000000.0,304,Task Calibration for Distributional Uncertainty in Few-Shot Classification,"[""~Sungnyun_Kim1"", ""~Se-Young_Yun1""]","[""Sungnyun Kim"", ""Se-Young Yun""]","[""few-shot learning"", ""meta-learning"", ""uncertainty estimation""]","As numerous meta-learning algorithms improve performance when solving few-shot classification problems for practical applications, accurate prediction of uncertainty, though challenging, has been considered essential. In this study, we contemplate modeling uncertainty in a few-shot classification framework and propose a straightforward method that appropriately predicts task uncertainty. We suppose that the random sampling of tasks can generate those in which it may be hard for the model to infer the queries from the support examples. Specifically, measuring the distributional mismatch between support and query sets via class-wise similarities, we propose novel meta-training that lets the model predict with careful confidence. Moreover, our method is algorithm-agnostic and readily expanded to include a range of meta-learning models. Through extensive experiments including dataset shift, we present that our training strategy helps the model avoid being indiscriminately confident, and thereby, produce calibrated classification results without the loss of accuracy.",/pdf/4cb98bda9becb90d7df5ee490da711e17c38b9aa.pdf,ICLR,2021, +Sy3XxCx0Z,BkiQxAx0Z,1509120000000.0,1518730000000.0,441,Natural Language Inference with External Knowledge,"[""cq1231@mail.ustc.edu.cn"", ""xiaodan.zhu@queensu.ca"", ""zhling@ustc.edu.cn"", ""diana.inkpen@uottawa.ca""]","[""Qian Chen"", ""Xiaodan Zhu"", ""Zhen-Hua Ling"", ""Diana Inkpen""]","[""natural language inference"", ""external knowledge"", ""state of the art""]","Modeling informal inference in natural language is very challenging. With the recent availability of large annotated data, it has become feasible to train complex models such as neural networks to perform natural language inference (NLI), which have achieved state-of-the-art performance. Although there exist relatively large annotated data, can machines learn all knowledge needed to perform NLI from the data? If not, how can NLI models benefit from external knowledge and how to build NLI models to leverage it? In this paper, we aim to answer these questions by enriching the state-of-the-art neural natural language inference models with external knowledge. We demonstrate that the proposed models with external knowledge further improve the state of the art on the Stanford Natural Language Inference (SNLI) dataset. ",/pdf/87ccc2c9ede781d0333680f36c73c674be067f4b.pdf,ICLR,2018,the proposed models with external knowledge further improve the state of the art on the SNLI dataset. +HkzZBi0cFQ,HJxCng_fYm,1538090000000.0,1545360000000.0,62,Quantization for Rapid Deployment of Deep Neural Networks,"[""junhaeng2.lee@samsung.com"", ""sw815.ha@samsung.com"", ""sincere.choi@samsung.com"", ""w-j.lee@samsung.com"", ""seungw.lee@samsung.com""]","[""Jun Haeng Lee"", ""Sangwon Ha"", ""Saerom Choi"", ""Won-Jo Lee"", ""Seungwon Lee""]",[],"This paper aims at rapid deployment of the state-of-the-art deep neural networks (DNNs) to energy efficient accelerators without time-consuming fine tuning or the availability of the full datasets. Converting DNNs in full precision to limited precision is essential in taking advantage of the accelerators with reduced memory footprint and computation power. However, such a task is not trivial since it often requires the full training and validation datasets for profiling the network statistics and fine tuning the networks to recover the accuracy lost after quantization. To address these issues, we propose a simple method recognizing channel-level distribution to reduce the quantization-induced accuracy loss and minimize the required image samples for profiling. We evaluated our method on eleven networks trained on the ImageNet classification benchmark and a network trained on the Pascal VOC object detection benchmark. The results prove that the networks can be quantized into 8-bit integer precision without fine tuning.",/pdf/5864e94427975d97800a5d8618480224312c914d.pdf,ICLR,2019, +S1lACa4YDS,H1lDkmf_wr,1569440000000.0,1577170000000.0,876,Meta-Learning for Variational Inference,"[""rz297@cornell.edu"", ""yingzhen.li@microsoft.com"", ""cdesa@cs.cornell.edu"", ""sam.devlin@microsoft.com"", ""cheng.zhang@microsoft.com""]","[""Ruqi Zhang"", ""Yingzhen Li"", ""Chris De Sa"", ""Sam Devlin"", ""Cheng Zhang""]","[""Variational inference"", ""Meta-learning""]","Variational inference (VI) plays an essential role in approximate Bayesian inference due to its computational efficiency and general applicability. +Crucial to the performance of VI is the selection of the divergence measure in the optimization objective, as it affects the properties of the approximate posterior significantly. In this paper, we propose a meta-learning algorithm to learn (i) the divergence measure suited for the task of interest to automate the design of the VI method; and (ii) initialization of the variational parameters, which reduces the number of VI optimization steps drastically. We demonstrate the learned divergence outperforms the hand-designed divergence on Gaussian mixture distribution approximation, Bayesian neural network regression, and partial variational autoencoder based recommender systems.",/pdf/3e6bd24c98886a22ea2ea77cd7a78a422f3bc7ed.pdf,ICLR,2020, +Bke_DertPB,rygqojeFDH,1569440000000.0,1583910000000.0,2365,Adversarial Lipschitz Regularization,"[""david.terjek92@gmail.com""]","[""D\u00e1vid Terj\u00e9k""]","[""generative adversarial networks"", ""wasserstein generative adversarial networks"", ""lipschitz regularization"", ""adversarial training""]","Generative adversarial networks (GANs) are one of the most popular approaches when it comes to training generative models, among which variants of Wasserstein GANs are considered superior to the standard GAN formulation in terms of learning stability and sample quality. However, Wasserstein GANs require the critic to be 1-Lipschitz, which is often enforced implicitly by penalizing the norm of its gradient, or by globally restricting its Lipschitz constant via weight normalization techniques. Training with a regularization term penalizing the violation of the Lipschitz constraint explicitly, instead of through the norm of the gradient, was found to be practically infeasible in most situations. Inspired by Virtual Adversarial Training, we propose a method called Adversarial Lipschitz Regularization, and show that using an explicit Lipschitz penalty is indeed viable and leads to competitive performance when applied to Wasserstein GANs, highlighting an important connection between Lipschitz regularization and adversarial training.",/pdf/875fb3e608e6625c223a3ac5bf9bae1da679b7af.pdf,ICLR,2020,alternative to gradient penalty +r1eiu2VtwH,H1e_Oz1uLS,1569440000000.0,1583910000000.0,53,Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data,"[""sapopov@yandex-team.ru"", ""stanis-morozov@yandex.ru"", ""artem.babenko@phystech.edu""]","[""Sergei Popov"", ""Stanislav Morozov"", ""Artem Babenko""]","[""tabular data"", ""architectures"", ""DNN""]","Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.",/pdf/a232ed6c3759aed211ca949a7fe572d04605f653.pdf,ICLR,2020,We propose a new DNN architecture for deep learning on tabular data +#NAME?,6obWYLmuN3y,1601310000000.0,1616020000000.0,2541,In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning,"[""~Mamshad_Nayeem_Rizve1"", ""~Kevin_Duarte1"", ""~Yogesh_S_Rawat1"", ""~Mubarak_Shah3""]","[""Mamshad Nayeem Rizve"", ""Kevin Duarte"", ""Yogesh S Rawat"", ""Mubarak Shah""]","[""Semi-Supervised Learning"", ""Pseudo-Labeling"", ""Uncertainty"", ""Calibration"", ""Deep Learning""]","The recent research in semi-supervised learning (SSL) is mostly dominated by consistency regularization based methods which achieve strong performance. However, they heavily rely on domain-specific data augmentations, which are not easy to generate for all data modalities. Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models; these predictions generate many incorrect pseudo-labels, leading to noisy training. We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process. Furthermore, UPS generalizes the pseudo-labeling process, allowing for the creation of negative pseudo-labels; these negative pseudo-labels can be used for multi-label classification as well as negative learning to improve the single-label classification. We achieve strong performance when compared to recent SSL methods on the CIFAR-10 and CIFAR-100 datasets. Also, we demonstrate the versatility of our method on the video dataset UCF-101 and the multi-label dataset Pascal VOC.",/pdf/c979bcaed90f2b14dbf27b5e90fdbb74407f161b.pdf,ICLR,2021,We present an uncertainty-aware pseudo-label selection framework for semi-supervised learning which greatly reduces the noise introduced by the pseudo-labeling process. +cL4wkyoxyDJ,NvFemz-PR2o,1601310000000.0,1614990000000.0,296,Towards Counteracting Adversarial Perturbations to Resist Adversarial Examples,"[""~Haimin_ZHANG1"", ""~Min_Xu5""]","[""Haimin ZHANG"", ""Min Xu""]","[""adversarial robustness"", ""resisting adversarial examples""]","Studies show that neural networks are susceptible to adversarial attacks. This exposes a potential threat to neural network-based artificial intelligence systems. We observe that the probability of the correct result outputted by the network increases by applying small perturbations generated for class labels other than the original predicted one to adversarial examples. Based on this observation, we propose a method of counteracting adversarial perturbations to resist adversarial examples. In our method, we randomly select a number of class labels and generate small perturbations for these selected labels. The generated perturbations are added together and then clamped to a specified space. The obtained perturbation is finally added to the adversarial example to counteract the adversarial perturbation contained in the example. The proposed method is applied at inference time and does not require retraining or finetuning the model. We validate the proposed method on CIFAR-10 and CIFAR-100. The experimental results demonstrate that our method effectively improves the defense performance of the baseline methods, especially against strong adversarial examples generated using more iterations.",/pdf/1f333a0a3909163d7fefb639933365702372974d.pdf,ICLR,2021,This paper proposes a method that uses small first-order perturbations to defend against adversarial attacks. +SJfPFjA9Fm,H1gizWicFX,1538090000000.0,1551910000000.0,455,ACCELERATING NONCONVEX LEARNING VIA REPLICA EXCHANGE LANGEVIN DIFFUSION,"[""yichen2016@u.northwestern.edu"", ""jinglinc@illinois.edu"", ""jd2736@columbia.edu"", ""jianpeng@illinois.edu"", ""zhaoranwang@gmail.com""]","[""Yi Chen"", ""Jinglin Chen"", ""Jing Dong"", ""Jian Peng"", ""Zhaoran Wang""]",[],"Langevin diffusion is a powerful method for nonconvex optimization, which enables the escape from local minima by injecting noise into the gradient. In particular, the temperature parameter controlling the noise level gives rise to a tradeoff between ``global exploration'' and ``local exploitation'', which correspond to high and low temperatures. To attain the advantages of both regimes, we propose to use replica exchange, which swaps between two Langevin diffusions with different temperatures. We theoretically analyze the acceleration effect of replica exchange from two perspectives: (i) the convergence in $\chi^2$-divergence, and (ii) the large deviation principle. Such an acceleration effect allows us to faster approach the global minima. Furthermore, by discretizing the replica exchange Langevin diffusion, we obtain a discrete-time algorithm. For such an algorithm, we quantify its discretization error in theory and demonstrate its acceleration effect in practice. ",/pdf/61073f72577a50f075e4f466d2ec08b720670c92.pdf,ICLR,2019, +lWaz5a9lcFU,0-mDHt0Fmz,1601310000000.0,1616040000000.0,67,EEC: Learning to Encode and Regenerate Images for Continual Learning,"[""~Ali_Ayub1"", ""~Alan_Wagner2""]","[""Ali Ayub"", ""Alan Wagner""]","[""Continual Learning"", ""Catastrophic Forgetting"", ""Cognitively-inspired Learning""]","The two main impediments to continual learning are catastrophic forgetting and memory limitations on the storage of data. To cope with these challenges, we propose a novel, cognitively-inspired approach which trains autoencoders with Neural Style Transfer to encode and store images. Reconstructed images from encoded episodes are replayed when training the classifier model on a new task to avoid catastrophic forgetting. The loss function for the reconstructed images is weighted to reduce its effect during classifier training to cope with image degradation. When the system runs out of memory the encoded episodes are converted into centroids and covariance matrices, which are used to generate pseudo-images during classifier training, keeping classifier performance stable with less memory. Our approach increases classification accuracy by 13-17% over state-of-the-art methods on benchmark datasets, while requiring 78% less storage space.",/pdf/e0fae0a996116aba06f3a1e0ff97fd0078ca47d2.pdf,ICLR,2021,We train autoencoders with Neural Style Transfer to replay old tasks data for continual learning. The encoded features are converted into centroids and covariances to keep memory footprint from growing while keeping classifier performance stable. +H1gsz30cKX,HJxQcfC9tX,1538090000000.0,1552350000000.0,1297,Fixup Initialization: Residual Learning Without Normalization,"[""hongyiz@mit.edu"", ""yann@dauphin.io"", ""tengyuma@stanford.edu""]","[""Hongyi Zhang"", ""Yann N. Dauphin"", ""Tengyu Ma""]","[""deep learning"", ""residual networks"", ""initialization"", ""batch normalization"", ""layer normalization""]","Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.",/pdf/05cb835b985926148d75a3864c4f566ff6c459be.pdf,ICLR,2019,All you need to train deep residual networks is a good initialization; normalization layers are not necessary. +S1Ow_e-Rb,rkwwdlZAb,1509130000000.0,1518730000000.0,606,How do deep convolutional neural networks learn from raw audio waveforms?,"[""ygong1@nd.edu"", ""cpoellab@nd.edu""]","[""Yuan Gong"", ""Christian Poellabauer""]","[""Convolutional neural networks"", ""Audio processing"", ""Speech processing""]","Prior work on speech and audio processing has demonstrated the ability to obtain excellent performance when learning directly from raw audio waveforms using convolutional neural networks (CNNs). However, the exact inner workings of a CNN remain unclear, which hinders further developments and improvements into this direction. In this paper, we theoretically analyze and explain how deep CNNs learn from raw audio waveforms and identify potential limitations of existing network structures. Based on this analysis, we further propose a new network architecture (called SimpleNet), which offers a very simple but concise structure and high model interpretability. ",/pdf/c9682e22a6541db936a911318ebf3efafee3ca2f.pdf,ICLR,2018, +rJlqoTEtDB,HklQdV1OwH,1569440000000.0,1577170000000.0,753,PowerSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization,"[""j.liu@uwaterloo.ca"", ""zhoubt@hust.edu.cn"", ""sunweigao@outlook.com"", ""ruijuanchen@hust.edu.cn"", ""tomlin@eecs.berkeley.edu"", ""yye@hust.edu.cn""]","[""Jun Liu"", ""Beitong Zhou"", ""Weigao Sun"", ""Ruijuan Chen"", ""Claire J. Tomlin"", ""Ye Yuan""]","[""stochastic gradient descent"", ""non-convex optimization"", ""powerball function"", ""acceleration""]","In this paper, we propose a novel technique for improving the stochastic gradient descent (SGD) method to train deep networks, which we term \emph{PowerSGD}. The proposed PowerSGD method simply raises the stochastic gradient to a certain power $\gamma\in[0,1]$ during iterations and introduces only one additional parameter, namely, the power exponent $\gamma$ (when $\gamma=1$, PowerSGD reduces to SGD). We further propose PowerSGD with momentum, which we term \emph{PowerSGDM}, and provide convergence rate analysis on both PowerSGD and PowerSGDM methods. Experiments are conducted on popular deep learning models and benchmark datasets. Empirical results show that the proposed PowerSGD and PowerSGDM obtain faster initial training speed than adaptive gradient methods, comparable generalization ability with SGD, and improved robustness to hyper-parameter selection and vanishing gradients. PowerSGD is essentially a gradient modifier via a nonlinear transformation. As such, it is orthogonal and complementary to other techniques for accelerating gradient-based optimization. ",/pdf/382c34338a2d67d1f42dd117328cecc1ef57d115.pdf,ICLR,2020,We propose a new class of optimizers for accelerated non-convex optimization via a nonlinear gradient transformation. +H1gS364FwS,Bkg_j1euvB,1569440000000.0,1577170000000.0,779,Event extraction from unstructured Amharic text,"[""ephe11ta@gmail.com"", ""rosatsegaye@gmail.com"", ""kuulaa@gmail.com""]","[""Ephrem Tadesse"", ""Rosa Tsegaye"", ""Kuulaa Qaqqabaa""]","[""Event extraction"", ""machine learning classifiers"", ""Nominal events""]","In information extraction, event extraction is one of the types that extract the specific knowledge of certain incidents from texts. Event extraction has been done on different languages texts but not on one of the Semitic language Amharic. In this study, we present a system that extracts an event from unstructured Amharic text. The system has designed by the integration of supervised machine learning and rule-based approaches together. We call it a hybrid system. The model from the supervised machine learning detects events from the text, then, handcrafted rules and the rule-based rules extract the event from the text. The hybrid system has compared with the standalone rule-based method that is well known for event extraction. The study has shown that the hybrid system has outperformed the standalone rule-based method. For the event extraction, we have been extracting event arguments. Event arguments identify event triggering words or phrases that clearly express the occurrence of the event. The event argument attributes can be verbs, nouns, occasionally adjectives such as ሰርግ/wedding and time as well.",/pdf/ed3f0358b109578122b6509603b59fe3e3bb0e0c.pdf,ICLR,2020,This paper extract events from Amharic text. +HyGcghRct7,HJeFZPJ9Ym,1538090000000.0,1550870000000.0,1103,Random mesh projectors for inverse problems,"[""kkothar3@illinois.edu"", ""gupta67@illinois.edu"", ""mdehoop@rice.edu"", ""dokmanic@illinois.edu""]","[""Konik Kothari*"", ""Sidharth Gupta*"", ""Maarten v. de Hoop"", ""Ivan Dokmanic""]","[""imaging"", ""inverse problems"", ""subspace projections"", ""random Delaunay triangulations"", ""CNN"", ""geophysics"", ""regularization""]","We propose a new learning-based approach to solve ill-posed inverse problems in imaging. We address the case where ground truth training samples are rare and the problem is severely ill-posed---both because of the underlying physics and because we can only get few measurements. This setting is common in geophysical imaging and remote sensing. We show that in this case the common approach to directly learn the mapping from the measured data to the reconstruction becomes unstable. Instead, we propose to first learn an ensemble of simpler mappings from the data to projections of the unknown image into random piecewise-constant subspaces. We then combine the projections to form a final reconstruction by solving a deconvolution-like problem. We show experimentally that the proposed method is more robust to measurement noise and corruptions not seen during training than a directly learned inverse.",/pdf/5e6c0c38852c72d569bf5a787cfdb24e433aaa7a.pdf,ICLR,2019,We solve ill-posed inverse problems with scarce ground truth examples by estimating an ensemble of random projections of the model instead of the model itself. +H1lefTEKDS,Bkx38BY8PB,1569440000000.0,1577170000000.0,399,Benchmarking Model-Based Reinforcement Learning,"[""tingwuwang@cs.toronto.edu"", ""xuchan.bao@mail.utoronto.ca"", ""iclavera@berkeley.edu"", ""jhoang@cs.toronto.edu"", ""ywen@cs.toronto.edu"", ""edl@cs.toronto.edu"", ""matthew.zhang@mail.utoronto.ca"", ""gdzhang@cs.toronto.edu"", ""pabbeel@cs.berkeley.edu"", ""jba@cs.toronto.edu""]","[""Tingwu Wang"", ""Xuchan Bao"", ""Ignasi Clavera"", ""Jerrick Hoang"", ""Yeming Wen"", ""Eric Langlois"", ""Shunshi Zhang"", ""Guodong Zhang"", ""Pieter Abbeel"", ""Jimmy Ba""]","[""Reinforcement learning"", ""model based Reinforcement learning"", ""Benchmarking""]","Model-based reinforcement learning (MBRL) is widely seen as having the potential +to be significantly more sample efficient than model-free RL. However, research in +model-based RL has not been very standardized. It is fairly common for authors to +experiment with self-designed environments, and there are several separate lines of +research, which are sometimes closed-sourced or not reproducible. Accordingly, it +is an open question how these various existing algorithms perform relative to each +other. To facilitate research in MBRL, in this paper we gather a wide collection +of MBRL algorithms and propose over 18 benchmarking environments specially +designed for MBRL. We benchmark these algorithms with unified problem settings, +including noisy environments. Beyond cataloguing performance, we explore +and unify the underlying algorithmic differences across MBRL algorithms. We +characterize three key research challenges for future MBRL research: the dynamics +bottleneck, the planning horizon dilemma, and the early-termination dilemma. +Finally, to facilitate future research on MBRL, we open-source our benchmark.",/pdf/e7af495c74dc695ddfdef50c99b483a991db88dd.pdf,ICLR,2020,Benchmarking Model-Based Reinforcement Learning in continuous control tasks +H1gN6kSFwS,B1gDXEkFPS,1569440000000.0,1577170000000.0,1986,Learning Neural Causal Models from Unknown Interventions,"[""rosemary.nan.ke@gmail.com"", ""obilaniu@gmail.com"", ""anirudhgoyal9119@gmail.com"", ""stefan.a.bauer@gmail.com"", ""hugolarochelle@google.com"", ""chris.j.pal@gmail.com"", ""yoshua.bengio@mila.quebec""]","[""Nan Rosemary Ke"", ""Olexa Bilaniuk"", ""Anirudh Goyal"", ""Stephan Bauer"", ""Hugol Larochelle"", ""Chris Pal"", ""Yoshua Bengio""]","[""deep learning"", ""graphical models"", ""meta learning""]","Meta-learning over a set of distributions can be interpreted as learning different types of parameters corresponding to short-term vs long-term aspects of the mechanisms underlying the generation of data. These are respectively captured by quickly-changing \textit{parameters} and slowly-changing \textit{meta-parameters}. We present a new framework for meta-learning causal models where the relationship between each variable and its parents is modeled by a neural network, modulated by structural meta-parameters which capture the overall topology of a directed graphical model. Our approach avoids a discrete search over models in favour of a continuous optimization procedure. We study a setting where interventional distributions are induced as a result of a random intervention on a single unknown variable of an unknown ground truth causal model, and the observations arising after such an intervention constitute one meta-example. To disentangle the slow-changing aspects of each conditional from the fast-changing adaptations to each intervention, we parametrize the neural network into fast parameters and slow meta-parameters. We introduce a meta-learning objective that favours solutions \textit{robust} to frequent but sparse interventional distribution change, and which generalize well to previously unseen interventions. Optimizing this objective is shown experimentally to recover the structure of the causal graph. Finally, we find that when the learner is unaware of the intervention variable, it is able to infer that information, improving results further and focusing the parameter and meta-parameter updates where needed.",/pdf/909851dd475666c92f50fb6a01f5c71b8b6432f3.pdf,ICLR,2020,Using end-to-end deep learning to discover the structure of a graphical model which is robust to interventions and trained without knowing what the interventions are +HJGwcKclx,,1478300000000.0,1492430000000.0,516,Soft Weight-Sharing for Neural Network Compression,"[""karen.ullrich@uva.nl"", ""tmeeds@gmail.com"", ""welling.max@gmail.com""]","[""Karen Ullrich"", ""Edward Meeds"", ""Max Welling""]","[""Deep learning"", ""Optimization""]","The success of deep learning in numerous application domains created the desire to run and train them on mobile devices. This however, conflicts with their computationally, memory and energy intense nature, leading to a growing interest in compression. +Recent work by Han et al. (2016) propose a pipeline that involves retraining, pruning and quantization of neural network weights, obtaining state-of-the-art compression rates. +In this paper, we show that competitive compression rates can be achieved by using a version of ""soft weight-sharing"" (Nowlan & Hinton, 1991). Our method achieves both quantization and pruning in one simple (re-)training procedure. +This point of view also exposes the relation between compression and the minimum description length (MDL) principle. ",/pdf/30991bcc3803c08ad381efb5299066b7e9558ba2.pdf,ICLR,2017,We use soft weight-sharing to compress neural network weights. +HJGXzmspb,HyG7zXoTZ,1508750000000.0,1518730000000.0,47,Training and Inference with Integers in Deep Neural Networks,"[""wus15@mails.tsinghua.edu.cn"", ""liguoqi@mail.tsinghua.edu.cn"", ""chenfeng@mail.tsinghua.edu.cn"", ""lpshi@mail.tsinghua.edu.cn""]","[""Shuang Wu"", ""Guoqi Li"", ""Feng Chen"", ""Luping Shi""]","[""quantization"", ""training"", ""bitwidth"", ""ternary weights""]","Researches on deep neural networks with discrete parameters and their deployment in embedded systems have been active and promising topics. Although previous works have successfully reduced precision in inference, transferring both training and inference processes to low-bitwidth integers has not been demonstrated simultaneously. In this work, we develop a new method termed as ``""WAGE"" to discretize both training and inference, where weights (W), activations (A), gradients (G) and errors (E) among layers are shifted and linearly constrained to low-bitwidth integers. To perform pure discrete dataflow for fixed-point devices, we further replace batch normalization by a constant scaling layer and simplify other components that are arduous for integer implementation. Improved accuracies can be obtained on multiple datasets, which indicates that WAGE somehow acts as a type of regularization. Empirically, we demonstrate the potential to deploy training in hardware systems such as integer-based deep learning accelerators and neuromorphic chips with comparable accuracy and higher energy efficiency, which is crucial to future AI applications in variable scenarios with transfer and continual learning demands.",/pdf/516345e75eb2cf918642c571d05976a33898d715.pdf,ICLR,2018,We apply training and inference with only low-bitwidth integers in DNNs +S1gBgnR9Y7,BklI_1ccYm,1538090000000.0,1545360000000.0,1070,End-to-end learning of pharmacological assays from high-resolution microscopy images,"[""hofmarcher@ml.jku.at"", ""rumetshofer@ml.jku.at"", ""hochreit@ml.jku.at"", ""klambauer@ml.jku.at""]","[""Markus Hofmarcher"", ""Elisabeth Rumetshofer"", ""Sepp Hochreiter"", ""G\u00fcnter Klambauer""]","[""Convolutional Neural Networks"", ""High-resolution images"", ""Multiple-Instance Learning"", ""Drug Discovery"", ""Molecular Biology""]","Predicting the outcome of pharmacological assays based on high-resolution microscopy +images of treated cells is a crucial task in drug discovery which tremendously +increases discovery rates. However, end-to-end learning on these images +with convolutional neural networks (CNNs) has not been ventured for this task +because it has been considered infeasible and overly complex. On the largest +available public dataset, we compare several state-of-the-art CNNs trained in an +end-to-end fashion with models based on a cell-centric approach involving segmentation. +We found that CNNs operating on full images containing hundreds +of cells perform significantly better at assay prediction than networks operating +on a single-cell level. Surprisingly, we could predict 29% of the 209 pharmacological +assays at high predictive performance (AUC > 0.9). We compared a +novel CNN architecture called “GapNet” against four competing CNN architectures +and found that it performs on par with the best methods and at the same time +has the lowest training time. Our results demonstrate that end-to-end learning on +high-resolution imaging data is not only possible but even outperforms cell-centric +and segmentation-dependent approaches. Hence, the costly cell segmentation and +feature extraction steps are not necessary, in fact they even hamper predictive performance. +Our work further suggests that many pharmacological assays could +be replaced by high-resolution microscopy imaging together with convolutional +neural networks.",/pdf/fe37e3036fcbbc64d83e27660d42d148a502e1d6.pdf,ICLR,2019, +rJlJ-2CqtX,B1eHgWA5tX,1538090000000.0,1545360000000.0,1134,Success at any cost: value constrained model-free continuous control,"[""sbohez@google.com"", ""aabdolmaleki@google.com"", ""neunertm@google.com"", ""buchli@google.com"", ""heess@google.com"", ""raia@google.com""]","[""Steven Bohez"", ""Abbas Abdolmaleki"", ""Michael Neunert"", ""Jonas Buchli"", ""Nicolas Heess"", ""Raia Hadsell""]","[""reinforcement learning"", ""continuous control"", ""robotics"", ""constrained optimization"", ""multi-objective optimization""]","Naively applying Reinforcement Learning algorithms to continuous control problems -- such as locomotion and robot control -- to maximize task reward often results in policies which rely on high-amplitude, high-frequency control signals, known colloquially as bang-bang control. While such policies can implement the optimal solution, particularly in simulated systems, they are often not desirable for real world systems since bang-bang control can lead to increased wear and tear and energy consumption and tends to excite undesired second-order dynamics. To counteract this issue, multi-objective optimization can be used to simultaneously optimize both the reward and some auxiliary cost that discourages undesired (e.g. high-amplitude) control. In principle, such an approach can yield the sought after, smooth, control policies. It can, however, be hard to find the correct trade-off between cost and return that results in the desired behavior. In this paper we propose a new constraint-based approach which defines a lower bound on the return while minimizing one or more costs (such as control effort). We employ Lagrangian relaxation to learn both (a) the parameters of a control policy that satisfies the desired constraints and (b) the Lagrangian multipliers for the optimization. Moreover, we demonstrate policy optimization which satisfies constraints either in expectation or in a per-step fashion, and we learn a single conditional policy that is able to dynamically change the trade-off between return and cost. We demonstrate the efficiency of our approach using a number of continuous control benchmark tasks as well as a realistic, energy-optimized quadruped locomotion task.",/pdf/9aec4778c427d21b35ddb4d551d06582515c9b60.pdf,ICLR,2019,"We apply constrained optimization to continuous control tasks subject to a penalty to ensure a lower bound on the return, and learn the resulting conditional Lagrangian multipliers simultaneously with the policy." +#NAME?,bsB2X_HD9Dj,1601310000000.0,1614990000000.0,1023,Rethinking Convolution: Towards an Optimal Efficiency,"[""~Tao_Wei1"", ""~Yonghong_Tian1"", ""~Chang_Wen_Chen1""]","[""Tao Wei"", ""Yonghong Tian"", ""Chang Wen Chen""]",[],"In this paper, we present our recent research about the computational efficiency in convolution. Convolution operation is the most critical component in recent surge of deep learning research. Conventional 2D convolution takes $O(C^{2}K^{2}HW)$ to calculate, where $C$ is the channel size, $K$ is the kernel size, while $H$ and $W$ are the output height and width. Such computation has become really costly considering that these parameters increased over the past few years to meet the needs of demanding applications. Among various implementation of the convolution, separable convolution has been proven to be more efficient in reducing the computational demand. For example, depth separable convolution reduces the complexity to $O(CHW\cdot(C+K^{2}))$ while spatial separable convolution reduces the complexity to $O(C^{2}KHW)$. However, these are considered an ad hoc design which cannot ensure that they can in general achieve optimal separation. In this research, we propose a novel operator called \emph{optimal separable convolution} which can be calculated at $O(C^{\frac{3}{2}}KHW)$ by optimal design for the internal number of groups and kernel sizes for general separable convolutions. When there is no restriction in the number of separated convolutions, an even lower complexity at $O(CHW\cdot\log(CK^{2}))$ can be achieved. Experimental results demonstrate that the proposed optimal separable convolution is able to achieve an improved accuracy-FLOPs and accuracy-#Params trade-offs over both conventional and depth/spatial separable convolutions.",/pdf/2b296a7d6d78b6b052a13387791cc11a7178f750.pdf,ICLR,2021, +r1E0OsA9tX,SJg3g3tcFX,1538090000000.0,1545360000000.0,403,Learning From the Experience of Others: Approximate Empirical Bayes in Neural Networks,"[""han.zhao@cs.cmu.edu"", ""yaohungt@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu"", ""ggordon@cs.cmu.edu""]","[""Han Zhao"", ""Yao-Hung Hubert Tsai"", ""Ruslan Salakhutdinov"", ""Geoff Gordon""]","[""Empirical Bayes"", ""Bayesian Deep Learning""]","Learning deep neural networks could be understood as the combination of representation learning and learning halfspaces. While most previous work aims to diversify representation learning by data augmentations and regularizations, we explore the opposite direction through the lens of empirical Bayes method. Specifically, we propose a matrix-variate normal prior whose covariance matrix has a Kronecker product structure to capture the correlations in learning different neurons through backpropagation. The prior encourages neurons to learn from the experience of others, hence it provides an effective regularization when training large networks on small datasets. To optimize the model, we design an efficient block coordinate descent algorithm with analytic solutions. Empirically, we show that the proposed method helps the network converge to better local optima that also generalize better, and we verify the effectiveness of the approach on both multiclass classification and multitask regression problems with various network structures. ",/pdf/e2e5f44c53d9106e0d8eac99cb211444ca9f9b70.pdf,ICLR,2019, +SJiHXGWAZ,r1jBmfbCb,1509140000000.0,1519330000000.0,804,Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting,"[""yaguang@usc.edu"", ""rose@caltech.edu"", ""shahabi@usc.edu"", ""yanliu.cs@usc.edu""]","[""Yaguang Li"", ""Rose Yu"", ""Cyrus Shahabi"", ""Yan Liu""]","[""Traffic prediction"", ""spatiotemporal forecasting"", ""diffusion"", ""graph convolution"", ""random walk"", ""long-term forecasting""]","Spatiotemporal forecasting has various applications in neuroscience, climate and transportation domain. Traffic forecasting is one canonical example of such learning task. The task is challenging due to (1) complex spatial dependency on road networks, (2) non-linear temporal dynamics with changing road conditions and (3) inherent difficulty of long-term forecasting. To address these challenges, we propose to model the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. Specifically, DCRNN captures the spatial dependency using bidirectional random walks on the graph, and the temporal dependency using the encoder-decoder architecture with scheduled sampling. We evaluate the framework on two real-world large-scale road network traffic datasets and observe consistent improvement of 12% - 15% over state-of-the-art baselines",/pdf/2d6e22060ead279ab261e0243001f58aa4b37c5f.pdf,ICLR,2018,A neural sequence model that learns to forecast on a directed graph. +Syxss0EYPS,SJgQQ7tuPS,1569440000000.0,1577170000000.0,1331,Agent as Scientist: Learning to Verify Hypotheses,"[""kdmarino@cs.cmu.edu"", ""fergus@cs.nyu.edu"", ""aszlam@fb.com"", ""abhinavg@cs.cmu.edu""]","[""Kenneth Marino"", ""Rob Fergus"", ""Arthur Szlam"", ""Abhinav Gupta""]",[],"In this paper, we formulate hypothesis verification as a reinforcement learning problem. Specifically, we aim to build an agent that, given a hypothesis about the dynamics of the world can take actions to generate observations which can help predict whether the hypothesis is true or false. Our first observation is that agents trained end-to-end with the reward fail to learn to solve this problem. In order to train the agents, we exploit the underlying structure in the majority of hypotheses -- they can be formulated as triplets (pre-condition, action sequence, post-condition). Once the agents have been pretrained to verify hypotheses with this structure, they can be fine-tuned to verify more general hypotheses. Our work takes a step towards a ``scientist agent'' that develops an understanding of the world by generating and testing hypotheses about its environment.",/pdf/9048a84e46ecb65831e1e06a8d0988018d13e87a.pdf,ICLR,2020, +ByzcS3AcYX,ryeyFG0qYQ,1538090000000.0,1550900000000.0,1570,Neural TTS Stylization with Adversarial and Collaborative Games,"[""shuangma@buffalo.edu"", ""damcduff@microsoft.com"", ""yalesong@csail.mit.edu""]","[""Shuang Ma"", ""Daniel Mcduff"", ""Yale Song""]","[""Text-To-Speech synthesis"", ""GANs""]","The modeling of style when synthesizing natural human speech from text has been the focus of significant attention. Some state-of-the-art approaches train an encoder-decoder network on paired text and audio samples (x_txt, x_aud) by encouraging its output to reconstruct x_aud. The synthesized audio waveform is expected to contain the verbal content of x_txt and the auditory style of x_aud. Unfortunately, modeling style in TTS is somewhat under-determined and training models with a reconstruction loss alone is insufficient to disentangle content and style from other factors of variation. In this work, we introduce an end-to-end TTS model that offers enhanced content-style disentanglement ability and controllability. We achieve this by combining a pairwise training procedure, an adversarial game, and a collaborative game into one training scheme. The adversarial game concentrates the true data distribution, and the collaborative game minimizes the distance between real samples and generated samples in both the original space and the latent space. As a result, the proposed model delivers a highly controllable generator, and a disentangled representation. Benefiting from the separate modeling of style and content, our model can generate human fidelity speech that satisfies the desired style conditions. Our model achieves start-of-the-art results across multiple tasks, including style transfer (content and style swapping), emotion modeling, and identity transfer (fitting a new speaker's voice).",/pdf/6d5794a4f1f02090ae73f699de775147ad35630d.pdf,ICLR,2019,a generative adversarial network for style modeling in a text-to-speech system +HyRnez-RW,H1pnezWRW,1509130000000.0,1519430000000.0,774,Multi-Mention Learning for Reading Comprehension with Neural Cascades,"[""swabha@cs.cmu.edu"", ""aparikh@google.com"", ""tomkwiat@google.com""]","[""Swabha Swayamdipta"", ""Ankur P. Parikh"", ""Tom Kwiatkowski""]","[""reading comprehension"", ""multi-loss"", ""question answering"", ""scalable"", ""TriviaQA"", ""feed-forward"", ""latent variable"", ""attention""]","Reading comprehension is a challenging task, especially when executed across longer or across multiple evidence documents, where the answer is likely to reoccur. Existing neural architectures typically do not scale to the entire evidence, and hence, resort to selecting a single passage in the document (either via truncation or other means), and carefully searching for the answer within that passage. However, in some cases, this strategy can be suboptimal, since by focusing on a specific passage, it becomes difficult to leverage multiple mentions of the same answer throughout the document. In this work, we take a different approach by constructing lightweight models that are combined in a cascade to find the answer. Each submodel consists only of feed-forward networks equipped with an attention mechanism, making it trivially parallelizable. We show that our approach can scale to approximately an order of magnitude larger evidence documents and can aggregate information from multiple mentions of each answer candidate across the document. Empirically, our approach achieves state-of-the-art performance on both the Wikipedia and web domains of the TriviaQA dataset, outperforming more complex, recurrent architectures.",/pdf/cbcbb8d05678fdc8f59e81b09eb91e8debee4b58.pdf,ICLR,2018,"We propose neural cascades, a simple and trivially parallelizable approach to reading comprehension, consisting only of feed-forward nets and attention that achieves state-of-the-art performance on the TriviaQA dataset." +rJlNKCNtPB,r1eOKmOdwr,1569440000000.0,1577170000000.0,1244,Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier,"[""zd11@rice.edu"", ""anshumali@rice.edu""]","[""Zhenwei Dai"", ""Anshumali Shrivastava""]","[""Ada-BF"", ""Bloom filter"", ""machine learning"", ""memory efficient""]","Recent work suggests improving the performance of Bloom filter by incorporating a machine learning model as a binary classifier. However, such learned Bloom filter does not take full advantage of the predicted probability scores. We proposed new algorithms that generalize the learned Bloom filter by using the complete spectrum of the scores regions. We proved our algorithms have lower False Positive Rate (FPR) and memory usage compared with the existing approaches to learned Bloom filter. We also demonstrated the improved performance of our algorithms on real-world datasets.",/pdf/1d455c57569ef90999a9ce5235fb005720bd916a.pdf,ICLR,2020,Propose an efficient algorithm to improve the Bloom filter by incorporating the machine learning model in a clever way +27acGyyI1BY,pcZv_isvL9k,1601310000000.0,1616000000000.0,3579,Neural ODE Processes,"[""alex.norcliffe98@gmail.com"", ""~Cristian_Bodnar1"", ""~Ben_Day1"", ""jm2311@cam.ac.uk"", ""~Pietro_Li\u00f21""]","[""Alexander Norcliffe"", ""Cristian Bodnar"", ""Ben Day"", ""Jacob Moss"", ""Pietro Li\u00f2""]","[""differential equations"", ""neural processes"", ""dynamics"", ""deep learning"", ""neural ode""]","Neural Ordinary Differential Equations (NODEs) use a neural network to model the instantaneous rate of change in the state of a system. However, despite their apparent suitability for dynamics-governed time-series, NODEs present a few disadvantages. First, they are unable to adapt to incoming data-points, a fundamental requirement for real-time applications imposed by the natural direction of time. Second, time-series are often composed of a sparse set of measurements that could be explained by many possible underlying dynamics. NODEs do not capture this uncertainty. In contrast, Neural Processes (NPs) are a new class of stochastic processes providing uncertainty estimation and fast data-adaptation, but lack an explicit treatment of the flow of time. To address these problems, we introduce Neural ODE Processes (NDPs), a new class of stochastic processes determined by a distribution over Neural ODEs. By maintaining an adaptive data-dependent distribution over the underlying ODE, we show that our model can successfully capture the dynamics of low-dimensional systems from just a few data-points. At the same time, we demonstrate that NDPs scale up to challenging high-dimensional time-series with unknown latent dynamics such as rotating MNIST digits. ",/pdf/cd93774612deb1de0bb3d59b6ecfd4411fbac85f.pdf,ICLR,2021,Neural Processes with time-awareness +r1lZgyBYwS,Syg8MGs_vS,1569440000000.0,1583910000000.0,1496,HiLLoC: lossless image compression with hierarchical latent variable models,"[""james.townsend@cs.ucl.ac.uk"", ""thomas.bird@cs.ucl.ac.uk"", ""julius.kunze@cs.ucl.ac.uk"", ""david.barber@ucl.ac.uk""]","[""James Townsend"", ""Thomas Bird"", ""Julius Kunze"", ""David Barber""]","[""compression"", ""variational inference"", ""lossless compression"", ""deep latent variable models""]","We make the following striking observation: fully convolutional VAE models trained on 32x32 ImageNet can generalize well, not just to 64x64 but also to far larger photographs, with no changes to the model. We use this property, applying fully convolutional models to lossless compression, demonstrating a method to scale the VAE-based 'Bits-Back with ANS' algorithm for lossless compression to large color photographs, and achieving state of the art for compression of full size ImageNet images. We release Craystack, an open source library for convenient prototyping of lossless compression using probabilistic models, along with full implementations of all of our compression results.",/pdf/34119517f309200cc11db1cbe6d3f127b858fbcf.pdf,ICLR,2020,"We scale up lossless compression with latent variables, achieving state of the art on full-size ImageNet images." +HyMS8iRcK7,SkgSccnLFm,1538090000000.0,1545360000000.0,171,SEQUENCE MODELLING WITH AUTO-ADDRESSING AND RECURRENT MEMORY INTEGRATING NETWORKS,"[""zhanghengli@pku.edu.cn"", ""jxzhong@pku.edu.cn"", ""jjhuang@pku.edu.cn"", ""t_zhang@pku.edu.cn"", ""thomasli@pkusz.edu.cn"", ""geli@ece.pku.edu.cn""]","[""Zhangheng Li"", ""Jia-Xing Zhong"", ""Jingjia Huang"", ""Tao Zhang"", ""Thomas Li"", ""Ge Li""]","[""Memory Network"", ""RNN"", ""Sequence Modelling""]","Processing sequential data with long term dependencies and learn complex transitions are two major challenges in many deep learning applications. In this paper, we introduce a novel architecture, the Auto-addressing and Recurrent Memory Integrating Network (ARMIN) to address these issues. The ARMIN explicitly stores previous hidden states and recurrently integrate useful past states into current time-step by an efficient memory addressing mechanism. Compared to existing memory networks, the ARMIN is more light-weight and inference-time efficient. Our network can be trained on small slices of long sequential data, and thus, can boost its training speed. Experiments on various tasks demonstrate the efficiency of the ARMIN architecture. Codes and models will be available.",/pdf/f377a7f2ac4b1486d69acc89ee5d404fccfd7214.pdf,ICLR,2019,We propose a light-weight Memory-Augmented RNN (MARNN) for sequence modelling. +B1gzLaNYvr,SygnSmdwDH,1569440000000.0,1577170000000.0,551,TSInsight: A local-global attribution framework for interpretability in time-series data,"[""shoaib_ahmed.siddiqui@dfki.de"", ""dominique.mercier@dfki.de"", ""andreas.dengel@dfki.de"", ""sheraz.ahmed@dfki.de""]","[""Shoaib Ahmed Siddiqui"", ""Dominique Mercier"", ""Andreas Dengel"", ""Sheraz Ahmed""]","[""Deep Learning"", ""Representation Learning"", ""Convolutional Neural Networks"", ""Time-Series Analysis"", ""Feature Importance"", ""Visualization"", ""Demystification""]","With the rise in employment of deep learning methods in safety-critical scenarios, interpretability is more essential than ever before. Although many different directions regarding interpretability have been explored for visual modalities, time-series data has been neglected with only a handful of methods tested due to their poor intelligibility. We approach the problem of interpretability in a novel way by proposing TSInsight where we attach an auto-encoder with a sparsity-inducing norm on its output to the classifier and fine-tune it based on the gradients from the classifier and a reconstruction penalty. The auto-encoder learns to preserve features that are important for the prediction by the classifier and suppresses the ones that are irrelevant i.e. serves as a feature attribution method to boost interpretability. In other words, we ask the network to only reconstruct parts which are useful for the classifier i.e. are correlated or causal for the prediction. In contrast to most other attribution frameworks, TSInsight is capable of generating both instance-based and model-based explanations. We evaluated TSInsight along with other commonly used attribution methods on a range of different time-series datasets to validate its efficacy. Furthermore, we analyzed the set of properties that TSInsight achieves out of the box including adversarial robustness and output space contraction. The obtained results advocate that TSInsight can be an effective tool for the interpretability of deep time-series models.",/pdf/bca86c726e22ebe6140fda9db57283806b943d19.pdf,ICLR,2020,We present an attribution technique leveraging sparsity inducing norms to achieve interpretability. +xVzlFUD3uC,JXUzDGstNm,1601310000000.0,1614990000000.0,753,Data augmentation as stochastic optimization,"[""~Boris_Hanin1"", ""~Yi_Sun3""]","[""Boris Hanin"", ""Yi Sun""]","[""data augmentation"", ""stochastic optimization"", ""scheduling"", ""convex optimization"", ""overparametrization""]","We present a theoretical framework recasting data augmentation as stochastic optimization for a sequence of time-varying proxy losses. This provides a unified language for understanding techniques commonly thought of as data augmentation, including synthetic noise and label-preserving transformations, as well as more traditional ideas in stochastic optimization such as learning rate and batch size scheduling. We then specialize our framework to study arbitrary augmentations in the context of a simple model (overparameterized linear regression). We extend in this setting the classical Monro-Robbins theorem to include augmentation and obtain rates of convergence, giving conditions on the learning rate and augmentation schedule under which augmented gradient descent converges. Special cases give provably good schedules for augmentation with additive noise, minibatch SGD, and minibatch SGD with noise.",/pdf/e335d16b39e6773dcba943ddf475061c0e75e184.pdf,ICLR,2021,We develop a principled theoretical approach relating data augmentation to stochastic optimization and apply it to obtain provably good augmentation schedules in the overparametrized linear setting. +B1eEKi0qYQ,SJg_s659KX,1538090000000.0,1545360000000.0,434,Interactive Parallel Exploration for Reinforcement Learning in Continuous Action Spaces,"[""wy.jung@kaist.ac.kr"", ""gs.park@kaist.ac.kr"", ""ycsung@kaist.ac.kr""]","[""Whiyoung Jung"", ""Giseung Park"", ""Youngchul Sung""]","[""reinforcement learning"", ""continuous action space RL""]","In this paper, a new interactive parallel learning scheme is proposed to enhance the performance of off-policy continuous-action reinforcement learning. In the proposed interactive parallel learning scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search space by the multiple learners. The guidance by the previous best policy and the enlarged search space by the proposed interactive parallel learning scheme enable faster and better policy search in the policy parameter space. Working algorithms are constructed by applying the proposed interactive parallel learning scheme to several off-policy reinforcement learning algorithms such as the twin delayed deep deterministic (TD3) policy gradient algorithm and the soft actor-critic (SAC) algorithm, and numerical results show that the constructed IPE-enhanced algorithms outperform most of the current state-of-the-art reinforcement learning algorithms for continuous action control.",/pdf/6fdadb08a71ac98a3b07187e325fbddd22a2d140.pdf,ICLR,2019, +BylRkAEKDH,Skl6HymdPS,1569440000000.0,1577170000000.0,911,TabNet: Attentive Interpretable Tabular Learning,"[""soarik@google.com"", ""tpfister@google.com""]","[""Sercan O. Arik"", ""Tomas Pfister""]","[""Tabular data"", ""interpretable neural networks"", ""attention models""]","We propose a novel high-performance interpretable deep tabular data learning network, TabNet. TabNet utilizes a sequential attention mechanism that softly selects features to reason from at each decision step and then aggregates the processed information to make a final prediction decision. By explicitly selecting sparse features, TabNet learns very efficiently as the model capacity at each decision step is fully utilized for the most relevant features, resulting in a high performance model. This sparsity also enables more interpretable decision making through the visualization of feature selection masks. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of tabular data learning datasets and yields interpretable feature attributions and insights into the global model behavior.",/pdf/a697c415227fcb1698f53db6eb5a25b37053bacb.pdf,ICLR,2020,We propose a novel high-performance interpretable deep tabular data learning network. +HNA0kUAFdbv,TfgGZ5qgM34,1601310000000.0,1614990000000.0,3557,CANVASEMB: Learning Layout Representation with Large-scale Pre-training for Graphic Design,"[""~Yuxi_Xie1"", ""~Danqing_Huang1"", ""wjp.pku@gmail.com"", ""~Chin-Yew_Lin1""]","[""Yuxi Xie"", ""Danqing Huang"", ""Jinpeng Wang"", ""Chin-Yew Lin""]","[""Layout Representation"", ""Pre-training""]","Layout representation, which models visual elements in a canvas and their inter-relations, plays a crucial role in graphic design intelligence. +With a large variety of layout designs and the unique characteristic of layouts that visual elements are defined as a list of categorical (e.g. shape type) and numerical (e.g. position and size) properties, it is challenging to learn a general and compact representation with limited data. Inspired by the recent success of self-supervised pre-training techniques in various natural language processing tasks, in this paper, we propose CanvasEmb (Canvas Embedding), which pre-trains deep representation from unlabeled graphic designs by jointly conditioning on all the context elements in the same canvas, with a multi-dimensional feature encoder and a multi-task learning objective. The pre-trained CanvasEmb model can be fine-tuned with just one additional output layer and with a small size of training data to create models for a wide range of downstream tasks. We verify our approach with presentation slides data. We construct a large-scale dataset with more than one million slides, and propose two novel layout understanding tasks with human labeling sets, namely element role labeling and image captioning. Evaluation results on these two tasks show that our model with fine-tuning achieves state-of-the-art performances. Furthermore, we conduct a deep analysis aiming to understand the modeling mechanism of CanvasEmb, and demonstrate its great potential use on more applications such as layout auto completion and layout retrieval.",/pdf/0af124d5ea9d11e3e4c80ea2049cfb9b5f64c43c.pdf,ICLR,2021,"we propose CanvasEmb for layout representation learning, which pre-trains deep representation from large-scale unlabeled graphic designs and facilitates downstream tasks for design intelligence." +SJg7IsC5KQ,HyeTtDCbKX,1538090000000.0,1545360000000.0,162,On the Convergence and Robustness of Batch Normalization,"[""matcyon@nus.edu.sg"", ""liqix@ihpc.a-star.edu.sg"", ""matzuows@nus.edu.sg""]","[""Yongqiang Cai"", ""Qianxiao Li"", ""Zuowei Shen""]","[""Batch normalization"", ""Convergence analysis"", ""Gradient descent"", ""Ordinary least squares"", ""Deep neural network""]","Despite its empirical success, the theoretical underpinnings of the stability, convergence and acceleration properties of batch normalization (BN) remain elusive. In this paper, we attack this problem from a modelling approach, where we perform thorough theoretical analysis on BN applied to simplified model: ordinary least squares (OLS). We discover that gradient descent on OLS with BN has interesting properties, including a scaling law, convergence for arbitrary learning rates for the weights, asymptotic acceleration effects, as well as insensitivity to choice of learning rates. We then demonstrate numerically that these findings are not specific to the OLS problem and hold qualitatively for more complex supervised learning problems. This points to a new direction towards uncovering the mathematical principles that underlies batch normalization.",/pdf/a0a09b83a8a320849fdf58cc14c47e565d18547a.pdf,ICLR,2019,We mathematically analyze the effect of batch normalization on a simple model and obtain key new insights that applies to general supervised learning. +PXDdWQDBsCG,PyOYWEmvoJM,1601310000000.0,1614990000000.0,91,Shape Defense,"[""~ali_borji1""]","[""ali borji""]","[""adversarial robustness"", ""adversarial defense"", ""adversarial attack"", ""shape"", ""background subtraction""]","Humans rely heavily on shape information to recognize objects. Conversely, convolutional +neural networks (CNNs) are biased more towards texture. This fact +is perhaps the main reason why CNNs are susceptible to adversarial examples. +Here, we explore how shape bias can be incorporated into CNNs to improve their +robustness. Two algorithms are proposed, based on the observation that edges are +invariant to moderate imperceptible perturbations. In the first one, a classifier is +adversarially trained on images with the edge map as an additional channel. At +inference time, the edge map is recomputed and concatenated to the image. In the +second algorithm, a conditional GAN is trained to translate the edge maps, from +clean and/or perturbed images, into clean images. The inference is done over the +generated image corresponding to the input’s edge map. A large number of experiments +with more than 10 data sets have proved the effectiveness of the proposed +algorithms against FGSM and `1 PGD-40 attacks. against FGSM and `$\ell_\infty$ PGD-40 attacks. +Further, we show that edge information can a) benefit other adversarial training methods, b) be even more effective +in conjunction with background subtraction, c) be used to defend against poisoning +attacks, and d) make CNNs more robust against natural image corruptions +such as motion blur, impulse noise, and JPEG compression, than CNNs trained +solely on RGB images. From a broader perspective, our study suggests that CNNs +do not adequately account for image structures and operations that are crucial for +robustness. The code is available at: https://github.com/[masked].",/pdf/f9d1df3495aaa212eaa7b389a519ae9fe84aac5c.pdf,ICLR,2021,"Inspired by human vision, we propose two adversarial defense methods that utilize shape, and show that edge redetection makes models robust to adversarial attacks such as FGSM and PGD-40." +HkewNJStDr,HkxNMph_DS,1569440000000.0,1577170000000.0,1658,Efficient High-Dimensional Data Representation Learning via Semi-Stochastic Block Coordinate Descent Methods,"[""bkwei028@gmail.com"", ""1615401247li@gmail.com"", ""fhshang@xidian.edu.cn"", ""yyliu@xidian.edu.cn"", ""hyliu@xidian.edu.cn"", ""jane.shen@pensees.ai""]","[""Bingkun Wei"", ""Yangyang Li"", ""Fanhua Shang"", ""Yuanyuan Liu"", ""Hongying Liu"", ""Shengmei Shen""]","[""Sparse learning"", ""Hard thresholding"", ""High-dimensional regression""]","With the increase of data volume and data dimension, sparse representation learning attracts more and more attention. For high-dimensional data, randomized block coordinate descent methods perform well because they do not need to calculate the gradient along the whole dimension. Existing hard thresholding algorithms evaluate gradients followed by a hard thresholding operation to update the model parameter, which leads to slow convergence. To address this issue, we propose a novel hard thresholding algorithm, called Semi-stochastic Block Coordinate Descent Hard Thresholding Pursuit (SBCD-HTP). Moreover, we present its sparse and asynchronous parallel variants. We theoretically analyze the convergence properties of our algorithms, which show that they have a significantly lower hard thresholding complexity than existing algorithms. Our empirical evaluations on real-world datasets and face recognition tasks demonstrate the superior performance of our algorithms for sparsity-constrained optimization problems.",/pdf/a73c57879f77d857754f6e15f65fcb3b9304da17.pdf,ICLR,2020, +rJxF73R9tX,Bkx26o69Km,1538090000000.0,1545360000000.0,1381,Knows When it Doesn’t Know: Deep Abstaining Classifiers,"[""sunil@lanl.gov"", ""tanmoy@lanl.gov"", ""bilmes@uw.edu"", ""gchennupati@lanl.gov"", ""jamal@lanl.gov""]","[""Sunil Thulasidasan"", ""Tanmoy Bhattacharya"", ""Jeffrey Bilmes"", ""Gopinath Chennupati"", ""Jamal Mohd-Yusof""]","[""deep learning"", ""robust learning"", ""abstention"", ""representation learning"", ""abstaining classifier"", ""open-set detection""]","We introduce the deep abstaining classifier -- a deep neural network trained with a novel loss function that provides an abstention option during training. This allows the DNN to abstain on confusing or difficult-to-learn examples while improving performance on the non-abstained samples. We show that such deep abstaining classifiers can: (i) learn representations for structured noise -- where noisy training labels or confusing examples are correlated with underlying features -- and then learn to abstain based on such features; (ii) enable robust learning in the presence of arbitrary or unstructured noise by identifying noisy samples; and (iii) be used as an effective out-of-category detector that learns to reliably abstain when presented with samples from unknown classes. We provide analytical results on loss function behavior that enable automatic tuning of accuracy and coverage, and demonstrate the utility of the deep abstaining classifier using multiple image benchmarks, Results indicate significant improvement in learning in the presence of label noise.",/pdf/d748f17dd4c23d2a45b79db6b5b55b715f9d3a83.pdf,ICLR,2019,A deep abstaining neural network trained with a novel loss function that learns representations for when to abstain enabling robust learning in the presence of different types of noise. +rJl3S2A9t7,Skg-0h5tt7,1538090000000.0,1545360000000.0,1577,Policy Optimization via Stochastic Recursive Gradient Algorithm,"[""yuanhz@pku.edu.cn"", ""junchi.li.duke@gmail.com"", ""yuhaotang97@gmail.com"", ""yuren.zhou@duke.edu""]","[""Huizhuo Yuan"", ""Chris Junchi Li"", ""Yuhao Tang"", ""Yuren Zhou""]","[""reinforcement learning"", ""policy gradient"", ""variance reduction"", ""stochastic recursive gradient algorithm""]","In this paper, we propose the StochAstic Recursive grAdient Policy Optimization (SARAPO) algorithm which is a novel variance reduction method on Trust Region Policy Optimization (TRPO). The algorithm incorporates the StochAstic Recursive grAdient algoritHm(SARAH) into the TRPO framework. Compared with the existing Stochastic Variance Reduced Policy Optimization (SVRPO), our algorithm is more stable in the variance. Furthermore, by theoretical analysis the ordinary differential equation and the stochastic differential equation (ODE/SDE) of SARAH, we analyze its convergence property and stability. Our experiments demonstrate its performance on a variety of benchmark tasks. We show that our algorithm gets better improvement in each iteration and matches or even outperforms SVRPO and TRPO. +",/pdf/7d40e2c6a08eaad3b498c61952a6fbfabc34581e.pdf,ICLR,2019,"This paper proposes the StochAstic Recursive Gradient Policy Optimization (SARAPO) algorithm based on the novel SARAH method, and exemplifies its advantages over existing policy gradient methods from both theory and experiments." +HJMXTsCqYQ,Syl0iSTqY7,1538090000000.0,1545360000000.0,790,Constrained Bayesian Optimization for Automatic Chemical Design,"[""rrg27@cam.ac.uk"", ""jmh233@cam.ac.uk""]","[""Ryan-Rhys Griffiths"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""Bayesian Optimization"", ""Generative Models""]","Automatic Chemical Design provides a framework for generating novel molecules with optimized molecular properties. The current model suffers from the pathology that it tends to produce invalid molecular structures. By reformulating the search procedure as a constrained Bayesian optimization problem, we showcase improvements in both the validity and quality of the generated molecules. We demonstrate that the model consistently produces novel molecules ranking above the 90th percentile of the distribution over training set scores across a range of objective functions. Importantly, our method suffers no degradation in the complexity or the diversity of the generated molecules.",/pdf/46ca8dd5165cf6bcab2612a307a19c62ec0fb7d3.pdf,ICLR,2019, +ryxnHhRqFm,ryl8EsY9tm,1538090000000.0,1547540000000.0,1581,Global-to-local Memory Pointer Networks for Task-Oriented Dialogue,"[""jason.wu@connect.ust.hk"", ""rsocher@salesforce.com"", ""cxiong@salesforce.com""]","[""Chien-Sheng Wu"", ""Richard Socher"", ""Caiming Xiong""]","[""pointer networks"", ""memory networks"", ""task-oriented dialogue systems"", ""natural language processing""]","End-to-end task-oriented dialogue is challenging since knowledge bases are usually large, dynamic and hard to incorporate into a learning framework. We propose the global-to-local memory pointer (GLMP) networks to address this issue. In our model, a global memory encoder and a local memory decoder are proposed to share external knowledge. The encoder encodes dialogue history, modifies global contextual representation, and generates a global memory pointer. The decoder first generates a sketch response with unfilled slots. Next, it passes the global memory pointer to filter the external knowledge for relevant information, then instantiates the slots via the local memory pointers. We empirically show that our model can improve copy accuracy and mitigate the common out-of-vocabulary problem. As a result, GLMP is able to improve over the previous state-of-the-art models in both simulated bAbI Dialogue dataset and human-human Stanford Multi-domain Dialogue dataset on automatic and human evaluation.",/pdf/db8acf203bd7c5a73dd104a07690d8f6f8d85757.pdf,ICLR,2019,"GLMP: Global memory encoder (context RNN, global pointer) and local memory decoder (sketch RNN, local pointer) that share external knowledge (MemNN) are proposed to strengthen response generation in task-oriented dialogue." +FyucNzzMba-,2tuR60FwSE1,1601310000000.0,1614990000000.0,38,Forward Prediction for Physical Reasoning,"[""~Rohit_Girdhar5"", ""~Laura_Gustafson1"", ""~Aaron_B._Adcock1"", ""~Laurens_van_der_Maaten3""]","[""Rohit Girdhar"", ""Laura Gustafson"", ""Aaron B. Adcock"", ""Laurens van der Maaten""]","[""Forward prediction"", ""physical reasoning""]","Physical reasoning requires forward prediction: the ability to forecast what will happen next given some initial world state. We study the performance of state-of-the-art forward-prediction models in the complex physical-reasoning tasks of the PHYRE benchmark (Bakhtin et al., 2019). We do so by incorporating models that operate on object or pixel-based representations of the world into simple physical-reasoning agents. We find that forward-prediction models can improve physical-reasoning performance, particularly on complex tasks that involve many objects. However, we also find that these improvements are contingent on the test tasks being small variations of train tasks, and that generalization to completely new task templates is challenging. Surprisingly, we observe that forward predictors with better pixel accuracy do not necessarily lead to better physical-reasoning performance. Nevertheless, our best models set a new state-of-the-art on the PHYRE benchmark.",/pdf/e180e8c4f7d7853340e173bc01a551dbe946e4c3.pdf,ICLR,2021,"We experiment with forward prediction models through the lens of a challenging new physical reasoning benchmark, PHYRE. We establish a new state of the art, and report rigorous analysis on where these models work and where future work is needed." +W1uVrPNO8Bw,c6CLibx0ID,1601310000000.0,1614990000000.0,2444,Implicit Regularization of SGD via Thermophoresis,"[""~Mingwei_Wei1"", ""~David_J._Schwab1""]","[""Mingwei Wei"", ""David J. Schwab""]","[""SGD"", ""regularization"", ""generalization"", ""statistical mechanics"", ""thermophoresis""]","A central ingredient in the impressive predictive performance of deep neural networks is optimization via stochastic gradient descent (SGD). While some theoretical progress has been made, the effect of SGD in neural networks is still unclear, especially during the early phase of training. Here we generalize the theory of thermophoresis from statistical mechanics and show that there exists an effective entropic force from SGD that pushes to reduce the gradient variance. We study this effect in detail in a simple two-layer model, where the thermophoretic force functions to decreases the weight norm and activation rate of the units. The strength of this effect is proportional to squared learning rate and inverse batch size, and is more effective during the early phase of training when the model's predictions are poor. Lastly we test our quantitative predictions with experiments on various models and datasets. +",/pdf/b012a7427521ec5de78a72abe830fa502b9db2f7.pdf,ICLR,2021,We generalize the theory of thermophoresis to show that there exists an effective entropic force from SGD. +SyAbZb-0Z,ry0ZZW-C-,1509130000000.0,1518730000000.0,649,Transfer Learning to Learn with Multitask Neural Model Search,"[""catwong@cs.stanford.edu"", ""agesmundo@google.com""]","[""Catherine Wong"", ""Andrea Gesmundo""]","[""Learning to Learn"", ""Meta learning"", ""Reinforcement learning"", ""Transfer learning""]","Deep learning models require extensive architecture design exploration and hyperparameter optimization to perform well on a given task. The exploration of the model design space is often made by a human expert, and optimized using a combination of grid search and search heuristics over a large space of possible choices. Neural Architecture Search (NAS) is a Reinforcement Learning approach that has been proposed to automate architecture design. NAS has been successfully applied to generate Neural Networks that rival the best human-designed architectures. However, NAS requires sampling, constructing, and training hundreds to thousands of models to achieve well-performing architectures. This procedure needs to be executed from scratch for each new task. The application of NAS to a wide set of tasks currently lacks a way to transfer generalizable knowledge across tasks. +In this paper, we present the Multitask Neural Model Search (MNMS) controller. Our goal is to learn a generalizable framework that can condition model construction on successful model searches for previously seen tasks, thus significantly speeding up the search for new tasks. We demonstrate that MNMS can conduct an automated architecture search for multiple tasks simultaneously while still learning well-performing, specialized models for each task. We then show that pre-trained MNMS controllers can transfer learning to new tasks. By leveraging knowledge from previous searches, we find that pre-trained MNMS models start from a better location in the search space and reduce search time on unseen tasks, while still discovering models that outperform published human-designed models.",/pdf/f6bc0e4771facae5fab7d132c34c17000c9d09bd.pdf,ICLR,2018,"We present Multitask Neural Model Search, a Meta-learner that can design models for multiple tasks simultaneously and transfer learning to unseen tasks." +HkxaFoC9KQ,Bke2bwo5FX,1538090000000.0,1550880000000.0,486,Deep reinforcement learning with relational inductive biases,"[""vzambaldi@google.com"", ""draposo@google.com"", ""adamsantoro@google.com"", ""vbapst@google.com"", ""yujiali@google.com"", ""ibab@google.com"", ""karltuyls@google.com"", ""reichert@google.com"", ""countzero@google.com"", ""locked@google.com"", ""mshanahan@google.com"", ""vlangston@google.com"", ""razp@google.com"", ""botvinick@google.com"", ""vinyals@google.com"", ""peterbattaglia@google.com""]","[""Vinicius Zambaldi"", ""David Raposo"", ""Adam Santoro"", ""Victor Bapst"", ""Yujia Li"", ""Igor Babuschkin"", ""Karl Tuyls"", ""David Reichert"", ""Timothy Lillicrap"", ""Edward Lockhart"", ""Murray Shanahan"", ""Victoria Langston"", ""Razvan Pascanu"", ""Matthew Botvinick"", ""Oriol Vinyals"", ""Peter Battaglia""]","[""relational reasoning"", ""reinforcement learning"", ""graph neural networks"", ""starcraft"", ""generalization"", ""inductive bias""]","We introduce an approach for augmenting model-free deep reinforcement learning agents with a mechanism for relational reasoning over structured representations, which improves performance, learning efficiency, generalization, and interpretability. Our architecture encodes an image as a set of vectors, and applies an iterative message-passing procedure to discover and reason about relevant entities and relations in a scene. In six of seven StarCraft II Learning Environment mini-games, our agent achieved state-of-the-art performance, and surpassed human grandmaster-level on four. In a novel navigation and planning task, our agent's performance and learning efficiency far exceeded non-relational baselines, it was able to generalize to more complex scenes than it had experienced during training. Moreover, when we examined its learned internal representations, they reflected important structure about the problem and the agent's intentions. The main contribution of this work is to introduce techniques for representing and reasoning about states in model-free deep reinforcement learning agents via relational inductive biases. Our experiments show this approach can offer advantages in efficiency, generalization, and interpretability, and can scale up to meet some of the most challenging test environments in modern artificial intelligence.",/pdf/b1ca2a9380e6534782a5ba79e91f2971a1b8ab8a.pdf,ICLR,2019,Relational inductive biases improve out-of-distribution generalization capacities in model-free reinforcement learning agents +S1erpeBFPB,HklBGbZYPr,1569440000000.0,1583910000000.0,2572,How to 0wn the NAS in Your Spare Time,"[""shhong@cs.umd.edu"", ""michael.davinroy@gmail.com"", ""cankaya@umiacs.umd.edu"", ""danadach@ece.umd.edu"", ""tdumitra@umiacs.umd.edu""]","[""Sanghyun Hong"", ""Michael Davinroy"", ""Yi\u01e7itcan Kaya"", ""Dana Dachman-Soled"", ""Tudor Dumitra\u015f""]","[""Reconstructing Novel Deep Learning Systems""]","New data processing pipelines and novel network architectures increasingly drive the success of deep learning. In consequence, the industry considers top-performing architectures as intellectual property and devotes considerable computational resources to discovering such architectures through neural architecture search (NAS). This provides an incentive for adversaries to steal these novel architectures; when used in the cloud, to provide Machine Learning as a Service (MLaaS), the adversaries also have an opportunity to reconstruct the architectures by exploiting a range of hardware side-channels. However, it is challenging to reconstruct novel architectures and pipelines without knowing the computational graph (e.g., the layers, branches or skip connections), the architectural parameters (e.g., the number of filters in a convolutional layer) or the specific pre-processing steps (e.g. embeddings). In this paper, we design an algorithm that reconstructs the key components of a novel deep learning system by exploiting a small amount of information leakage from a cache side-channel attack, Flush+Reload. We use Flush+Reload to infer the trace of computations and the timing for each computation. Our algorithm then generates candidate computational graphs from the trace and eliminates incompatible candidates through a parameter estimation process. We implement our algorithm in PyTorch and Tensorflow. We demonstrate experimentally that we can reconstruct MalConv, a novel data pre-processing pipeline for malware detection, and ProxylessNAS-CPU, a novel network architecture for the ImageNet classification optimized to run on CPUs, without knowing the architecture family. In both cases, we achieve 0% error. These results suggest hardware side channels are a practical attack vector against MLaaS, and more efforts should be devoted to understanding their impact on the security of deep learning systems.",/pdf/598041527370f8bd568a4493447d8e208055cea4.pdf,ICLR,2020,"We design an algorithm that reconstructs the key components of a novel deep learning system by exploiting a small amount of information leakage from a cache side-channel attack, Flush+Reload." +Skld1aVtPB,HkxKO0DSvS,1569440000000.0,1577170000000.0,307,Deep Mining: Detecting Anomalous Patterns in Neural Network Activations with Subset Scanning,"[""skyler@ke.ibm.com"", ""celia.cintas@ibm.com"", ""victor.akinwande1@ibm.com"", ""sriharis.sridharan@ke.ibm.com"", ""mcfowland@umn.edu""]","[""Skyler Speakman"", ""Celia Cintas"", ""Victor Akinwande"", ""Srihari Sridharan"", ""Edward McFowland III""]","[""anomalous pattern detection"", ""subset scanning"", ""node activations"", ""adversarial noise""]","This work views neural networks as data generating systems and applies anomalous pattern detection techniques on that data in order to detect when a network is processing a group of anomalous inputs. Detecting anomalies is a critical component for multiple machine learning problems including detecting the presence of adversarial noise added to inputs. More broadly, this work is a step towards giving neural networks the ability to detect groups of out-of-distribution samples. This work introduces ``Subset Scanning methods from the anomalous pattern detection domain to the task of detecting anomalous inputs to neural networks. Subset Scanning allows us to answer the question: ""``Which subset of inputs have larger-than-expected activations at which subset of nodes?"" Framing the adversarial detection problem this way allows us to identify systematic patterns in the activation space that span multiple adversarially noised images. Such images are ``""weird together"". Leveraging this common anomalous pattern, we show increased detection power as the proportion of noised images increases in a test set. Detection power and accuracy results are provided for targeted adversarial noise added to CIFAR-10 images on a 20-layer ResNet using the Basic Iterative Method attack. ",/pdf/f7c39049e2d42184c2a68bc6e508dc4d00326338.pdf,ICLR,2020,We efficiently find a subset of images that have higher than expected activations for some subset of nodes. These images appear more anomalous and easier to detect when viewed as a group. +BygNAa4YPH,rygYPdZuvr,1569440000000.0,1577170000000.0,852,Out-of-distribution Detection in Few-shot Classification,"[""wangkua1@cs.toronto.edu"", ""pvicol@cs.toronto.edu"", ""eleni@cs.toronto.edu"", ""cc.liu2018@gmail.com"", ""zemel@cs.toronto.edu""]","[""Kuan-Chieh Wang"", ""Paul Vicol"", ""Eleni Triantafillou"", ""Chia-Cheng Liu"", ""Richard Zemel""]","[""few-shot classification"", ""out-of-distribution detection"", ""uncertainty estimate""]","In many real-world settings, a learning model must perform few-shot classification: learn to classify examples from unseen classes using only a few labeled examples per class. +Additionally, to be safely deployed, it should have the ability to detect out-of-distribution inputs: examples that do not belong to any of the classes. +While both few-shot classification and out-of-distribution detection are popular topics, +their combination has not been studied. In this work, we propose tasks for out-of-distribution detection in the few-shot setting and establish benchmark datasets, based on four popular few-shot classification datasets. Then, we propose two new methods for this task and investigate their performance. +In sum, we establish baseline out-of-distribution detection results using standard metrics on new benchmark datasets and show improved results with our proposed methods.",/pdf/4eaf6f094cc13d70bc181b44218fa357c0945fb9.pdf,ICLR,2020,"We quantitatively study out-of-distribution detection in few-shot setting, establish baseline results with ProtoNet, MAML, ABML, and improved upon them." +H13WofbAb,HypC9GWAW,1509140000000.0,1518730000000.0,915,Faster Distributed Synchronous SGD with Weak Synchronization,"[""cx2@illinois.edu"", ""sanmi@illinois.edu"", ""indy@illinois.edu""]","[""Cong Xie"", ""Oluwasanmi O. Koyejo"", ""Indranil Gupta""]","[""distributed"", ""deep learning"", ""straggler""]","Distributed training of deep learning is widely conducted with large neural networks and large datasets. Besides asynchronous stochastic gradient descent~(SGD), synchronous SGD is a reasonable alternative with better convergence guarantees. However, synchronous SGD suffers from stragglers. To make things worse, although there are some strategies dealing with slow workers, the issue of slow servers is commonly ignored. In this paper, we propose a new parameter server~(PS) framework dealing with not only slow workers, but also slow servers by weakening the synchronization criterion. The empirical results show good performance when there are stragglers.",/pdf/1c1ff517c33f6e17855efaec19617cddbfe2b617.pdf,ICLR,2018, +H1gmHaEKwB,Hkxyu6BDPr,1569440000000.0,1586580000000.0,517,Data-Independent Neural Pruning via Coresets,"[""bengordoncshaifa@gmail.com"", ""rita@cs.haifa.ac.il"", ""vova@cs.jhu.edu"", ""samsonzhou@gmail.com"", ""dannyf.post@gmail.co""]","[""Ben Mussay"", ""Margarita Osadchy"", ""Vladimir Braverman"", ""Samson Zhou"", ""Dan Feldman""]","[""coresets"", ""neural pruning"", ""network compression""]","Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. + +We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90% compression of the LeNet-300-100 architecture on MNIST while improving the accuracy.",/pdf/5120c7006cf551983e41c4ccd53877c232cc2c21.pdf,ICLR,2020,"We propose an efficient, provable and data independent method for network compression via neural pruning using coresets of neurons -- a novel construction proposed in this paper." +BJgy-n0cK7,Hkgn6AhqFX,1538090000000.0,1545360000000.0,1135,Inter-BMV: Interpolation with Block Motion Vectors for Fast Semantic Segmentation on Video,"[""samvit@eecs.berkeley.edu"", ""jegonzal@cs.berkeley.edu""]","[""Samvit Jain"", ""Joseph Gonzalez""]","[""semantic segmentation"", ""video"", ""efficient inference"", ""video segmentation"", ""video compression""]","Models optimized for accuracy on single images are often prohibitively slow to +run on each frame in a video, especially on challenging dense prediction tasks, +such as semantic segmentation. Recent work exploits the use of optical flow to +warp image features forward from select keyframes, as a means to conserve computation +on video. This approach, however, achieves only limited speedup, even +when optimized, due to the accuracy degradation introduced by repeated forward +warping, and the inference cost of optical flow estimation. To address these problems, +we propose a new scheme that propagates features using the block motion +vectors (BMV) present in compressed video (e.g. H.264 codecs), instead of optical +flow, and bi-directionally warps and fuses features from enclosing keyframes +to capture scene context on each video frame. Our technique, interpolation-BMV, +enables us to accurately estimate the features of intermediate frames, while keeping +inference costs low. We evaluate our system on the CamVid and Cityscapes +datasets, comparing to both a strong single-frame baseline and related work. We +find that we are able to substantially accelerate segmentation on video, achieving +near real-time frame rates (20+ frames per second) on large images (e.g. 960 x 720 +pixels), while maintaining competitive accuracy. This represents an improvement +of almost 6x over the single-frame baseline and 2.5x over the fastest prior work.",/pdf/4104c17308e988fc00ff924440548974c27431cc.pdf,ICLR,2019,"We exploit video compression techniques (in particular, the block motion vectors in H.264 video) and feature similarity across frames to accelerate a classical image recognition task, semantic segmentation, on video." +v-9E8egy_i,GoEZ3LCNOfP,1601310000000.0,1614990000000.0,3645,Gated Relational Graph Attention Networks,"[""~Denis_Lukovnikov1"", ""~Asja_Fischer1""]","[""Denis Lukovnikov"", ""Asja Fischer""]","[""graph neural networks"", ""GNN"", ""long-range dependencies"", ""deep GNN"", ""relational GNN""]","Relational Graph Neural Networks (GNN) are a class of GNN that are capable of handling multi-relational graphs. Like all GNNs, they suffer from a drop in performance when training deeper networks, which may be caused by vanishing gradients, over-parameterization, and oversmoothing. Previous works have investigated methods that improve the training of deeper GNNs, which include normalization techniques and various types of skip connection within a node. However, learning long-range patterns in multi-relational graphs using GNNs remains an under-explored topic. In this work, we propose a novel GNN architecture based on the Graph Attention Network (GAT) that uses gated skip connections to improve long-range modeling between nodes and uses a more scalable vector-based approach for parameterizing relations. We perform an extensive experimental analysis on synthetic and real data, focusing explicitly on learning long-range patterns. The results indicate that the proposed method significantly outperforms several commonly used relational GNN variants when used in deeper configurations and stays competitive to existing architectures in a shallow setup.",/pdf/d07a1c16bf91ec7447d9ad4bf0af7dd2034c2d00.pdf,ICLR,2021,We propose a novel GAT-based architecture to better model long-range patterns in multi-relational graphs. +0O_cQfw6uEh,lCwGYP4aJ4F,1601310000000.0,1616430000000.0,3694,Gradient Origin Networks,"[""~Sam_Bond-Taylor1"", ""christopher.g.willcocks@durham.ac.uk""]","[""Sam Bond-Taylor"", ""Chris G. Willcocks""]","[""Deep Learning"", ""Generative Models"", ""Implicit Representation""]","This paper proposes a new type of generative model that is able to quickly learn a latent representation without an encoder. This is achieved using empirical Bayes to calculate the expectation of the posterior, which is implemented by initialising a latent vector with zeros, then using the gradient of the log-likelihood of the data with respect to this zero vector as new latent points. The approach has similar characteristics to autoencoders, but with a simpler architecture, and is demonstrated in a variational autoencoder equivalent that permits sampling. This also allows implicit representation networks to learn a space of implicit functions without requiring a hypernetwork, retaining their representation advantages across datasets. The experiments show that the proposed method converges faster, with significantly lower reconstruction error than autoencoders, while requiring half the parameters.",/pdf/afd0b57d3c44292d9b9855c9f88b7f8f152fc9dd.pdf,ICLR,2021,A new model that uses the negative gradient of the loss with respect to the origin as a latent vector is found to be superior to equivalent networks. +U_mat0b9iv,SoaaAzY_Jy1,1601310000000.0,1615940000000.0,738,Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network,"[""~James_Diffenderfer1"", ""~Bhavya_Kailkhura1""]","[""James Diffenderfer"", ""Bhavya Kailkhura""]","[""Binary Neural Networks"", ""Pruning"", ""Lottery Ticket Hypothesis""]","Recently, Frankle & Carbin (2019) demonstrated that randomly-initialized dense networks contain subnetworks that once found can be trained to reach test accuracy comparable to the trained dense network. However, finding these high performing trainable subnetworks is expensive, requiring iterative process of training and pruning weights. In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: + +A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). + +This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained. Without ever updating the weight values, our MPTs-1/32 not only set new binary weight network state-of-the-art (SOTA) Top-1 accuracy -- 94.8% on CIFAR-10 and 74.03% on ImageNet -- but also outperform their full-precision counterparts by 1.78% and 0.76%, respectively. Further, our MPT-1/1 achieves SOTA Top-1 accuracy (91.9%) for binary neural networks on CIFAR-10. Code and pre-trained models are available at: https://github.com/chrundle/biprop.",/pdf/a893180f3b4ea79ef89fca056e7170bcd084f923.pdf,ICLR,2021,A new paradigm for learning compact yet accurate binary neural networks by pruning and quantizing randomly weighted full precision DNNs +EoFNy62JGd,t5BDojjW3z,1601310000000.0,1615370000000.0,1539,Neural gradients are near-lognormal: improved quantized and sparse training,"[""~Brian_Chmiel1"", ""liadgo2@gmail.com"", ""~Moran_Shkolnik1"", ""~Elad_Hoffer1"", ""~Ron_Banner1"", ""~Daniel_Soudry1""]","[""Brian Chmiel"", ""Liad Ben-Uri"", ""Moran Shkolnik"", ""Elad Hoffer"", ""Ron Banner"", ""Daniel Soudry""]",[],"While training can mostly be accelerated by reducing the time needed to propagate neural gradients (loss gradients with respect to the intermediate neural layer outputs) back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity --- in each case without accuracy degradation. +Reference implementation accompanies the paper in the supplementary material.",/pdf/6931f8bb8366bb42bd29d92eb0e841b8ac24f29e.pdf,ICLR,2021, +BJxYUaVtPB,Bkl8bFtDDr,1569440000000.0,1577170000000.0,568,Match prediction from group comparison data using neural networks,"[""koishkim@gmail.com"", ""jmj427@lunit.io"", ""chsuh@kaist.ac.kr""]","[""Sunghyun Kim"", ""Minje jang"", ""Changho Suh""]","[""Neural networks"", ""Group comparison"", ""Match prediction"", ""Rank aggregation""]","We explore the match prediction problem where one seeks to estimate the likelihood of a group of M items preferred over another, based on partial group comparison data. Challenges arise in practice. As existing state-of-the-art algorithms are tailored to certain statistical models, we have different best algorithms across distinct scenarios. Worse yet, we have no prior knowledge on the underlying model for a given scenario. These call for a unified approach that can be universally applied to a wide range of scenarios and achieve consistently high performances. To this end, we incorporate deep learning architectures so as to reflect the key structural features that most state-of-the-art algorithms, some of which are optimal in certain settings, share in common. This enables us to infer hidden models underlying a given dataset, which govern in-group interactions and statistical patterns of comparisons, and hence to devise the best algorithm tailored to the dataset at hand. Through extensive experiments on synthetic and real-world datasets, we evaluate our framework in comparison to state-of-the-art algorithms. It turns out that our framework consistently leads to the best performance across all datasets in terms of cross entropy loss and prediction accuracy, while the state-of-the-art algorithms suffer from inconsistent performances across different datasets. Furthermore, we show that it can be easily extended to attain satisfactory performances in rank aggregation tasks, suggesting that it can be adaptable for other tasks as well.",/pdf/26d73b4e18d94a6152bad5846e99ab973aef0122.pdf,ICLR,2020,"We investigate the merits of employing neural networks in the match prediction problem where one seeks to estimate the likelihood of a group of M items preferred over another, based on partial group comparison data." +xCcdBRQEDW,_-hMm1Etobp,1601310000000.0,1616040000000.0,2543,PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics,"[""~Zhiao_Huang1"", ""~Yuanming_Hu1"", ""~Tao_Du1"", ""elegyhunter@gmail.com"", ""~Hao_Su1"", ""~Joshua_B._Tenenbaum1"", ""~Chuang_Gan1""]","[""Zhiao Huang"", ""Yuanming Hu"", ""Tao Du"", ""Siyuan Zhou"", ""Hao Su"", ""Joshua B. Tenenbaum"", ""Chuang Gan""]","[""Soft Body"", ""Differentiable Physics"", ""Benchmark""]","Simulated virtual environments serve as one of the main driving forces behind developing and evaluating skill learning algorithms. However, existing environments typically only simulate rigid body physics. Additionally, the simulation process usually does not provide gradients that might be useful for planning and control optimizations. We introduce a new differentiable physics benchmark called PasticineLab, which includes a diverse collection of soft body manipulation tasks. In each task, the agent uses manipulators to deform the plasticine into a desired configuration. The underlying physics engine supports differentiable elastic and plastic deformation using the DiffTaichi system, posing many under-explored challenges to robotic agents. We evaluate several existing reinforcement learning (RL) methods and gradient-based methods on this benchmark. Experimental results suggest that 1) RL-based approaches struggle to solve most of the tasks efficiently; 2) gradient-based approaches, by optimizing open-loop control sequences with the built-in differentiable physics engine, can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning. We expect that PlasticineLab will encourage the development of novel algorithms that combine differentiable physics and RL for more complex physics-based skill learning tasks. PlasticineLab will be made publicly available.",/pdf/9e12adbc865507edd6c3c49f14880127967bfea8.pdf,ICLR,2021,We propose a soft-body manipulation benchmark with differentiable physics support. +ryb83alCZ,rklI26lRW,1509120000000.0,1518730000000.0,429,Towards Unsupervised Classification with Deep Generative Models,"[""dkal@iti.gr"", ""ntina_kotta@yahoo.com"", ""kalamar@iti.gr"", ""anasvaf@iti.gr"", ""a.c.rawstron@leeds.ac.uk"", ""dimitrios.tzovaras@iti.gr"", ""kostas.stamatopoulos@gmail.com""]","[""Dimitris Kalatzis"", ""Konstantia Kotta"", ""Ilias Kalamaras"", ""Anastasios Vafeiadis"", ""Andrew Rawstron"", ""Dimitris Tzovaras"", ""Kostas Stamatopoulos""]","[""variational inference"", ""vae"", ""variational autoencoders"", ""generative modeling"", ""representation learning"", ""classification""]","Deep generative models have advanced the state-of-the-art in semi-supervised classification, however their capacity for deriving useful discriminative features in a completely unsupervised fashion for classification in difficult real-world data sets, where adequate manifold separation is required has not been adequately explored. Most methods rely on defining a pipeline of deriving features via generative modeling and then applying clustering algorithms, separating the modeling and discriminative processes. We propose a deep hierarchical generative model which uses a mixture of discrete and continuous distributions to learn to effectively separate the different data manifolds and is trainable end-to-end. We show that by specifying the form of the discrete variable distribution we are imposing a specific structure on the model's latent representations. We test our model's discriminative performance on the task of CLL diagnosis against baselines from the field of computational FC, as well as the Variational Autoencoder literature.",/pdf/ff5681a1a9316f4ac3e53e4a01ccd192d4d6e187.pdf,ICLR,2018,Unsupervised classification via deep generative modeling with controllable feature learning evaluated in a difficult real world task +SylJ1D1C-,rJy1ywJCZ,1509020000000.0,1518730000000.0,128,PDE-Net: Learning PDEs from Data,"[""zlong@pku.edu.cn"", ""luyiping9712@pku.edu.cn"", ""xianzhongma@pku.edu.cn"", ""dongbin@math.pku.edu.cn""]","[""Zichao Long"", ""Yiping Lu"", ""Xianzhong Ma"", ""Bin Dong""]","[""deep convolution network"", ""partial differential equation"", ""physical laws""]","Partial differential equations (PDEs) play a prominent role in many disciplines such as applied mathematics, physics, chemistry, material science, computer science, etc. PDEs are commonly derived based on physical laws or empirical observations. However, the governing equations for many complex systems in modern applications are still not fully known. With the rapid development of sensors, computational power, and data storage in the past decade, huge quantities of data can be easily collected and efficiently stored. Such vast quantity of data offers new opportunities for data-driven discovery of hidden physical laws. Inspired by the latest development of neural network designs in deep learning, we propose a new feed-forward deep network, called PDE-Net, to fulfill two objectives at the same time: to accurately predict dynamics of complex systems and to uncover the underlying hidden PDE models. The basic idea of the proposed PDE-Net is to learn differential operators by learning convolution kernels (filters), and apply neural networks or other machine learning methods to approximate the unknown nonlinear responses. Comparing with existing approaches, which either assume the form of the nonlinear response is known or fix certain finite difference approximations of differential operators, our approach has the most flexibility by learning both differential operators and the nonlinear responses. A special feature of the proposed PDE-Net is that all filters are properly constrained, which enables us to easily identify the governing PDE models while still maintaining the expressive and predictive power of the network. These constrains are carefully designed by fully exploiting the relation between the orders of differential operators and the orders of sum rules of filters (an important concept originated from wavelet theory). We also discuss relations of the PDE-Net with some existing networks in computer vision such as Network-In-Network (NIN) and Residual Neural Network (ResNet). Numerical experiments show that the PDE-Net has the potential to uncover the hidden PDE of the observed dynamics, and predict the dynamical behavior for a relatively long time, even in a noisy environment.",/pdf/e665c50f60e89faa2f770856dadbf91598940f87.pdf,ICLR,2018,"This paper proposes a new feed-forward network, call PDE-Net, to learn PDEs from data. " +ryGgSsAcFQ,B1loMZhHt7,1538090000000.0,1550930000000.0,57,"Deep, Skinny Neural Networks are not Universal Approximators","[""jejo.math@gmail.com""]","[""Jesse Johnson""]","[""neural network"", ""universality"", ""expressability""]","In order to choose a neural network architecture that will be effective for a particular modeling problem, one must understand the limitations imposed by each of the potential options. These limitations are typically described in terms of information theoretic bounds, or by comparing the relative complexity needed to approximate example functions between different architectures. In this paper, we examine the topological constraints that the architecture of a neural network imposes on the level sets of all the functions that it is able to approximate. This approach is novel for both the nature of the limitations and the fact that they are independent of network depth for a broad family of activation functions.",/pdf/3dfd6280ba045de3af9e85183f1db0ea5fe7840a.pdf,ICLR,2019,"This paper proves that skinny neural networks cannot approximate certain functions, no matter how deep they are." +jnMjOctlfbZ,xUIoHErbqOg,1601310000000.0,1614990000000.0,2327,ATOM3D: Tasks On Molecules in Three Dimensions,"[""~Raphael_John_Lamarre_Townshend1"", ""mvoegele@stanford.edu"", ""psuriana@stanford.edu"", ""aderry@stanford.edu"", ""lxpowers@stanford.edu"", ""jlalouda@stanford.edu"", ""sidhikab@stanford.edu"", ""~Brandon_M_Anderson1"", ""~Stephan_Eismann1"", ""~Risi_Kondor1"", ""~Russ_Altman1"", ""~Ron_O._Dror1""]","[""Raphael John Lamarre Townshend"", ""Martin Vogele"", ""Patricia Suriana"", ""Alex Derry"", ""Alex Powers"", ""Yianni Laloudakis"", ""Sidhika Balachandar"", ""Brandon M Anderson"", ""Stephan Eismann"", ""Risi Kondor"", ""Russ Altman"", ""Ron O. Dror""]","[""machine learning"", ""structural biology"", ""biomolecules""]","While a variety of methods have been developed for predicting molecular properties, deep learning networks that operate directly on three-dimensional molecular structure have recently demonstrated particular promise. In this work we present ATOM3D, a collection of both novel and existing datasets spanning several key classes of biomolecules, to systematically assess such learning methods. We develop three-dimensional molecular learning networks for each of these tasks, finding that they consistently improve performance relative to one- and two-dimensional methods. The specific choice of architecture proves to be critical for performance, with three-dimensional convolutional networks excelling at tasks involving complex geometries, while graph networks perform well on systems requiring detailed positional information. Furthermore, equivariant networks show significant promise but are currently unable to scale. Our results indicate many molecular problems stand to gain from three-dimensional molecular learning. All code and datasets are available at github.com/xxxxxxx/xxxxxx.",/pdf/6b6d038cad017ba59dd7e2d47cd5cffdd55afa31.pdf,ICLR,2021,ATOM3D is a collection of benchmark datasets for learning algorithms that work with 3D biomolecular structure. +BkeaxAEKvB,SJeKQ37dPS,1569440000000.0,1577170000000.0,945,New Loss Functions for Fast Maximum Inner Product Search,"[""guorq@google.com"", ""qgeng@google.com"", ""dsimcha@google.com"", ""fchern@google.com"", ""sunphil@google.com"", ""sanjivk@google.com""]","[""Ruiqi Guo"", ""Quan Geng"", ""David Simcha"", ""Felix Chern"", ""Phil Sun"", ""Sanjiv Kumar""]",[],"Quantization based methods are popular for solving large scale maximum inner product search problems. However, in most traditional quantization works, the objective is to minimize the reconstruction error for datapoints to be searched. In this work, we focus directly on minimizing error in inner product approximation and derive a new class of quantization loss functions. One key aspect of the new loss functions is that we weight the error term based on the value of the inner product, giving more importance to pairs of queries and datapoints whose inner products are high. We provide theoretical grounding to the new quantization loss function, which is simple, intuitive and able to work with a variety of quantization techniques, including binary quantization and product quantization. We conduct experiments on public benchmarking datasets \url{http://ann-benchmarks.com} to demonstrate that our method using the new objective outperforms other state-of-the-art methods. We are committed to release our source code.",/pdf/1a6e7ff2b3b00348070a28893c6c3018051d187f.pdf,ICLR,2020, +ryA-jdlA-,HyabjOeRW,1509100000000.0,1518730000000.0,319,A closer look at the word analogy problem,"[""siddharthkumar@upwork.com""]","[""Siddharth Krishna Kumar""]","[""word2vec"", ""glove"", ""word analogy"", ""word relationships"", ""word vectors""]","Although word analogy problems have become a standard tool for evaluating word vectors, little is known about why word vectors are so good at solving these problems. In this paper, I attempt to further our understanding of the subject, by developing a simple, but highly accurate generative approach to solve the word analogy problem for the case when all terms involved in the problem are nouns. My results demonstrate the ambiguities associated with learning the relationship between a word pair, and the role of the training dataset in determining the relationship which gets most highlighted. Furthermore, my results show that the ability of a model to accurately solve the word analogy problem may not be indicative of a model’s ability to learn the relationship between a word pair the way a human does. +",/pdf/f89cff1f0a3406cc6753a9c3f2ab68f629fbbef3.pdf,ICLR,2018,"Simple generative approach to solve the word analogy problem which yields insights into word relationships, and the problems with estimating them" +8PS8m9oYtNy,wM_f7lvmmnL,1601310000000.0,1615970000000.0,3381,Implicit Normalizing Flows,"[""~Cheng_Lu5"", ""~Jianfei_Chen1"", ""~Chongxuan_Li1"", ""~Qiuhao_Wang1"", ""~Jun_Zhu2""]","[""Cheng Lu"", ""Jianfei Chen"", ""Chongxuan Li"", ""Qiuhao Wang"", ""Jun Zhu""]","[""Normalizing flows"", ""deep generative models"", ""probabilistic inference"", ""implicit functions""]","Normalizing flows define a probability distribution by an explicit invertible transformation $\boldsymbol{\mathbf{z}}=f(\boldsymbol{\mathbf{x}})$. In this work, we present implicit normalizing flows (ImpFlows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation $F(\boldsymbol{\mathbf{z}}, \boldsymbol{\mathbf{x}})= \boldsymbol{\mathbf{0}}$. ImpFlows build on residual flows (ResFlows) with a proper balance between expressiveness and tractability. Through theoretical analysis, we show that the function space of ImpFlow is strictly richer than that of ResFlows. Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a non-negligible approximation error. However, the function is exactly representable by a single-block ImpFlow. We propose a scalable algorithm to train and draw samples from ImpFlows. Empirically, we evaluate ImpFlow on several classification and density modeling tasks, and ImpFlow outperforms ResFlow with a comparable amount of parameters on all the benchmarks.",/pdf/19708593079fad9a0012ae316f6d90a49b83c54b.pdf,ICLR,2021,"We generalize normalizing flows, allowing the mapping to be implicitly defined by the roots of an equation and enlarging the expressiveness power while retaining the tractability." +ASAJvUPWaDI,KElnRjbinx,1601310000000.0,1614990000000.0,3307,A Near-Optimal Recipe for Debiasing Trained Machine Learning Models,"[""~Ibrahim_Alabdulmohsin1"", ""~Mario_Lucic1""]","[""Ibrahim Alabdulmohsin"", ""Mario Lucic""]","[""Fairness"", ""Classification"", ""Statistical Parity"", ""Deep Learning""]","We present an efficient and scalable algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. Unlike previous black-box reduction methods to cost-sensitive classification rules, the proposed algorithm operates on models that have been trained without having to retrain the model. Furthermore, as the algorithm is based on projected stochastic gradient descent (SGD), it is particularly attractive for deep learning applications. We empirically validate the proposed algorithm on standard benchmark datasets across both classical algorithms and modern DNN architectures and demonstrate that it outperforms previous post-processing approaches for unbiased classification.",/pdf/ce61746ea333b965b4c3c174135301f23ab7a7cc.pdf,ICLR,2021, The paper introduces a new near-optimal algorithm debiasing learned models. +T0tmb7uhRhD,VoAaTS8lBXp,1601310000000.0,1614990000000.0,596,Model-Agnostic Round-Optimal Federated Learning via Knowledge Transfer,"[""~Qinbin_Li1"", ""~Bingsheng_He1"", ""~Dawn_Song1""]","[""Qinbin Li"", ""Bingsheng He"", ""Dawn Song""]","[""Federated Learning"", ""Communication-Bounded Learning""]","Federated learning enables multiple parties to collaboratively learn a model without exchanging their local data. Currently, federated averaging (FedAvg) is the most widely used federated learning algorithm. However, FedAvg or its variants have obvious shortcomings. It can only be used to learn differentiable models and needs many communication rounds to converge. In this paper, we propose a novel federated learning algorithm FedKT that needs only a single communication round (i.e., round-optimal). With applying the knowledge transfer approach, our algorithm can be applied to any classification model. Moreover, we develop the differentially private versions of FedKT and theoretically analyze the privacy loss. The experiments show that our method can achieve close or better accuracy compared with the other state-of-the-art federated learning algorithms. ",/pdf/1eb0057f89b8a6fe8037ba2ad41c2ce90817d526.pdf,ICLR,2021,The paper presents a new federated learning framework with a single communication round. +SyF7Erp6W,B1u7VraaW,1508890000000.0,1518730000000.0,71,Learning to play slot cars and Atari 2600 games in just minutes,"[""lionel.cordesses@renault.com"", ""omar.bentahar@renault.com"", ""ju.page@hotmail.com""]","[""Lionel Cordesses"", ""Omar Bentahar"", ""Julien Page""]","[""Artificial Intelligence"", ""Signal processing"", ""Philosophy"", ""Analogy"", ""ALE"", ""Slot Car""]","Machine learning algorithms for controlling devices will need to learn quickly, with few trials. Such a goal can be attained with concepts borrowed from continental philosophy and formalized using tools from the mathematical theory of categories. Illustrations of this approach are presented on a cyberphysical system: the slot car game, and also on Atari 2600 games.",/pdf/fc6ded92e6fd8448563646537a4d8e4256544f68.pdf,ICLR,2018,Continental-philosophy-inspired approach to learn with few data. +B1eB5xSFvr,HyednCetPr,1569440000000.0,1583910000000.0,2469,DiffTaichi: Differentiable Programming for Physical Simulation,"[""yuanmhu@gmail.com"", ""lukea@mit.edu"", ""tzumao@berkeley.edu"", ""qisu@adobe.com"", ""ncarr@adobe.com"", ""jrk@berkeley.edu"", ""fredo@mit.edu""]","[""Yuanming Hu"", ""Luke Anderson"", ""Tzu-Mao Li"", ""Qi Sun"", ""Nathan Carr"", ""Jonathan Ragan-Kelley"", ""Fredo Durand""]","[""Differentiable programming"", ""robotics"", ""optimal control"", ""physical simulation"", ""machine learning system""]","We present DiffTaichi, a new differentiable programming language tailored for building high-performance differentiable physical simulators. Based on an imperative programming language, DiffTaichi generates gradients of simulation steps using source code transformations that preserve arithmetic intensity and parallelism. A light-weight tape is used to record the whole simulation program structure and replay the gradient kernels in a reversed order, for end-to-end backpropagation. +We demonstrate the performance and productivity of our language in gradient-based learning and optimization tasks on 10 different physical simulators. For example, a differentiable elastic object simulator written in our language is 4.2x shorter than the hand-engineered CUDA version yet runs as fast, and is 188x faster than the TensorFlow implementation. +Using our differentiable programs, neural network controllers are typically optimized within only tens of iterations.",/pdf/6d9976c7113eb4ad907e38be6d4797388ff35a3b.pdf,ICLR,2020,"We study the problem of learning and optimizing through physical simulations via differentiable programming, using our proposed DiffSim programming language and compiler." +Bke7MANKvS,Hyx8ih4ODB,1569440000000.0,1577170000000.0,997,A Kolmogorov Complexity Approach to Generalization in Deep Learning,"[""hazar.yueksel@ibm.com"", ""krvarshn@us.ibm.com"", ""bedk@us.ibm.com""]","[""Hazar Yueksel"", ""Kush R. Varshney"", ""Brian Kingsbury""]","[""Kolmogorov complexity"", ""information distance"", ""generalization""]","Deep artificial neural networks can achieve an extremely small difference between training and test accuracies on identically distributed training and test sets, which is a standard measure of generalization. However, the training and test sets may not be sufficiently representative of the empirical sample set, which consists of real-world input samples. When samples are drawn from an underrepresented or unrepresented subset during inference, the gap between the training and inference accuracies can be significant. To address this problem, we first reformulate a classification algorithm as a procedure for searching for a source code that maps input features to classes. We then derive a necessary and sufficient condition for generalization using a universal cognitive similarity metric, namely information distance, based on Kolmogorov complexity. Using this condition, we formulate an optimization problem to learn a more general classification function. To achieve this end, we extend the input features by concatenating encodings of them, and then train the classifier on the extended features. As an illustration of this idea, we focus on image classification, where we use channel codes on the input features as a systematic way to improve the degree to which the training and test sets are representative of the empirical sample set. To showcase our theoretical findings, considering that corrupted or perturbed input features belong to the empirical sample set, but typically not to the training and test sets, we demonstrate through extensive systematic experiments that, as a result of learning a more general classification function, a model trained on encoded input features is significantly more robust to common corruptions, e.g., Gaussian and shot noise, as well as adversarial perturbations, e.g., those found via projected gradient descent, than the model trained on uncoded input features.",/pdf/48546b850012b582b87c1dc39cc75b966a1b704e.pdf,ICLR,2020,"We present a theoretical and experimental framework for defining, understanding, and achieving generalization, and as a result robustness, in deep learning by drawing on algorithmic information theory and coding theory." +5FRJWsiLRmA,4eBspEQlq3q,1601310000000.0,1614990000000.0,399,Reservoir Transformers,"[""sheng.s@berkeley.edu"", ""~Alexei_Baevski1"", ""~Ari_S._Morcos1"", ""~Kurt_Keutzer1"", ""~Michael_Auli1"", ""~Douwe_Kiela1""]","[""Sheng Shen"", ""Alexei Baevski"", ""Ari S. Morcos"", ""Kurt Keutzer"", ""Michael Auli"", ""Douwe Kiela""]",[],"We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear reservoir layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.",/pdf/03dfd303778baf8efa60c572dea9de9dd9edaf41.pdf,ICLR,2021, +rylqmxBKvH,HJeNpEeKPS,1569440000000.0,1577170000000.0,2220,Unsupervised Spatiotemporal Data Inpainting,"[""yuan.yin@lip6.fr"", ""arthur.pajot@lip6.fr"", ""emmanuel.de-bezenac@lip6.fr"", ""patrick.gallinari@lip6.fr""]","[""Yuan Yin"", ""Arthur Pajot"", ""Emmanuel de B\u00e9zenac"", ""Patrick Gallinari""]","[""Deep Learning"", ""Adversarial"", ""MAP"", ""GAN"", ""neural networks"", ""video""]","We tackle the problem of inpainting occluded area in spatiotemporal sequences, such as cloud occluded satellite observations, in an unsupervised manner. We place ourselves in the setting where there is neither access to paired nor unpaired training data. We consider several cases in which the underlying information of the observed sequence in certain areas is lost through an observation operator. In this case, the only available information is provided by the observation of the sequence, the nature of the measurement process and its associated statistics. We propose an unsupervised-learning framework to retrieve the most probable sequence using a generative adversarial network. We demonstrate the capacity of our model to exhibit strong reconstruction capacity on several video datasets such as satellite sequences or natural videos. +",/pdf/f36a71efb8e57f69c60d07121b6278aa551afe04.pdf,ICLR,2020, +cR91FAodFMe,K76NZMMPaOs,1601310000000.0,1613080000000.0,3405,Learning to Set Waypoints for Audio-Visual Navigation,"[""~Changan_Chen2"", ""~Sagnik_Majumder1"", ""~Ziad_Al-Halah2"", ""~Ruohan_Gao2"", ""~Santhosh_Kumar_Ramakrishnan1"", ""~Kristen_Grauman1""]","[""Changan Chen"", ""Sagnik Majumder"", ""Ziad Al-Halah"", ""Ruohan Gao"", ""Santhosh Kumar Ramakrishnan"", ""Kristen Grauman""]","[""visual navigation"", ""audio visual learning"", ""embodied vision""]","In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.",/pdf/fa0a991905ae30b2fa74ca7b101b3acabd532c13.pdf,ICLR,2021,We introduce a hierarchical reinforcement learning approach to audio-visual navigation that learns to dynamically set waypoints in an end-to-end fashion +uvEgLKYMBF9,Zlhx3kiEIbf,1601310000000.0,1614990000000.0,1936,Variance Reduction in Hierarchical Variational Autoencoders,"[""~Adeel_Pervez1"", ""~Efstratios_Gavves1""]","[""Adeel Pervez"", ""Efstratios Gavves""]",[],"Variational autoencoders with deep hierarchies of stochastic layers have been known to suffer from the problem of posterior collapse, where the top layers fall back to the prior and become independent of input. +We suggest that the hierarchical VAE objective explicitly includes the variance of the function parameterizing the mean and variance of the latent Gaussian distribution which itself is often a high variance function. +Building on this we generalize VAE neural networks by incorporating a smoothing parameter motivated by Gaussian analysis to reduce variance in parameterizing functions and show that this can help to solve the problem of posterior collapse. +We further show that under such smoothing the VAE loss exhibits a phase transition, where the top layer KL divergence sharply drops to zero at a critical value of the smoothing parameter. +We validate the phenomenon across model configurations and datasets.",/pdf/e756906c3f124a660b078c8ebfb7e146209461ad.pdf,ICLR,2021, +JeweO9-QqV-,xxWRczrQhnpS,1601310000000.0,1614990000000.0,1445,SoGCN: Second-Order Graph Convolutional Networks,"[""~Peihao_Wang1"", ""~Yuehao_Wang1"", ""~Hua_Lin1"", ""~Jianbo_Shi1""]","[""Peihao Wang"", ""Yuehao Wang"", ""Hua Lin"", ""Jianbo Shi""]","[""Graph Convolutional Networks"", ""Filter Representation Power"", ""Graph Polynomial Filters""]","We introduce a second-order graph convolution (SoGC), a maximally localized kernel, that can express a polynomial spectral filter with arbitrary coefficients. We contrast our SoGC with vanilla GCN, first-order (one-hop) aggregation, and higher-order (multi-hop) aggregation by analyzing graph convolutional layers via generalized filter space. We argue that SoGC is a simple design capable of forming the basic building block of graph convolution, playing the same role as $3 \times 3$ kernels in CNNs. We build purely topological Second-Order Graph Convolutional Networks (SoGCN) and demonstrate that SoGCN consistently achieves state-of-the-art performance on the latest benchmark. Moreover, we introduce the Gated Recurrent Unit (GRU) to spectral GCNs. This explorative attempt further improves our experimental results.",/pdf/a11b4d3b7841850f0fd5a804fb2989b109ca0f5d.pdf,ICLR,2021,"We introduce a second-order graph convolution (SoGC), a maximally localized kernel, that can express a polynomial spectral filter of order $K$ with arbitrary coefficients. " +SJxy5A4twS,H1l6pwu_PB,1569440000000.0,1577170000000.0,1269,Superbloom: Bloom filter meets Transformer,"[""janders@google.com"", ""qqhuang@google.com"", ""walidk@google.com"", ""srendle@google.com"", ""liqzhang@google.com""]","[""John Anderson"", ""Qingqing Huang"", ""Walid Krichene"", ""Steffen Rendle"", ""Li Zhang""]","[""Bloom filter"", ""Transformer"", ""word pieces"", ""contextual embeddings""]","We extend the idea of word pieces in natural language models to machine learning tasks on opaque ids. This is achieved by applying hash functions to map each id to multiple hash tokens in a much smaller space, similarly to a Bloom filter. We show that by applying a multi-layer Transformer to these Bloom filter digests, we are able to obtain models with high accuracy. They outperform models of a similar size without hashing and, to a large degree, models of a much larger size trained using sampled softmax with the same computational budget. Our key observation is that it is important to use a multi-layer Transformer for Bloom filter digests to remove ambiguity in the hashed input. We believe this provides an alternative method to solving problems with large vocabulary size.",/pdf/83c074889aa9a046029f2f75efc0cb78a2a7f5b3.pdf,ICLR,2020,We apply Transformer on Bloom filter digests and show it achieves good quality. +ry0WOxbRZ,B1p-OxbA-,1509130000000.0,1518730000000.0,604,IVE-GAN: Invariant Encoding Generative Adversarial Networks,"[""robin.winter@bayer.com"", ""djork-arne.clevert@bayer.com""]","[""Robin Winter"", ""Djork-Arn\u00e8 Clevert""]","[""Deep learning"", ""Unsupervised Learning""]","Generative adversarial networks (GANs) are a powerful framework for generative tasks. However, they are difficult to train and tend to miss modes of the true data generation process. Although GANs can learn a rich representation of the covered modes of the data in their latent space, the framework misses an inverse mapping from data to this latent space. We propose Invariant Encoding Generative Adversarial Networks (IVE-GANs), a novel GAN framework that introduces such a mapping for individual samples from the data by utilizing features in the data which are invariant to certain transformations. Since the model maps individual samples to the latent space, it naturally encourages the generator to cover all modes. We demonstrate the effectiveness of our approach in terms of generative performance and learning rich representations on several datasets including common benchmark image generation tasks.",/pdf/15ee38bd8da6fcb3f771978bb8a5be0520ee3203.pdf,ICLR,2018,A noval GAN framework that utilizes transformation-invariant features to learn rich representations and strong generators. +SkMQg3C5K7,Hyg4r_a5FQ,1538090000000.0,1545390000000.0,1063,A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks,"[""arora@cs.princeton.edu"", ""cohennadav@ias.edu"", ""ngolowich@college.harvard.edu"", ""huwei@cs.princeton.edu""]","[""Sanjeev Arora"", ""Nadav Cohen"", ""Noah Golowich"", ""Wei Hu""]","[""Deep Learning"", ""Learning Theory"", ""Non-Convex Optimization""]","We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network by minimizing the L2 loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).",/pdf/85f94616aa707234fc6d02a284f15d53b7cc8841.pdf,ICLR,2019,"We analyze gradient descent for deep linear neural networks, providing a guarantee of convergence to global optimum at a linear rate." +sfgcqgOm2F_,eebXaibofzt,1601310000000.0,1614990000000.0,2173,Natural Compression for Distributed Deep Learning,"[""~Samuel_Horv\u00e1th1"", ""chenyu.ho@kaust.edu.sa"", ""ludovit.horvath.94@gmail.com"", ""atal.sahu@kaust.edu.sa"", ""~Marco_Canini1"", ""~Peter_Richtarik1""]","[""Samuel Horv\u00e1th"", ""Chen-Yu Ho"", ""Ludovit Horv\u00e1th"", ""Atal Narayan Sahu"", ""Marco Canini"", ""Peter Richtarik""]",[],"Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($C_{nat}$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a ``natural'' way by ignoring the mantissa. We show that compared to no compression, $C_{nat}$ increases the second moment of the compressed vector by not more than the tiny factor $\nicefrac{9}{8}$, which means that the effect of $C_{nat}$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by $C_{nat}$ are substantial, +leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $C_{nat}$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.",/pdf/b60ec47f0f26a6e6ee0ebb62b44962e3b1e71443.pdf,ICLR,2021, +Byl28eBtwH,Bkx0uqxFDS,1569440000000.0,1577170000000.0,2337,Learning Cluster Structured Sparsity by Reweighting,"[""yljblues@whu.edu.cn"", ""ly.wd@whu.edu.cn"", ""haijian.zhang@whu.edu.cn"", ""liuzhou@whu.edu.cn""]","[""Yulun Jiang"", ""Lei Yu"", ""Haijian Zhang"", ""Zhou Liu""]","[""Sparse Recovery"", ""Sparse Representation"", ""Structured Sparsity""]","Recently, the paradigm of unfolding iterative algorithms into finite-length feed-forward neural networks has achieved a great success in the area of sparse recovery. Benefit from available training data, the learned networks have achieved state-of-the-art performance in respect of both speed and accuracy. However, the structure behind sparsity, imposing constraint on the support of sparse signals, is often an essential prior knowledge but seldom considered in the existing networks. In this paper, we aim at bridging this gap. Specifically, exploiting the iterative reweighted $\ell_1$ minimization (IRL1) algorithm, we propose to learn the cluster structured sparsity (CSS) by rewegihting adaptively. In particular, we first unfold the Reweighted Iterative Shrinkage Algorithm (RwISTA) into an end-to-end trainable deep architecture termed as RW-LISTA. Then instead of the element-wise reweighting, the global and local reweighting manner are proposed for the cluster structured sparse learning. Numerical experiments further show the superiority of our algorithm against both classical algorithms and learning-based networks on different tasks. ",/pdf/3bb43c33c1bab83811763506a2b81008d95f501e.pdf,ICLR,2020, +ByxkijC5FQ,Hyg3U1c5FQ,1538090000000.0,1550750000000.0,584,Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology,"[""bastian.rieck@bsse.ethz.ch"", ""matteo.togninalli@bsse.ethz.ch"", ""christian.bock@bsse.ethz.ch"", ""michael.moor@bsse.ethz.ch"", ""max.horn@bsse.ethz.ch"", ""thomas.gumbsch@bsse.ethz.ch"", ""karsten.borgwardt@bsse.ethz.ch""]","[""Bastian Rieck"", ""Matteo Togninalli"", ""Christian Bock"", ""Michael Moor"", ""Max Horn"", ""Thomas Gumbsch"", ""Karsten Borgwardt""]","[""Algebraic topology"", ""persistent homology"", ""network complexity"", ""neural network""]","While many approaches to make neural networks more fathomable have been proposed, they are restricted to interrogating the network with input data. Measures for characterizing and monitoring structural properties, however, have not been developed. In this work, we propose neural persistence, a complexity measure for neural network architectures based on topological data analysis on weighted stratified graphs. To demonstrate the usefulness of our approach, we show that neural persistence reflects best practices developed in the deep learning community such as dropout and batch normalization. Moreover, we derive a neural persistence-based stopping criterion that shortens the training process while achieving comparable accuracies as early stopping based on validation loss.",/pdf/e08531064d88c56cd19cfe26840297accf25652b.pdf,ICLR,2019,We develop a new topological complexity measure for deep neural networks and demonstrate that it captures their salient properties. +Hk4fpoA5Km,r1g87_6cY7,1538090000000.0,1550880000000.0,784,Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning,"[""kostrikov@cs.nyu.edu"", ""kumarkagrawal@gmail.com"", ""debidatta@google.com"", ""slevine@google.com"", ""tompson@google.com""]","[""Ilya Kostrikov"", ""Kumar Krishna Agrawal"", ""Debidatta Dwibedi"", ""Sergey Levine"", ""Jonathan Tompson""]","[""deep learning"", ""reinforcement learning"", ""imitation learning"", ""adversarial learning""]","We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework. The first problem is implicit bias present in the reward functions used in these algorithms. While these biases might work well for some environments, they can also lead to sub-optimal behavior in others. Secondly, even though these algorithms can learn from few expert demonstrations, they require a prohibitively large number of interactions with the environment in order to imitate the expert for many real-world applications. In order to address these issues, we propose a new algorithm called Discriminator-Actor-Critic that uses off-policy Reinforcement Learning to reduce policy-environment interaction sample complexity by an average factor of 10. Furthermore, since our reward function is designed to be unbiased, we can apply our algorithm to many problems without making any task-specific adjustments. ",/pdf/14c0807f98709fad7a6be7cff5051151b71b397e.pdf,ICLR,2019,We address sample inefficiency and reward bias in adversarial imitation learning algorithms such as GAIL and AIRL. +HJlMkTNYvH,HJek9wVBDH,1569440000000.0,1577170000000.0,292,MODiR: Multi-Objective Dimensionality Reduction for Joint Data Visualisation,"[""tim.repke@hpi.uni-potsdam.de"", ""ralf.krestel@hpi.de""]","[""Tim Repke"", ""Ralf Krestel""]","[""dimensionality reduction"", ""visualisation"", ""text visualisation"", ""network drawing""]","Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents. +Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-cooccurrence networks. +Furthermore, social networks can be extracted from email corpora, tweets, or social media. +When it comes to visualising these large corpora, either the textual content or the network graph are used. + +In this paper, we propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents' content but also the relationships expressed by the inherent network structure. +To this end, we introduce a novel algorithm based on multi-objective optimisation to jointly position embedded documents and graph nodes in a two-dimensional landscape. +We illustrate the effectiveness of our approach with real-world datasets and show that we can capture the semantics of large document collections better than other visualisations based on either the content or the network information.",/pdf/3d61cf503e9dd2bb8b536afe887b0f8e8ab6e70e.pdf,ICLR,2020,"Dimensionality reduction algorithm to visualise text with network information, for example an email corpus or co-authorships." +rygp3iRcF7,S1xrZR3cKm,1538090000000.0,1545360000000.0,751,Area Attention,"[""liyang@google.com"", ""lukaszkaiser@google.com"", ""bengio@google.com"", ""sisidaisy@google.com""]","[""Yang Li"", ""Lukasz Kaiser"", ""Samy Bengio"", ""Si Si""]","[""Deep Learning"", ""attentional mechanisms"", ""neural machine translation"", ""image captioning""]","Existing attention mechanisms, are mostly item-based in that a model is trained to attend to individual items in a collection (the memory) where each item has a predefined, fixed granularity, e.g., a character or a word. Intuitively, an area in the memory consisting of multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area or the level of aggregation, is dynamically determined via learning, which can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only individual items, a model can attend to information with varying granularity. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation (both character and token-level) and image captioning, and improve upon strong (state-of-the-art) baselines in all the cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.",/pdf/0aa7f168abb30d826e6d0124fedaf39cccb84aba.pdf,ICLR,2019,The paper presents a novel approach for attentional mechanisms that can benefit a range of tasks such as machine translation and image captioning. +fGF8qAqpXXG,v9dBYH0rg7,1601310000000.0,1616040000000.0,2926,Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms,"[""~Arda_Sahiner1"", ""~Tolga_Ergen1"", ""~John_M._Pauly1"", ""~Mert_Pilanci3""]","[""Arda Sahiner"", ""Tolga Ergen"", ""John M. Pauly"", ""Mert Pilanci""]","[""neural networks"", ""theory"", ""convex optimization"", ""copositive programming"", ""convex duality"", ""nonnegative PCA"", ""semi-nonnegative matrix factorization"", ""computational complexity"", ""global optima"", ""semi-infinite duality"", ""convolutional neural networks""]","We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with soft-thresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and which corresponds with the solution of Stochastic Gradient Descent in practice.",/pdf/0222bcef2a87d75e3670c0707c8b848554ecbe31.pdf,ICLR,2021,"We demonstrate that two-layer vector-output ReLU networks can be expressed as copositive programs, and introduce algorithms for provably finding their global optima, which are polynomial in the number of samples for a fixed data rank." +BJMvBjC5YQ,H1g-c-aHFQ,1538090000000.0,1545360000000.0,95,Cutting Down Training Memory by Re-fowarding,"[""jfeng1@andrew.cmu.edu"", ""donghuang@cmu.edu""]","[""Jianwei Feng"", ""Dong Huang""]","[""deep learning"", ""training memory"", ""computation-memory trade off"", ""optimal solution""]","Deep Neutral Networks(DNNs) require huge GPU memory when training on modern image/video databases. Unfortunately, the GPU memory as a hardware resource is always finite, which limits the image resolution, batch size, and learning rate that could be used for better DNN performance. In this paper, we propose a novel training approach, called Re-forwarding, that substantially reduces memory usage in training. Our approach automatically finds a subset of vertices in a DNN computation graph, and stores tensors only at these vertices during the first forward. During backward, extra local forwards (called the Re-forwarding process) are conducted to compute the missing tensors between the subset of vertices. The total memory cost becomes the sum of (1) the memory cost at the subset of vertices and (2) the maximum memory cost among local re-forwards. Re-forwarding trades training time overheads for memory and does not compromise any performance in testing. We propose theories and algorithms that achieve the optimal memory solutions for DNNs with either linear or arbitrary computation graphs. Experiments show that Re-forwarding cuts down up-to 80% of training memory on popular DNNs such as Alexnet, VGG, ResNet, Densenet and Inception net.",/pdf/3f8164c00eb764f7ee94ac23d52a1943a493b8dc.pdf,ICLR,2019,"This paper proposes fundamental theory and optimal algorithms for DNN training, which reduce up to 80% of training memory for popular DNNs." +rJ7yZ2P6-,BJzJZ2v6b,1508520000000.0,1518730000000.0,27,Enhance Word Representation for Out-of-Vocabulary on Ubuntu Dialogue Corpus,"[""jdongca2003@gmail.com"", ""ccjimhuang@gmail.com""]","[""JIANXIONG DONG"", ""Jim Huang""]","[""next utterance selection"", ""ubuntu dialogue corpus"", ""out-of-vocabulary"", ""word representation""]","Ubuntu dialogue corpus is the largest public available dialogue corpus to make it feasible to build end-to-end +deep neural network models directly from the conversation data. One challenge of Ubuntu dialogue corpus is +the large number of out-of-vocabulary words. In this paper we proposed an algorithm which combines the general pre-trained word embedding vectors with those generated on the task-specific training set to address this issue. We integrated character embedding into Chen et al's Enhanced LSTM method (ESIM) and used it to evaluate the effectiveness of our proposed method. For the task of next utterance selection, the proposed method has demonstrated a significant performance improvement against original ESIM and the new model has achieved state-of-the-art results on both Ubuntu dialogue corpus and Douban conversation corpus. In addition, we investigated the performance impact of end-of-utterance and end-of-turn token tags. ",/pdf/136ec9497d120ac8863cf8ff0fd325638b1a4bfb.pdf,ICLR,2018,Combine information between pre-built word embedding and task-specific word representation to address out-of-vocabulary issue +SJ25-B5eg,,1478280000000.0,1488570000000.0,240,The Neural Noisy Channel,"[""lei.yu@cs.ox.ac.uk"", ""pblunsom@google.com"", ""cdyer@google.com"", ""etg@google.com"", ""tkocisky@google.com""]","[""Lei Yu"", ""Phil Blunsom"", ""Chris Dyer"", ""Edward Grefenstette"", ""Tomas Kocisky""]","[""Natural language processing"", ""Deep learning"", ""Semi-Supervised Learning""]","We formulate sequence to sequence transduction as a noisy channel decoding problem and use recurrent neural networks to parameterise the source and channel models. Unlike direct models which can suffer from explaining-away effects during training, noisy channel models must produce outputs that explain their inputs, and their component models can be trained with not only paired training samples but also unpaired samples from the marginal output distribution. Using a latent variable to control how much of the conditioning sequence the channel model needs to read in order to generate a subsequent symbol, we obtain a tractable and effective beam search decoder. Experimental results on abstractive sentence summarisation, morphological inflection, and machine translation show that noisy channel models outperform direct models, and that they significantly benefit from increased amounts of unpaired output data that direct models cannot easily use.",/pdf/587cf5979157e003499aa4188ed3aa8e78064016.pdf,ICLR,2017,We formulate sequence to sequence transduction as a noisy channel decoding problem and use recurrent neural networks to parameterise the source and channel models. +Syx0Mh05YQ,S1e47365Y7,1538090000000.0,1550890000000.0,1317,Learning Grid Cells as Vector Representation of Self-Position Coupled with Matrix Representation of Self-Motion,"[""ruiqigao@ucla.edu"", ""jianwen@ucla.edu"", ""sczhu@stat.ucla.edu"", ""ywu@stat.ucla.edu""]","[""Ruiqi Gao"", ""Jianwen Xie"", ""Song-Chun Zhu"", ""Ying Nian Wu""]",[],"This paper proposes a representational model for grid cells. In this model, the 2D self-position of the agent is represented by a high-dimensional vector, and the 2D self-motion or displacement of the agent is represented by a matrix that transforms the vector. Each component of the vector is a unit or a cell. The model consists of the following three sub-models. (1) Vector-matrix multiplication. The movement from the current position to the next position is modeled by matrix-vector multi- plication, i.e., the vector of the next position is obtained by multiplying the matrix of the motion to the vector of the current position. (2) Magnified local isometry. The angle between two nearby vectors equals the Euclidean distance between the two corresponding positions multiplied by a magnifying factor. (3) Global adjacency kernel. The inner product between two vectors measures the adjacency between the two corresponding positions, which is defined by a kernel function of the Euclidean distance between the two positions. Our representational model has explicit algebra and geometry. It can learn hexagon patterns of grid cells, and it is capable of error correction, path integral and path planning.",/pdf/1553f8c38acacffb76f7dd1b0bb9bd75785a02f4.pdf,ICLR,2019, +B14ejsA5YQ,rklhtMAqtm,1538090000000.0,1545360000000.0,594,Neural Causal Discovery with Learnable Input Noise,"[""tailin@mit.edu"", ""tbreuel@nvidia.com"", ""jkautz@nvidia.com""]","[""Tailin Wu"", ""Thomas Breuel"", ""Jan Kautz""]","[""neural causal learning"", ""learnable noise""]","Learning causal relations from observational time series with nonlinear interactions and complex causal structures is a key component of human intelligence, and has a wide range of applications. Although neural nets have demonstrated their effectiveness in a variety of fields, their application in learning causal relations has been scarce. This is due to both a lack of theoretical results connecting risk minimization and causality (enabling function approximators like neural nets to apply), and a lack of scalability in prior causal measures to allow for expressive function approximators like neural nets to apply. In this work, we propose a novel causal measure and algorithm using risk minimization to infer causal relations from time series. We demonstrate the effectiveness and scalability of our algorithms to learn nonlinear causal models in synthetic datasets as comparing to other methods, and its effectiveness in inferring causal relations in a video game environment and real-world heart-rate vs. breath-rate and rat brain EEG datasets.",/pdf/57a11848006e91d0eeb3c9006f26442258c407cd.pdf,ICLR,2019, +BJQPG5lR-,S1fwG9l0b,1509100000000.0,1518730000000.0,347,Avoiding degradation in deep feed-forward networks by phasing out skip-connections,"[""r.monti@ucl.ac.uk"", ""sina@gatsby.ucl.ac.uk"", ""robin.cao@ucl.ac.uk""]","[""Ricardo Pio Monti"", ""Sina Tootoonian"", ""Robin Cao""]","[""optimization"", ""vanishing gradients"", ""shattered gradients"", ""skip-connections""]","A widely observed phenomenon in deep learning is the degradation problem: increasing +the depth of a network leads to a decrease in performance on both test and training data. Novel architectures such as ResNets and Highway networks have addressed this issue by introducing various flavors of skip-connections or gating mechanisms. However, the degradation problem persists in the context of plain feed-forward networks. In this work we propose a simple method to address this issue. The proposed method poses the learning of weights in deep networks as a constrained optimization problem where the presence of skip-connections is penalized by Lagrange multipliers. This allows for skip-connections to be introduced during the early stages of training and subsequently phased out in a principled manner. We demonstrate the benefits of such an approach with experiments on MNIST, fashion-MNIST, CIFAR-10 and CIFAR-100 where the proposed method is shown to greatly decrease the degradation effect (compared to plain networks) and is often competitive with ResNets.",/pdf/d8e035737222ad3316426edcea95c8a1a3dab73d.pdf,ICLR,2018,Phasing out skip-connections in a principled manner avoids degradation in deep feed-forward networks. +B1jnyXXJx,,1476760000000.0,1481750000000.0,4,Charged Point Normalization: An Efficient Solution to the Saddle Point Problem,"[""armen.ag@live.com""]","[""Armen Aghajanyan""]","[""Deep learning"", ""Computer vision"", ""Optimization""]","Recently, the problem of local minima in very high dimensional non-convex optimization has been challenged and the problem of saddle points has been introduced. This paper introduces a dynamic type of normalization that forces the system to escape saddle points. Unlike other saddle point escaping algorithms, second order information is not utilized, and the system can be trained with an arbitrary gradient descent learner. The system drastically improves learning in a range of deep neural networks on various data-sets in comparison to non-CPN neural networks.",/pdf/4f4854eeac3c1a8dc5416dfce2adba583b6fb529.pdf,ICLR,2017, +Hyg5TRNtDH,r1egd79OPH,1569440000000.0,1577170000000.0,1404,Unsupervised Temperature Scaling: Robust Post-processing Calibration for Domain Shift,"[""azadeh-sadat.mozafari.1@ulaval.ca"", ""hugo.siqueira-gomes.1@ulaval.ca"", ""christian.gagne@gel.ulaval.ca""]","[""Azadeh Sadat Mozafari"", ""Hugo Siqueira Gomes"", ""Christian Gagne""]","[""calibration"", ""domain shift"", ""uncertainty prediction"", ""deep neural networks"", ""temperature scaling""]","The uncertainty estimation is critical in real-world decision making applications, especially when distributional shift between the training and test data are prevalent. Many calibration methods in the literature have been proposed to improve the predictive uncertainty of DNNs which are generally not well-calibrated. However, none of them is specifically designed to work properly under domain shift condition. In this paper, we propose Unsupervised Temperature Scaling (UTS) as a robust calibration method to domain shift. It exploits test samples to adjust the uncertainty prediction of deep models towards the test distribution. UTS utilizes a novel loss function, weighted NLL, that allows unsupervised calibration. We evaluate UTS on a wide range of model-datasets which shows the possibility of calibration without labels and demonstrate the robustness of UTS compared to other methods (e.g., TS, MC-dropout, SVI, ensembles) in shifted domains. ",/pdf/b514a7d6338d59505b41116d83c40c13f93205de.pdf,ICLR,2020,A robust post-processing calibration method for domain shift. +r1aGWUqgg,,1478280000000.0,1484340000000.0,288,Unsupervised Learning of State Representations for Multiple Tasks,"[""antonin.raffin@ensta-paristech.fr"", ""sebastian.hoefer@tu-berlin.de"", ""rico.jonschkowski@tu-berlin.de"", ""oliver.brock@tu-berlin.de"", ""freek.stulp@dlr.de""]","[""Antonin Raffin"", ""Sebastian H\u00f6fer"", ""Rico Jonschkowski"", ""Oliver Brock"", ""Freek Stulp""]","[""Reinforcement Learning"", ""Unsupervised Learning""]","We present an approach for learning state representations in multi-task reinforcement learning. Our method learns multiple low-dimensional state representations from raw observations in an unsupervised fashion, without any knowledge of which task is executed, nor of the number of tasks involved. +The method is based on a gated neural network architecture, trained with an extension of the learning with robotic priors objective. In simulated experiments, we show that our method is able to learn better state representations for reinforcement learning, and we analyze why and when it manages to do so.",/pdf/20543da4c263b038cc463144a1390f187a5e1780.pdf,ICLR,2017,Learning method for automatic detection of multiple reinforcement tasks and extraction of state representations from raw observations +Dtahsj2FkrK,aolEkFNzGaD,1601310000000.0,1614990000000.0,488,A REINFORCEMENT LEARNING FRAMEWORK FOR TIME DEPENDENT CAUSAL EFFECTS EVALUATION IN A/B TESTING,"[""~Chengchun_Shi1"", ""wxyinucas@gmail.com"", ""luoshikai@didiglobal.com"", ""~Rui_Song2"", ""~Hongtu_Zhu2"", ""~Jieping_Ye3""]","[""Chengchun Shi"", ""Xiaoyu Wang"", ""Shikai Luo"", ""Rui Song"", ""Hongtu Zhu"", ""Jieping Ye""]","[""reinforcement learning"", ""A/B testing"", ""causal inference"", ""sequential testing""]","A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. The aim of this paper is to introduce a reinforcement learn- ing framework for carrying A/B testing in two-sided marketplace platforms, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both synthetic data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. +",/pdf/5a1a068d6d0d088d2a3fa01e6770aac6b889e790.pdf,ICLR,2021,We introduce a reinforcement learning framework to evaluate time dependent causal effects in A/B testing. +HJeQToAqKQ,ryl-4uTcKX,1538090000000.0,1545360000000.0,786,TherML: The Thermodynamics of Machine Learning,"[""alemi@google.com"", ""iansf@google.com""]","[""Alexander A. Alemi"", ""Ian Fischer""]","[""representation learning"", ""information theory"", ""information bottleneck"", ""thermodynamics"", ""predictive information""]",In this work we offer an information-theoretic framework for representation learning that connects with a wide class of existing objectives in machine learning. We develop a formal correspondence between this work and thermodynamics and discuss its implications.,/pdf/7cbd3e389d18abb3b90642dbd453925a05355a94.pdf,ICLR,2019,We offer a framework for representation learning that connects with a wide class of existing objectives and is analogous to thermodynamics. +CaCHjsqCBJV,X1aVroCMsCC,1601310000000.0,1614990000000.0,270,Differentiable Optimization of Generalized Nondecomposable Functions using Linear Programs,"[""~Zihang_Meng1"", ""~Lopamudra_Mukherjee1"", ""~Vikas_Singh1"", ""~Sathya_N._Ravi1""]","[""Zihang Meng"", ""Lopamudra Mukherjee"", ""Vikas Singh"", ""Sathya N. Ravi""]","[""linear programming"", ""nondecomposable functions"", ""differentiable"", ""AUC"", ""Fscore""]","We propose a framework which makes it feasible to directly train deep neural networks with respect to popular families of task-specific non-decomposable per- formance measures such as AUC, multi-class AUC, F -measure and others, as well as models such as non-negative matrix factorization. A common feature of the optimization model that emerges from these tasks is that it involves solving a Linear Programs (LP) during training where representations learned by upstream layers influence the constraints. The constraint matrix is not only large but the constraints are also modified at each iteration. We show how adopting a set of influential ideas proposed by Mangasarian for 1-norm SVMs – which advocates for solving LPs with a generalized Newton method – provides a simple and effective solution. In particular, this strategy needs little unrolling, which makes it more efficient during backward pass. While a number of specialized algorithms have been proposed for the models that we de- scribe here, our module turns out to be applicable without any specific adjustments or relaxations. We describe each use case, study its properties and demonstrate the efficacy of the approach over alternatives which use surrogate lower bounds and often, specialized optimization schemes. Frequently, we achieve superior computational behavior and performance improvements on common datasets used in the literature. +",/pdf/4d4ca52452dfe2dde7188f41e23da05da6ddf495.pdf,ICLR,2021,We propose a framework which makes it feasible to directly train deep neural networks with respect to popular families of task-specific non-decomposable performance measures. +rkN2Il-RZ,ryXhUebC-,1509130000000.0,1518730000000.0,593,SCAN: Learning Hierarchical Compositional Visual Concepts,"[""irinah@google.com"", ""sonnerat@google.com"", ""lmatthey@google.com"", ""arkap@google.com"", ""cpburgess@google.com"", ""matko@google.com"", ""mshanahan@google.com"", ""botvinick@google.com"", ""demishassabis@google.com"", ""lerchner@google.com""]","[""Irina Higgins"", ""Nicolas Sonnerat"", ""Loic Matthey"", ""Arka Pal"", ""Christopher P Burgess"", ""Matko Bo\u0161njak"", ""Murray Shanahan"", ""Matthew Botvinick"", ""Demis Hassabis"", ""Alexander Lerchner""]","[""grounded visual concepts"", ""compositional representation"", ""concept hierarchy"", ""disentangling"", ""beta-VAE"", ""variational autoencoder"", ""deep learning"", ""generative model""]","The seemingly infinite diversity of the natural world arises from a relatively small set of coherent rules, such as the laws of physics or chemistry. We conjecture that these rules give rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts. If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts. This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such abstractions in the visual domain. SCAN learns concepts through fast symbol association, grounding them in disentangled visual primitives that are discovered in an unsupervised manner. Unlike state of the art multimodal generative model baselines, our approach requires very few pairings between symbols and images and makes no assumptions about the form of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to break away from its training data distribution and imagine novel visual concepts through symbolically instructed recombination of previously learnt concepts.",/pdf/47b6e5497ffcfbeb4b7f78766afce2d082f3954b.pdf,ICLR,2018,We present a neural variational model for learning language-guided compositional visual concepts. +HQoCa9WODc0,cL_eOhvmIm1,1601310000000.0,1614990000000.0,1076,Suppressing Outlier Reconstruction in Autoencoders for Out-of-Distribution Detection,"[""~Sangwoong_Yoon1"", ""~Yung-Kyun_Noh1"", ""~Frank_C._Park1""]","[""Sangwoong Yoon"", ""Yung-Kyun Noh"", ""Frank C. Park""]","[""autoencoder"", ""outlier detection"", ""novelty detection"", ""energy-based model""]","While only trained to reconstruct training data, autoencoders may produce high-quality reconstructions of inputs that are well outside the training data distribution. This phenomenon, which we refer to as outlier reconstruction, has a detrimental effect on the use of autoencoders for outlier detection, as an autoencoder will misclassify a clear outlier as being in-distribution. In this paper, we introduce the Energy-Based Autoencoder (EBAE), an autoencoder that is considerably less susceptible to outlier reconstruction. +The core idea of EBAE is to treat the reconstruction error as an energy function of a normalized density and to strictly enforce the normalization constraint. We show that the reconstruction of non-training inputs can be suppressed, and the reconstruction error made highly discriminative to outliers, by enforcing this constraint. We empirically show that EBAE significantly outperforms both existing autoencoders and other generative models for several out-of-distribution detection tasks.",/pdf/15e1d6e96e5e2fe72ef50d79dcc9f79b5893879f.pdf,ICLR,2021,"We investigate the phenomenon of an autoencoder reconstructing outliers and propose Energy-Based Autoencoder, where the reconstruction of outliers are explicitly suppressed through an energy-based formulation." +HJ_X8GupW,H1vXIfd6Z,1508550000000.0,1518730000000.0,30,Multi-label Learning for Large Text Corpora using Latent Variable Model with Provable Gurantees,"[""sayandg@umich.edu""]","[""Sayantan Dasgupta""]","[""Spectral Method"", ""Multi-label Learning"", ""Tensor Factorisation""]","Here we study the problem of learning labels for large text corpora where each document can be assigned a variable number of labels. The problem is trivial when the label dimensionality is small and can be easily solved by a series of one-vs-all classifiers. However, as the label dimensionality increases, the parameter space of such one-vs-all classifiers becomes extremely large and outstrips the memory. Here we propose a latent variable model to reduce the size of the parameter space, but still efficiently learn the labels. We learn the model using spectral learning and show how to extract the parameters using only three passes through the training dataset. Further, we analyse the sample complexity of our model using PAC learning theory and then demonstrate the performance of our algorithm on several benchmark datasets in comparison with existing algorithms. +",/pdf/04fcbc94a08d3687368bb8c05f9045b7a0220be9.pdf,ICLR,2018, +wZ4yWvQ_g2y,#NAME?,1601310000000.0,1614990000000.0,1743,Task-Agnostic and Adaptive-Size BERT Compression,"[""~Jin_Xu5"", ""~Xu_Tan1"", ""~Renqian_Luo1"", ""~Kaitao_Song1"", ""~Li_Jian1"", ""~Tao_Qin1"", ""~Tie-Yan_Liu1""]","[""Jin Xu"", ""Xu Tan"", ""Renqian Luo"", ""Kaitao Song"", ""Li Jian"", ""Tao Qin"", ""Tie-Yan Liu""]","[""BERT compression"", ""neural architecture search"", ""adaptive sizes"", ""across tasks"", ""knowledge distillation""]","While pre-trained language models such as BERT and RoBERTa have achieved impressive results on various natural language processing tasks, they have huge numbers of parameters and suffer from huge computational and memory costs, which make them difficult for real-world deployment. Hence, model compression should be performed in order to reduce the computation and memory cost of pre-trained models. In this work, we aim to compress BERT and address the following two challenging practical issues: (1) The compression algorithm should be able to output multiple compressed models with different sizes and latencies, so as to support devices with different kinds of memory and latency limitations; (2) the algorithm should be downstream task agnostic, so that the compressed models are generally applicable for different downstream tasks. We leverage techniques in neural architecture search (NAS) and propose NAS-BERT, an efficient method for BERT compression. NAS-BERT trains a big supernet on a carefully designed search space containing various architectures and outputs multiple compressed models with adaptive sizes and latency. Furthermore, the training of NAS-BERT is conducted on standard self-supervised pre-training tasks (e.g., masked language model) and does not depend on specific downstream tasks. Thus, the models it produces can be used across various downstream tasks. The technical challenge of NAS-BERT is that training a big supernet on the pre-training task is extremely costly. We employ several techniques including block-wise search, search space pruning, and performance approximation to improve search efficiency and accuracy. Extensive experiments on GLUE benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches, and can be directly applied to different downstream tasks with adaptive model sizes for different requirements of memory or latency.",/pdf/b9075151b7b155f2bc523a3e13c7717f7cff6256.pdf,ICLR,2021,"we propose NAS-BERT, which leverages neural architecture search for BERT compression with adaptive model sizes and across downstream tasks." +BygqBiRcFQ,H1ebUtqFtQ,1538090000000.0,1550880000000.0,110,Diffusion Scattering Transforms on Graphs,"[""fgama@seas.upenn.edu"", ""aribeiro@seas.upenn.edu"", ""bruna@cims.nyu.edu""]","[""Fernando Gama"", ""Alejandro Ribeiro"", ""Joan Bruna""]","[""graph neural networks"", ""deep learning"", ""stability"", ""scattering transforms"", ""convolutional neural networks""]","Stability is a key aspect of data analysis. In many applications, the natural notion of stability is geometric, as illustrated for example in computer vision. Scattering transforms construct deep convolutional representations which are certified stable to input deformations. This stability to deformations can be interpreted as stability with respect to changes in the metric structure of the domain. + +In this work, we show that scattering transforms can be generalized to non-Euclidean domains using diffusion wavelets, while preserving a notion of stability with respect to metric changes in the domain, measured with diffusion maps. The resulting representation is stable to metric perturbations of the domain while being able to capture ''high-frequency'' information, akin to the Euclidean Scattering. ",/pdf/76146599058fc3e1d8ba58e094a1e5f2d0686847.pdf,ICLR,2019,Stability of scattering transform representations of graph data to deformations of the underlying graph support. +SJz6MnC5YQ,HylebAT5Ym,1538090000000.0,1545360000000.0,1309,DEEP GRAPH TRANSLATION,"[""xguo7@gmu.edu"", ""lwu@email.wm.edu"", ""lzhao9@gmu.edu""]","[""Xiaojie Guo"", ""Lingfei Wu"", ""Liang Zhao""]",[],"The tremendous success of deep generative models on generating continuous data +like image and audio has been achieved; however, few deep graph generative models +have been proposed to generate discrete data such as graphs. The recently proposed +approaches are typically unconditioned generative models which have no +control over modes of the graphs being generated. Differently, in this paper, we +are interested in a new problem named Deep Graph Translation: given an input +graph, the goal is to infer a target graph by learning their underlying translation +mapping. Graph translation could be highly desirable in many applications such +as disaster management and rare event forecasting, where the rare and abnormal +graph patterns (e.g., traffic congestions and terrorism events) will be inferred prior +to their occurrence even without historical data on the abnormal patterns for this +specific graph (e.g., a road network or human contact network). To this end, we +propose a novel Graph-Translation-Generative Adversarial Networks (GT-GAN) +which translates one mode of the input graphs to its target mode. GT-GAN consists +of a graph translator where we propose new graph convolution and deconvolution +layers to learn the global and local translation mapping. A new conditional +graph discriminator has also been proposed to classify target graphs by conditioning +on input graphs. Extensive experiments on multiple synthetic and real-world +datasets demonstrate the effectiveness and scalability of the proposed GT-GAN.",/pdf/e9283840b94e1153d88d1c5f8874d4bc0490cddf.pdf,ICLR,2019, +RepN5K31PT3,qnI3gfD1FtI,1601310000000.0,1614990000000.0,2101,On the Dynamic Regret of Online Multiple Mirror Descent,"[""~Nima_Eshraghi1"", ""~and_Ben_Liang1""]","[""Nima Eshraghi"", ""and Ben Liang""]","[""Online learning"", ""Online Convex Optimization"", ""Mirror Descent""]","We study the problem of online convex optimization, where a learner makes sequential decisions to minimize an accumulation of strongly convex costs over time. The quality of decisions is given in terms of the dynamic regret, which measures the performance of the learner relative to a sequence of dynamic minimizers. Prior works on gradient descent and mirror descent have shown that the dynamic regret can be upper bounded using the path length, which depend on the differences between successive minimizers, and an upper bound using the squared path length has also been shown when multiple gradient queries are allowed per round. However, they all require the cost functions to be Lipschitz continuous, which imposes a strong requirement especially when the cost functions are also strongly convex. In this work, we consider Online Multiple Mirror Descent (OMMD), which is based on mirror descent but uses multiple mirror descent steps per online round. Without requiring the cost functions to be Lipschitz continuous, we derive two upper bounds on the dynamic regret based on the path length and squared path length. We further derive a third upper bound that relies on the gradient of cost functions, which can be much smaller than the path length or squared path length, especially when the cost functions are smooth but fluctuate over time. Thus, we show that the dynamic regret of OMMD scales linearly with the minimum among the path length, squared path length, and sum squared gradients. Our experimental results further show substantial improvement on the dynamic regret compared with existing alternatives.",/pdf/64579c85b01740e273ee99a2606f7bd66fe9f6db.pdf,ICLR,2021, +n7wIfYPdVet,y9799AmUlTB,1601310000000.0,1615830000000.0,1149,Auxiliary Learning by Implicit Differentiation,"[""~Aviv_Navon1"", ""~Idan_Achituve1"", ""~Haggai_Maron1"", ""~Gal_Chechik1"", ""~Ethan_Fetaya1""]","[""Aviv Navon"", ""Idan Achituve"", ""Haggai Maron"", ""Gal Chechik"", ""Ethan Fetaya""]","[""Auxiliary Learning"", ""Multi-task Learning""]","Training neural networks with auxiliary tasks is a common practice for improving the performance on a main task of interest. +Two main challenges arise in this multi-task learning setting: (i) designing useful auxiliary tasks; and (ii) combining auxiliary tasks into a single coherent loss. Here, we propose a novel framework, AuxiLearn, that targets both challenges based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn non-linear interactions between tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes in the low data regime, and find that it consistently outperforms competing methods.",/pdf/2571967094ddf29d39923e16c68d6ddb7f5bb72e.pdf,ICLR,2021,Learn to combine auxiliary tasks in a nonlinear fashion and to design them automatically. +4dXmpCDGNp7,ZKyVvDZMT4u,1601310000000.0,1617920000000.0,1959,Evaluations and Methods for Explanation through Robustness Analysis,"[""~Cheng-Yu_Hsieh1"", ""~Chih-Kuan_Yeh1"", ""~Xuanqing_Liu1"", ""~Pradeep_Kumar_Ravikumar1"", ""~Seungyeon_Kim1"", ""~Sanjiv_Kumar1"", ""~Cho-Jui_Hsieh1""]","[""Cheng-Yu Hsieh"", ""Chih-Kuan Yeh"", ""Xuanqing Liu"", ""Pradeep Kumar Ravikumar"", ""Seungyeon Kim"", ""Sanjiv Kumar"", ""Cho-Jui Hsieh""]","[""Interpretability"", ""Explanations"", ""Adversarial Robustness""]","Feature based explanations, that provide importance of each feature towards the model prediction, is arguably one of the most intuitive ways to explain a model. In this paper, we establish a novel set of evaluation criteria for such feature based explanations by robustness analysis. In contrast to existing evaluations which require us to specify some way to ""remove"" features that could inevitably introduces biases and artifacts, we make use of the subtler notion of smaller adversarial perturbations. By optimizing towards our proposed evaluation criteria, we obtain new explanations that are loosely necessary and sufficient for a prediction. We further extend the explanation to extract the set of features that would move the current prediction to a target class by adopting targeted adversarial attack for the robustness analysis. Through experiments across multiple domains and a user study, we validate the usefulness of our evaluation criteria and our derived explanations.",/pdf/d6f38258eb28ef886c3aee4efb08046abbe76e5a.pdf,ICLR,2021,We propose a suite of objective measurements for evaluating feature based explanations by the notion of robustness analysis; we further derive new explanation that captures different characteristics of explanation comparing to existing methods. +rklz16Vtvr,Byx2kuVBvB,1569440000000.0,1577170000000.0,293,ISBNet: Instance-aware Selective Branching Networks,"[""shaofeng@comp.nus.edu.sg"", ""shuyao@comp.nus.edu.sg"", ""wangwei@comp.nus.edu.sg"", ""cg@zju.edu.cn"", ""ooibc@comp.nus.edu.sg""]","[""Shaofeng Cai"", ""Yao Shu"", ""Wei Wang"", ""Gang Chen"", ""Beng Chin Ooi""]","[""neural networks"", ""neural architecture search"", ""efficient inference""]","Recent years have witnessed growing interests in designing efficient neural networks and neural architecture search (NAS). Although remarkable efficiency and accuracy have been achieved, existing expert designed and NAS models neglect the fact that input instances are of varying complexity and thus different amounts of computation are required. Inference with a fixed model that processes all instances through the same transformations would incur computational resources unnecessarily. Customizing the model capacity in an instance-aware manner is required to alleviate such a problem. In this paper, we propose a novel Instance-aware Selective Branching Network-ISBNet to support efficient instance-level inference by selectively bypassing transformation branches of insignificant importance weight. These weights are dynamically determined by a lightweight hypernetwork SelectionNet and recalibrated by gumbel-softmax for sparse branch selection. Extensive experiments show that ISBNet achieves extremely efficient inference in terms of parameter size and FLOPs comparing to existing networks. For example, ISBNet takes only 8.70% parameters and 31.01% FLOPs of the efficient network MobileNetV2 with comparable accuracy on CIFAR-10.",/pdf/9055bc5ef917306a87b00e1550cd345f9d3c03d8.pdf,ICLR,2020, +UfJn-cstSF,#NAME?,1601310000000.0,1614990000000.0,1525,Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding,"[""~Li_Ziang1"", ""~Wu_Kailun1"", ""~Yiwen_Guo1"", ""~Changshui_Zhang2""]","[""Li Ziang"", ""Wu Kailun"", ""Yiwen Guo"", ""Changshui Zhang""]","[""Sparse coding"", ""Learned ISTA"", ""Convergence Analysis""]","The learned iterative shrinkage thresholding algorithm (LISTA) introduces deep unfolding models with learnable thresholds in the shrinkage function for sparse coding. Drawing on some theoretical insights, we advocate an error-based thresholding (EBT) mechanism for LISTA, which leverages a function of the layer-wise reconstruction error to suggest an appropriate threshold value for each observation on each layer. We show that the EBT mechanism well-disentangles the learnable parameters in the shrinkage functions from the reconstruction errors, making them more adaptive to the various observations. With rigorous theoretical analyses, we show that the proposed EBT can lead to faster convergence on the basis of LISTA and its variants, in addition to its higher adaptivity. Extensive experimental results confirm our theoretical analyses and verify the effectiveness of our methods.",/pdf/af0eb62f571d0d053e5876db6e8c2a4e485d7b41.pdf,ICLR,2021,"We advocate an error-based thresholding (EBT) mechanism for LISTA, with superior performance and no extra learnable parameters." +H1aIuk-RW,BynUO1WCZ,1509120000000.0,1519210000000.0,489,Active Learning for Convolutional Neural Networks: A Core-Set Approach,"[""ozansener@cs.stanford.edu"", ""ssilvio@stanford.edu""]","[""Ozan Sener"", ""Silvio Savarese""]","[""Active Learning"", ""Convolutional Neural Networks"", ""Core-Set Selection""]","Convolutional neural networks (CNNs) have been successfully applied to many recognition and learning tasks using a universal recipe; training a deep model on a very large dataset of supervised examples. However, this approach is rather restrictive in practice since collecting a large set of labeled images is very expensive. One way to ease this problem is coming up with smart ways for choosing images to be labelled from a very large collection (i.e. active learning). + +Our empirical study suggests that many of the active learning heuristics in the literature are not effective when applied to CNNs when applied in batch setting. Inspired by these limitations, we define the problem of active learning as core-set selection, i.e. choosing set of points such that a model learned over the selected subset is competitive for the remaining data points. We further present a theoretical result characterizing the performance of any selected subset using the geometry of the datapoints. As an active learning algorithm, we choose the subset which is expected to yield best result according to our characterization. Our experiments show that the proposed method significantly outperforms existing approaches in image classification experiments by a large margin. +",/pdf/e03df38b7410cbd2c6a27c4578c872e0dc1ce300.pdf,ICLR,2018,We approach to the problem of active learning as a core-set selection problem and show that this approach is especially useful in the batch active learning setting which is crucial when training CNNs. +rJWechg0Z,B1lxq3g0b,1509110000000.0,1519050000000.0,399,Minimal-Entropy Correlation Alignment for Unsupervised Deep Domain Adaptation,"[""pietro.morerio@iit.it"", ""jacopo.cavazza@iit.it"", ""vittorio.murino@iit.it""]","[""Pietro Morerio"", ""Jacopo Cavazza"", ""Vittorio Murino""]","[""unsupervised domain adaptation"", ""entropy minimization"", ""image classification"", ""deep transfer learning""]","In this work, we face the problem of unsupervised domain adaptation with a novel deep learning approach which leverages our finding that entropy minimization is induced by the optimal alignment of second order statistics between source and target domains. We formally demonstrate this hypothesis and, aiming at achieving an optimal alignment in practical cases, we adopt a more principled strategy which, differently from the current Euclidean approaches, deploys alignment along geodesics. Our pipeline can be implemented by adding to the standard classification loss (on the labeled source domain), a source-to-target regularizer that is weighted in an unsupervised and data-driven fashion. We provide extensive experiments to assess the superiority of our framework on standard domain and modality adaptation benchmarks.",/pdf/2dbd37c1167a2d85b843db7e76dbd4cb087a0c8d.pdf,ICLR,2018,A new unsupervised deep domain adaptation technique which efficiently unifies correlation alignment and entropy minimization +SyxL2TNtvr,SJgWugeuPB,1569440000000.0,1583910000000.0,783,Unsupervised Model Selection for Variational Disentangled Representation Learning,"[""sunnyd@google.com"", ""lmatthey@google.com"", ""andresnds@google.com"", ""nwatters@google.com"", ""cpburgess@google.com"", ""lerchner@google.com"", ""irinah@google.com""]","[""Sunny Duan"", ""Loic Matthey"", ""Andre Saraiva"", ""Nick Watters"", ""Chris Burgess"", ""Alexander Lerchner"", ""Irina Higgins""]","[""unsupervised disentanglement metric"", ""disentangling"", ""representation learning""]","Disentangled representations have recently been shown to improve fairness, data efficiency and generalisation in simple supervised and reinforcement learning tasks. To extend the benefits of disentangled representations to more complex domains and practical applications, it is important to enable hyperparameter tuning and model selection of existing unsupervised approaches without requiring access to ground truth attribute labels, which are not available for most datasets. This paper addresses this problem by introducing a simple yet robust and reliable method for unsupervised disentangled model selection. We show that our approach performs comparably to the existing supervised alternatives across 5400 models from six state of the art unsupervised disentangled representation learning model classes. Furthermore, we show that the ranking produced by our approach correlates well with the final task performance on two different domains.",/pdf/e7d741db6e2baa1fb3b937144540ebd6f0e70681.pdf,ICLR,2020,We introduce a method for unsupervised disentangled model selection for VAE-based disentangled representation learning approaches. +HygBZnRctX,BklG1p6cKQ,1538090000000.0,1556300000000.0,1168,Transferring Knowledge across Learning Processes,"[""sflennerhag@turing.ac.uk"", ""morepabl@amazon.com"", ""lawrennd@amazon.com"", ""damianou@amazon.com""]","[""Sebastian Flennerhag"", ""Pablo G. Moreno"", ""Neil D. Lawrence"", ""Andreas Damianou""]","[""meta-learning"", ""transfer learning""]","In complex transfer learning scenarios new tasks might not be tightly linked to previous tasks. Approaches that transfer information contained only in the final parameters of a source model will therefore struggle. Instead, transfer learning at at higher level of abstraction is needed. We propose Leap, a framework that achieves this by transferring knowledge across learning processes. We associate each task with a manifold on which the training process travels from initialization to final parameters and construct a meta-learning objective that minimizes the expected length of this path. Our framework leverages only information obtained during training and can be computed on the fly at negligible cost. We demonstrate that our framework outperforms competing methods, both in meta-learning and transfer learning, on a set of computer vision tasks. Finally, we demonstrate that Leap can transfer knowledge across learning processes in demanding reinforcement learning environments (Atari) that involve millions of gradient steps.",/pdf/d6a95271a4893d361a3ced54895d26e51229e05e.pdf,ICLR,2019,"We propose Leap, a framework that transfers knowledge across learning processes by minimizing the expected distance the training process travels on a task's loss surface." +rkl2s34twS,HylWQMfZwH,1569440000000.0,1577170000000.0,167,Wildly Unsupervised Domain Adaptation and Its Powerful and Efficient Solution,"[""feng.liu-2@student.uts.edu.au"", ""jie.lu@uts.edu.au"", ""bo.han@riken.jp"", ""gang.niu@riken.jp"", ""guangquan.zhang@uts.edu.au"", ""sugi@k.u-tokyo.ac.jp""]","[""Feng Liu"", ""Jie Lu"", ""Bo Han"", ""Gang Niu"", ""Guangquan Zhang"", ""Masashi Sugiyama""]",[],"In unsupervised domain adaptation (UDA), classifiers for the target domain (TD) are trained with clean labeled data from the source domain (SD) and unlabeled data from TD. However, in the wild, it is hard to acquire a large amount of perfectly clean labeled data in SD given limited budget. Hence, we consider a new, more realistic and more challenging problem setting, where classifiers have to be trained with noisy labeled data from SD and unlabeled data from TD---we name it wildly UDA (WUDA). We show that WUDA ruins all UDA methods if taking no care of label noise in SD, and to this end, we propose a Butterfly framework, a powerful and efficient solution to WUDA. Butterfly maintains four models (e.g., deep networks) simultaneously, where two take care of all adaptations (i.e., noisy-to-clean, labeled-to-unlabeled, and SD-to-TD-distributional) and then the other two can focus on classification in TD. As a consequence, Butterfly possesses all the conceptually necessary components for solving WUDA. Experiments demonstrate that under WUDA, Butterfly significantly outperforms existing baseline methods.",/pdf/5d4363f65f80ef0786970034fb1dc46384734a76.pdf,ICLR,2020, +BygfiAEtwS,BygbgWKOPH,1569440000000.0,1577170000000.0,1311,Inducing Stronger Object Representations in Deep Visual Trackers,"[""goroshin@google.com"", ""tompson@google.com"", ""debidatta@google.com""]","[""Ross Goroshin"", ""Jonathan Tompson"", ""Debidatta Dwibedi""]","[""Object Tracking"", ""Computer Vision"", ""Deep Learning""]","Fully convolutional deep correlation networks are integral components of state-of- +the-art approaches to single object visual tracking. It is commonly assumed that +these networks perform tracking by detection by matching features of the object +instance with features of the entire frame. Strong architectural priors and conditioning +on the object representation is thought to encourage this tracking strategy. +Despite these strong priors, we show that deep trackers often default to “tracking- +by-saliency” detection – without relying on the object instance representation. Our +analysis shows that despite being a useful prior, salience detection can prevent the +emergence of more robust tracking strategies in deep networks. This leads us to +introduce an auxiliary detection task that encourages more discriminative object +representations that improve tracking performance.",/pdf/512bebc7cdcc248f17bf64e4e587aebdfbae07e3.pdf,ICLR,2020, +jDIWFyftpQh,dM8ucSt-9fh,1601310000000.0,1614990000000.0,1187,Discriminative Cross-Modal Data Augmentation for Medical Imaging Applications,"[""~Yue_Yang2"", ""~Pengtao_Xie3""]","[""Yue Yang"", ""Pengtao Xie""]","[""Deep learning"", ""Medical imaging"", ""Cross-Modal Learning""]","While deep learning methods have shown great success in medical image analysis, they require a number of medical images to train. Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training. In this paper, we study cross-modality data augmentation to mitigate the data deficiency issue in medical imaging domain. We propose a discriminative unpaired image-to-image translation model which translate images in source modality into images in target modality where the translation task is conducted jointly with the downstream prediction task and the translation is guided by the prediction. Experiments on two applications demonstrate the effectiveness of our method. ",/pdf/8fb6967f0b15eb8138cdacade03b26fe04371ae8.pdf,ICLR,2021, +S1lEX04tPr,B1gwqOS_vr,1569440000000.0,1583910000000.0,1035,CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning,"[""yjiachen@gmail.com"", ""anakhaei@honda-ri.com"", ""disele@honda-ri.com"", ""kfujimura@honda-ri.com"", ""zha@cc.gatech.edu""]","[""Jiachen Yang"", ""Alireza Nakhaei"", ""David Isele"", ""Kikuo Fujimura"", ""Hongyuan Zha""]","[""multi-agent reinforcement learning""]","A variety of cooperative multi-agent control problems require agents to achieve individual goals while contributing to collective success. This multi-goal multi-agent setting poses difficulties for recent algorithms, which primarily target settings with a single global reward, due to two new challenges: efficient exploration for learning both individual goal attainment and cooperation for others' success, and credit-assignment for interactions between actions and goals of different agents. To address both challenges, we restructure the problem into a novel two-stage curriculum, in which single-agent goal attainment is learned prior to learning multi-agent cooperation, and we derive a new multi-goal multi-agent policy gradient with a credit function for localized credit assignment. We use a function augmentation scheme to bridge value and policy functions across the curriculum. The complete architecture, called CM3, learns significantly faster than direct adaptations of existing algorithms on three challenging multi-goal multi-agent problems: cooperative navigation in difficult formations, negotiating multi-vehicle lane changes in the SUMO traffic simulator, and strategic cooperation in a Checkers environment.",/pdf/87850f6ffe6f3c3098ce5843d5d5e9199fffa32a.pdf,ICLR,2020,"A modular method for fully cooperative multi-goal multi-agent reinforcement learning, based on curriculum learning for efficient exploration and credit assignment for action-goal interactions." +ZJGnFbd6vW,nCChE6hYYkR,1601310000000.0,1614990000000.0,1644,PCPs: Patient Cardiac Prototypes,"[""~Dani_Kiyasseh1"", ""tingting.zhu@eng.ox.ac.uk"", ""~David_A._Clifton1""]","[""Dani Kiyasseh"", ""Tingting Zhu"", ""David A. Clifton""]","[""Contrastive learning"", ""dataset distillation"", ""patient-similarity"", ""physiological signals"", ""healthcare""]","Existing deep learning methodologies within the medical domain are typically population-based and difficult to interpret. This limits their clinical utility as population-based findings may not generalize to the individual patient. To overcome these obstacles, we propose to learn patient-specific representations, entitled patient cardiac prototypes (PCPs), that efficiently summarize the cardiac state of a patient. We show that PCPs, learned in an end-to-end manner via contrastive learning, allow for the discovery of similar patients both within and across datasets, and can be exploited for dataset distillation as a compact substitute for the original dataset.",/pdf/226b44764233931a940f56444ece588b90e36730.pdf,ICLR,2021, +dFBRrTMjlyL,TfPvmBi7Ids,1601310000000.0,1614990000000.0,1014,Bidirectionally Self-Normalizing Neural Networks,"[""yao.lu@anu.edu.au"", ""~Stephen_Gould1"", ""~Thalaiyasingam_Ajanthan1""]","[""Yao Lu"", ""Stephen Gould"", ""Thalaiyasingam Ajanthan""]",[],"The problem of exploding and vanishing gradients has been a long-standing obstacle that hinders the effective training of neural networks. Despite various tricks and techniques that have been employed to alleviate the problem in practice, there still lacks satisfactory theories or provable solutions. In this paper, we address the problem from the perspective of high-dimensional probability theory. We provide a rigorous result that shows, under mild conditions, how the exploding/vanishing gradient problem disappears with high probability if the neural networks have sufficient width. Our main idea is to constrain both forward and backward signal propagation in a nonlinear neural network through a new class of activation functions, namely Gaussian-Poincaré normalized functions, and orthogonal weight matrices. Experiments on both synthetic and real-world data validate our theory and confirm its effectiveness on very deep neural networks when applied in practice.",/pdf/cfe65c31875b1490fc7a8459a92eaaf3f8fd1e20.pdf,ICLR,2021,We theoretically solve the exploding and vanishing gradients problem in neural network training. +HkgR8erKwB,HkxD65lYPr,1569440000000.0,1577170000000.0,2343,PAC-Bayesian Neural Network Bounds,"[""yossiadidrum@gmail.com"", ""aschwing@illinois.edu"", ""tamir.hazan@technion.ac.il""]","[""Yossi Adi"", ""Alex Schwing"", ""Tamir Hazan""]","[""PAC-Bayesian bounds"", ""PAC-Bayes"", ""Generalization bounds"", ""Bayesian inference""]","Bayesian neural networks, which both use the negative log-likelihood loss function and average their predictions using a learned posterior over the parameters, have been used successfully across many scientific fields, partly due to their ability to `effortlessly' extract desired representations from many large-scale datasets. However, generalization bounds for this setting is still missing. +In this paper, we present a new PAC-Bayesian generalization bound for the negative log-likelihood loss which utilizes the \emph{Herbst Argument} for the log-Sobolev inequality to bound the moment generating function of the learners risk.",/pdf/92510bae6d9695944200529fd0b20cb43467e467.pdf,ICLR,2020,We derive a new PAC-Bayesian Bound for unbounded loss functions (e.g. Negative Log-Likelihood). +crAi7c41xTh,79mKtVDAXr6,1601310000000.0,1614990000000.0,1062,Shape Matters: Understanding the Implicit Bias of the Noise Covariance,"[""~Jeff_Z._HaoChen1"", ""~Colin_Wei1"", ""~Jason_D._Lee1"", ""~Tengyu_Ma1""]","[""Jeff Z. HaoChen"", ""Colin Wei"", ""Jason D. Lee"", ""Tengyu Ma""]","[""implicit regularization"", ""implicit bias"", ""algorithmic regularization"", ""over-parameterization"", ""learning theory""]","The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise --- induced by mini-batches or label perturbation --- is far more effective than Gaussian noise. +This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not. ",/pdf/1eaa42a8029f10c2aaa236a956889a4fc8fbc67a.pdf,ICLR,2021,"We theoretically prove that in an over-parameterized setting, SGD with label noise recovers the ground-truth, whereas SGD with spherical Gaussian noise overfits." +jYVY_piet7m,GKGL0LuVfme,1601310000000.0,1614990000000.0,1482,Hybrid-Regressive Neural Machine Translation,"[""~Qiang_Wang8"", ""~Heng_Yu1"", ""~Shaohui_Kuang1"", ""weihua.luowh@alibaba-inc.com""]","[""Qiang Wang"", ""Heng Yu"", ""Shaohui Kuang"", ""Weihua Luo""]","[""neural machine translation"", ""non-autoregressive translation""]","Although the non-autoregressive translation model based on iterative refinement has achieved comparable performance to the autoregressive counterparts with faster decoding, we empirically found that such aggressive iterations make the acceleration rely heavily on small batch size (e.g., 1) and computing device (e.g., GPU). +By designing synthetic experiments, we highlight that iteration times can be significantly reduced when providing a good (partial) target context. +Inspired by this, we propose a two-stage translation prototype -- Hybrid-Regressive Translation (HRT). HRT first jumpily generates a discontinuous sequence by autoregression (e.g., make a prediction every k tokens, k>1). Then, with the help of the partially deterministic target context, HRT fills all the previously skipped tokens with one iteration in a non-autoregressive way. +The experimental results on WMT'16 En-Ro and WMT'14 En-De show that our model outperforms the state-of-the-art non-autoregressive models with multiple iterations, even autoregressive models. Moreover, compared with autoregressive models, HRT can be steadily accelerated 1.5 times regardless of batch size and device.",/pdf/e6ab7ee69fcaa957e9f19d449ae13b791ed627fd.pdf,ICLR,2021,"Conventional non-autoregressive translation with multiple iterations cannot accelerate decoding when using a small batch size (especially on CPU), and we propose Hybrid-Regressive Translation (HRT) to overcome this issue." +SkeuipVKDH,Hyx-l4JdDB,1569440000000.0,1577170000000.0,750,RTC-VAE: HARNESSING THE PECULIARITY OF TOTAL CORRELATION IN LEARNING DISENTANGLED REPRESENTATIONS,"[""ze.cheng@cn.bosch.com"", ""junchenl@cs.cmu.edu"", ""chenxujwang@gmail.com"", ""jixuan.gu@sjtu.edu.cn"", ""hao.xu-1@colorado.edu"", ""xinjianl@cs.cmu.edu"", ""fmetze@cs.cmu.edu""]","[""Ze Cheng"", ""Juncheng B Li"", ""Chenxu Wang"", ""Jixuan Gu"", ""Hao Xu"", ""Xinjian Li"", ""Florian Metze""]","[""Total Correlation"", ""VAEs"", ""Disentanglement""]","In the problem of unsupervised learning of disentangled representations, one of the promising methods is to penalize the total correlation of sampled latent vari-ables. Unfortunately, this well-motivated strategy often fail to achieve disentanglement due to a problematic difference between the sampled latent representation and its corresponding mean representation. We provide a theoretical explanation that low total correlation of sample distribution cannot guarantee low total correlation of the mean representation. We prove that for the mean representation of arbitrarily high total correlation, there exist distributions of latent variables of abounded total correlation. However, we still believe that total correlation could be a key to the disentanglement of unsupervised representative learning, and we propose a remedy, RTC-VAE, which rectifies the total correlation penalty. Experiments show that our model has a more reasonable distribution of the mean representation compared with baseline models, e.g.,β-TCVAE and FactorVAE.",/pdf/791065a4afc28b0129574cec055b91961ecf6bd6.pdf,ICLR,2020,diagnosed all the problem of STOA VAEs theoretically and qualitatively +H1a37GWCZ,Byj2QfZCZ,1509140000000.0,1518730000000.0,811,UNSUPERVISED SENTENCE EMBEDDING USING DOCUMENT STRUCTURE-BASED CONTEXT,"[""taesung.lee@ibm.com"", ""young_park@us.ibm.com""]","[""Taesung Lee"", ""Youngja Park""]","[""distributed representation"", ""sentence embedding"", ""structure"", ""technical documents"", ""sentence embedding"", ""out-of-vocabulary""]","We present a new unsupervised method for learning general-purpose sentence embeddings. +Unlike existing methods which rely on local contexts, such as words +inside the sentence or immediately neighboring sentences, our method selects, for +each target sentence, influential sentences in the entire document based on a document +structure. We identify a dependency structure of sentences using metadata +or text styles. Furthermore, we propose a novel out-of-vocabulary word handling +technique to model many domain-specific terms, which were mostly discarded by +existing sentence embedding methods. We validate our model on several tasks +showing 30% precision improvement in coreference resolution in a technical domain, +and 7.5% accuracy increase in paraphrase detection compared to baselines.",/pdf/80b6fe527ac8b5682356fe041d123f162a57c28c.pdf,ICLR,2018,"To train a sentence embedding using technical documents, our approach considers document structure to find broader context and handle out-of-vocabulary words." +rkg0_eHtDr,H1gYD6eKwH,1569440000000.0,1577170000000.0,2417,Benefits of Overparameterization in Single-Layer Latent Variable Generative Models,"[""rbuhai@mit.edu"", ""aristesk@andrew.cmu.edu"", ""yhalpern@google.com"", ""dsontag@csail.mit.edu""]","[""Rares-Darius Buhai"", ""Andrej Risteski"", ""Yoni Halpern"", ""David Sontag""]","[""overparameterization"", ""unsupervised"", ""parameter recovery"", ""rigorous experiments""]","One of the most surprising and exciting discoveries in supervising learning was the benefit of overparameterization (i.e. training a very large model) to improving the optimization landscape of a problem, with minimal effect on statistical performance (i.e. generalization). In contrast, unsupervised settings have been under-explored, despite the fact that it has been observed that overparameterization can be helpful as early as Dasgupta & Schulman (2007). In this paper, we perform an exhaustive study of different aspects of overparameterization in unsupervised learning via synthetic and semi-synthetic experiments. We discuss benefits to different metrics of success (recovering the parameters of the ground-truth model, held-out log-likelihood), sensitivity to variations of the training algorithm, and behavior as the amount of overparameterization increases. We find that, when learning using methods such as variational inference, larger models can significantly increase the number of ground truth latent variables recovered.",/pdf/5a9b3f9bba59f819e208f1c030b90ee105134505.pdf,ICLR,2020,Overparameterization aids parameter recovery in unsupervised settings. +S1eEdj0cK7,Bye4agXYKm,1538090000000.0,1545360000000.0,344,On the Relationship between Neural Machine Translation and Word Alignment,"[""znculee@gmail.com"", ""redmondliu@tencent.com"", ""epsilonlee.green@gmail.com"", ""max.meng@ieee.org"", ""shumingshi@tencent.com""]","[""Xintong Li"", ""Lemao Liu"", ""Guanlin Li"", ""Max Meng"", ""Shuming Shi""]","[""Neural Machine Translation"", ""Word Alignment"", ""Neural Network"", ""Pointwise Mutual Information""]","Prior researches suggest that attentional neural machine translation (NMT) is able to capture word alignment by attention, however, to our surprise, it almost fails for NMT models with multiple attentional layers except for those with a single layer. This paper introduce two methods to induce word alignment from general neural machine translation models. Experiments verify that both methods obtain much better word alignment than the method by attention. Furthermore, based on one of the proposed method, we design a criterion to divide target words into two categories (i.e. those mostly contributed from source ""CFS"" words and the other words mostly contributed from target ""CFT"" words), and analyze word alignment under these two categories in depth. We find that although NMT models are difficult to capture word alignment for CFT words but these words do not sacrifice translation quality significantly, which provides an explanation why NMT is more successful for translation yet worse for word alignment compared to statistical machine translation. We further demonstrate that word alignment errors for CFS words are responsible for translation errors in some extent by measuring the correlation between word alignment and translation for several NMT systems.",/pdf/162c682634a59e444545ecdf11cf9fb49d3d4688.pdf,ICLR,2019,It proposes methods to induce word alignment for neural machine translation (NMT) and uses them to interpret the relationship between NMT and word alignment. +ryxepo0cFX,r1gtQhZ5tX,1538090000000.0,1550620000000.0,768,AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks,"[""bchang@stat.ubc.ca"", ""minminc@google.com"", ""haber@math.ubc.ca"", ""edchi@google.com""]","[""Bo Chang"", ""Minmin Chen"", ""Eldad Haber"", ""Ed H. Chi""]",[],"Recurrent neural networks have gained widespread use in modeling sequential data. Learning long-term dependencies using these models remains difficult though, due to exploding or vanishing gradients. In this paper, we draw connections between recurrent networks and ordinary differential equations. A special form of recurrent networks called the AntisymmetricRNN is proposed under this theoretical framework, which is able to capture long-term dependencies thanks to the stability property of its underlying differential equation. Existing approaches to improving RNN trainability often incur significant computation overhead. In comparison, AntisymmetricRNN achieves the same goal by design. We showcase the advantage of this new architecture through extensive simulations and experiments. AntisymmetricRNN exhibits much more predictable dynamics. It outperforms regular LSTM models on tasks requiring long-term memory and matches the performance on tasks where short-term dependencies dominate despite being much simpler.",/pdf/bb1ee1fd82aac529efcebca0ddd48ad14d3d4209.pdf,ICLR,2019, +rygtPhVtDS,SylFePeurB,1569440000000.0,1577170000000.0,13,Noise Regularization for Conditional Density Estimation,"[""jonas.rothfuss@gmail.com"", ""fabioferreira@mailbox.org"", ""simonboehm@gmx.de"", ""simon.walther@kit.edu"", ""maxim.ulrich@kit.edu"", ""asfour@kit.edu"", ""krausea@ethz.ch""]","[""Jonas Rothfuss"", ""Fabio Ferreira"", ""Simon Boehm"", ""Simon Walther"", ""Maxim Ulrich"", ""Tamim Asfour"", ""Andreas Krause""]",[],"Modelling statistical relationships beyond the conditional mean is crucial in many settings. Conditional density estimation (CDE) aims to learn the full conditional probability density from data. Though highly expressive, neural network based CDE models can suffer from severe over-fitting when trained with the maximum likelihood objective. Due to the inherent structure of such models, classical regularization approaches in the parameter space are rendered ineffective. To address this issue, we develop a model-agnostic noise regularization method for CDE that adds random perturbations to the data during training. We demonstrate that the proposed approach corresponds to a smoothness regularization and prove its asymptotic consistency. In our experiments, noise regularization significantly and consistently outperforms other regularization methods across seven data sets and three CDE models. The effectiveness of noise regularization makes neural network based CDE the preferable method over previous non- and semi-parametric approaches, even when training data is scarce. ",/pdf/94e720b86702780f551af0a2c2274287eea4e87a.pdf,ICLR,2020,A model-agnostic regularization scheme for neural network-based conditional density estimation. +BJGWO9k0Z,HJWWuqJAZ,1509040000000.0,1519290000000.0,149,Critical Percolation as a Framework to Analyze the Training of Deep Networks,"[""zoharahoz@gmail.com"", ""rodrigo.bem@gmail.com""]","[""Zohar Ringel"", ""Rodrigo Andrade de Bem""]","[""Deep Convolutional Networks"", ""Loss function landscape"", ""Graph Structured Data"", ""Training Complexity"", ""Theory of deep learning"", ""Percolation theory"", ""Anderson Localization""]","In this paper we approach two relevant deep learning topics: i) tackling of graph structured input data and ii) a better understanding and analysis of deep networks and related learning algorithms. With this in mind we focus on the topological classification of reachability in a particular subset of planar graphs (Mazes). Doing so, we are able to model the topology of data while staying in Euclidean space, thus allowing its processing with standard CNN architectures. We suggest a suitable architecture for this problem and show that it can express a perfect solution to the classification task. The shape of the cost function around this solution is also derived and, remarkably, does not depend on the size of the maze in the large maze limit. Responsible for this behavior are rare events in the dataset which strongly regulate the shape of the cost function near this global minimum. We further identify an obstacle to learning in the form of poorly performing local minima in which the network chooses to ignore some of the inputs. We further support our claims with training experiments and numerical analysis of the cost function on networks with up to $128$ layers.",/pdf/ab3cedcdcc4f95c60e44b97419f70d20678b74e6.pdf,ICLR,2018,A toy dataset based on critical percolation in a planar graph provides an analytical window to the training dynamics of deep neural networks +po-DLlBuAuz,Sr4T93jMd4v,1601310000000.0,1616070000000.0,216,Batch Reinforcement Learning Through Continuation Method,"[""~Yijie_Guo1"", ""~Shengyu_Feng1"", ""~Nicolas_Le_Roux2"", ""~Ed_Chi1"", ""~Honglak_Lee2"", ""~Minmin_Chen1""]","[""Yijie Guo"", ""Shengyu Feng"", ""Nicolas Le Roux"", ""Ed Chi"", ""Honglak Lee"", ""Minmin Chen""]","[""batch reinforcement learning"", ""continuation method"", ""relaxed regularization""]","Many real-world applications of reinforcement learning (RL) require the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efficiently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint, our method 1) helps the agent escape local optima; 2) reduces the error in policy evaluation in the optimization procedure. We present results on a variety of control tasks, game environments, and a recommendation task to empirically demonstrate the efficacy of our proposed method.",/pdf/84a7a35d996f84ab9fbbbcabccbdc21f44f2ba68.pdf,ICLR,2021, +hecuSLbL_vC,K2zUGGrwah,1601310000000.0,1614990000000.0,2367,Generalisation Guarantees For Continual Learning With Orthogonal Gradient Descent,"[""~Mehdi_Abbana_Bennani1"", ""~Thang_Doan1"", ""~Masashi_Sugiyama1""]","[""Mehdi Abbana Bennani"", ""Thang Doan"", ""Masashi Sugiyama""]","[""Continual Learning"", ""Neural Tangent Kernel"", ""Optimisation""]","In Continual Learning settings, deep neural networks are prone to Catastrophic Forgetting. Orthogonal Gradient Descent (Farajtabar et al., 2019) was proposed to tackle the challenge. However, no theoretical guarantees have been proven yet. We present a theoretical framework to study Continual Learning algorithms in the NTK regime. This framework comprises closed form expression of the model through tasks and proxies for transfer learning, generalisation and tasks similarity. In this framework, we prove that OGD is robust to Catastrophic Forgetting then derive the first generalisation bound for SGD and OGD for Continual Learning. Finally, we study the limits of this framework in practice for OGD and highlight the importance of the NTK variation for Continual Learning.",/pdf/ed8f61f90640bce7f2903fa1e2562fa9813e6e81.pdf,ICLR,2021,"An NTK framework for Continual Learning, with robustness and generalisation guarantees for Orthogonal Gradient Descent." +dV19Yyi1fS3,POO4hVNsW5Y,1601310000000.0,1613400000000.0,630,Training with Quantization Noise for Extreme Model Compression,"[""~Pierre_Stock1"", ""~Angela_Fan2"", ""~Benjamin_Graham1"", ""~Edouard_Grave1"", ""~R\u00e9mi_Gribonval1"", ""~Herve_Jegou1"", ""~Armand_Joulin1""]","[""Pierre Stock"", ""Angela Fan"", ""Benjamin Graham"", ""Edouard Grave"", ""R\u00e9mi Gribonval"", ""Herve Jegou"", ""Armand Joulin""]","[""Compression"", ""Efficiency"", ""Product Quantization""]","We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work with extreme compression methods where the approximations introduced by STE are severe. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and ConvNet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14 MB and 80.0% top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3 MB.",/pdf/e7c435ae8b65ac9199efd2c6f55258018a8a229b.pdf,ICLR,2021, +r1e13s05YX,HyxApcj5Ym,1538090000000.0,1548750000000.0,672,Neural network gradient-based learning of black-box function interfaces,"[""alon.jacovi@il.ibm.com"", ""guyh@il.ibm.com"", ""einatke@il.ibm.com"", ""boazc@il.ibm.com"", ""oferl@il.ibm.com"", ""gkour@ibm.com"", ""joberant@cs.tau.ac.il""]","[""Alon Jacovi"", ""Guy Hadash"", ""Einat Kermany"", ""Boaz Carmeli"", ""Ofer Lavi"", ""George Kour"", ""Jonathan Berant""]","[""neural networks"", ""black box functions"", ""gradient descent""]","Deep neural networks work well at approximating complicated functions when provided with data and trained by gradient descent methods. At the same time, there is a vast amount of existing functions that programmatically solve different tasks in a precise manner eliminating the need for training. In many cases, it is possible to decompose a task to a series of functions, of which for some we may prefer to use a neural network to learn the functionality, while for others the preferred method would be to use existing black-box functions. We propose a method for end-to-end training of a base neural network that integrates calls to existing black-box functions. We do so by approximating the black-box functionality with a differentiable neural network in a way that drives the base network to comply with the black-box function interface during the end-to-end optimization process. At inference time, we replace the differentiable estimator with its external black-box non-differentiable counterpart such that the base network output matches the input arguments of the black-box function. Using this ``Estimate and Replace'' paradigm, we train a neural network, end to end, to compute the input to black-box functionality while eliminating the need for intermediate labels. We show that by leveraging the existing precise black-box function during inference, the integrated model generalizes better than a fully differentiable model, and learns more efficiently compared to RL-based methods.",/pdf/32f89b33cd2283a8bdbd1da127f1142f36a834e1.pdf,ICLR,2019,Training DNNs to interface w\ black box functions w\o intermediate labels by using an estimator sub-network that can be replaced with the black box after training +B16dGcqlx,,1478300000000.0,1488760000000.0,531,Third Person Imitation Learning,"[""bstadie@openai.com"", ""pieter@openai.com"", ""ilyasu@openai.com""]","[""Bradly C Stadie"", ""Pieter Abbeel"", ""Ilya Sutskever""]",[],"Reinforcement learning (RL) makes it possible to train agents capable of achieving +sophisticated goals in complex and uncertain environments. A key difficulty in +reinforcement learning is specifying a reward function for the agent to optimize. +Traditionally, imitation learning in RL has been used to overcome this problem. +Unfortunately, hitherto imitation learning methods tend to require that demonstrations +are supplied in the first-person: the agent is provided with a sequence of +states and a specification of the actions that it should have taken. While powerful, +this kind of imitation learning is limited by the relatively hard problem of collecting +first-person demonstrations. Humans address this problem by learning from +third-person demonstrations: they observe other humans perform tasks, infer the +task, and accomplish the same task themselves. +In this paper, we present a method for unsupervised third-person imitation learning. +Here third-person refers to training an agent to correctly achieve a simple +goal in a simple environment when it is provided a demonstration of a teacher +achieving the same goal but from a different viewpoint; and unsupervised refers +to the fact that the agent receives only these third-person demonstrations, and is +not provided a correspondence between teacher states and student states. Our +methods primary insight is that recent advances from domain confusion can be +utilized to yield domain agnostic features which are crucial during the training +process. To validate our approach, we report successful experiments on learning +from third-person demonstrations in a pointmass domain, a reacher domain, and +inverted pendulum.",/pdf/ad3dffc5a4859d914d90c0e7a948350dfa425511.pdf,ICLR,2017,Agent watches another agent at a different camera angle completing the task and learns via raw pixels how to imitate. +USCNapootw,_kXYS2y5LqL,1601310000000.0,1615980000000.0,3751,Certify or Predict: Boosting Certified Robustness with Compositional Architectures,"[""~Mark_Niklas_Mueller2"", ""~Mislav_Balunovic1"", ""~Martin_Vechev1""]","[""Mark Niklas Mueller"", ""Mislav Balunovic"", ""Martin Vechev""]","[""Provable Robustness"", ""Network Architecture"", ""Robustness"", ""Adversarial Accuracy"", ""Certified Robustness""]","A core challenge with existing certified defense mechanisms is that while they improve certified robustness, they also tend to drastically decrease natural accuracy, making it difficult to use these methods in practice. In this work, we propose a new architecture which addresses this challenge and enables one to boost the certified robustness of any state-of-the-art deep network, while controlling the overall accuracy loss, without requiring retraining. The key idea is to combine this model with a (smaller) certified network where at inference time, an adaptive selection mechanism decides on the network to process the input sample. The approach is compositional: one can combine any pair of state-of-the-art (e.g., EfficientNet or ResNet) and certified networks, without restriction. The resulting architecture enables much higher natural accuracy than previously possible with certified defenses alone, while substantially boosting the certified robustness of deep networks. We demonstrate the effectiveness of this adaptive approach on a variety of datasets and architectures. For instance, on CIFAR-10 with an $\ell_\infty$ perturbation of 2/255, we are the first to obtain a high natural accuracy (90.1%) with non-trivial certified robustness (27.5%). Notably, prior state-of-the-art methods incur a substantial drop in accuracy for a similar certified robustness.",/pdf/9e6eb18330c25797cf3af7f9a23a4ac3a44de56c.pdf,ICLR,2021,"We propose a compositional network architecture boosting the certified robustness of an accurate state-of-the-art network, by combining it with a shallow, provable network using a certified, adaptive selection mechanism." +S1xiOjC9F7,rJxCN_4cYQ,1538090000000.0,1545360000000.0,384,Graph Matching Networks for Learning the Similarity of Graph Structured Objects,"[""yujiali@google.com"", ""gcj@google.com"", ""thomasdullien@google.com"", ""vinyals@google.com"", ""pushmeet@google.com""]","[""Yujia Li"", ""Chenjie Gu"", ""Thomas Dullien"", ""Oriol Vinyals"", ""Pushmeet Kohli""]","[""Similarity learning"", ""structured objects"", ""graph matching networks""]","This paper addresses the challenging problem of retrieval and matching of graph structured objects, and makes two key contributions. First, we demonstrate how Graph Neural Networks (GNN), which have emerged as an effective model for various supervised prediction problems defined on structured data, can be trained to produce embedding of graphs in vector spaces that enables efficient similarity reasoning. Second, we propose a novel Graph Matching Network model that, given a pair of graphs as input, computes a similarity score between them by jointly reasoning on the pair through a new cross-graph attention-based matching mechanism. We demonstrate the effectiveness of our models on different domains including the challenging problem of control-flow-graph based function similarity search that plays an important role in the detection of vulnerabilities in software systems. The experimental analysis demonstrates that our models are not only able to exploit structure in the context of similarity learning but they can also outperform domain-specific baseline systems that have been carefully hand-engineered for these problems.",/pdf/66aa6fbe79298e8b0fdb69f8118a45d1a38f4fd7.pdf,ICLR,2019,"We tackle the problem of similarity learning for structured objects with applications in particular in computer security, and propose a new model graph matching networks that excels on this task." +Y9McSeEaqUh,MqafZmMny9K,1601310000000.0,1615850000000.0,1583,Predicting Classification Accuracy When Adding New Unobserved Classes,"[""~Yuli_Slavutsky1"", ""~Yuval_Benjamini1""]","[""Yuli Slavutsky"", ""Yuval Benjamini""]","[""multiclass classification"", ""classification"", ""extrapolation"", ""accuracy"", ""ROC""]","Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier’s performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the ""reversed ROC"" (rROC), which is obtained by replacing the roles of classes and data-points in the common ROC. We show that the classification accuracy is a function of the rROC in multiclass classifiers, for which the learned representation of data from the initial class sample remains unchanged when new classes are added. Using these results we formulate a robust neural-network-based algorithm, ""CleaneX"", which learns to estimate the accuracy of such classifiers on arbitrarily large sets of classes. Unlike previous methods, our method uses both the observed accuracies of the classifier and densities of classification scores, and therefore achieves remarkably better predictions than current state-of-the-art methods on both simulations and real datasets of object detection, face recognition, and brain decoding.",/pdf/0a83be3c3bae29a302683c1ec66eeea1b7307703.pdf,ICLR,2021,A new prediction method of multiclass classification accuracy for an increased number of classes. +QzKDLiosEd,hd8ihxy48V,1601310000000.0,1614990000000.0,1655,Can one hear the shape of a neural network?: Snooping the GPU via Magnetic Side Channel,"[""~Henrique_Teles_Maia1"", ""~Chang_Xiao1"", ""~Dingzeyu_Li2"", ""~Eitan_Grinspun3"", ""~Changxi_Zheng1""]","[""Henrique Teles Maia"", ""Chang Xiao"", ""Dingzeyu Li"", ""Eitan Grinspun"", ""Changxi Zheng""]","[""side channel"", ""model extraction"", ""GPU"", ""magnetic induction"", ""sensors""]","We examine the magnetic flux emanating from a graphics processing unit’s (GPU’s) power cable, as acquired by a cheap $3 induction sensor, and find that this signal betrays the detailed topology and hyperparameters of a black-box neural network model. The attack acquires the magnetic signal for one query with unknown input values, but known input dimension and batch size. The reconstruction is possible due to the modular layer sequence in which deep neural networks are evaluated. We find that each layer component’s evaluation produces an identifiable magnetic signal signature, from which layer topology, width, function type, and sequence order can be inferred using a suitably trained classifier and an optimization based on integer programming. We study the extent to which network specifications can be recovered, and consider metrics for comparing network similarity. We demonstrate the potential accuracy of this side channel attack in recovering the details for a broad range of network architectures including also random designs. We consider applications that may exploit this novel side channel exposure, such as adversarial transfer attacks. In response, we discuss countermeasures to protect against our method and other similar snooping techniques.",/pdf/a54e8fb6b7c012c819236654144d0a327bcc5b41.pdf,ICLR,2021,We examine the magnetic flux emanating from a graphics processing unit’s (GPU’s) power cable and find that this signal betrays the detailed topology and hyperparameters of a black-box neural network model. +SkJd_y-Cb,BkAPukWA-,1509120000000.0,1518730000000.0,491,Word2net: Deep Representations of Language,"[""marirudolph@gmail.com"", ""f.ruiz@columbia.edu"", ""david.blei@columbia.edu""]","[""Maja Rudolph"", ""Francisco Ruiz"", ""David Blei""]","[""neural language models"", ""word embeddings"", ""neural networks""]","Word embeddings extract semantic features of words from large datasets of text. +Most embedding methods rely on a log-bilinear model to predict the occurrence +of a word in a context of other words. Here we propose word2net, a method that +replaces their linear parametrization with neural networks. For each term in the +vocabulary, word2net posits a neural network that takes the context as input and +outputs a probability of occurrence. Further, word2net can use the hierarchical +organization of its word networks to incorporate additional meta-data, such as +syntactic features, into the embedding model. For example, we show how to share +parameters across word networks to develop an embedding model that includes +part-of-speech information. We study word2net with two datasets, a collection +of Wikipedia articles and a corpus of U.S. Senate speeches. Quantitatively, we +found that word2net outperforms popular embedding methods on predicting held- +out words and that sharing parameters based on part of speech further boosts +performance. Qualitatively, word2net learns interpretable semantic representations +and, compared to vector-based methods, better incorporates syntactic information.",/pdf/9a7a0e8409dc920a6c97ed2d4777df567c7dfbb8.pdf,ICLR,2018,Word2net is a novel method for learning neural network representations of words that can use syntactic information to learn better semantic features. +pD9x3TmLONE,pxJgk1CmGJy,1601310000000.0,1614990000000.0,2663,XMixup: Efficient Transfer Learning with Auxiliary Samples by Cross-Domain Mixup,"[""~Xingjian_Li1"", ""~Haoyi_Xiong1"", ""~Haozhe_An1"", ""~Cheng-zhong_Xu1"", ""~Dejing_Dou1""]","[""Xingjian Li"", ""Haoyi Xiong"", ""Haozhe An"", ""Cheng-zhong Xu"", ""Dejing Dou""]","[""transfer learning"", ""deep learning""]","Transferring knowledge from large source datasets is an effective way to fine-tune the deep neural networks of the target task with a small sample size. A great number of algorithms have been proposed to facilitate deep transfer learning, and these techniques could be generally categorized into two groups – Regularized Learning of the target task using models that have been pre-trained from source datasets, and Multitask Learning with both source and target datasets to train a shared backbone neural network. In this work, we aim to improve the multitask paradigm for deep transfer learning via Cross-domain Mixup (XMixup). While the existing multitask learning algorithms need to run backpropagation over both the source and target datasets and usually consume a higher gradient complexity, XMixup transfers the knowledge from source to target tasks more efficiently: for every class of the target task, XMixup selects the auxiliary samples from the source dataset and augments training samples via the simple mixup strategy. We evaluate XMixup over six real world transfer learning datasets. Experiment results show that XMixup improves the accuracy by 1.9% on average. Compared with other state-of-the-art transfer learning approaches, XMixup costs much less training time while still obtains higher accuracy.",/pdf/1a88444fb3cd58103a150ec21b79c2bbfd9960d6.pdf,ICLR,2021,This paper presents an effective and efficient deep transfer learning algorithm. +tqOvYpjPax2,bh5AAX70-bTM,1601310000000.0,1614160000000.0,1769,Intraclass clustering: an implicit learning ability that regularizes DNNs,"[""~Simon_Carbonnelle1"", ""~Christophe_De_Vleeschouwer1""]","[""Simon Carbonnelle"", ""Christophe De Vleeschouwer""]","[""deep learning"", ""generalization"", ""implicit regularization""]","Several works have shown that the regularization mechanisms underlying deep neural networks' generalization performances are still poorly understood. In this paper, we hypothesize that deep neural networks are regularized through their ability to extract meaningful clusters among the samples of a class. This constitutes an implicit form of regularization, as no explicit training mechanisms or supervision target such behaviour. To support our hypothesis, we design four different measures of intraclass clustering, based on the neuron- and layer-level representations of the training data. We then show that these measures constitute accurate predictors of generalization performance across variations of a large set of hyperparameters (learning rate, batch size, optimizer, weight decay, dropout rate, data augmentation, network depth and width).",/pdf/ad12f1d02acbf2f8844ae158fd9d58309060ee34.pdf,ICLR,2021,This paper provides empirical evidence that deep neural networks are implicitly regularized through their ability to extract meaningful clusters among the samples of a class. +ARQAdp7F8OQ,JhlaN9F7QF,1601310000000.0,1614990000000.0,3727,Brain-like approaches to unsupervised learning of hidden representations - a comparative study ,"[""~Naresh_Balaji1"", ""ala@kth.se"", ""paherman@kth.se""]","[""Naresh Balaji"", ""Anders Lansner"", ""Pawel Herman""]","[""neural networks"", ""bio-inspired"", ""brain-like"", ""unsupervised learning"", ""structural plasticity""]","Unsupervised learning of hidden representations has been one of the most vibrant research directions in machine learning in recent years. In this work we study the brain-like Bayesian Confidence Propagating Neural Network (BCPNN) model, recently extended to extract sparse distributed high-dimensional representations. The saliency and separability of the hidden representations when trained on MNIST dataset is studied using an external linear classifier and compared with other unsupervised learning methods that include restricted Boltzmann machines and autoencoders. ",/pdf/c27ecded1ca0d20207838904cf974fb228841ddb.pdf,ICLR,2021,"We compare unsupervised learning algorithms implementing biologically plausible local plasticity rules on MNIST dataset, with special emphasis on the Bayesian Confidence Propagating Neural Network (BCPNN)." +S1xLN3C9YX,BJl8cQ35Fm,1538090000000.0,1550880000000.0,1455,Learnable Embedding Space for Efficient Neural Architecture Compression,"[""caoshengcao@pku.edu.cn"", ""xiaofan2@cs.cmu.edu"", ""kkitani@cs.cmu.edu""]","[""Shengcao Cao"", ""Xiaofang Wang"", ""Kris M. Kitani""]","[""Network Compression"", ""Neural Architecture Search"", ""Bayesian Optimization"", ""Architecture Embedding""]","We propose a method to incrementally learn an embedding space over the domain of network architectures, to enable the careful selection of architectures for evaluation during compressed architecture search. Given a teacher network, we search for a compressed network architecture by using Bayesian Optimization (BO) with a kernel function defined over our proposed embedding space to select architectures for evaluation. We demonstrate that our search algorithm can significantly outperform various baseline methods, such as random search and reinforcement learning (Ashok et al., 2018). The compressed architectures found by our method are also better than the state-of-the-art manually-designed compact architecture ShuffleNet (Zhang et al., 2018). We also demonstrate that the learned embedding space can be transferred to new settings for architecture search, such as a larger teacher network or a teacher network in a different architecture family, without any training.",/pdf/b96481e350451fe4d2d428a29a6ea6272af23ab2.pdf,ICLR,2019,"We propose a method to incrementally learn an embedding space over the domain of network architectures, to enable the careful selection of architectures for evaluation during compressed architecture search." +rJgBd2NYPH,rkgwRYDNIH,1569440000000.0,1583910000000.0,40,Learning deep graph matching with channel-independent embedding and Hungarian attention,"[""tianshuy@asu.edu"", ""runzhong.wang@sjtu.edu.cn"", ""yanjunchi@sjtu.edu.cn"", ""baoxin.li@asu.edu""]","[""Tianshu Yu"", ""Runzhong Wang"", ""Junchi Yan"", ""Baoxin Li""]","[""deep graph matching"", ""edge embedding"", ""combinatorial problem"", ""Hungarian loss""]","Graph matching aims to establishing node-wise correspondence between two graphs, which is a classic combinatorial problem and in general NP-complete. Until very recently, deep graph matching methods start to resort to deep networks to achieve unprecedented matching accuracy. Along this direction, this paper makes two complementary contributions which can also be reused as plugin in existing works: i) a novel node and edge embedding strategy which stimulates the multi-head strategy in attention models and allows the information in each channel to be merged independently. In contrast, only node embedding is accounted in previous works; ii) a general masking mechanism over the loss function is devised to improve the smoothness of objective learning for graph matching. Using Hungarian algorithm, it dynamically constructs a structured and sparsely connected layer, taking into account the most contributing matching pairs as hard attention. Our approach performs competitively, and can also improve state-of-the-art methods as plugin, regarding with matching accuracy on three public benchmarks.",/pdf/081851b20cafd686bf26e23e9e87d288e7b318fd.pdf,ICLR,2020,"We proposed a deep graph matching method with novel channel-independent embedding and Hungarian loss, which achieved state-of-the-art performance." +L7WD8ZdscQ5,5AcpKg-spIB,1601310000000.0,1615970000000.0,1139,The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods,"[""~Wei_Tao3"", ""ls15186322349@163.com"", ""gaowei.wu@ia.ac.cn"", ""qing.tao@ia.ac.cn""]","[""Wei Tao"", ""Sheng Long"", ""Gaowei Wu"", ""Qing Tao""]","[""Deep learning"", ""convex optimization"", ""momentum methods"", ""adaptive heavy-ball methods"", ""optimal convergence""]","The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as {\it individual convergence}), which is a more difficult task than convergence analysis of the averaged solution. Specifically, in the constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB) method, in which the step size is only updated using the exponential moving average strategy, attains an individual convergence rate of $O(\frac{1}{\sqrt{t}})$, as opposed to that of $O(\frac{\log t}{\sqrt {t}})$ of SGD, where $t$ is the number of iterations. Our new analysis not only shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization but also gives valuable hints how the momentum parameters should be scheduled in deep learning. Empirical results validate the correctness of our convergence analysis in optimizing convex functions and demonstrate the improved performance of the adaptive HB methods in training deep networks.",/pdf/88887f8c02852ef3a9f82b1ae8a010345d221175.pdf,ICLR,2021,A theory-practice gap in convex optimization and deep learning is bridged by giving a novel convergence analysis of the last Iterate of adaptive Heavy-ball methods. +rkecJ6VFvr,SkeHALcBDr,1569440000000.0,1583910000000.0,311,Logic and the 2-Simplicial Transformer,"[""jamesedwardclift@gmail.com"", ""dmitry.doryn@gmail.com"", ""d.murfet@unimelb.edu.au"", ""james.wallbridge@gmail.com""]","[""James Clift"", ""Dmitry Doryn"", ""Daniel Murfet"", ""James Wallbridge""]","[""transformer"", ""logic"", ""reinforcement learning"", ""reasoning""]","We introduce the 2-simplicial Transformer, an extension of the Transformer which includes a form of higher-dimensional attention generalising the dot-product attention, and uses this attention to update entity representations with tensor products of value vectors. We show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning. +",/pdf/3c77c97bf2a64439c08bd8853682b3a03a8b74ed.pdf,ICLR,2020,We introduce the 2-simplicial Transformer and show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning. +Hkxr1nCcFm,HkeT8V9cFQ,1538090000000.0,1545360000000.0,985,An investigation of model-free planning,"[""aguez@google.com"", ""mmirza@google.com"", ""karolg@google.com"", ""rkabra@google.com"", ""sracaniere@google.com"", ""theophane@google.com"", ""draposo@google.com"", ""adamsantoro@google.com"", ""lorseau@google.com"", ""eccles@google.com"", ""gregwayne@google.com"", ""davidsilver@google.com"", ""countzero@google.com""]","[""Arthur Guez"", ""Mehdi Mirza"", ""Karol Gregor"", ""Rishabh Kabra"", ""S\u00e9bastien Racani\u00e8re"", ""Th\u00e9ophane Weber"", ""David Raposo"", ""Adam Santoro"", ""Laurent Orseau"", ""Tom Eccles"", ""Greg Wayne"", ""David Silver"", ""Timothy Lillicrap""]",[],"The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the hallmarks that we would typically associate with a model-based planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.",/pdf/92e5d6efde2451dc172d4c2fa698e60a71d06923.pdf,ICLR,2019, +rd_bm8CK7o0,dqNzt9Bdo-Z,1601310000000.0,1614990000000.0,3207,Q-Value Weighted Regression: Reinforcement Learning with Limited Data,"[""~Piotr_Kozakowski3"", ""~Lukasz_Kaiser1"", ""~Henryk_Michalewski1"", ""~Afroz_Mohiuddin1"", ""kkanska@google.com""]","[""Piotr Kozakowski"", ""Lukasz Kaiser"", ""Henryk Michalewski"", ""Afroz Mohiuddin"", ""Katarzyna Ka\u0144ska""]","[""reinforcement learning"", ""rl"", ""offline rl"", ""continuous control"", ""atari"", ""sample efficiency""]","Sample efficiency and performance in the offline setting have emerged as among the main +challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), +a simple RL algorithm that excels in these aspects. + +QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm +that performs very well on continuous control tasks, also in the offline setting, but struggles +on tasks with discrete actions and in sample efficiency. We perform a theoretical analysis +of AWR that explains its shortcomings and use the insights to motivate QWR theoretically. + +We show experimentally that QWR matches state-of-the-art algorithms both on tasks with +continuous and discrete actions. We study the main hyperparameters of QWR +and find that it is stable in a wide range of their choices and on different tasks. +In particular, QWR yields results on par with SAC on the MuJoCo suite and - with +the same set of hyperparameters -- yields results on par with a highly tuned Rainbow +implementation on a set of Atari games. We also verify that QWR performs well in the +offline RL setting, making it a compelling choice for reinforcement learning in domains +with limited data.",/pdf/2fe86a5b1a4ca14cbc0dd8499758f9f8fe3678d8.pdf,ICLR,2021,"We analyze the sample-efficiency of actor-critic RL algorithms, and introduce a new algorithm, achieving superior sample-efficiency while maintaining competitive final performance on the MuJoCo task suite and on Atari games." +Skn9Shcxe,,1478310000000.0,1489530000000.0,539,Highway and Residual Networks learn Unrolled Iterative Estimation,"[""klaus@idsia.ch"", ""rupesh@idsia.ch"", ""juergen@idsia.ch""]","[""Klaus Greff"", ""Rupesh K. Srivastava"", ""J\u00fcrgen Schmidhuber""]","[""Theory"", ""Deep learning"", ""Supervised Learning""]","The past year saw the introduction of new architectures such as Highway networks and Residual networks which, for the first time, enabled the training of feedforward networks with dozens to hundreds of layers using simple gradient descent. +While depth of representation has been posited as a primary reason for their success, there are indications that these architectures defy a popular view of deep learning as a hierarchical computation of increasingly abstract features at each layer. + +In this report, we argue that this view is incomplete and does not adequately explain several recent findings. +We propose an alternative viewpoint based on unrolled iterative estimation---a group of successive layers iteratively refine their estimates of the same features instead of computing an entirely new representation. +We demonstrate that this viewpoint directly leads to the construction of highway and residual networks. +Finally we provide preliminary experiments to discuss the similarities and differences between the two architectures.",/pdf/0bbf16911fcc27702f7d59202487194811132f4a.pdf,ICLR,2017, +ryZERzWCZ,ryme0fZC-,1509140000000.0,1518730000000.0,1045,The Information-Autoencoding Family: A Lagrangian Perspective on Latent Variable Generative Modeling,"[""sjzhao@stanford.edu"", ""tsong@cs.stanford.edu"", ""ermon@cs.stanford.edu""]","[""Shengjia Zhao"", ""Jiaming Song"", ""Stefano Ermon""]","[""Generative Models"", ""Variational Autoencoder"", ""Generative Adversarial Network""]","A variety of learning objectives have been recently proposed for training generative models. We show that many of them, including InfoGAN, ALI/BiGAN, ALICE, CycleGAN, VAE, $\beta$-VAE, adversarial autoencoders, AVB, and InfoVAE, are Lagrangian duals of the same primal optimization problem. This generalization reveals the implicit modeling trade-offs between flexibility and computational requirements being made by these models. Furthermore, we characterize the class of all objectives that can be optimized under certain computational constraints. +Finally, we show how this new Lagrangian perspective can explain undesirable behavior of existing methods and provide new principled solutions.",/pdf/27ef4955b028c7a77d5b76170b017d35d32e5569.pdf,ICLR,2018, +WrNjg9tCLUt,46RozSZw43,1601310000000.0,1614990000000.0,3266,BAFFLE: TOWARDS RESOLVING FEDERATED LEARNING’S DILEMMA - THWARTING BACKDOOR AND INFERENCE ATTACKS,"[""ducthien.nguyen@trust.tu-darmstadt.de"", ""phillip.rieger@trust.tu-darmstadt.de"", ""yalame@encrypto.cs.tu-darmstadt.de"", ""moellering@encrypto.cs.tu-darmstadt.de"", ""hossein.fereidooni@trust.tu-darmstadt.de"", ""samuel.marchal@aalto.fi"", ""markus.miettinen@trust.tu-darmstadt.de"", ""~Azalia_Mirhoseini1"", ""ahmad.sadeghi@trust.tu-darmstadt.de"", ""schneider@encrypto.cs.tu-darmstadt.de"", ""shaza.zeitouni@trust.tu-darmstadt.de""]","[""Thien Duc Nguyen"", ""Phillip Rieger"", ""Hossein Yalame"", ""Helen M\u00f6llering"", ""Hossein Fereidooni"", ""Samuel Marchal"", ""Markus Miettinen"", ""Azalia Mirhoseini"", ""Ahmad-Reza Sadeghi"", ""Thomas Schneider"", ""Shaza Zeitouni""]","[""federated learning"", ""secure machine learning"", ""backdoor attacks"", ""inference attacks"", ""data privacy""]","Recently, federated learning (FL) has been subject to both security and privacy attacks posing a dilemmatic challenge on the underlying algorithmic designs: On the one hand, FL is shown to be vulnerable to backdoor attacks that stealthily manipulate the global model output using malicious model updates, and on the other hand, FL is shown vulnerable to inference attacks by a malicious aggregator inferring information about clients’ data from their model updates. Unfortunately, existing defenses against these attacks are insufficient and mitigating both attacks at the same time is highly challenging, because while defeating backdoor attacks requires the analysis of model updates, protection against inference attacks prohibits access to the model updates to avoid information leakage. In this work, we introduce BAFFLE, a novel in-depth defense for FL that tackles this challenge. To mitigate backdoor attacks, it applies a multilayered defense by using a Model Filtering layer to detect and reject malicious model updates and a Poison Elimination layer to eliminate any effect of a remaining undetected weak manipulation. To impede inference attacks, we build private BAFFLE that securely evaluates the BAFFLE algorithm under encryption using sophisticated secure computation techniques. We extensively evaluate BAFFLE against state-of-the-art backdoor attacks on several datasets and applications, including image classification, word prediction, and IoT intrusion. We show that BAFFLE can entirely remove backdoors with a negligible effect on accuracy and that private BAFFLE is practical.",/pdf/9a628e0b39be7ca1cceb483ece7a4c21ab42c640.pdf,ICLR,2021,"We introduce BAFFLE, a novel in-depth defense for federated learning that tackles both backdoor and inference attacks." +Skvd-myR-,HJLuWQkAb,1509010000000.0,1518730000000.0,120,Learning Non-Metric Visual Similarity for Image Retrieval,"[""garciadn@aston.ac.uk"", ""g.vogiatzis@aston.ac.uk""]","[""Noa Garcia"", ""George Vogiatzis""]","[""image retrieval"", ""visual similarity"", ""non-metric learning""]","Measuring visual (dis)similarity between two or more instances within a data distribution is a fundamental task in many applications, specially in image retrieval. Theoretically, non-metric distances are able to generate a more complex and accurate similarity model than metric distances, provided that the non-linear data distribution is precisely captured by the similarity model. In this work, we analyze a simple approach for deep learning networks to be used as an approximation of non-metric similarity functions and we study how these models generalize across different image retrieval datasets.",/pdf/3f0b53420f77a2ba56ffc6caf689f2ea4220bae8.pdf,ICLR,2018,Similarity network to learn a non-metric visual similarity estimation between a pair of images +H1eqviAqYX,rklh6v_9FX,1538090000000.0,1545360000000.0,290,Why Do Neural Response Generation Models Prefer Universal Replies?,"[""jasonwbw@yahoo.com"", ""nanjiang@buaa.edu.cn"", ""gao_zhifeng@pku.edu.cn"", ""jasonwang0512@gmail.com"", ""lisuke@ss.pku.edu.cn"", ""w.rong@buaa.edu.cn"", ""baoxun.wang@gmail.com""]","[""Bowen Wu"", ""Nan Jiang"", ""Zhifeng Gao"", ""Zongsheng Wang"", ""Suke Li"", ""Wenge Rong"", ""Baoxun Wang""]","[""Neural Response Generation"", ""Universal Replies"", ""Optimization Goal Analysis"", ""Max-Marginal Ranking Regularization""]","Recent advances in neural Sequence-to-Sequence (Seq2Seq) models reveal a purely data-driven approach to the response generation task. Despite its diverse variants and applications, the existing Seq2Seq models are prone to producing short and generic replies, which blocks such neural network architectures from being utilized in practical open-domain response generation tasks. In this research, we analyze this critical issue from the perspective of the optimization goal of models and the specific characteristics of human-to-human conversational corpora. Our analysis is conducted by decomposing the goal of Neural Response Generation (NRG) into the optimizations of word selection and ordering. It can be derived from the decomposing that Seq2Seq based NRG models naturally tend to select common words to compose responses, and ignore the semantic of queries in word ordering. On the basis of the analysis, we propose a max-marginal ranking regularization term to avoid Seq2Seq models from producing the generic and uninformative responses. The empirical experiments on benchmarks with several metrics have validated our analysis and proposed methodology.",/pdf/35e85684d7d6058129e0028d68054dd120b1d976.pdf,ICLR,2019,Analyze the reason for neural response generative models preferring universal replies; Propose a method to avoid it. +LucJxySuJcE,h9md5uETeT,1601310000000.0,1615080000000.0,215,Protecting DNNs from Theft using an Ensemble of Diverse Models,"[""~Sanjay_Kariyappa1"", ""~Atul_Prakash1"", ""~Moinuddin_K_Qureshi2""]","[""Sanjay Kariyappa"", ""Atul Prakash"", ""Moinuddin K Qureshi""]","[""Model stealing"", ""machine learning security""]","Several recent works have demonstrated highly effective model stealing (MS) attacks on Deep Neural Networks (DNNs) in black-box settings, even when the training data is unavailable. These attacks typically use some form of Out of Distribution (OOD) data to query the target model and use the predictions obtained to train a clone model. Such a clone model learns to approximate the decision boundary of the target model, achieving high accuracy on in-distribution examples. We propose Ensemble of Diverse Models (EDM) to defend against such MS attacks. EDM is made up of models that are trained to produce dissimilar predictions for OOD inputs. By using a different member of the ensemble to service different queries, our defense produces predictions that are highly discontinuous in the input space for the adversary's OOD queries. Such discontinuities cause the clone model trained on these predictions to have poor generalization on in-distribution examples. Our evaluations on several image classification tasks demonstrate that EDM defense can severely degrade the accuracy of clone models (up to $39.7\%$). Our defense has minimal impact on the target accuracy, negligible computational costs during inference, and is compatible with existing defenses for MS attacks.",/pdf/fb157da741cd6b15681066ac173712bdecd45316.pdf,ICLR,2021,Discontinuous predictions produced by an ensemble of diverse models can be used to create an effective defense against model stealing attacks. +avBunqDXFS,teiTCvBDfrr,1601310000000.0,1614990000000.0,2073,Memory-Efficient Semi-Supervised Continual Learning: The World is its Own Replay Buffer,"[""~James_Smith1"", ""~Jonathan_C_Balloch1"", ""~Yen-Chang_Hsu1"", ""~Zsolt_Kira1""]","[""James Smith"", ""Jonathan C Balloch"", ""Yen-Chang Hsu"", ""Zsolt Kira""]","[""continual learning"", ""semi-supervised learning""]","Rehearsal is a critical component for class-incremental continual learning, yet it requires a substantial memory budget. Our work investigates whether we can significantly reduce this memory budget by leveraging unlabeled data from an agent's environment in a realistic and challenging continual learning paradigm. Specifically, we explore and formalize a novel semi-supervised continual learning (SSCL) setting, where labeled data is scarce yet non-i.i.d. unlabeled data from the agent's environment is plentiful. Importantly, data distributions in the SSCL setting are realistic and therefore reflect object class correlations between, and among, the labeled and unlabeled data distributions. We show that a strategy built on pseudo-labeling, consistency regularization, Out-of-Distribution (OoD) detection, and knowledge distillation reduces forgetting in this setting. Our approach, DistillMatch, increases performance over the state-of-the-art by no less than 8.7% average task accuracy and up to a 54.5% increase in average task accuracy in SSCL CIFAR-100 experiments. Moreover, we demonstrate that DistillMatch can save up to 0.23 stored images per processed unlabeled image compared to the next best method which only saves 0.08. Our results suggest that focusing on realistic correlated distributions is a significantly new perspective, which accentuates the importance of leveraging the world's structure as a continual learning strategy.",/pdf/f7cb935ef838d9afbda9e597e2f199d03c2f23d0.pdf,ICLR,2021,"We propose the realistic semi-supervised continual learning (SSCL) setting and show that a strategy built on pseudo-labeling, consistency regularization, Out-of-Distribution detection, and knowledge distillation reduces forgetting in this setting." +pqZV_srUVmK,Xbgx3IN-NLh,1601310000000.0,1612740000000.0,1194,Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy,"[""~Zuyue_Fu1"", ""~Zhuoran_Yang1"", ""~Zhaoran_Wang1""]","[""Zuyue Fu"", ""Zhuoran Yang"", ""Zhaoran Wang""]",[],"We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time. ",/pdf/23d733db99e61b484d06180f5dfbb470789b7ff0.pdf,ICLR,2021, +BkQqq0gRb,ByM9qReCW,1509120000000.0,1526830000000.0,463,Variational Continual Learning,"[""vcn22@cam.ac.uk"", ""yl494@cam.ac.uk"", ""tdb40@cam.ac.uk"", ""ret26@cam.ac.uk""]","[""Cuong V. Nguyen"", ""Yingzhen Li"", ""Thang D. Bui"", ""Richard E. Turner""]","[""continual learning"", ""online variational inference""]","This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",/pdf/337aa9dfb6d74edadccfaf45f22d58771ece25f6.pdf,ICLR,2018,This paper develops a principled method for continual learning in deep models. +JoCR4h9O3Ew,6DLW4JPpt5R,1601310000000.0,1616070000000.0,745,ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity,"[""~Kangkang_Lu1"", ""~Cuong_Manh_Nguyen1"", ""~Xun_Xu1"", ""~Kiran_Chari1"", ""~Yu_Jing_Goh1"", ""~Chuan-Sheng_Foo1""]","[""Kangkang Lu"", ""Cuong Manh Nguyen"", ""Xun Xu"", ""Kiran Chari"", ""Yu Jing Goh"", ""Chuan-Sheng Foo""]","[""Adversarial Robustness"", ""Semi-supervised Learning"", ""Multi-view Learning"", ""Diversity Regularization"", ""Entropy Maximization""]","Adversarial attacks pose a major challenge for modern deep neural networks. Recent advancements show that adversarially robust generalization requires a large amount of labeled data for training. If annotation becomes a burden, can unlabeled data help bridge the gap? In this paper, we propose ARMOURED, an adversarially robust training method based on semi-supervised learning that consists of two components. The first component applies multi-view learning to simultaneously optimize multiple independent networks and utilizes unlabeled data to enforce labeling consistency. The second component reduces adversarial transferability among the networks via diversity regularizers inspired by determinantal point processes and entropy maximization. Experimental results show that under small perturbation budgets, ARMOURED is robust against strong adaptive adversaries. Notably, ARMOURED does not rely on generating adversarial samples during training. When used in combination with adversarial training, ARMOURED yields competitive performance with the state-of-the-art adversarially-robust benchmarks on SVHN and outperforms them on CIFAR-10, while offering higher clean accuracy.",/pdf/e402c5559bfaf850b8b9316d0ada923c0a35e9b6.pdf,ICLR,2021,ARMOURED is a novel technique for adversarially robust learning that elegantly unifies semi-supervised learning and diversity regularization through a multi-view learning framework. +OItp-Avs6Iy,9zXxjheGl0,1601310000000.0,1614990000000.0,2353,Concentric Spherical GNN for 3D Representation Learning,"[""~James_S_Fox1"", ""bzhao68@gatech.edu"", ""srajama@sandia.gov"", ""rampi.ramprasad@mse.gatech"", ""~Le_Song1""]","[""James S Fox"", ""Bo Zhao"", ""Sivasankaran Rajamanickam"", ""Rampi Ramprasad"", ""Le Song""]","[""spherical cnn"", ""GNN"", ""graph convolution"", ""rotation equivariance"", ""3D""]","Learning 3D representations that generalize well to arbitrarily oriented inputs is a challenge of practical importance in applications varying from computer vision to physics and chemistry. +We propose a novel multi-resolution convolutional architecture for learning over concentric spherical feature maps, of which the single sphere representation is a special case. +Our hierarchical architecture is based on alternatively learning to incorporate both intra-sphere and inter-sphere information. +We show the applicability of our method for two different types of 3D inputs, mesh objects, which can be regularly sampled, and point clouds, which are irregularly distributed. +We also propose an efficient mapping of point clouds to concentric spherical images using radial basis functions, thereby bridging spherical convolutions on grids with general point clouds. +We demonstrate the effectiveness of our approach in achieving state-of-the-art performance on 3D classification tasks with rotated data.",/pdf/784bb25ea4d6920d38a0657912ff99d6abf9a466.pdf,ICLR,2021,We propose a spherical GNN based on concentric spheres representation for 3D representation learning. +cbtV7xGO9pS,ICQvNHfDHGJ,1601310000000.0,1614990000000.0,770,TEAC: Intergrating Trust Region and Max Entropy Actor Critic for Continuous Control,"[""~Hongyu_Zang1"", ""~Xin_Li31"", ""~Li_Zhang18"", ""~Peiyao_Zhao1"", ""~Mingzhong_Wang1""]","[""Hongyu Zang"", ""Xin Li"", ""Li Zhang"", ""Peiyao Zhao"", ""Mingzhong Wang""]","[""Reinforcement Learning"", ""Trust region methods"", ""Maximum Entropy Reinforcement Learning"", ""Deep Reinforcement Learning""]","Trust region methods and maximum entropy methods are two state-of-the-art branches used in reinforcement learning (RL) for the benefits of stability and exploration in continuous environments, respectively. This paper proposes to integrate both branches in a unified framework, thus benefiting from both sides. We first transform the original RL objective to a constraint optimization problem and then proposes trust entropy actor-critic (TEAC), an off-policy algorithm to learn stable and sufficiently explored policies for continuous states and actions. TEAC trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. +We prove that the policy evaluation and policy improvement in TEAC is guaranteed to converge. +We compare TEAC with 4 state-of-the-art solutions on 6 tasks in the MuJoCo environment. The results show that TEAC outperforms state-of-the-art solutions in terms of efficiency and effectiveness. +",/pdf/0a33e759eb91142035576988ff487e6462d5e20d.pdf,ICLR,2021,We propose a novel off-policy trust entropy actor critic method to learn stable and sufficiently explored policies for continuous states and actions. +r1gfweBFPB,HylbNoeFvH,1569440000000.0,1577170000000.0,2352,Learning by shaking: Computing policy gradients by physical forward-propagation,"[""amehrjou@tuebingen.mpg.de"", ""soli.ashkan98@gmail.com"", ""stefan.bauer@tuebingen.mpg.de"", ""bs@tuebingen.mpg.de""]","[""Arash Mehrjou"", ""Ashkan Soleymani"", ""Stefan Bauer"", ""Bernhard Sch\u00f6lkopf""]","[""Reinforcement Learning"", ""Control Theory""]","Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation (shaking) of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them.",/pdf/3265027a0c22c75ee4fae9b9b7a0a25dfd906246.pdf,ICLR,2020,We propose a method to learn the effect of changing the parameters of the policy on the produced trajectories directly from the physical system. +rksfwnFxl,,1478240000000.0,1484810000000.0,135,LSTM-Based System-Call Language Modeling and Ensemble Method for Host-Based Intrusion Detection,"[""kgwmath@snu.ac.kr"", ""hyyi@snu.ac.kr"", ""ubuntu@snu.ac.kr"", ""ypaek@snu.ac.kr"", ""sryoon@snu.ac.kr""]","[""Gyuwan Kim"", ""Hayoon Yi"", ""Jangho Lee"", ""Yunheung Paek"", ""Sungroh Yoon""]",[],"In computer security, designing a robust intrusion detection system is one of the most fundamental and important problems. In this paper, we propose a system-call language-modeling approach for designing anomaly-based host intrusion detection systems. To remedy the issue of high false-alarm rates commonly arising in conventional methods, we employ a novel ensemble method that blends multiple thresholding classifiers into a single one, making it possible to accumulate `highly normal' sequences. The proposed system-call language model has various advantages leveraged by the fact that it can learn the semantic meaning and interactions of each system call that existing methods cannot effectively consider. Through diverse experiments on public benchmark datasets, we demonstrate the validity and effectiveness of the proposed method. Moreover, we show that our model possesses high portability, which is one of the key aspects of realizing successful intrusion detection systems.",/pdf/fe56e36cd82b26a03c7f34c4530924ff1a0c36f8.pdf,ICLR,2017, +BJlVeyHFwH,BygusXs_Dr,1569440000000.0,1577170000000.0,1503,On the Invertibility of Invertible Neural Networks,"[""jensb@uni-bremen.de"", ""pvicol@cs.toronto.edu"", ""wangkua1@cs.toronto.edu"", ""rgrosse@cs.toronto.edu"", ""j.jacobsen@vectorinstitute.ai""]","[""Jens Behrmann"", ""Paul Vicol"", ""Kuan-Chieh Wang"", ""Roger B. Grosse"", ""J\u00f6rn-Henrik Jacobsen""]","[""Invertible Neural Networks"", ""Stability"", ""Normalizing Flows"", ""Generative Models"", ""Evaluation of Generative Models""]","Guarantees in deep learning are hard to achieve due to the interplay of flexible modeling schemes and complex tasks. Invertible neural networks (INNs), however, provide several mathematical guarantees by design, such as the ability to approximate non-linear diffeomorphisms. One less studied advantage of INNs is that they enable the design of bi-Lipschitz functions. This property has been used implicitly by various works to design generative models, memory-saving gradient computation, regularize classifiers, and solve inverse problems. +In this work, we study Lipschitz constants of invertible architectures in order to investigate guarantees on stability of their inverse and forward mapping. Our analysis reveals that commonly-used INN building blocks can easily become non-invertible, leading to questionable ``exact'' log likelihood computations and training difficulties. We introduce a set of numerical analysis tools to diagnose non-invertibility in practice. Finally, based on our theoretical analysis, we show how to guarantee numerical invertibility for one of the most common INN architectures.",/pdf/eb0edbe55cb1f82e10b84eb06541e4f32df3d7ee.pdf,ICLR,2020,"Little known fact: Invertible Neural Networks can be non-invertible; we show why, when and how to fix it." +B1eX_a4twH,ByxjO1hwvS,1569440000000.0,1577170000000.0,627,Superseding Model Scaling by Penalizing Dead Units and Points with Separation Constraints,"[""blauigris@gmail.com"", ""camilorey@gmail.com"", ""epuertas@ub.edu"", ""oriol_pujol@ub.edu""]","[""Carles Riera"", ""Camilo Rey-Torres"", ""Eloi Puertas"", ""Oriol Pujol""]","[""Dead Point"", ""Dead Unit"", ""Model Scaling"", ""Separation Constraints"", ""Dying ReLU"", ""Constant Width"", ""Deep Neural Networks"", ""Backpropagation""]","In this article, we study a proposal that enables to train extremely thin (4 or 8 neurons per layer) and relatively deep (more than 100 layers) feedforward networks without resorting to any architectural modification such as Residual or Dense connections, data normalization or model scaling. We accomplish that by alleviating two problems. One of them are neurons whose output is zero for all the dataset, which renders them useless. This problem is known to the academic community as \emph{dead neurons}. The other is a less studied problem, dead points. Dead points refers to data points that are mapped to zero during the forward pass of the network. As such, the gradient generated by those points is not propagated back past the layer where they die, thus having no effect in the training process. In this work, we characterize both problems and propose a constraint formulation that added to the standard loss function solves them both. As an additional benefit, the proposed method allows to initialize the network weights with constant or even zero values and still allowing the network to converge to reasonable results. We show very promising results on a toy, MNIST, and CIFAR-10 datasets.",/pdf/f69c928c53b00bf01472c1cad86046a7928f1d24.pdf,ICLR,2020,We propose using a set of constraints to penalize dead neurons and points in order to train very deep networks of constant width. +HygS91rYvH,Hkx0Eq0uwr,1569440000000.0,1577170000000.0,1877,Universal Adversarial Attack Using Very Few Test Examples,"[""amitdesh@microsoft.com"", ""ksandeshk@cmi.ac.in"", ""kv@cmi.ac.in""]","[""Amit Deshpande"", ""Sandesh Kamath"", ""K V Subrahmanyam""]","[""universal"", ""adversarial"", ""SVD""]","Adversarial attacks such as Gradient-based attacks, Fast Gradient Sign Method (FGSM) by Goodfellow et al.(2015) and DeepFool by Moosavi-Dezfooli et al. (2016) are input-dependent, small pixel-wise perturbations of images which fool state of the art neural networks into misclassifying images but are unlikely to fool any human. On the other hand a universal adversarial attack is an input-agnostic perturbation. The same perturbation is applied to all inputs and yet the neural network is fooled on a large fraction of the inputs. In this paper, we show that multiple known input-dependent pixel-wise perturbations share a common spectral property. Using this spectral property, we show that the top singular vector of input-dependent adversarial attack directions can be used as a very simple universal adversarial attack on neural networks. We evaluate the error rates and fooling rates of three universal attacks, SVD-Gradient, SVD-DeepFool and SVD-FGSM, on state of the art neural networks. We show that these universal attack vectors can be computed using a small sample of test inputs. We establish our results both theoretically and empirically. On VGG19 and VGG16, the fooling rate of SVD-DeepFool and SVD-Gradient perturbations constructed from observing less than 0.2% of the validation set of ImageNet is as good as the universal attack of Moosavi-Dezfooli et al. (2017a). To prove our theoretical results, we use matrix concentration inequalities and spectral perturbation bounds. For completeness, we also discuss another recent approach to universal adversarial perturbations based on (p, q)-singular vectors, proposed independently by Khrulkov & Oseledets (2018), and point out the simplicity and efficiency of our universal attack as the key difference.",/pdf/8e32d12377f555ac660a03167fc32b376ced6ed8.pdf,ICLR,2020, +rkE3y85ee,,1478280000000.0,1487870000000.0,281,Categorical Reparameterization with Gumbel-Softmax,"[""ejang@google.com"", ""sg717@cam.ac.uk"", ""poole@cs.stanford.edu""]","[""Eric Jang"", ""Shixiang Gu"", ""Ben Poole""]","[""Deep learning"", ""Semi-Supervised Learning"", ""Optimization"", ""Structured prediction""]","Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.",/pdf/14d989bed02fa62e71fbb485fcec6e26d6c91ccf.pdf,ICLR,2017,"Simple, differentiable sampling mechanism for categorical variables that can be trained in neural nets via standard backprop." +BkxXe0Etwr,BkxVXE7_PS,1569440000000.0,1583910000000.0,923,CAQL: Continuous Action Q-Learning,"[""mkryu@google.com"", ""yinlamchow@google.com"", ""rander@google.com"", ""ctjandra@google.com"", ""cboutilier@google.com""]","[""Moonkyung Ryu"", ""Yinlam Chow"", ""Ross Anderson"", ""Christian Tjandraatmadja"", ""Craig Boutilier""]","[""Reinforcement learning (RL)"", ""DQN"", ""Continuous control"", ""Mixed-Integer Programming (MIP)""]","Reinforcement learning (RL) with value-based methods (e.g., Q-learning) has shown success in a variety of domains such as +games and recommender systems (RSs). When the action space is finite, these algorithms implicitly finds a policy by learning the optimal value function, which are often very efficient. +However, one major challenge of extending Q-learning to tackle continuous-action RL problems is that obtaining optimal Bellman backup requires solving a continuous action-maximization (max-Q) problem. While it is common to restrict the parameterization of the Q-function to be concave in actions to simplify the max-Q problem, such a restriction might lead to performance degradation. Alternatively, when the Q-function is parameterized with a generic feed-forward neural network (NN), the max-Q problem can be NP-hard. In this work, we propose the CAQL method which minimizes the Bellman residual using Q-learning with one of several plug-and-play action optimizers. In particular, leveraging the strides of optimization theories in deep NN, we show that max-Q problem can be solved optimally with mixed-integer programming (MIP)---when the Q-function has sufficient representation power, this MIP-based optimization induces better policies and is more robust than counterparts, e.g., CEM or GA, that approximate the max-Q solution. To speed up training of CAQL, we develop three techniques, namely (i) dynamic tolerance, (ii) dual filtering, and (iii) clustering. +To speed up inference of CAQL, we introduce the action function that concurrently learns the optimal policy. +To demonstrate the efficiency of CAQL we compare it with state-of-the-art RL algorithms on benchmark continuous control problems that have different degrees of action constraints and show that CAQL significantly outperforms policy-based methods in heavily constrained environments.",/pdf/49cfb54fa89345e85f991893706d894f0d953510.pdf,ICLR,2020,A general framework of value-based reinforcement learning for continuous control +ryen_CEFwr,Sye6WxuuvS,1569440000000.0,1577170000000.0,1226,"Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos","[""aysegul.dundar89@gmail.com"", ""kjshih2@illinois.edu"", ""garg@cs.stanford.edu"", ""rpottorff@gmail.com"", ""atao@nvidia.com"", ""bcatanzaro@nvidia.com""]","[""Aysegul Dundar"", ""Kevin J Shih"", ""Animesh Garg"", ""Robert Pottorf"", ""Andrew Tao"", ""Bryan Catanzaro""]","[""unsupervised landmark discovery""]","Unsupervised landmark learning is the task of learning semantic keypoint-like +representations without the use of expensive keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction.",/pdf/a857ab1f20de366e275f0e5cbf8036ed6470e4de.pdf,ICLR,2020, +SksY3deAW,HyoY2_xAW,1509100000000.0,1518730000000.0,320,Learning Deep ResNet Blocks Sequentially using Boosting Theory,"[""furongh@cs.umd.edu"", ""jordantash@gmail.com"", ""jcl@microsoft.com"", ""schapire@microsoft.com""]","[""Furong Huang"", ""Jordan T. Ash"", ""John Langford"", ""Robert E. Schapire""]","[""residual network"", ""boosting theory"", ""training error guarantee""]","We prove a multiclass boosting theory for the ResNet architectures which simultaneously creates a new technique for multiclass boosting and provides a new algorithm for ResNet-style architectures. Our proposed training algorithm, BoostResNet, is particularly suitable in non-differentiable architectures. Our method only requires the relatively inexpensive sequential training of T ""shallow ResNets"". We prove that the training error decays exponentially with the depth T if the weak module classifiers that we train perform slightly better than some weak baseline. In other words, we propose a weak learning condition and prove a boosting theory for ResNet under the weak learning condition. A generalization error bound based on margin theory is proved and suggests that ResNet could be resistant to overfitting using a network with l_1 norm bounded weights.",/pdf/0a6dc2f2538631a7cba056687a3c4b1e5b157cbc.pdf,ICLR,2018,We prove a multiclass boosting theory for the ResNet architectures which simultaneously creates a new technique for multiclass boosting and provides a new algorithm for ResNet-style architectures. +rJlg1n05YX,Sklkvtp5Km,1538090000000.0,1545360000000.0,957,Penetrating the Fog: the Path to Efficient CNN Models,"[""kun@cs.ucsb.edu"", ""boyuan@cs.ucsb.edu"", ""shuyang1995@ucsb.edu"", ""yufeiding@cs.ucsb.edu""]","[""Kun Wan"", ""Boyuan Feng"", ""Shu Yang"", ""Yufei Ding""]","[""Efficient CNN models"", ""Computer Vision""]","With the increasing demand to deploy convolutional neural networks (CNNs) on mobile platforms, the sparse kernel approach was proposed, which could save more parameters than the standard convolution while maintaining accuracy. However, despite the great potential, no prior research has pointed out how to craft an sparse kernel design with such potential (i.e., effective design), and all prior works just adopt simple combinations of existing sparse kernels such as group convolution. Meanwhile due to the large design space it is also impossible to try all combinations of existing sparse kernels. In this paper, we are the first in the field to consider how to craft an effective sparse kernel design by eliminating the large design space. Specifically, we present a sparse kernel scheme to illustrate how to reduce the space from three aspects. First, in terms of composition we remove designs composed of repeated layers. Second, to remove designs with large accuracy degradation, we find an unified property named~\emph{information field} behind various sparse kernel designs, which could directly indicate the final accuracy. Last, we remove designs in two cases where a better parameter efficiency could be achieved. Additionally, we provide detailed efficiency analysis on the final 4 designs in our scheme. Experimental results validate the idea of our scheme by showing that our scheme is able to find designs which are more efficient in using parameters and computation with similar or higher accuracy.",/pdf/09dd80df69fc22779a3fa0bb14dedb8f31fa2232.pdf,ICLR,2019,"We are the first in the field to show how to craft an effective sparse kernel design from three aspects: composition, performance and efficiency." +TuK6agbdt27,2Bq9qUP_WUk,1601310000000.0,1615890000000.0,3467,Learning Associative Inference Using Fast Weight Memory,"[""~Imanol_Schlag3"", ""~Tsendsuren_Munkhdalai1"", ""~J\u00fcrgen_Schmidhuber1""]","[""Imanol Schlag"", ""Tsendsuren Munkhdalai"", ""J\u00fcrgen Schmidhuber""]","[""memory-augmented neural networks"", ""tensor product"", ""fast weights""]","Humans can quickly associate stimuli to solve problems in novel contexts. Our novel neural network model learns state representations of facts that can be composed to perform such associative inference. To this end, we augment the LSTM model with an associative memory, dubbed \textit{Fast Weight Memory} (FWM). Through differentiable operations at every step of a given input sequence, the LSTM \textit{updates and maintains} compositional associations stored in the rapidly changing FWM weights. Our model is trained end-to-end by gradient descent and yields excellent performance on compositional language reasoning problems, meta-reinforcement-learning for POMDPs, and small-scale word-level language modelling.",/pdf/96ccad214bb6dc5b347aa32436f14fdd5391d21b.pdf,ICLR,2021,We present a Recurrent Neural Network model which is augmented with an associative memory to generalise in a more systematically +SkMx_iC9K7,HkgMU1u9K7,1538090000000.0,1545360000000.0,323,DelibGAN: Coarse-to-Fine Text Generation via Adversarial Network,"[""wangke17@pku.edu.cn"", ""wanxiaojun@pku.edu.cn""]","[""Ke Wang"", ""Xiaojun Wan""]","[""unsupervised text generation"", ""coarse-to-fine generator"", ""multiple instance discriminator"", ""GAN"", ""DelibGAN""]","In this paper, we propose a novel adversarial learning framework, namely DelibGAN, for generating high-quality sentences without supervision. Our framework consists of a coarse-to-fine generator, which contains a first-pass decoder and a second-pass decoder, and a multiple instance discriminator. And we propose two training mechanisms DelibGAN-I and DelibGAN-II. The discriminator is used to fine-tune the second-pass decoder in DelibGAN-I and further evaluate the importance of each word and tune the first-pass decoder in DelibGAN-II. We compare our models with several typical and state-of-the-art unsupervised generic text generation models on three datasets (a synthetic dataset, a descriptive text dataset and a sentimental text dataset). Both qualitative and quantitative experimental results show that our models produce more realistic samples, and DelibGAN-II performs best.",/pdf/b5b08afed1fa2473c2a8db2feedd59ae47035382.pdf,ICLR,2019,"A novel adversarial learning framework, namely DelibGAN, is proposed for generating high-quality sentences without supervision." +kB8DkEKSDH,LWb7nda5Gqh,1601310000000.0,1614990000000.0,3784,Hellinger Distance Constrained Regression,"[""~Egor_Rotinov1""]","[""Egor Rotinov""]","[""offline"", ""Reinforcement Learning"", ""off-policy"", ""control""]","This paper introduces an off-policy reinforcement learning method that uses Hellinger distance between sampling policy (from what samples were collected) and current policy (policy being optimized) as a constraint. +Hellinger distance squared multiplied by two is greater than or equal to total variation distance squared and less than or equal to Kullback-Leibler divergence, therefore a lower bound for expected discounted return for the new policy is improved compared to the lower bound for training with KL. +Also, Hellinger distance is less than or equal to 1, so there is a policy-independent lower bound for expected discounted return. +HDCR is capable of training with Experience Replay, a common setting for distributed RL when collecting trajectories using different policies and learning from this data centralized. +HDCR shows results comparable to or better than Advantage-weighted Behavior Model and Advantage-Weighted Regression on MuJoCo tasks using tiny offline datasets collected by random agents. On bigger datasets (100k timesteps) obtained by pretrained behavioral policy, HDCR outperforms ABM and AWR methods on 3 out of 4 tasks. ",/pdf/b2424686d8d6aba3981f12c8c0777b99db8fc501.pdf,ICLR,2021,This paper presents an offline reinforcement learning method based on the Hellinger distance constraint. +SkxoqRNKwr,rJxXgRudwr,1569440000000.0,1577170000000.0,1295,Adversarial Privacy Preservation under Attribute Inference Attack,"[""han.zhao@cs.cmu.edu"", ""jc6ub@virginia.edu"", ""yuant@virginia.edu"", ""geoff.gordon@microsoft.com""]","[""Han Zhao"", ""Jianfeng Chi"", ""Yuan Tian"", ""Geoffrey J. Gordon""]",[],"With the prevalence of machine learning services, crowdsourced data containing sensitive information poses substantial privacy challenges. Existing work focusing on protecting against membership inference attacks under the rigorous framework of differential privacy are vulnerable to attribute inference attacks. In light of the current gap between theory and practice, we develop a novel theoretical framework for privacy-preservation under the attack of attribute inference. Under our framework, we propose a minimax optimization formulation to protect the given attribute and analyze its privacy guarantees against arbitrary adversaries. On the other hand, it is clear that privacy constraint may cripple utility when the protected attribute is correlated with the target variable. To this end, we also prove an information-theoretic lower bound to precisely characterize the fundamental trade-off between utility and privacy. Empirically, we extensively conduct experiments to corroborate our privacy guarantee and validate the inherent trade-offs in different privacy preservation algorithms. Our experimental results indicate that the adversarial representation learning approaches achieve the best trade-off in terms of privacy preservation and utility maximization.",/pdf/b31145bfcb2e3b9d874cc9d86395c79058655a0f.pdf,ICLR,2020, +PI_CwQparl_,cFBGV-0eznW,1601310000000.0,1614990000000.0,3403,Image Modeling with Deep Convolutional Gaussian Mixture Models,"[""~Alexander_Gepperth1"", ""~Benedikt_Pf\u00fclb1""]","[""Alexander Gepperth"", ""Benedikt Pf\u00fclb""]","[""Gaussian Mixture Model"", ""Deep Learning"", ""Unsupervised Representation Learning"", ""Sampling""]","In this conceptual work, we present DCGMM, a deep hierarchical Gaussian Mixture Model (GMM) that is particularly suited for describing and generating images. +Vanilla (i.e., ""flat"") GMMs require a very large number of components to well describe images, leading to long training times and memory issues. +DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations. +This allows to exploit the compositionality of images in a similar way as deep CNNs do. +This sets them apart from vanilla GMMs which are trained by EM, requiring a prior k-means initialization which is infeasible in a layered structure. +For generating sharp images with DCGMM, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling. +Based on the MNIST and FashionMNIST datasets, we validate the DCGMM model by demonstrating its superiority over ""flat"" GMMs for clustering, sampling and outlier detection. +We additionally demonstrate the applicability of DCGMM to variant generation, in-painting and class-conditional sampling. ",/pdf/d3265d489a19bf176b7d84e78e5d2aef32000838.pdf,ICLR,2021,"We present a deep Gaussian Mixture Model, leveraging typical CNN concepts like convolutions and pooling for describing images at a manageable computational cost." +Hye6uoC9tm,HJgNi45qt7,1538090000000.0,1545360000000.0,394,Incremental Hierarchical Reinforcement Learning with Multitask LMDPs,"[""adamchristopherearle@gmail.com"", ""andrew.saxe@psy.ox.ac.uk"", ""benjros@gmail.com""]","[""Adam C Earle"", ""Andrew M Saxe"", ""Benjamin Rosman""]","[""Reinforcement learning"", ""hierarchy"", ""linear markov decision process"", ""lmdl"", ""subtask discovery"", ""incremental""]","Exploration is a well known challenge in Reinforcement Learning. One principled way of overcoming this challenge is to find a hierarchical abstraction of the base problem and explore at these higher levels, rather than in the space of primitives. However, discovering a deep abstraction autonomously remains a largely unsolved problem, with practitioners typically hand-crafting these hierarchical control architectures. Recent work with multitask linear Markov decision processes, allows for the autonomous discovery of deep hierarchical abstractions, but operates exclusively in the offline setting. By extending this work, we develop an agent that is capable of incrementally growing a hierarchical representation, and using its experience to date to improve exploration.",/pdf/9705b6ec7657158f7d433b7b6df194ba813c0028.pdf,ICLR,2019,"We develop an agent capable of incrementally growing a hierarchical representation, and using its experience to date to improve exploration." +j39sWOYhfEg,hvrnaqCoPu,1601310000000.0,1614990000000.0,1256,The shape and simplicity biases of adversarially robust ImageNet-trained CNNs,"[""peijiechenauburn@gmail.com"", ""~Chirag_Agarwal1"", ""~Anh_Nguyen1""]","[""Peijie Chen"", ""Chirag Agarwal"", ""Anh Nguyen""]","[""shape bias"", ""texture bias"", ""interpretability"", ""smoothness"", ""visualization""]","Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. +Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-of-distribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of ``""robustifying"" networks. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks high-frequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity and can act as a strong image prior in image synthesis. ",/pdf/1311ac377834b5fdc8b6e19ca67aa2be74ebe963.pdf,ICLR,2021,adversarial training makes networks more robust by smoothing out kernels and reducing the space of active inputs of each neuron +Sy-dQG-Rb,BJ9w7zW0Z,1509140000000.0,1519560000000.0,806,Neural Speed Reading via Skim-RNN,"[""minjoon@cs.washington.edu"", ""shmsw25@snu.ac.kr"", ""ali@cs.washington.edu"", ""hannaneh@washington.edu""]","[""Minjoon Seo"", ""Sewon Min"", ""Ali Farhadi"", ""Hannaneh Hajishirzi""]","[""Natural Language Processing"", ""RNN"", ""Inference Speed""]","Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives a significant computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be easily used instead of RNNs in existing models. In our experiments, we show that Skim-RNN can achieve significantly reduced computational cost without losing accuracy compared to standard RNNs across five different natural language tasks. In addition, we demonstrate that the trade-off between accuracy and speed of Skim-RNN can be dynamically controlled during inference time in a stable manner. Our analysis also shows that Skim-RNN running on a single CPU offers lower latency compared to standard RNNs on GPUs.",/pdf/9dd779561969b56e400aa5b93404a0a9abb2b6c2.pdf,ICLR,2018, +w6p7UMtf-0S,0P_glDZZU5o,1601310000000.0,1614990000000.0,2158,Improving Few-Shot Visual Classification with Unlabelled Examples,"[""~Peyman_Bateni1"", ""jarred.barber@gmail.com"", ""~Jan-Willem_van_de_Meent1"", ""~Frank_Wood2""]","[""Peyman Bateni"", ""Jarred Barber"", ""Jan-Willem van de Meent"", ""Frank Wood""]","[""Meta-learning"", ""Few-shot image classification"", ""Transductive few-shot learning""]","We propose a transductive meta-learning method that uses unlabelled instances to improve few-shot image classification performance. Our approach combines a regularized Mahalanobis-distance-based soft k-means clustering procedure with a modified state of the art neural adaptive feature extractor to achieve improved test-time classification accuracy using unlabelled data. We evaluate our method on transductive few-shot learning tasks, in which the goal is to jointly predict labels for query (test) examples given a set of support (training) examples. We achieve new state of the art performance on the Meta-Dataset and the mini-ImageNet and tiered-ImageNet benchmarks.",/pdf/87bedd9a16fc096e6ad38275d8835ef9ce2c1dfc.pdf,ICLR,2021,"We propose Transductive CNAPS, a neural adaptive Mahalanobis-distance based soft k-means approach for transductive few-shot image classification." +Iw4ZGwenbXf,gllss8qHcg,1601310000000.0,1615850000000.0,2254,NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control,"[""~Ioannis_Exarchos1"", ""~Marcus_Aloysius_Pereira1"", ""~Ziyi_Wang1"", ""~Evangelos_Theodorou1""]","[""Ioannis Exarchos"", ""Marcus Aloysius Pereira"", ""Ziyi Wang"", ""Evangelos Theodorou""]","[""deep neural networks"", ""nested optimization"", ""stochastic control"", ""deep FBSDEs""]","In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end learning. We study the proposed optimization module's properties and benchmark it against two existing alternatives on a synthetic energy-based structured prediction task, and further showcase its use in stochastic optimal control applications.",/pdf/19f093001a03a82a092d19740971a45fff9f47a8.pdf,ICLR,2021, +HJJ23bW0b,r1Ai2-bRZ,1509130000000.0,1519380000000.0,721,Initialization matters: Orthogonal Predictive State Recurrent Neural Networks,"[""kchoro@google.com"", ""cmdowney@cs.cmu.edu"", ""bboots@cc.gatech.edu""]","[""Krzysztof Choromanski"", ""Carlton Downey"", ""Byron Boots""]","[""recurrent neural networks"", ""orthogonal random features"", ""predictive state representations""]","Learning to predict complex time-series data is a fundamental challenge in a range of disciplines including Machine Learning, Robotics, and Natural Language Processing. Predictive State Recurrent Neural Networks (PSRNNs) (Downey et al.) are a state-of-the-art approach for modeling time-series data which combine the benefits of probabilistic filters and Recurrent Neural Networks into a single model. PSRNNs leverage the concept of Hilbert Space Embeddings of distributions (Smola et al.) to embed predictive states into a Reproducing Kernel Hilbert Space, then estimate, predict, and update these embedded states using Kernel Bayes Rule. Practical implementations of PSRNNs are made possible by the machinery of Random Features, where input features are mapped into a new space where dot products approximate the kernel well. Unfortunately PSRNNs often require a large number of RFs to obtain good results, resulting in large models which are slow to execute and slow to train. Orthogonal Random Features (ORFs) (Choromanski et al.) is an improvement on RFs which has been shown to decrease the number of RFs required for pointwise kernel approximation. Unfortunately, it is not clear that ORFs can be applied to PSRNNs, as PSRNNs rely on Kernel Ridge Regression as a core component of their learning algorithm, and the theoretical guarantees of ORF do not apply in this setting. In this paper, we extend the theory of ORFs to Kernel Ridge Regression and show that ORFs can be used to obtain Orthogonal PSRNNs (OPSRNNs), which are smaller and faster than PSRNNs. In particular, we show that OPSRNN models clearly outperform LSTMs and furthermore, can achieve accuracy similar to PSRNNs with an order of magnitude smaller number of features needed.",/pdf/dbeaab9a90e3833072ace5a9b61a9430527b32b4.pdf,ICLR,2018,Improving Predictive State Recurrent Neural Networks via Orthogonal Random Features +rJ0JwFcex,,1478300000000.0,1488510000000.0,498,Neuro-Symbolic Program Synthesis,"[""eparisot@andrew.cmu.edu"", ""asamir@microsoft.com"", ""risin@microsoft.com"", ""lihongli@microsoft.com"", ""denzho@microsoft.com"", ""pkohli@microsoft.com""]","[""Emilio Parisotto"", ""Abdel-rahman Mohamed"", ""Rishabh Singh"", ""Lihong Li"", ""Dengyong Zhou"", ""Pushmeet Kohli""]","[""Deep learning"", ""Structured prediction""]","Recent years have seen the proposal of a number of neural architectures for the problem of Program Induction. Given a set of input-output examples, these architectures are able to learn mappings that generalize to new test inputs. While achieving impressive results, these approaches have a number of important limitations: (a) they are computationally expensive and hard to train, (b) a model has to be trained for each task (program) separately, and (c) it is hard to interpret or verify the correctness of the learnt mapping (as it is defined by a neural network). In this paper, we propose a novel technique, Neuro-Symbolic Program Synthesis, to overcome the above-mentioned problems. Once trained, our approach can automatically construct computer programs in a domain-specific language that are consistent with a set of input-output examples provided at test time. Our method is based on two novel neural modules. The first module, called the cross correlation I/O network, given a set of input-output examples, produces a continuous representation of the set of I/O examples. The second module, the Recursive-Reverse-Recursive Neural Network (R3NN), given the continuous representation of the examples, synthesizes a program by incrementally expanding partial programs. We demonstrate the effectiveness of our approach by applying it to the rich and complex domain of regular expression based string transformations. Experiments show that the R3NN model is not only able to construct programs from new input-output examples, but it is also able to construct new programs for tasks that it had never observed before during training.",/pdf/0cc9921e54883e42e2fe02be08463c42f9ddbbb5.pdf,ICLR,2017,A neural architecture for learning programs in a domain-specific language that are consistent with a given set of input-output examples +6fb4mex_pUT,qKtBSMf2wn,1601310000000.0,1614990000000.0,1307,An Algorithm for Out-Of-Distribution Attack to Neural Network Encoder ,"[""~Liang_Liang2"", ""~Linhai_Ma1"", ""lxq93@miami.edu"", ""jasonchen@miami.edu""]","[""Liang Liang"", ""Linhai Ma"", ""Linchen Qian"", ""Jiasong Chen""]","[""Out-Of-Distribution"", ""DNN"", ""image classification""]","Deep neural networks (DNNs), especially convolutional neural networks, have achieved superior performance on image classification tasks. However, such performance is only guaranteed if the input to a trained model is similar to the training samples, i.e., the input follows the probability distribution of the training set. Out-Of-Distribution (OOD) samples do not follow the distribution of training set, and therefore the predicted class labels on OOD samples become meaningless. Classification-based methods have been proposed for OOD detection; however, in this study we show that this type of method has no theoretical guarantee and is practically breakable by our OOD Attack algorithm because of dimensionality reduction in the DNN models. We also show that Glow likelihood-based OOD detection is breakable as well. ",/pdf/fe645f02d59fa9ec53ad65f760790ab9956d0ff5.pdf,ICLR,2021,Neural network is easily fooled by OOD samples due to non-bijective mapping caused by dimensionality reduction: a new method to generate OOD samples. +SyzKd1bCW,HJ-FdkZRW,1509120000000.0,1519430000000.0,492,Backpropagation through the Void: Optimizing control variates for black-box gradient estimation,"[""wgrathwohl@cs.toronto.edu"", ""choidami@cs.toronto.edu"", ""ywu@cs.toronto.edu"", ""roeder@cs.toronto.edu"", ""duvenaud@cs.toronto.edu""]","[""Will Grathwohl"", ""Dami Choi"", ""Yuhuai Wu"", ""Geoff Roeder"", ""David Duvenaud""]","[""optimization"", ""machine learning"", ""variational inference"", ""reinforcement learning"", ""gradient estimation"", ""deep learning"", ""discrete optimization""]","Gradient-based optimization is the foundation of deep learning and reinforcement learning. +Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables, based on gradients of a learned function. +These estimators can be jointly trained with model parameters or policies, and are applicable in both discrete and continuous settings. We give unbiased, adaptive analogs of state-of-the-art reinforcement learning methods such as advantage actor-critic. We also demonstrate this framework for training discrete latent-variable models.",/pdf/fa6e4f023de69c1031cf9f4bc7f54c69f223b9fb.pdf,ICLR,2018,We present a general method for unbiased estimation of gradients of black-box functions of random variables. We apply this method to discrete variational inference and reinforcement learning. +0aZG2VcWLY,cTLRP_TRbZ2,1601310000000.0,1614990000000.0,2664,Signal Coding and Reconstruction using Spike Trains,"[""~Anik_Chattopadhyay1"", ""~Arunava_Banerjee2""]","[""Anik Chattopadhyay"", ""Arunava Banerjee""]","[""spike trains"", ""signal encoding"", ""reconstruction"", ""kernel"", ""representer theorem"", ""compression"", ""convolutional matching pursuit"", ""COMP""]","In many animal sensory pathways, the transformation from external stimuli to spike trains is essentially deterministic. In this context, a new mathematical framework for coding and reconstruction, based on a biologically plausible model of the spiking neuron, is presented. The framework considers encoding of a signal through spike trains generated by an ensemble of neurons via a standard convolve-then-threshold mechanism, albeit with a wide variety of convolution kernels. Neurons are distinguished by their convolution kernels and threshold values. Reconstruction is posited as a convex optimization minimizing energy. Formal conditions under which perfect reconstruction of the signal from the spike trains is possible are then identified. Coding experiments on a large audio dataset are presented to demonstrate the strength of the framework.",/pdf/1171a3c1ecfbd2ebc3d1d12d70fbab1a5267507b.pdf,ICLR,2021," A mathematical framework for signal encoding and decoding, based on a model of biological neurons, is formulated and its efficacy is established both via mathematical results and through simulation experiments on large corpora of audio signals." +rkE8pVcle,,1478280000000.0,1488770000000.0,220,Learning through Dialogue Interactions by Asking Questions,"[""jiwel@fb.com"", ""ahm@fb.com"", ""spchopra@fb.com"", ""ranzato@fb.com"", ""jase@fb.com""]","[""Jiwei Li"", ""Alexander H. Miller"", ""Sumit Chopra"", ""Marc'Aurelio Ranzato"", ""Jason Weston""]","[""Natural language processing""]","A good dialogue agent should have the ability to interact with users by both responding to questions and by asking questions, and importantly to learn from both types of interactions. In this work, we explore this direction by designing a simulator and a set of synthetic tasks in the movie domain that allow such interactions between a learner and a teacher. We investigate how a learner can benefit from asking questions in both offline and online reinforcement learning settings, and demonstrate that the learner improves when asking questions. Our work represents a first step in developing such end-to-end learned interactive dialogue agents. +",/pdf/d3f95730beec457da2c74f0167b649461cdb77be.pdf,ICLR,2017,We investigate how a bot can benefit from interacting with users and asking questions. +edku48LG0pT,RRBA9OzPnxd,1601310000000.0,1614990000000.0,2093,A Neural Network MCMC sampler that maximizes Proposal Entropy,"[""~ZENGYI_LI1"", ""~Yubei_Chen1"", ""~Friedrich_Sommer1""]","[""ZENGYI LI"", ""Yubei Chen"", ""Friedrich Sommer""]","[""MCMC"", ""Adaptive MCMC"", ""Neural MCMC"", ""Normalizing Flow"", ""Entropy based speed Measure"", ""HMC"", ""Energy-based Model"", ""Sampling""]","Markov Chain Monte Carlo (MCMC) methods sample from unnormalized probability distributions and offer guarantees of exact sampling. However, in the continuous case, unfavorable geometry of the target distribution can greatly limit the efficiency of MCMC methods. Augmenting samplers with neural networks can potentially improve their efficiency. Previous neural network based samplers were trained with objectives that either did not explicitly encourage exploration, or used a L2 jump objective which could only be applied to well structured distributions. Thus it seems promising to instead maximize the proposal entropy for adapting the proposal to distributions of any shape. To allow direct optimization of the proposal entropy, we propose a neural network MCMC sampler that has a flexible and tractable proposal distribution. Specifically, our network architecture utilizes the gradient of the target distribution for generating proposals. Our model achieves significantly higher efficiency than previous neural network MCMC techniques in a variety of sampling tasks. Further, the sampler is applied on training of a convergent energy-based model of natural images. The learned sampler achieves significantly higher proposal entropy and sample quality compared to Langevin dynamics sampler.",/pdf/b65c7cd09ae6688097859e95829675444d767b1e.pdf,ICLR,2021,Neural network MCMC sampler that tractably maximize proposal entropy objective +B1xtFpVtvB,BygBwzpvwB,1569440000000.0,1577170000000.0,678,Improving the Generalization of Visual Navigation Policies using Invariance Regularization,"[""michel.aractingi@naverlabs.com"", ""christopher.dance@naverlabs.com"", ""julien.perez@naverlabs.com"", ""tomi.silander@naverlabs.com""]","[""Michel Aractingi"", ""Christopher Dance"", ""Julien Perez"", ""Tomi Silander""]","[""Generalization"", ""Deep Reinforcement Learning"", ""Invariant Representation""]","Training agents to operate in one environment often yields overfitted models that are unable to generalize to the changes in that environment. However, due to the numerous variations that can occur in the real-world, the agent is often required to be robust in order to be useful. This has not been the case for agents trained with reinforcement learning (RL) algorithms. In this paper, we investigate the overfitting of RL agents to the training environments in visual navigation tasks. Our experiments show that deep RL agents can overfit even when trained on multiple environments simultaneously. +We propose a regularization method which combines RL with supervised learning methods by adding a term to the RL objective that would encourage the invariance of a policy to variations in the observations that ought not to affect the action taken. The results of this method, called invariance regularization, show an improvement in the generalization of policies to environments not seen during training. +",/pdf/8ec83accc4a3d7c2085209fa6cf874ba78e97703.pdf,ICLR,2020,"We propose a regularization term that, when added to the reinforcement learning objective, allows the policy to maximize the reward and simultaneously learn to be invariant to the irrelevant changes within the input.." +m08OHhXxl-5,PvsWMonDBK2,1601310000000.0,1614990000000.0,1459,Privacy Preserving Recalibration under Domain Shift,"[""~Rachel_Luo1"", ""~Shengjia_Zhao1"", ""~Jiaming_Song1"", ""~Jonathan_Kuck1"", ""~Stefano_Ermon1"", ""~Silvio_Savarese1""]","[""Rachel Luo"", ""Shengjia Zhao"", ""Jiaming Song"", ""Jonathan Kuck"", ""Stefano Ermon"", ""Silvio Savarese""]","[""uncertainty calibration"", ""differential privacy""]","Classifiers deployed in high-stakes applications must output calibrated confidence scores, i.e. their predicted probabilities should reflect empirical frequencies. Typically this is achieved with recalibration algorithms that adjust probability estimates based on the real-world data; however, existing algorithms are not applicable in real-world situations where the test data follows a different distribution from the training data, and privacy preservation is paramount (e.g. protecting patient records). We introduce a framework that provides abstractions for performing recalibration under differential privacy constraints. This framework allows us to adapt existing recalibration algorithms to satisfy differential privacy while remaining effective for domain-shift situations. Guided by our framework, we also design a novel recalibration algorithm, accuracy temperature scaling, that is tailored to the requirements of differential privacy. In an extensive empirical study, we find that our algorithm improves calibration on domain-shift benchmarks under the constraints of differential privacy. On the 15 highest severity perturbations of the ImageNet-C dataset, our method achieves a median ECE of 0.029, over 2x better than the next best recalibration method and almost 5x better than without recalibration.",/pdf/0003c5ea8027bb4b102dabb9588e9ad76a5e0fb5.pdf,ICLR,2021,"We introduce a framework that provides abstractions for performing recalibration under differential privacy constraints, design a novel recalibration algorithm that works well in this setting, and extensively validate our method experimentally. " +H1fF0iR9KX,H1lWyATqKX,1538090000000.0,1545360000000.0,914,Geometry aware convolutional filters for omnidirectional images representation,"[""renata.khasanova@epfl.ch"", ""pascal.frossard@epfl.ch""]","[""Renata Khasanova"", ""Pascal Frossard""]","[""omnidirectional images"", ""classification"", ""deep learning"", ""graph signal processing""]","Due to their wide field of view, omnidirectional cameras are frequently used by autonomous vehicles, drones and robots for navigation and other computer vision tasks. The images captured by such cameras, are often analysed and classified with techniques designed for planar images that unfortunately fail to properly handle the native geometry of such images. That results in suboptimal performance, and lack of truly meaningful visual features. In this paper we aim at improving popular deep convolutional neural networks so that they can properly take into account the specific properties of omnidirectional data. In particular we propose an algorithm that adapts convolutional layers, which often serve as a core building block of a CNN, to the properties of omnidirectional images. Thus, our filters have a shape and size that adapts with the location on the omnidirectional image. We show that our method is not limited to spherical surfaces and is able to incorporate the knowledge about any kind of omnidirectional geometry inside the deep learning network. As depicted by our experiments, our method outperforms the existing deep neural network techniques for omnidirectional image classification and compression tasks.",/pdf/ded664f48f0530981c63bb43298026520f16c441.pdf,ICLR,2019, +HkNGsseC-,BJ7fiigR-,1509110000000.0,1519310000000.0,374,On the Expressive Power of Overlapping Architectures of Deep Learning,"[""or.sharir@cs.huji.ac.il"", ""shashua@cs.huji.ac.il""]","[""Or Sharir"", ""Amnon Shashua""]","[""Deep Learning"", ""Expressive Efficiency"", ""Overlapping"", ""Receptive Fields""]","Expressive efficiency refers to the relation between two architectures A and B, whereby any function realized by B could be replicated by A, but there exists functions realized by A, which cannot be replicated by B unless its size grows significantly larger. For example, it is known that deep networks are exponentially efficient with respect to shallow networks, in the sense that a shallow network must grow exponentially large in order to approximate the functions represented by a deep network of polynomial size. In this work, we extend the study of expressive efficiency to the attribute of network connectivity and in particular to the effect of ""overlaps"" in the convolutional process, i.e., when the stride of the convolution is smaller than its filter size (receptive field). +To theoretically analyze this aspect of network's design, we focus on a well-established surrogate for ConvNets called Convolutional Arithmetic Circuits (ConvACs), and then demonstrate empirically that our results hold for standard ConvNets as well. Specifically, our analysis shows that having overlapping local receptive fields, and more broadly denser connectivity, results in an exponential increase in the expressive capacity of neural networks. Moreover, while denser connectivity can increase the expressive capacity, we show that the most common types of modern architectures already exhibit exponential increase in expressivity, without relying on fully-connected layers.",/pdf/d94fb288204c263b5fc1be7d0bdd504368c3d8e4.pdf,ICLR,2018,We analyze how the degree of overlaps between the receptive fields of a convolutional network affects its expressive power. +r1lclxBYDS,ryls26kFDB,1569440000000.0,1577170000000.0,2109,On the implicit minimization of alternative loss functions when training deep networks,"[""alexandre.lemire-paquin.1@ulaval.ca"", ""brahim.chaib-draa@ift.ulaval.ca"", ""philippe.giguere@ift.ulaval.ca""]","[""Alexandre Lemire Paquin"", ""Brahim Chaib-draa"", ""Philippe Gigu\u00e8re""]","[""implicit minimization"", ""optimization bias"", ""margin based loss functions"", ""flat minima""]","Understanding the implicit bias of optimization algorithms is important in order to improve generalization of neural networks. One approach to try to exploit such understanding would be to then make the bias explicit in the loss function. Conversely, an interesting approach to gain more insights into the implicit bias could be to study how different loss functions are being implicitly minimized when training the network. In this work, we concentrate our study on the inductive bias occurring when minimizing the cross-entropy loss with different batch sizes and learning rates. We investigate how three loss functions are being implicitly minimized during training. These three loss functions are the Hinge loss with different margins, the cross-entropy loss with different temperatures and a newly introduced Gcdf loss with different standard deviations. This Gcdf loss establishes a connection between a sharpness measure for the 0−1 loss and margin based loss functions. We find that a common behavior is emerging for all the loss functions considered.",/pdf/a3c36c4f3b86a0ccd2afa515b2f44eb9a406b6b7.pdf,ICLR,2020,We study how the batch size and the learning affect the implicit minimization of different loss functions. +2wjKRmraNan,sGUx_F4FaFM,1601310000000.0,1614990000000.0,143,Non-Inherent Feature Compatible Learning,"[""~Yantao_Shen2"", ""~Fanzi_Wu1"", ""~Ying_Shan2""]","[""Yantao Shen"", ""Fanzi Wu"", ""Ying Shan""]","[""Deep Learning"", ""Feature Learning"", ""Compatible Learning""]","The need of Feature Compatible Learning (FCL) arises from many large scale retrieval-based applications, where updating the entire library of embedding vectors is expensive. When an upgraded embedding model shows potential, it is desired to transform the benefit of the new model without refreshing the library. While progresses have been made along this new direction, existing approaches for feature compatible learning mostly rely on old training data and classifiers, which are not available in many industry settings. In this work, we introduce an approach for feature compatible learning without inheriting old classifier and training data, i.e., Non-Inherent Feature Compatible Learning. Our approach requires only features extracted by \emph{old} model's backbone and \emph{new} training data, and makes no assumption about the overlap between old and new training data. We propose a unified framework for FCL, and extend it to handle the case where the old model is a black-box. Specifically, we learn a simple pseudo classifier in lieu of the old model, and further enhance it with a random walk algorithm. As a result, the embedding features produced by the new model can be matched with those from the old model without sacrificing performance. Experiments on ImageNet ILSVRC 2012 and Places365 data proved the efficacy of the proposed approach.",/pdf/295d87afa38298e1616cd1f3c4e779839e7bec09.pdf,ICLR,2021, +BklKFo09YX,Hyg12foctQ,1538090000000.0,1545360000000.0,466,Mol-CycleGAN - a generative model for molecular optimization,"[""l.maziarka@gmail.com"", ""lamiane.chan@gmail.com"", ""jan.kaczmarczyk@ardigen.com"", ""michal.warchol@ardigen.com""]","[""\u0141ukasz Maziarka"", ""Agnieszka Pocha"", ""Jan Kaczmarczyk"", ""Micha\u0142 Warcho\u0142""]","[""generative adversarial networks"", ""drug design"", ""deep learning"", ""molecule optimization""]","Designing a molecule with desired properties is one of the biggest challenges in drug development, as it requires optimization of chemical compound structures with respect to many complex properties. To augment the compound design process we introduce Mol-CycleGAN -- a CycleGAN-based model that generates optimized compounds with a chemical scaffold of interest. Namely, given a molecule our model generates a structurally similar one with an optimized value of the considered property. We evaluate the performance of the model on selected optimization objectives related to structural properties (presence of halogen groups, number of aromatic rings) and to a physicochemical property (penalized logP). In the task of optimization of penalized logP of drug-like molecules our model significantly outperforms previous results. ",/pdf/7203e57ca5c9da0c706368097c2cafa5d27ab1bf.pdf,ICLR,2019,We introduce Mol-CycleGAN - a new generative model for optimization of molecules to augment drug design. +JNtw9rUJnV,BZZqv3IAtQ6,1601310000000.0,1614990000000.0,1346,Real-Time AutoML,"[""~Iddo_Drori1"", ""bjk224@cornell.edu"", ""agk2151@columbia.edu"", ""ll3252@columbia.edu"", ""~Qiang_Ma3"", ""jd3599@columbia.edu"", ""ns625@cornell.edu"", ""~Madeleine_Udell1""]","[""Iddo Drori"", ""Brandon Kates"", ""Anant Kharkar"", ""Lu Liu"", ""Qiang Ma"", ""Jonah Deykin"", ""Nihar Sidhu"", ""Madeleine Udell""]","[""Automated machine learning"", ""zero-shot learning"", ""graph neural networks"", ""transformers""]","We present a new zero-shot approach to automated machine learning (AutoML) that predicts a high-quality model for a supervised learning task and dataset in real-time without fitting a single model. In contrast, most AutoML systems require tens or hundreds of model evaluations. Hence our approach accelerates AutoML by orders of magnitude. Our method uses a transformer-based language embedding to represent datasets and algorithms using their free-text descriptions and a meta-feature extractor to represent the data. We train a graph neural network in which each node represents a dataset to predict the best machine learning pipeline for a new test dataset. The graph neural network generalizes to new datasets and new sets of datasets. Our approach leverages the progress of unsupervised representation learning in natural language processing to provide a significant boost to AutoML. Performance is competitive with state-of-the-art AutoML systems while reducing running time from minutes to seconds and prediction time from minutes to milliseconds, providing AutoML in real-time.",/pdf/f6bffce3f915fe3658066439ab0161cd4820dcde.pdf,ICLR,2021, +HygkpxStvr,BygkAeWYDS,1569440000000.0,1577170000000.0,2558,Weakly-Supervised Trajectory Segmentation for Learning Reusable Skills,"[""parsa.m@berkeley.edu"", ""trevor@eecs.berkeley.edu"", ""pathak@berkeley.edu""]","[""Parsa Mahmoudieh"", ""Trevor Darrell"", ""Deepak Pathak""]","[""skills"", ""demonstration"", ""agent"", ""sub-task"", ""primitives"", ""robot learning"", ""manipulation""]","Learning useful and reusable skill, or sub-task primitives, is a long-standing problem in sensorimotor control. This is challenging because it's hard to define what constitutes a useful skill. Instead of direct manual supervision which is tedious and prone to bias, in this work, our goal is to extract reusable skills from a collection of human demonstrations collected directly for several end-tasks. We propose a weakly-supervised approach for trajectory segmentation following the classic work on multiple instance learning. Our approach is end-to-end trainable, works directly from high-dimensional input (e.g., images) and only requires the knowledge of what skill primitives are present at training, without any need of segmentation or ordering of primitives. We evaluate our approach via rigorous experimentation across four environments ranging from simulation to real world robots, procedurally generated to human collected demonstrations and discrete to continuous action space. Finally, we leverage the generated skill segmentation to demonstrate preliminary evidence of zero-shot transfer to new combinations of skills. Result videos at https://sites.google.com/view/trajectory-segmentation/",/pdf/7d52d37ebb5d2121ed3bf02485cc7b3d2456a777.pdf,ICLR,2020,Weakly supervised segmentation of human demonstrations into skill primitives by only using trajectory-level labels at training with neither time-step segmentation labels nor ordering information. +S1ecYANtPr,ryeAyIddDr,1569440000000.0,1577170000000.0,1258,Representation Learning Through Latent Canonicalizations,"[""orlitany@gmail.com"", ""arimorcos@gmail.com"", ""ssrinath@cs.stanford.edu"", ""guibas@cs.stanford.edu"", ""judy@gatech.edu""]","[""Or Litany"", ""Ari Morcos"", ""Srinath Sridhar"", ""Leonidas Guibas"", ""Judy Hoffman""]","[""representation learning"", ""latent canonicalization"", ""sim2real"", ""few shot"", ""disentanglement""]","We seek to learn a representation on a large annotated data source that generalizes to a target domain using limited new supervision. Many prior approaches to this problem have focused on learning disentangled representations so that as individual factors vary in a new domain, only a portion of the representation need be updated. In this work, we seek the generalization power of disentangled representations, but relax the requirement of explicit latent disentanglement and instead encourage linearity of individual factors of variation by requiring them to be manipulable by learned linear transformations. We dub these transformations latent canonicalizers, as they aim to modify the value of a factor to a pre-determined (but arbitrary) canonical value (e.g., recoloring the image foreground to black). Assuming a source domain with access to meta-labels specifying the factors of variation within an image, we demonstrate experimentally that our method helps reduce the number of observations needed to generalize to a similar target domain when compared to a number of supervised baselines. ",/pdf/df2471b4b7e5155a21a27c2b5a06328c7553f34b.pdf,ICLR,2020,We introduce latent canonicalizers: linear transformations meant to structure latent representations for improved sim2real adaptation +KWToR-Phbrz,XDpUysCy5z0,1601310000000.0,1614990000000.0,1375,Beyond Trivial Counterfactual Generations with Diverse Valuable Explanations,"[""~Pau_Rodriguez2"", ""~Massimo_Caccia1"", ""~Alexandre_Lacoste1"", ""~Lee_Zamparo1"", ""~Issam_H._Laradji1"", ""~Laurent_Charlin1"", ""~David_Vazquez1""]","[""Pau Rodriguez"", ""Massimo Caccia"", ""Alexandre Lacoste"", ""Lee Zamparo"", ""Issam H. Laradji"", ""Laurent Charlin"", ""David Vazquez""]","[""Interpretability"", ""Counterfactual"", ""Explanations"", ""Black-Box""]","Explainability of machine learning models has gained considerable attention within our research community given the importance of deploying more reliable machine-learning systems. Explanability can also be helpful for model debugging. In computer vision applications, most methods explain models by displaying the regions in the input image that they focus on for their prediction, but it is difficult to improve models based on these explanations since they do not indicate why the model fail. Counterfactual methods, on the other hand, indicate how to perturb the input to change the model prediction, providing details about the model's decision-making. Unfortunately, current counterfactual methods make ambiguous interpretations as they combine multiple biases of the model and the data in a single counterfactual interpretation of the model's decision. Moreover, these methods tend to generate trivial counterfactuals about the model's decision, as they often suggest to exaggerate or remove the presence of the attribute being classified. Trivial counterfactuals are usually not valuable, since the information they provide is often already known to the system's designer. In this work, we propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss to uncover multiple valuable explanations about the model's prediction. Further, we introduce a mechanism to prevent the model from producing trivial explanations. Experiments on CelebA and Synbols demonstrate that our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods. We will make the code public.",/pdf/2a9430e565061e45e19c65485e181e17f4c2edc6.pdf,ICLR,2021,DiVE generates counterfactual explanations that are non-trivial or obvious and thus more informative +HygrdpVKvr,ryenOg2wDB,1569440000000.0,1583910000000.0,630,NAS evaluation is frustratingly hard,"[""antoineyang3@gmail.com"", ""pedro.esperanca@huawei.com"", ""fabiom.carlucci@gmail.com""]","[""Antoine Yang"", ""Pedro M. Esperan\u00e7a"", ""Fabio M. Carlucci""]","[""neural architecture search"", ""nas"", ""benchmark"", ""reproducibility"", ""harking""]","Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of 8 NAS methods on 5 datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macrostructure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between 8 and 20 cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls, e.g. difficulties in reproducibility and comparison of search methods. The +code used is available at https://github.com/antoyang/NAS-Benchmark.",/pdf/2d0d2643d5295b6885c025c9313be38f0d3ce47c.pdf,ICLR,2020,"A study of how different components in the NAS pipeline contribute to the final accuracy. Also, a benchmark of 8 methods on 5 datasets." +lbHDMllIYI1,wCtkRTvBRcb,1601310000000.0,1614990000000.0,3317,Sparse matrix products for neural network compression,"[""~Luc_Giffon1"", ""~hachem_kadri2"", ""~Stephane_Ayache1"", ""~Ronan_Sicre3"", ""~thierry_artieres1""]","[""Luc Giffon"", ""hachem kadri"", ""Stephane Ayache"", ""Ronan Sicre"", ""thierry artieres""]","[""Compression"", ""sparsity""]","Over-parameterization of neural networks is a well known issue that comes along with their great performance. Among the many approaches proposed to tackle this problem, low-rank tensor decompositions are largely investigated to compress deep neural networks. Such techniques rely on a low-rank assumption of the layer weight tensors that does not always hold in practice. Following this observation, this paper studies sparsity inducing techniques to build new sparse matrix product layer for high-rate neural networks compression. Specifically, we explore recent advances in sparse optimization to replace each layer's weight matrix, either convolutional or fully connected, by a product of sparse matrices. Our experiments validate that our approach provides a better compression-accuracy trade-off than most popular low-rank-based compression techniques. +",/pdf/c8aa1d080e30039bded234619005c11003ef7334.pdf,ICLR,2021,The paper explores high rate neural networks compression with factorisation of weight matrices as product of sparse matrices. +ryxf9CEKDr,HJgUUKOdPH,1569440000000.0,1577170000000.0,1275,Efficient Saliency Maps for Explainable AI,"[""mundhenk1@llnl.gov"", ""chen52@llnl.gov"", ""fractor@eecs.berkeley.edu""]","[""T. Nathan Mundhenk"", ""Barry Chen"", ""Gerald Friedland""]","[""Saliency"", ""XAI"", ""Efficent"", ""Information""]","We describe an explainable AI saliency map method for use with deep convolutional neural networks (CNN) that is much more efficient than popular gradient methods. It is also quantitatively similar or better in accuracy. Our technique works by measuring information at the end of each network scale. This is then combined into a single saliency map. We describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. Finally, we visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods. Our method is generally straight forward and should be applicable to the most commonly used CNNs. (Full source code is available at http://www.anonymous.submission.com).",/pdf/fa8d58f0ff836f15c0d17b1786ec0e1e9f48b606.pdf,ICLR,2020,An efficent method for determining which locations in an image are informative to a CNN. +dFwBosAcJkN,#NAME?,1601310000000.0,1625510000000.0,261,Perceptual Adversarial Robustness: Defense Against Unseen Threat Models,"[""~Cassidy_Laidlaw1"", ""~Sahil_Singla1"", ""~Soheil_Feizi2""]","[""Cassidy Laidlaw"", ""Sahil Singla"", ""Soheil Feizi""]",[],"A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception, used in the definition of adversarial attacks that are imperceptible to human eyes. Most current attacks and defenses try to get around this issue by considering restrictive adversarial threat models such as those bounded by $L_2$ or $L_\infty$ distance, spatial perturbations, etc. However, models that are robust against any of these restrictive threat models are still fragile against other threat models, i.e. they have poor generalization to unforeseen attacks. Moreover, even if a model is robust against the union of several restrictive threat models, it is still susceptible to other imperceptible adversarial examples that are not contained in any of the constituent threat models. To resolve these issues, we propose adversarial training against the set of all imperceptible adversarial examples. Since this set is intractable to compute without a human in the loop, we approximate it using deep neural networks. We call this threat model the neural perceptual threat model (NPTM); it includes adversarial examples with a bounded neural perceptual distance (a neural network-based approximation of the true perceptual distance) to natural images. Through an extensive perceptual study, we show that the neural perceptual distance correlates well with human judgements of perceptibility of adversarial examples, validating our threat model. + +Under the NPTM, we develop novel perceptual adversarial attacks and defenses. Because the NPTM is very broad, we find that Perceptual Adversarial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against five diverse adversarial attacks: $L_2$, $L_\infty$, spatial, recoloring, and JPEG. We find that PAT achieves state-of-the-art robustness against the union of these five attacks—more than doubling the accuracy over the next best model—without training against any of them. That is, PAT generalizes well to unforeseen perturbation types. This is vital in sensitive applications where a particular threat model cannot be assumed, and to the best of our knowledge, PAT is the first adversarial training defense with this property. + +Code and data are available at https://github.com/cassidylaidlaw/perceptual-advex",/pdf/eb768a2d394baa94aebe2efba341cc5a5244428d.pdf,ICLR,2021,Adversarial training against a perceptually-aligned attack gives high robustness against many diverse adversarial threat models. +Hy_o3x-0b,rJwjnx-C-,1509130000000.0,1518730000000.0,628,Feature Map Variational Auto-Encoders,"[""larsma@dtu.dk"", ""olwi@dtu.dk""]","[""Lars Maal\u00f8e"", ""Ole Winther""]","[""deep learning"", ""representation learning"", ""variational auto-encoders"", ""variational inference"", ""generative models""]","There have been multiple attempts with variational auto-encoders (VAE) to learn powerful global representations of complex data using a combination of latent stochastic variables and an autoregressive model over the dimensions of the data. However, for the most challenging natural image tasks the purely autoregressive model with stochastic variables still outperform the combined stochastic autoregressive models. In this paper, we present simple additions to the VAE framework that generalize to natural images by embedding spatial information in the stochastic layers. We significantly improve the state-of-the-art results on MNIST, OMNIGLOT, CIFAR10 and ImageNet when the feature map parameterization of the stochastic variables are combined with the autoregressive PixelCNN approach. Interestingly, we also observe close to state-of-the-art results without the autoregressive part. This opens the possibility for high quality image generation with only one forward-pass. +",/pdf/e6a071da6490a0996d08f3ddbc463bea39e1f913.pdf,ICLR,2018,We present a generative model that proves state-of-the-art results on gray-scale and natural images. +SkEqro0ctQ,rkeJSn9KFm,1538090000000.0,1547600000000.0,113,Hierarchical interpretations for neural network predictions,"[""chandan_singh@berkeley.edu"", ""jmurdoch@berkeley.edu"", ""binyu@berkeley.edu""]","[""Chandan Singh"", ""W. James Murdoch"", ""Bin Yu""]","[""interpretability"", ""natural language processing"", ""computer vision""]","Deep neural networks (DNNs) have achieved impressive predictive performance due to their ability to learn complex, non-linear relationships between variables. However, the inability to effectively visualize these relationships has led to DNNs being characterized as black boxes and consequently limited their applications. To ameliorate this problem, we introduce the use of hierarchical interpretations to explain DNN predictions through our proposed method: agglomerative contextual decomposition (ACD). Given a prediction from a trained DNN, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive. We introduce ACD using examples from Stanford Sentiment Treebank and ImageNet, in order to diagnose incorrect predictions, identify dataset bias, and extract polarizing phrases of varying lengths. Through human experiments, we demonstrate that ACD enables users both to identify the more accurate of two DNNs and to better trust a DNN's outputs. We also find that ACD's hierarchy is largely robust to adversarial perturbations, implying that it captures fundamental aspects of the input and ignores spurious noise.",/pdf/edd9240b121ad8ce494d42bb4ce31f44f4d8d045.pdf,ICLR,2019,"We introduce and validate hierarchical local interpretations, the first technique to automatically search for and display important interactions for individual predictions made by LSTMs and CNNs." +kDnal_bbb-E,WBMK5Uno697,1601310000000.0,1616020000000.0,2807,DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues,"[""~Rishabh_Joshi1"", ""~Vidhisha_Balachandran1"", ""~Shikhar_Vashishth1"", ""~Alan_Black1"", ""~Yulia_Tsvetkov1""]","[""Rishabh Joshi"", ""Vidhisha Balachandran"", ""Shikhar Vashishth"", ""Alan Black"", ""Yulia Tsvetkov""]","[""negotiation"", ""dialogue"", ""graph neural networks"", ""interpretability"", ""structure""]","To successfully negotiate a deal, it is not enough to communicate fluently: pragmatic planning of persuasive negotiation strategies is essential. While modern dialogue agents excel at generating fluent sentences, they still lack pragmatic grounding and cannot reason strategically. We present DialoGraph, a negotiation system that incorporates pragmatic strategies in a negotiation dialogue using graph neural networks. DialoGraph explicitly incorporates dependencies between sequences of strategies to enable improved and interpretable prediction of next optimal strategies, given the dialogue context. Our graph-based method outperforms prior state-of-the-art negotiation models both in the accuracy of strategy/dialogue act prediction and in the quality of downstream dialogue response generation. We qualitatively show further benefits of learned strategy-graphs in providing explicit associations between effective negotiation strategies over the course of the dialogue, leading to interpretable and strategic dialogues.",/pdf/1f09e2eb0a2962d022f2fc8411de57bb2f420a25.pdf,ICLR,2021,"We propose DialoGraph, a negotiation dialogue system that leverages Graph Attention Networks to model complex negotiation strategies while providing interpretability for the model via intermediate structures." +SJxpsxrYPS,r1eDRybFvS,1569440000000.0,1583910000000.0,2518,PROGRESSIVE LEARNING AND DISENTANGLEMENT OF HIERARCHICAL REPRESENTATIONS,"[""zl7904@rit.edu"", ""jvm6526@rit.edu"", ""pkg2182@rit.edu"", ""linwei.wang@rit.edu""]","[""Zhiyuan Li"", ""Jaideep Vitthal Murkute"", ""Prashnna Kumar Gyawali"", ""Linwei Wang""]","[""generative model"", ""disentanglement"", ""progressive learning"", ""VAE""]","Learning rich representation from data is an important task for deep generative models such as variational auto-encoder (VAE). However, by extracting high-level abstractions in the bottom-up inference process, the goal of preserving all factors of variations for top-down generation is compromised. Motivated by the concept of “starting small”, we present a strategy to progressively learn independent hierarchical representations from high- to low-levels of abstractions. The model starts with learning the most abstract representation, and then progressively grow the network architecture to introduce new representations at different levels of abstraction. We quantitatively demonstrate the ability of the presented model to improve disentanglement in comparison to existing works on two benchmark datasets using three disentanglement metrics, including a new metric we proposed to complement the previously-presented metric of mutual information gap. We further present both qualitative and quantitative evidence on how the progression of learning improves disentangling of hierarchical representations. By drawing on the respective advantage of hierarchical representation learning and progressive learning, this is to our knowledge the first attempt to improve disentanglement by progressively growing the capacity of VAE to learn hierarchical representations.",/pdf/efb4325df6680b31749512d779002476a362c608.pdf,ICLR,2020,We proposed a progressive learning method to improve learning and disentangling latent representations at different levels of abstraction. +SyxeqhP9ll,,1478290000000.0,1487880000000.0,434,Calibrating Energy-based Generative Adversarial Networks,"[""zander.dai@gmail.com"", ""amjadmahayri@gmail.com"", ""phil.bachman@gmail.com"", ""hovy@cmu.edu"", ""aaron.courville@gmail.com""]","[""Zihang Dai"", ""Amjad Almahairi"", ""Philip Bachman"", ""Eduard Hovy"", ""Aaron Courville""]","[""Deep learning""]","In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples. +Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. +We derive the analytic form of the induced solution, and analyze the properties. +In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. +Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.",/pdf/b8945bf8e29ebf04fdda251269f36cbd8986c26b.pdf,ICLR,2017, +CYHMIhbuLFl,gs9a3E0rC6_,1601310000000.0,1614990000000.0,1881,Contextual HyperNetworks for Novel Feature Adaptation,"[""~Angus_Lamb1"", ""e.s.saveliev@gmail.com"", ""~Yingzhen_Li1"", ""~Sebastian_Tschiatschek1"", ""camilla.longden@microsoft.com"", ""simon.woodhead@eedi.co.uk"", ""~Jos\u00e9_Miguel_Hern\u00e1ndez-Lobato1"", ""~Richard_E_Turner1"", ""~Pashmina_Cameron1"", ""~Cheng_Zhang1""]","[""Angus Lamb"", ""Evgeny Saveliev"", ""Yingzhen Li"", ""Sebastian Tschiatschek"", ""Camilla Longden"", ""Simon Woodhead"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato"", ""Richard E Turner"", ""Pashmina Cameron"", ""Cheng Zhang""]","[""Meta learning"", ""few-shot learning"", ""continual learning"", ""recommender systems"", ""deep learning""]","While deep learning has obtained state-of-the-art results in many applications, the adaptation of neural network architectures to incorporate new features remains a research challenge. This issue is particularly severe in online learning settings, where new features are added continually with few or no associated observations. As such, methods for adapting neural networks to novel features which are both time and data-efficient are desired. To address this, we propose the Contextual HyperNetwork (CHN), which predicts the network weights associated with new features by incorporating information from both existing data as well as the few observations for the new feature and any associated feature metadata. At prediction time, the CHN requires only a single forward pass through a small neural network, yielding a significant speed-up when compared to re-training and fine-tuning approaches. In order to showcase the performance of CHNs, in this work we use a CHN to augment a partial variational autoencoder (P-VAE), a flexible deep generative model which can impute the values of missing features in sparsely-observed data. We show that this system obtains significantly improved performance for novel feature adaptation over existing imputation and meta-learning baselines across recommender systems, e-learning, and healthcare tasks.",/pdf/dad11a3cea27e7d0d62ec1154262f381e27fe808.pdf,ICLR,2021,"We introduce an auxiliary neural network to extend existing neural networks to make accurate predictions for new features in the few-shot learning regime, given a small number of observations and/or metadata for the new feature." +ES9cpVTyLL,#NAME?,1601310000000.0,1614990000000.0,970,"Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale","[""zielinski@google.com"", ""~Shankar_Krishnan1"", ""~Satrajit_Chatterjee1""]","[""Piotr Zielinski"", ""Shankar Krishnan"", ""Satrajit Chatterjee""]","[""generalization"", ""deep learning"", ""hardness of examples""]","Coherent Gradients (CGH) [Chatterjee, ICLR 20] is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. The key insight of CGH is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. In this paper, we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the techniques presented in the original paper do not scale beyond toy models and datasets, we propose new methods. By posing the problem of suppressing weak gradient directions as a problem of robust mean estimation, we develop a coordinate-based median of means approach. We present two versions of this algorithm, M3, which partitions a mini-batch into 3 groups and computes the median, and a more efficient version RM3, which reuses gradients from previous two time steps to compute the median. Since they suppress weak gradient directions without requiring per-example gradients, they can be used to train models at scale. Experimentally, we find that they indeed greatly reduce overfitting (and memorization) and thus provide the first convincing evidence that CGH holds at scale. We also propose a new test of CGH that does not depend on adding noise to training labels or on suppressing weak gradient directions. Using the intuition behind CGH, we posit that the examples learned early in the training process (i.e., ""easy"" examples) are precisely those that have more in common with other training examples. Therefore, as per CGH, the easy examples should generalize better amongst themselves than the hard examples amongst themselves. We validate this hypothesis with detailed experiments, and believe that it provides further orthogonal evidence for CGH.",/pdf/9a7a885bad60d56487a63b60a48bcddd5142a959.pdf,ICLR,2021,"We present a new algorithm that allows us to test the Coherent Gradient hypothesis for the first time at scale, and also present a fundamentally new test of Coherent Gradients based on hardness of examples." +rkhlb8lCZ,BJhxbLlRW,1509090000000.0,1519180000000.0,270,Wavelet Pooling for Convolutional Neural Networks,"[""tlwilli3@aggies.ncat.edu"", ""eeli@ncat.edu""]","[""Travis Williams"", ""Robert Li""]","[""Pooling"", ""Wavelet"", ""CNN"", ""Neural Network"", ""Deep Learning"", ""Classification"", ""Machine Learning"", ""Object Recognition""]","Convolutional Neural Networks continuously advance the progress of 2D and 3D image and object classification. The steadfast usage of this algorithm requires constant evaluation and upgrading of foundational concepts to maintain progress. Network regularization techniques typically focus on convolutional layer operations, while leaving pooling layer operations without suitable options. We introduce Wavelet Pooling as another alternative to traditional neighborhood pooling. This method decomposes features into a second level decomposition, and discards the first-level subbands to reduce feature dimensions. This method addresses the overfitting problem encountered by max pooling, while reducing features in a more structurally compact manner than pooling via neighborhood regions. Experimental results on four benchmark classification datasets demonstrate our proposed method outperforms or performs comparatively with methods like max, mean, mixed, and stochastic pooling. ",/pdf/ca8156ee78d51d1d47cea8c49ffab3a6d818c7ea.pdf,ICLR,2018,"Pooling is achieved using wavelets instead of traditional neighborhood approaches (max, average, etc)." +B1eWbxStPH,SkgMH1gKwB,1569440000000.0,1583910000000.0,2126,Directional Message Passing for Molecular Graphs,"[""klicpera@in.tum.de"", ""grossja@in.tum.de"", ""guennemann@in.tum.de""]","[""Johannes Klicpera"", ""Janek Gro\u00df"", ""Stephan G\u00fcnnemann""]","[""GNN"", ""Graph neural network"", ""message passing"", ""graphs"", ""equivariance"", ""molecules""]","Graph neural networks have recently achieved great successes in predicting quantum mechanical properties of molecules. These models represent a molecule as a graph using only the distance between atoms (nodes). They do not, however, consider the spatial direction from one atom to another, despite directional information playing a central role in empirical potentials for molecules, e.g. in angular potentials. To alleviate this limitation we propose directional message passing, in which we embed the messages passed between atoms instead of the atoms themselves. Each message is associated with a direction in coordinate space. These directional message embeddings are rotationally equivariant since the associated directions rotate with the molecule. We propose a message passing scheme analogous to belief propagation, which uses the directional information by transforming messages based on the angle between them. Additionally, we use spherical Bessel functions and spherical harmonics to construct theoretically well-founded, orthogonal representations that achieve better performance than the currently prevalent Gaussian radial basis representations while using fewer than 1/4 of the parameters. We leverage these innovations to construct the directional message passing neural network (DimeNet). DimeNet outperforms previous GNNs on average by 76% on MD17 and by 31% on QM9. Our implementation is available online.",/pdf/0255ee67aa275932d71e466955039aabc8460e41.pdf,ICLR,2020,Directional message passing incorporates spatial directional information to improve graph neural networks. +BJlgt2EYwr,rkg-JaL5US,1569440000000.0,1577170000000.0,66,Stabilizing DARTS with Amended Gradient Estimation on Architectural Parameters,"[""bikaifeng@huawei.com"", ""huchangping@huawei.com"", ""198808xc@gmail.com"", ""1410452@tongji.edu.cn"", ""weilonghui1@huawei.com"", ""tian.qi1@huawei.com""]","[""Kaifeng Bi"", ""Changping Hu"", ""Lingxi Xie"", ""Xin Chen"", ""Longhui Wei"", ""Qi Tian""]","[""Neural Architecture Search"", ""DARTS"", ""Stability""]","Differentiable neural architecture search has been a popular methodology of exploring architectures for deep learning. Despite the great advantage of search efficiency, it often suffers weak stability, which obstacles it from being applied to a large search space or being flexibly adjusted to different scenarios. This paper investigates DARTS, the currently most popular differentiable search algorithm, and points out an important factor of instability, which lies in its approximation on the gradients of architectural parameters. In the current status, the optimization algorithm can converge to another point which results in dramatic inaccuracy in the re-training process. Based on this analysis, we propose an amending term for computing architectural gradients by making use of a direct property of the optimality of network parameter optimization. Our approach mathematically guarantees that gradient estimation follows a roughly correct direction, which leads the search stage to converge on reasonable architectures. In practice, our algorithm is easily implemented and added to DARTS-based approaches efficiently. Experiments on CIFAR and ImageNet demonstrate that our approach enjoys accuracy gain and, more importantly, enables DARTS-based approaches to explore much larger search spaces that have not been studied before.",/pdf/29b80854791c31b9909a8044bc9dbbc8086e0eb9.pdf,ICLR,2020,An improved optimization of differentiable NAS that largely improves search stability +EeeOTYhLlVm,D4O-w6-N1S,1601310000000.0,1614990000000.0,3672,EpidemiOptim: A Toolbox for the Optimization of Control Policies in Epidemiological Models,"[""~C\u00e9dric_Colas1"", ""boris.hejblum@u-bordeaux.fr"", ""sebastien.rouillon@u-bordeaux.fr"", ""rodolphe.thiebaut@inria.fr"", ""~Pierre-Yves_Oudeyer1"", ""~Cl\u00e9ment_Moulin-Frier2"", ""melanie.prague@u-bordeaux.fr""]","[""C\u00e9dric Colas"", ""Boris Hejblum"", ""S\u00e9bastien Rouillon"", ""Rodolphe Thiebaut"", ""Pierre-Yves Oudeyer"", ""Cl\u00e9ment Moulin-Frier"", ""M\u00e9lanie Prague""]","[""epidemiology"", ""covid19"", ""reinforcement learning"", ""evolutionary algorithms"", ""multi-objective optimization"", ""decision-making"", ""toolbox""]","Epidemiologists model the dynamics of epidemics in order to propose control strategies based on pharmaceutical and non-pharmaceutical interventions (contact limitation, lock down, vaccination, etc). Hand-designing such strategies is not trivial because of the number of possible interventions and the difficulty to predict long-term effects. This task can be cast as an optimization problem where state-of-the-art machine learning algorithms such as deep reinforcement learning might bring significant value. However, the specificity of each domain - epidemic modelling or solving optimization problems - requires strong collaborations between researchers from different fields of expertise. +This is why we introduce EpidemiOptim, a Python toolbox that facilitates collaborations between researchers in epidemiology and optimization. EpidemiOptim turns epidemiological models and cost functions into optimization problems via a standard interface commonly used by optimization practitioners (OpenAI Gym). Reinforcement learning algorithms based on Q-Learning with deep neural networks (DQN) and evolutionary algorithms (NSGA-II) are already implemented. We illustrate the use of EpidemiOptim to find optimal policies for dynamical on-off lock-down control under the optimization of death toll and economic recess using a Susceptible-Exposed-Infectious-Removed (SEIR) model for SARS-CoV-2/COVID-19. +Using EpidemiOptim and its interactive visualization platform in Jupyter notebooks, epidemiologists, optimization practitioners and others (e.g. economists) can easily compare epidemiological models, costs functions and optimization algorithms to address important choices to be made by health decision-makers.",/pdf/42e495f744094c499493e66b4478cef58395b642.pdf,ICLR,2021,"We present EpidemiOptim, a toolbox that facilitates collaborations between epidemiologists and optimization practitioners by formulating the search of epidemic control policies as standard optimization problems. " +XG1Drw7VbLJ,tYPascUvLLi,1601310000000.0,1614990000000.0,1737,Defining Benchmarks for Continual Few-Shot Learning,"[""~Antreas_Antoniou2"", ""~Massimiliano_Patacchiola1"", ""~Mateusz_Ochal1"", ""~Amos_Storkey1""]","[""Antreas Antoniou"", ""Massimiliano Patacchiola"", ""Mateusz Ochal"", ""Amos Storkey""]","[""few-shot learning"", ""continual learning"", ""benchmark""]","In recent years there has been substantial progress in few-shot learning, where a model is trained on a small labeled dataset related to a specific task, and in continual learning, where a model has to retain knowledge acquired on a sequence of datasets. Both of these fields are different abstractions of the same real world scenario, where a learner has to adapt to limited information from different changing sources and be able to generalize in and from each of them. Combining these two paradigms, where a model is trained on several sequential few-shot tasks, and then tested on a validation set stemming from all those tasks, helps by explicitly defining the competing requirements for both efficient integration and continuity. In this paper we propose such a setting, naming it Continual Few-Shot Learning (CFSL). We first define a theoretical framework for CFSL, then we propose a range of flexible benchmarks to unify the evaluation criteria. As part of the benchmark, we introduce a compact variant of ImageNet, called SlimageNet64, which retains all original 1000 classes but only contains 200 instances of each one (a total of 200K data-points) downscaled to 64 by 64 pixels. We provide baselines for the proposed benchmarks using a number of popular few-shot and continual learning methods, exposing previously unknown strengths and weaknesses of those algorithms. The dataloader and dataset will be released with an open-source license.",/pdf/9d040a3a2bfb3fa9f9f1e9c64fe193c40efcbe90.pdf,ICLR,2021,The paper propose a benchmark for bridging the gap between few-shot and continual learning. +gJYlaqL8i8,ZGExPjz4lsU,1601310000000.0,1615990000000.0,3454,Learning to Sample with Local and Global Contexts in Experience Replay Buffer,"[""~Youngmin_Oh2"", ""~Kimin_Lee1"", ""~Jinwoo_Shin1"", ""~Eunho_Yang1"", ""~Sung_Ju_Hwang1""]","[""Youngmin Oh"", ""Kimin Lee"", ""Jinwoo Shin"", ""Eunho Yang"", ""Sung Ju Hwang""]","[""reinforcement learning"", ""experience replay buffer"", ""off-policy RL""]","Experience replay, which enables the agents to remember and reuse experience from the past, has played a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, the existing sampling methods allow selecting out more meaningful experiences by imposing priorities on them based on certain metrics (e.g. TD-error). However, they may result in sampling highly biased, redundant transitions since they compute the sampling rate for each transition independently, without consideration of its importance in relation to other transitions. In this paper, we aim to address the issue by proposing a new learning-based sampling method that can compute the relative importance of transition. To this end, we design a novel permutation-equivariant neural architecture that takes contexts from not only features of each transition (local) but also those of others (global) as inputs. We validate our framework, which we refer to as Neural Experience Replay Sampler (NERS), on multiple benchmark tasks for both continuous and discrete control tasks and show that it can significantly improve the performance of various off-policy RL methods. Further analysis confirms that the improvements of the sample efficiency indeed are due to sampling diverse and meaningful transitions by NERS that considers both local and global contexts. ",/pdf/92ef8e632b99778a17bd8e0187962812d2cd42c5.pdf,ICLR,2021,We propose a learning-based neural replay which calculates the relative importance to sample experience for off-policy RL. +fV2ScEA03Hg,rTaw6w21I9s,1601310000000.0,1614990000000.0,3100,AutoCleansing: Unbiased Estimation of Deep Learning with Mislabeled Data,"[""~Koichi_Kuriyama1""]","[""Koichi Kuriyama""]","[""Automatic Data Cleansing"", ""Incorrect Labels"", ""Multiple Objects""]","Mislabeled samples cause prediction errors. This study proposes a solution to the problem of incorrect labels, called AutoCleansing, to automatically capture the effect of incorrect labels and mitigate it without removing the mislabeled samples. AutoCleansing consists of a base network model and sample-category specific constants. Both parameters of the base model and sample-category constants are estimated simultaneously using the training data. Thereafter, predictions for test data are made using a base model without the constants capturing the mislabeled effects. A theoretical model for AutoCleansing is developed and showing that the gradient of the loss function of the proposed method can be zero at true parameters with mislabeled data if the model is correctly constructed. Experimental results show that AutoCleansing has better performance in test accuracy than previous studies for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.",/pdf/3b5902797c2072fc5c0be807ea4b660fd02763a2.pdf,ICLR,2021,AutoCleansing can capture the effect of incorrect labels and mitigate it without removing the mislabeled samples. +NqWY3s0SILo,ll7Aqfbihb9,1601310000000.0,1614990000000.0,517,Differentiable Graph Optimization for Neural Architecture Search,"[""~Chengyue_Huang1"", ""~Lingfei_Wu1"", ""~Yadong_Ding1"", ""~Siliang_Tang1"", ""lili@yixue.us"", ""zongchang@zju.edu.cn"", ""chilie.tan@tongdun.net"", ""~Yueting_Zhuang1""]","[""Chengyue Huang"", ""Lingfei Wu"", ""Yadong Ding"", ""Siliang Tang"", ""Fangli Xu"", ""Chang Zong"", ""Chilie Tan"", ""Yueting Zhuang""]","[""Neural Architecture Search"", ""Graph Structure Learning""]","In this paper, we propose Graph Optimized Neural Architecture Learning (GOAL), a novel gradient-based method for Neural Architecture Search (NAS), to find better architectures with fewer evaluated samples. Popular NAS methods usually employ black-box optimization based approaches like reinforcement learning, evolution algorithm or Bayesian optimization, which may be inefficient when having huge combinatorial NAS search spaces. In contrast, we aim to explicitly model the NAS search space as graphs, and then perform gradient-based optimization to learn graph structure with efficient exploitation. To this end, we learn a differentiable graph neural network as a surrogate model to rank candidate architectures, which enable us to obtain gradient w.r.t the input architectures. To cope with the difficulty in gradient-based optimization on the discrete graph structures, we propose to leverage proximal gradient descent to find potentially better architectures. +Our empirical results show that GOAL outperforms mainstream black-box methods on existing NAS benchmarks in terms of search efficiency.",/pdf/b022c3c21589cd211be319cbfd39547863896534.pdf,ICLR,2021, +rkedXgrKDH,HyghSVxYDS,1569440000000.0,1577170000000.0,2216,Trajectory growth through random deep ReLU networks,"[""ilan.price@maths.ox.ac.uk"", ""tanner@maths.ox.ac.uk""]","[""Ilan Price"", ""Jared Tanner""]","[""Deep networks"", ""expressivity"", ""trajectory growth"", ""sparse neural networks""]","This paper considers the growth in the length of one-dimensional trajectories as they are passed through deep ReLU neural networks, which, among other things, is one measure of the expressivity of deep networks. We generalise existing results, providing an alternative, simpler method for lower bounding expected trajectory growth through random networks, for a more general class of weights distributions, including sparsely connected networks. We illustrate this approach by deriving bounds for sparse-Gaussian, sparse-uniform, and sparse-discrete-valued random nets. We prove that trajectory growth can remain exponential in depth with these new distributions, including their sparse variants, with the sparsity parameter appearing in the base of the exponent.",/pdf/372479fc655b12abb02339ed4b2e7c85fb9dfae1.pdf,ICLR,2020,The expected trajectory growth of a random sparsely connected deep neural network is exponential in depth across many distributions including the default initialisations used in Tensorflow and Pytorch +4zr9e5xwZ9Y,PulXyTtNVnx,1601310000000.0,1614990000000.0,1617,Distributed Training of Graph Convolutional Networks using Subgraph Approximation,"[""~Alexandra_Angerd1"", ""keshavba@usc.edu"", ""~Murali_Annavaram1""]","[""Alexandra Angerd"", ""Keshav Balasubramanian"", ""Murali Annavaram""]",[],"Modern machine learning techniques are successfully being adapted to data modeled as graphs. However, many real-world graphs are typically very large and do not fit in memory, often making the problem of training machine learning models on them intractable. Distributed training has been successfully employed to alleviate memory problems and speed up training in machine learning domains in which the input data is assumed to be independently identical distributed (i.i.d). However, distributing the training of non i.i.d data such as graphs that are used as training inputs in Graph Convolutional Networks (GCNs) causes accuracy problems since information is lost at the graph partitioning boundaries. + +In this paper, we propose a training strategy that mitigates the lost information across multiple partitions of a graph through a subgraph approximation scheme. Our proposed approach augments each sub-graph with a small amount of edge and vertex information that is approximated from all other sub-graphs. The subgraph approximation approach helps the distributed training system converge at single-machine accuracy, while keeping the memory footprint low and minimizing synchronization overhead between the machines. ",/pdf/63fcb680a4c3494c75e9a014b5858d8e3cdc1ed9.pdf,ICLR,2021, +rkeYUsRqKQ,ryxlBR45tm,1538090000000.0,1545360000000.0,196,An Adversarial Learning Framework for a Persona-based Multi-turn Dialogue Model,"[""oluwatobi.olabiyi@capitalone.com"", ""anish.khazan@capitalone.com"", ""alan.salimov@capitalone.com"", ""erik.mueller@capitalone.com""]","[""Oluwatobi O. Olabiyi"", ""Anish Khazane"", ""Alan Salimov"", ""Erik T.Mueller""]","[""conversation model"", ""dialogue system"", ""adversarial net"", ""persona""]","In this paper, we extend the persona-based sequence-to-sequence (Seq2Seq) neural network conversation model to a multi-turn dialogue scenario by modifying the state-of-the-art hredGAN architecture to simultaneously capture utterance attributes such as speaker identity, dialogue topic, speaker sentiments and so on. The proposed system, phredGAN has a persona-based HRED generator (PHRED) and a conditional discriminator. We also explore two approaches to accomplish the conditional discriminator: (1) $phredGAN_a$, a system that passes the attribute representation as an additional input into a traditional adversarial discriminator, and (2) $phredGAN_d$, a dual discriminator system which in addition to the adversarial discriminator, collaboratively predicts the attribute(s) that generated the input utterance. To demonstrate the superior performance of phredGAN over the persona SeqSeq model, we experiment with two conversational datasets, the Ubuntu Dialogue Corpus (UDC) and TV series transcripts from the Big Bang Theory and Friends. Performance comparison is made with respect to a variety of quantitative measures as well as crowd-sourced human evaluation. We also explore the trade-offs from using either variant of $phredGAN$ on datasets with many but weak attribute modalities (such as with Big Bang Theory and Friends) and ones with few but strong attribute modalities (customer-agent interactions in Ubuntu dataset).",/pdf/aae86fa9eccfebea5bbe0a03b1c7fbd37e026724.pdf,ICLR,2019,This paper develops an adversarial learning framework for neural conversation models with persona +B1eKk2CcKm,S1gr5TTqt7,1538090000000.0,1545360000000.0,1005,Towards the Latent Transcriptome,"[""trofimov.assya@gmail.com"", ""frdutil@gmail.com"", ""claude.perreault@umontreal.ca"", ""s.lemieux@umontreal.ca"", ""yoshua.bengio@mila.quebec"", ""joseph@josephpcohen.com""]","[""Assya Trofimov"", ""Francis Dutil"", ""Claude Perreault"", ""Sebastien Lemieux"", ""Yoshua Bengio"", ""Joseph Paul Cohen""]","[""representation learning"", ""RNA-Seq"", ""gene expression"", ""bioinformatics"", ""computational biology"", ""transcriptomics"", ""deep learning"", ""genomics""]","In this work we propose a method to compute continuous embeddings for kmers from raw RNA-seq data, in a reference-free fashion. We report that our model captures information of both DNA sequence similarity as well as DNA sequence abundance in the embedding latent space. We confirm the quality of these vectors by comparing them to known gene sub-structures and report that the latent space recovers exon information from raw RNA-Seq data from acute myeloid leukemia patients. Furthermore we show that this latent space allows the detection of genomic abnormalities such as translocations as well as patient-specific mutations, making this representation space both useful for visualization as well as analysis.",/pdf/2370fd672042c7948e940fa6fc68b501b183f3b6.pdf,ICLR,2019, +H1T2hmZAb,HkjnhX-0Z,1509140000000.0,1519440000000.0,1172,Deep Complex Networks,"[""chiheb.trabelsi@polymtl.ca"", ""olexa.bilaniuk@umontreal.ca"", ""ying.zhang@umontreal.ca"", ""serdyuk@iro.umontreal.ca"", ""sandeep.subramanian.1@umontreal.ca"", ""jfsantos@emt.inrs.ca"", ""soroush.mehri@microsoft.com"", ""negar@elementai.com"", ""yoshua.bengio@umontreal.ca"", ""christopher.pal@polymtl.ca""]","[""Chiheb Trabelsi"", ""Olexa Bilaniuk"", ""Ying Zhang"", ""Dmitriy Serdyuk"", ""Sandeep Subramanian"", ""Joao Felipe Santos"", ""Soroush Mehri"", ""Negar Rostamzadeh"", ""Yoshua Bengio"", ""Christopher J Pal""]","[""deep learning"", ""complex-valued neural networks""]","At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech spectrum prediction using TIMIT. We achieve state-of-the-art performance on these audio-related tasks.",/pdf/29671e1339962c29eaf91a3e54ee106980863e1f.pdf,ICLR,2018, +Sywh5KYex,,1478230000000.0,1486150000000.0,113,Learning Identity Mappings with Residual Gates,"[""savarese@land.ufrj.br"", ""leonardomazza@poli.ufrj.br"", ""daniel@land.ufrj.br""]","[""Pedro H. P. Savarese"", ""Leonardo O. Mazza"", ""Daniel R. Figueiredo""]","[""Computer vision"", ""Deep learning"", ""Optimization""]","We propose a layer augmentation technique that adds shortcut connections with a linear gating mechanism, and can be applied to almost any network model. By using a scalar parameter to control each gate, we provide a way to learn identity mappings by optimizing only one parameter. We build upon the motivation behind Highway Neural Networks and Residual Networks, where a layer is reformulated in order to make learning identity mappings less problematic to the optimizer. The augmentation introduces only one extra parameter per layer, and provides easier optimization by making degeneration into identity mappings simpler. Experimental results show that augmenting layers provides better optimization, increased performance, and more layer independence. We evaluate our method on MNIST using fully-connected networks, showing empirical indications that our augmentation facilitates the optimization of deep models, and that it provides high tolerance to full layer removal: the model retains over 90% of its performance even after half of its layers have been randomly removed. In our experiments, augmented plain networks -- which can be interpreted as simplified Highway Neural Networks -- outperform ResNets, raising new questions on how shortcut connections should be designed. We also evaluate our model on CIFAR-10 and CIFAR-100 using augmented Wide ResNets, achieving 3.65% and 18.27% test error, respectively.",/pdf/7c82ff06cea8f836f8982b749b2630804d9acd5c.pdf,ICLR,2017,This paper proposes adding simple gates to layers to make learning identity mappings trivial. It also introduces Gated Plain Networks and Gated Residual Networks. +mLtPtH2SIHX,ihramOpxDekM,1601310000000.0,1614990000000.0,1706,Bypassing the Random Input Mixing in Mixup,"[""~Hongyu_Guo1""]","[""Hongyu Guo""]","[""Deep Learning"", ""Data Augmentation"", ""Mixup""]","Mixup and its variants have promoted a surge of interest due to their capability of boosting the accuracy of deep models. For a random sample pair, such approaches generate a set of synthetic samples through interpolating both the inputs and their corresponding one-hot labels. Current methods either interpolate random features from an input pair or learn to mix salient features from the pair. Nevertheless, the former methods can create misleading synthetic samples or remove important features from the given inputs, and the latter strategies incur significant computation cost for selecting descriptive input regions. In this paper, we show that the effort needed for the input mixing can be bypassed. For a given sample pair, averaging the features from the two inputs and then assigning it with a set of soft labels can effectively regularize the training. We empirically show that the proposed approach performs on par with state-of-the-art strategies in terms of predictive accuracy. ",/pdf/3ede09fe30999ca281b179cdafdfbc91398f9f76.pdf,ICLR,2021,We report a finding that one can bypass the random input mixing in Mixup to conduct effective model regularization. +BJgPCveAW,rkyPRDlA-,1509090000000.0,1518730000000.0,300,Characterizing Sparse Connectivity Patterns in Neural Networks,"[""souryade@usc.edu"", ""kuanwenh@usc.edu"", ""pabeerel@usc.edu"", ""chugg@usc.edu""]","[""Sourya Dey"", ""Kuan-Wen Huang"", ""Peter A. Beerel"", ""Keith M. Chugg""]","[""Machine learning"", ""Neural networks"", ""Sparse neural networks"", ""Pre-defined sparsity"", ""Scatter"", ""Connectivity patterns"", ""Adjacency matrix"", ""Parameter Reduction"", ""Morse code""]","We propose a novel way of reducing the number of parameters in the storage-hungry fully connected layers of a neural network by using pre-defined sparsity, where the majority of connections are absent prior to starting training. Our results indicate that convolutional neural networks can operate without any loss of accuracy at less than 0.5% classification layer connection density, or less than 5% overall network connection density. We also investigate the effects of pre-defining the sparsity of networks with only fully connected layers. Based on our sparsifying technique, we introduce the `scatter' metric to characterize the quality of a particular connection pattern. As proof of concept, we show results on CIFAR, MNIST and a new dataset on classifying Morse code symbols, which highlights some interesting trends and limits of sparse connection patterns.",/pdf/b605751d37d5913194d55f4d62151378dc994643.pdf,ICLR,2018,Neural networks can be pre-defined to have sparse connectivity without performance degradation. +B1NGT8xCZ,SJNGaUxA-,1509090000000.0,1518730000000.0,278,Principled Hybrids of Generative and Discriminative Domain Adaptation,"[""han.zhao@cs.cmu.edu"", ""zhenyaozhu@baidu.com"", ""junjieh@cmu.edu"", ""adamcoates@baidu.com"", ""ggordon@cs.cmu.edu""]","[""Han Zhao"", ""Zhenyao Zhu"", ""Junjie Hu"", ""Adam Coates"", ""Geoff Gordon""]","[""domain adaptation"", ""neural networks"", ""generative models"", ""discriminative models""]","We propose a probabilistic framework for domain adaptation that blends both generative and discriminative modeling in a principled way. Under this framework, generative and discriminative models correspond to specific choices of the prior over parameters. This provides us a very general way to interpolate between generative and discriminative extremes through different choices of priors. By maximizing both the marginal and the conditional log-likelihoods, models derived from this framework can use both labeled instances from the source domain as well as unlabeled instances from \emph{both} source and target domains. Under this framework, we show that the popular reconstruction loss of autoencoder corresponds to an upper bound of the negative marginal log-likelihoods of unlabeled instances, where marginal distributions are given by proper kernel density estimations. This provides a way to interpret the empirical success of autoencoders in domain adaptation and semi-supervised learning. We instantiate our framework using neural networks, and build a concrete model, \emph{DAuto}. Empirically, we demonstrate the effectiveness of DAuto on text, image and speech datasets, showing that it outperforms related competitors when domain adaptation is possible. +",/pdf/e557edd1fa3e2a9b24aa6dc1b4069763ef957073.pdf,ICLR,2018, +HyePberFvH,rJfbtggFDB,1569440000000.0,1577170000000.0,2140,Monte Carlo Deep Neural Network Arithmetic,"[""julian.faraone@sydney.edu.au"", ""philip.leong@sydney.edu.au""]","[""Julian Faraone"", ""Philip Leong""]","[""deep learning"", ""quantization"", ""floating point"", ""monte carlo methods""]","Quantization is a crucial technique for achieving low-power, low latency and high throughput hardware implementations of Deep Neural Networks. Quantized floating point representations have received recent interest due to their hardware efficiency benefits and ability to represent a higher dynamic range than fixed point representations, leading to improvements in accuracy. We present a novel technique, Monte Carlo Deep Neural Network Arithmetic (MCA), for determining the sensitivity of Deep Neural Networks to quantization in floating point arithmetic.We do this by applying Monte Carlo Arithmetic to the inference computation and analyzing the relative standard deviation of the neural network loss. The method makes no assumptions regarding the underlying parameter distributions. We evaluate our method on pre-trained image classification models on the CIFAR10 andImageNet datasets. For the same network topology and dataset, we demonstrate the ability to gain the equivalent of bits of precision by simply choosing weight parameter sets which demonstrate a lower loss of significance from the Monte Carlo trials. Additionally, we can apply MCA to compare the sensitivity of different network topologies to quantization effects.",/pdf/5cbe0a087592d00d6cde7285273883d9ff46d291.pdf,ICLR,2020,Determining the sensitivity of Deep Neural Networks to floating point rounding error using Monte Carlo Methods +rkxs0yHFPH,HJgeAwJYDB,1569440000000.0,1583910000000.0,2037,SpikeGrad: An ANN-equivalent Computation Model for Implementing Backpropagation with Spikes,"[""johannes.thiele@cea.fr"", ""olivier.bichler@cea.fr"", ""antoine.dupret@cea.fr""]","[""Johannes C. Thiele"", ""Olivier Bichler"", ""Antoine Dupret""]","[""spiking neural network"", ""neuromorphic engineering"", ""backpropagation""]","Event-based neuromorphic systems promise to reduce the energy consumption of deep neural networks by replacing expensive floating point operations on dense matrices by low energy, sparse operations on spike events. While these systems can be trained increasingly well using approximations of the backpropagation algorithm, this usually requires high precision errors and is therefore incompatible with the typical communication infrastructure of neuromorphic circuits. In this work, we analyze how the gradient can be discretized into spike events when training a spiking neural network. To accelerate our simulation, we show that using a special implementation of the integrate-and-fire neuron allows us to describe the accumulated activations and errors of the spiking neural network in terms of an equivalent artificial neural network, allowing us to largely speed up training compared to an explicit simulation of all spike events. This way we are able to demonstrate that even for deep networks, the gradients can be discretized sufficiently well with spikes if the gradient is properly rescaled. This form of spike-based backpropagation enables us to achieve equivalent or better accuracies on the MNIST and CIFAR10 datasets than comparable state-of-the-art spiking neural networks trained with full precision gradients. The algorithm, which we call SpikeGrad, is based on only accumulation and comparison operations and can naturally exploit sparsity in the gradient computation, which makes it an interesting choice for a spiking neuromorphic systems with on-chip learning capacities.",/pdf/f6708d62ff26dc70ee82019e32d3c6c59f6989d0.pdf,ICLR,2020,An implementation of the backpropagation algorithm using spiking neurons for forward and backward propagation. +Bye6weHFvB,Hkxvz2gKvB,1569440000000.0,1577170000000.0,2377,Plan2Vec: Unsupervised Representation Learning by Latent Plans,"[""yangge1987@gmail.com"", ""amyzhang2011@gmail.com"", ""arimorcos@gmail.com"", ""jpineau@cs.mcgill.ca"", ""pabbeel@cs.berkeley.edu"", ""rcalandra@fb.com""]","[""Ge Yang"", ""Amy Zhang"", ""Ari Morcos"", ""Joelle Pineau"", ""Pieter Abbeel"", ""Roberto Calandra""]","[""Unsupervised Learning"", ""Reinforcement Learning"", ""Manifold Learning""]","Creating a useful representation of the world takes more than just rote memorization of individual data samples. This is because fundamentally, we use our internal representation to plan, to solve problems, and to navigate the world. For a representation to be amenable to planning, it is critical for it to embody some notion of optimality. A representation learning objective that explicitly considers some form of planning should generate representations which are more computationally valuable than those that memorize samples. In this paper, we introduce \textbf{Plan2Vec}, an unsupervised representation learning objective inspired by value-based reinforcement learning methods. By abstracting away low-level control with a learned local metric, we show that it is possible to learn plannable representations that inform long-range structures, entirely passively from high-dimensional sequential datasets without supervision. A latent space is learned by playing an ``Imagined Planning Game"" on the graph formed by the data points, using a local metric function trained contrastively from context. We show that the global metric on this learned embedding can be used to plan with O(1) complexity by linear interpolation. This exponential speed-up is critical for planning with a learned representation on any problem containing non-trivial global topology. We demonstrate the effectiveness of Plan2Vec on simulated toy tasks from both proprioceptive and image states, as well as two real-world image datasets, showing that Plan2Vec can effectively plan using learned representations. Additional results and videos can be found at \url{https://sites.google.com/view/plan2vec}.",/pdf/1aab5aa19b2a3cd1726f8258818794967d68f257.pdf,ICLR,2020,"Plan2Vec poses unsupervised representation learning as an RL problem, to extend local information to a consistent global embedding" +1cEEqSp9kXV,dJ1ZrfvjByk,1601310000000.0,1614990000000.0,2312,Constructing Multiple High-Quality Deep Neural Networks: A TRUST-TECH Based Approach,"[""~Zhiyong_Hao1"", ""~Hsiao-Dong_Chiang1"", ""bw297@cornell.edu""]","[""Zhiyong Hao"", ""Hsiao-Dong Chiang"", ""Bin Wang""]","[""Nonlinear Dynamical Systems"", ""Global Optimization"", ""Deep Neural Networks"", ""Ensemble.""]","The success of deep neural networks relied heavily on efficient stochastic gradient descent-like training methods. However, these methods are sensitive to initialization and hyper-parameters. +In this paper, a systematical method for finding multiple high-quality local optimal deep neural networks from a single training session, using the TRUST-TECH (TRansformation Under Stability-reTaining Equilibria Characterization) method, is introduced. +To realize effective TRUST-TECH searches to train deep neural networks on large datasets, a dynamic search paths (DSP) method is proposed to provide an improved search guidance in TRUST-TECH method. +The proposed DSP-TT method is implemented such that the computation graph remains constant during the search process, with only minor GPU memory overhead and requires just one training session to obtain multiple local optimal solutions (LOS). To take advantage of these LOSs, we also propose an improved ensemble method. Experiments on image classification datasets show that our method improves the testing performance by a substantial margin. Specifically, our fully-trained DSP-TT ResNet ensmeble improves the SGD baseline by 20\% (CIFAR10) and 15\%(CIFAR100). Furthermore, our method shows several advantages over other ensembling methods. ",/pdf/cc35da6347cfb331c52993ff76dac0d4089dc5ee.pdf,ICLR,2021,We propose a novel method of obtaining multiple diverse networks systematically in one training run. +SyeQFiCcF7,Hyl1LL5cFX,1538090000000.0,1545360000000.0,429,Siamese Capsule Networks,"[""james.o-neill@liverpool.ac.uk""]","[""James O' Neill""]","[""capsule networks"", ""pairwise learning"", ""few-shot learning"", ""face verification""]","Capsule Networks have shown encouraging results on defacto benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where point-wise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce Siamese Capsule Networks, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with l2-normalized capsule encoded pose features. We find that Siamese Capsule Networks perform well against strong baselines on both pairwise learning datasets, yielding best results in the few-shot learning setting where image pairs in the test set contain unseen subjects.",/pdf/56751b128ffa0216b6353ede94a96d1d8c587d49.pdf,ICLR,2019,A variant of capsule networks that can be used for pairwise learning tasks. Results shows that Siamese Capsule Networks work well in the few shot learning setting. +BJJ9bz-0-,SkRKWz-Ab,1509130000000.0,1518730000000.0,778,Reinforcement Learning from Imperfect Demonstrations,"[""yg@eecs.berkeley.edu"", ""huazhe_xu@eecs.berkeley.edu"", ""lin-j14@mails.tsinghua.edu.cn"", ""fy@eecs.berkeley.edu"", ""svlevine@eecs.berkeley.edu"", ""trevor@eecs.berkeley.edu""]","[""Yang Gao"", ""Huazhe(Harry) Xu"", ""Ji Lin"", ""Fisher Yu"", ""Sergey Levine"", ""Trevor Darrell""]","[""learning from demonstration"", ""reinforcement learning"", ""maximum entropy learning""]","Robust real-world learning should benefit from both demonstrations and interaction with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on reward from the environment. These tasks have divergent losses which are difficult to jointly optimize; further, such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. Our Normalized Actor-Critic (NAC) method can learn from demonstration data of arbitrary quality and also leverages rewards from an interactive environment. NAC learns an initial policy network from demonstration and refines the policy in a real environment. Crucially, both learning from demonstration and interactive refinement use exactly the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data, since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.",/pdf/472dfbe7059af65461c84e0e210c27f8245c9421.pdf,ICLR,2018, +#NAME?,lQuU_BLMWcw,1601310000000.0,1614990000000.0,3004,Explore with Dynamic Map: Graph Structured Reinforcement Learning,"[""~Jiarui_Jin1"", ""zhousijin@sjtu.edu.cn"", ""~Weinan_Zhang1"", ""~Rasool_Fakoor1"", ""~David_Wipf1"", ""~Tong_He5"", ""~Yong_Yu1"", ""~Zheng_Zhang1"", ""~Alex_Smola1""]","[""Jiarui Jin"", ""Sijin Zhou"", ""Weinan Zhang"", ""Rasool Fakoor"", ""David Wipf"", ""Tong He"", ""Yong Yu"", ""Zheng Zhang"", ""Alex Smola""]","[""Deep Reinforcement Learning"", ""Graph Structured Reinforcement Learning"", ""Exploration""]","In reinforcement learning, a map with states and transitions built based on historical trajectories is often helpful in exploration and exploitation. Even so, learning and planning on such a map within a sparse environment remains a challenge. As a step towards this goal, we propose Graph Structured Reinforcement Learning (GSRL), which utilizes historical trajectories to slowly adjust exploration directions and learn related experiences while rapidly updating the value function estimation. GSRL constructs a dynamic graph on top of state transitions in the replay buffer based on historical trajectories, and develops an attention strategy on the map to select an appropriate goal direction, which decomposes the task of reaching a distant goal state into a sequence of easier tasks. We also leverage graph structure to sample related trajectories for efficient value learning. Results demonstrate that GSRL can outperform the state-of-the-art algorithms in terms of sample efficiency on benchmarks with sparse reward functions. ",/pdf/03ca48a13c4411ee18caa4920549c8e552ea8598.pdf,ICLR,2021,"In this paper, we propose to construct a dynamic graph on top of state transitions in replay buffer and leverage graph structure for learning and planning." +S1GcHsAqtm,Hylz7FIYY7,1538090000000.0,1545360000000.0,112,Adaptive Pruning of Neural Language Models for Mobile Devices,"[""r33tang@uwaterloo.ca"", ""jimmylin@uwaterloo.ca""]","[""Raphael Tang"", ""Jimmy Lin""]","[""Inference-time pruning"", ""Neural Language Models""]","Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space where better perplexity typically comes at the cost of greater computation complexity. In a software keyboard application on mobile devices, this translates into higher power consumption and shorter battery life. This paper represents the first attempt, to our knowledge, in exploring accuracy-efficiency tradeoffs for NLMs. Building on quasi-recurrent neural networks (QRNNs), we apply pruning techniques to provide a ""knob"" to select different operating points. In addition, we propose a simple technique to recover some perplexity using a negligible amount of memory. Our empirical evaluations consider both perplexity as well as energy consumption on a Raspberry Pi, where we demonstrate which methods provide the best perplexity-power consumption operating point. At one operating point, one of the techniques is able to provide energy savings of 40% over the state of the art with only a 17% relative increase in perplexity.",/pdf/a713b9c004963c36d77ba4d80f0444ad0de3a461.pdf,ICLR,2019, +RtNpzLdHUAW,s1IeiPbDsS4,1601310000000.0,1614990000000.0,588,Stochastic Subset Selection for Efficient Training and Inference of Neural Networks,"[""~Andreis_Bruno1"", ""~Tuan_Nguyen3"", ""~Juho_Lee2"", ""~Eunho_Yang1"", ""~Sung_Ju_Hwang1""]","[""Andreis Bruno"", ""Tuan Nguyen"", ""Juho Lee"", ""Eunho Yang"", ""Sung Ju Hwang""]","[""efficient deep learning"", ""meta learning"", ""efficient training"", ""data compression"", ""instance selection""]","Current machine learning algorithms are designed to work with huge volumes of high dimensional data such as images. However, these algorithms are being increasingly deployed to resource constrained systems such as mobile devices and embedded systems. Even in cases where large computing infrastructure is available, the size of each data instance, as well as datasets, can provide a huge bottleneck in data transfer across communication channels. Also, there is a huge incentive both in energy and monetary terms in reducing both the computational and memory requirements of these algorithms. For non-parametric models that require to leverage the stored training data at the inference time, the increased cost in memory and computation could be even more problematic. In this work, we aim to reduce the volume of data these algorithms must process through an end-to-end two-stage neural subset selection model, where the first stage selects a set of candidate points using a conditionally independent Bernoulli mask followed by an iterative coreset selection via a conditional Categorical distribution. The subset selection model is trained by meta-learning with a distribution of sets. We validate our method on set reconstruction and classification tasks with feature selection as well as the selection of representative samples from a given dataset, on which our method outperforms relevant baselines. We also show in our experiments that our method enhances scalability of non-parametric models such as Neural Processes.",/pdf/36e4ddca1ab6d9563fdd5c17e972be76bb776bf4.pdf,ICLR,2021,We learn a stochastic subset selection model for instance and pixel selection that reduces computational and storage cost. +Skxw-REFwS,rkevCBVOwS,1569440000000.0,1577170000000.0,969,Unsupervised Progressive Learning and the STAM Architecture,"[""jamessealesmith@gatech.edu"", ""constantine@gatech.edu""]","[""James Smith"", ""Constantine Dovrolis""]","[""continual learning"", ""unsupervised learning"", ""online learning""]","We first pose the Unsupervised Progressive Learning (UPL) problem: learning salient representations from a non-stationary stream of unlabeled data in which the number of object classes increases with time. If some limited labeled data is also available, those representations can be associated with specific classes, thus enabling classification tasks. To solve the UPL problem, we propose an architecture that involves an online clustering module, called Self-Taught Associative Memory (STAM). Layered hierarchies of STAM modules learn based on a combination of online clustering, novelty detection, forgetting outliers, and storing only prototypical representations rather than specific examples. The goal of this paper is to introduce the UPL problem, describe the STAM architecture, and evaluate the latter in the UPL context. ",/pdf/1fb579c390e9e16814037ac30da12039ffe6457c.pdf,ICLR,2020,We introduce Unsupervised Progressive Learning (UPL) and evaluate a neuro-inspired architecture: Self-Taught Associative Memory (STAM). +HyeSin4FPB,rJeeaSyWDr,1569440000000.0,1583910000000.0,151,Learning to Control PDEs with Differentiable Physics,"[""philipp.holl@tum.de"", ""nils.thuerey@tum.de"", ""vkoltun@gmail.com""]","[""Philipp Holl"", ""Nils Thuerey"", ""Vladlen Koltun""]","[""Differentiable physics"", ""Optimal control"", ""Deep learning""]","Predicting outcomes and planning interactions with the physical world are long-standing goals for machine learning. A variety of such tasks involves continuous physical systems, which can be described by partial differential equations (PDEs) with many degrees of freedom. Existing methods that aim to control the dynamics of such systems are typically limited to relatively short time frames or a small number of interaction parameters. We present a novel hierarchical predictor-corrector scheme which enables neural networks to learn to understand and control complex nonlinear physical systems over long time frames. We propose to split the problem into two distinct tasks: planning and control. To this end, we introduce a predictor network that plans optimal trajectories and a control network that infers the corresponding control parameters. Both stages are trained end-to-end using a differentiable PDE solver. We demonstrate that our method successfully develops an understanding of complex physical systems and learns to control them for tasks involving PDEs such as the incompressible Navier-Stokes equations.",/pdf/53667f36c5866c5204576e75122a448bb6ad7eeb.pdf,ICLR,2020,We train a combination of neural networks to predict optimal trajectories for complex physical systems. +ByliZgBKPH,Skl31ZxYvH,1569440000000.0,1620730000000.0,2148,Policy path programming,"[""daniel.c.mcnamee@gmail.com""]","[""Daniel McNamee""]","[""markov decision process"", ""planning"", ""hierarchical"", ""reinforcement learning""]","We develop a normative theory of hierarchical model-based policy optimization for Markov decision processes resulting in a full-depth, full-width policy iteration algorithm. This method performs policy updates which integrate reward information over all states at all horizons simultaneously thus sequentially maximizing the expected reward obtained per algorithmic iteration. Effectively, policy path programming ascends the expected cumulative reward gradient in the space of policies defined over all state-space paths. An exact formula is derived which finitely parametrizes these path gradients in terms of action preferences. Policy path gradients can be directly computed using an internal model thus obviating the need to sample paths in order to optimize in depth. They are quadratic in successor representation entries and afford natural generalizations to higher-order gradient techniques. In simulations, it is shown that intuitive hierarchical reasoning is emergent within the associated policy optimization dynamics.",/pdf/2bfb0137b82e2798a3ef6cf5693e7ecfafb12c4d.pdf,ICLR,2020,A normative theory of hierarchical model-based policy optimization +S1exA2NtDB,BylPSB44DB,1569440000000.0,1583910000000.0,250,ES-MAML: Simple Hessian-Free Meta Learning,"[""xsong@berkeley.edu"", ""wg2279@columbia.edu"", ""yxyang@google.com"", ""kchoro@google.com"", ""pacchiano@berkeley.edu"", ""yt2541@columbia.edu""]","[""Xingyou Song"", ""Wenbo Gao"", ""Yuxiang Yang"", ""Krzysztof Choromanski"", ""Aldo Pacchiano"", ""Yunhao Tang""]","[""ES"", ""MAML"", ""evolution"", ""strategies"", ""meta"", ""learning"", ""gaussian"", ""perturbation"", ""reinforcement"", ""learning"", ""adaptation""]","We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries.",/pdf/e3cf50e5461bd1dcdea3f36650ad042605fb45e6.pdf,ICLR,2020,"We provide a new framework for MAML in the ES/blackbox setting, and show that it allows deterministic and linear policies, better exploration, and non-differentiable adaptation operators." +rygHq6EFwr,BJg_W36vwr,1569440000000.0,1577170000000.0,705,GResNet: Graph Residual Network for Reviving Deep GNNs from Suspended Animation,"[""jiawei@ifmlab.org"", ""lin@ifmlab.org""]","[""Jiawei Zhang"", ""Lin Meng""]","[""Graph Neural Networks"", ""Node Classification"", ""Representation Learning""]","The existing graph neural networks (GNNs) based on the spectral graph convolutional operator have been criticized for its performance degradation, which is especially common for the models with deep architectures. In this paper, we further identify the suspended animation problem with the existing GNNs. Such a problem happens when the model depth reaches the suspended animation limit, and the model will not respond to the training data any more and become not learnable. Analysis about the causes of the suspended animation problem with existing GNNs will be provided in this paper, whereas several other peripheral factors that will impact the problem will be reported as well. To resolve the problem, we introduce the GRESNET (Graph Residual Network) framework in this paper, which creates extensively connected highways to involve nodes’ raw features or intermediate representations throughout the graph for all the model layers. Different from the other learning settings, the extensive connections in the graph data will render the existing simple residual learning methods fail to work. We prove the effectiveness of the introduced new graph residual terms from the norm preservation perspective, which will help avoid dramatic changes to the node’s representations between sequential layers. Detailed studies about the GRESNET framework for many existing GNNs, including GCN, GAT and LOOPYNET, will be reported in the paper with extensive empirical experiments on real-world benchmark datasets.",/pdf/eb260039993ec90976aaef54278ffb7f24011343.pdf,ICLR,2020,"Identifying suspended animation problem with GNNs, propose a new model to resolve the problem with graph residual learning." +S1JG13oee,,1478370000000.0,1481800000000.0,570,b-GAN: Unified Framework of Generative Adversarial Networks,"[""uehara-masatoshi136@g.ecc.u-tokyo.ac.jp"", ""sato@k.u-tokyo.ac.jp"", ""masa@weblab.t.u-tokyo.ac.jp"", ""nakayama@weblab.t.u-tokyo.ac.jp"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Masatosi Uehara"", ""Issei Sato"", ""Masahiro Suzuki"", ""Kotaro Nakayama"", ""Yutaka Matsuo""]","[""Deep learning"", ""Unsupervised Learning""]","Generative adversarial networks (GANs) are successful deep generative models. They are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats density ratio estimation and f-divergence minimization. Our algorithm offers a new unified perspective toward understanding GANs and is able to make use of multiple viewpoints obtained from the density ratio estimation research, e.g. what divergence is stable and relative density ratio is useful. ",/pdf/debdd3674598cc40a4812df3eeaf7e814823d3bd.pdf,ICLR,2017,New Unified Framework of Generative Adversarial Networks using Bregman divergence beyond f-GAN +rJeB22VFvS,r1xMFTbMDr,1569440000000.0,1577170000000.0,188,Towards More Realistic Neural Network Uncertainties,"[""joachim.sicking@iais.fraunhofer.de"", ""alexander.kister@iais.fraunhofer.de"", ""matthias.fahrland@iav.de"", ""stefan.eickeler@iais.fraunhofer.de"", ""fabian.hueger@volkswagen.de"", ""stefan.rueping@iais.fraunhofer.de"", ""peter.schlicht@volkswagen.de"", ""tim.wirtz@iais.fraunhofer.de""]","[""Joachim Sicking"", ""Alexander Kister"", ""Matthias Fahrland"", ""Stefan Eickeler"", ""Fabian Hueger"", ""Stefan Rueping"", ""Peter Schlicht"", ""Tim Wirtz""]","[""uncertainty"", ""variational inference"", ""MC dropout"", ""variational autoencoder"", ""evaluation""]","Statistical models are inherently uncertain. Quantifying or at least upper-bounding their uncertainties is vital for safety-critical systems. While standard neural networks do not report this information, several approaches exist to integrate uncertainty estimates into them. Assessing the quality of these uncertainty estimates is not straightforward, as no direct ground truth labels are available. Instead, implicit statistical assessments are required. For regression, we propose to evaluate uncertainty realism---a strict quality criterion---with a Mahalanobis distance-based statistical test. An empirical evaluation reveals the need for uncertainty measures that are appropriate to upper-bound heavy-tailed empirical errors. Alongside, we transfer the variational U-Net classification architecture to standard supervised image-to-image tasks. It provides two uncertainty mechanisms and significantly improves uncertainty realism compared to a plain encoder-decoder model.",/pdf/e306175babb8beb2eee62de1cbbe1aadb099c82c.pdf,ICLR,2020,We assess and improve the quality of neural network uncertainties by proposing an evaluation criterion and introducing a new uncertainty mechanism. +rJqBEPcxe,,1478290000000.0,1486780000000.0,378,Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations,"[""davidscottkrueger@gmail.com"", ""tegan.jrm@gmail.com"", ""ballas.n@gmail.com""]","[""David Krueger"", ""Tegan Maharaj"", ""Janos Kramar"", ""Mohammad Pezeshki"", ""Nicolas Ballas"", ""Nan Rosemary Ke"", ""Anirudh Goyal"", ""Yoshua Bengio"", ""Aaron Courville"", ""Christopher Pal""]","[""Deep learning""]","We propose zoneout, a novel method for regularizing RNNs. +At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. +Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. +But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. +We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.",/pdf/fdd2ff27bd56cb0045a9dee87e9dce9f8c134b2e.pdf,ICLR,2017,Zoneout is like dropout (for RNNs) but uses identity masks instead of zero masks +HkgxasA5Ym,BJlHtDh9t7,1538090000000.0,1545360000000.0,770,Reliable Uncertainty Estimates in Deep Neural Networks using Noise Contrastive Priors,"[""mail@danijar.com"", ""trandustin@google.com"", ""countzero@google.com"", ""alexirpan@google.com"", ""james@electric-thought.com""]","[""Danijar Hafner"", ""Dustin Tran"", ""Timothy Lillicrap"", ""Alex Irpan"", ""James Davidson""]","[""uncertainty estimates"", ""out of distribution"", ""bayesian neural network"", ""neural network priors"", ""regression"", ""active learning""]","Obtaining reliable uncertainty estimates of neural network predictions is a long standing challenge. Bayesian neural networks have been proposed as a solution, but it remains open how to specify their prior. In particular, the common practice of a standard normal prior in weight space imposes only weak regularities, causing the function posterior to possibly generalize in unforeseen ways on inputs outside of the training distribution. We propose noise contrastive priors (NCPs) to obtain reliable uncertainty estimates. The key idea is to train the model to output high uncertainty for data points outside of the training distribution. NCPs do so using an input prior, which adds noise to the inputs of the current mini batch, and an output prior, which is a wide distribution given these inputs. NCPs are compatible with any model that can output uncertainty estimates, are easy to scale, and yield reliable uncertainty estimates throughout training. Empirically, we show that NCPs prevent overfitting outside of the training distribution and result in uncertainty estimates that are useful for active learning. We demonstrate the scalability of our method on the flight delays data set, where we significantly improve upon previously published results.",/pdf/cc9cd0c72f18313016d47b5caea803b18bbc832c.pdf,ICLR,2019,We train neural networks to be uncertain on noisy inputs to avoid overconfident predictions outside of the training distribution. +IX3Nnir2omJ,_LX35PPtzuV,1601310000000.0,1616000000000.0,3152,Characterizing signal propagation to close the performance gap in unnormalized ResNets,"[""~Andrew_Brock1"", ""~Soham_De2"", ""~Samuel_L_Smith1""]","[""Andrew Brock"", ""Soham De"", ""Samuel L Smith""]","[""normalizers"", ""signal propagation"", ""deep learning"", ""neural networks"", ""ResNets"", ""EfficientNets"", ""ImageNet"", ""CNNs"", ""ConvNets""]","Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in ReLU networks by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with state-of-the-art EfficientNets on ImageNet.",/pdf/796f0f646a7dc728f2d8d89bc6d55288c9457889.pdf,ICLR,2021,"We show how to train ResNets completely without normalization, and attain performance competitive with batch-normalized EfficientNets." +r1h2DllAW,Hkt2DeeCb,1509060000000.0,1518730000000.0,213,Discrete-Valued Neural Networks Using Variational Inference,"[""roth@tugraz.at"", ""pernkopf@tugraz.at""]","[""Wolfgang Roth"", ""Franz Pernkopf""]","[""low-precision"", ""neural networks"", ""resource efficient"", ""variational inference"", ""Bayesian""]","The increasing demand for neural networks (NNs) being employed on embedded devices has led to plenty of research investigating methods for training low precision NNs. While most methods involve a quantization step, we propose a principled Bayesian approach where we first infer a distribution over a discrete weight space from which we subsequently derive hardware-friendly low precision NNs. To this end, we introduce a probabilistic forward pass to approximate the intractable variational objective that allows us to optimize over discrete-valued weight distributions for NNs with sign activation functions. In our experiments, we show that our model achieves state of the art performance on several real world data sets. In addition, the resulting models exhibit a substantial amount of sparsity that can be utilized to further reduce the computational costs for inference.",/pdf/5c610a586fbc2330c4ac85bd02f7226eb8fb2794.pdf,ICLR,2018,Variational Inference for infering a discrete distribution from which a low-precision neural network is derived +4qgEGwOtxU,nlP-N0r62n,1601310000000.0,1614990000000.0,3050,Importance and Coherence: Methods for Evaluating Modularity in Neural Networks,"[""~Shlomi_Hod1"", ""~Stephen_Casper1"", ""~Daniel_Filan1"", ""~Cody_Wild1"", ""~Andrew_Critch1"", ""~Stuart_Russell1""]","[""Shlomi Hod"", ""Stephen Casper"", ""Daniel Filan"", ""Cody Wild"", ""Andrew Critch"", ""Stuart Russell""]","[""interpretability"", ""modularity""]","As deep neural networks become more advanced and widely-used, it is important to understand their inner workings. Toward this goal, modular interpretations are appealing because they offer flexible levels of abstraction aside from standard architectural building blocks (e.g., neurons, channels, layers). In this paper, we consider the problem of assessing how functionally interpretable a given partitioning of neurons is. We propose two proxies for this: importance which reflects how crucial sets of neurons are to network performance, and coherence which reflects how consistently their neurons associate with input/output features. To measure these proxies, we develop a set of statistical methods based on techniques that have conventionally been used for the interpretation of individual neurons. We apply these methods on partitionings generated by a spectral clustering algorithm which uses a graph representation of the network's neurons and weights. We show that despite our partitioning algorithm using neither activations nor gradients, it reveals clusters with a surprising amount of importance and coherence. Together, these results support the use of modular interpretations, and graph-based partitionings in particular, for interpretability.",/pdf/19928882e1498c5cdfdc43aadc7c1b6bf90850e8.pdf,ICLR,2021,"Toward better tools for interpretability, we develop methods for evaluating modularity in neural networks and apply them to partitions of neurons from a graph-based clustering algorithm." +rkxJus0cFX,BJgO3UGFFm,1538090000000.0,1545360000000.0,320,RedSync : Reducing Synchronization Traffic for Distributed Deep Learning,"[""fang_jiarui@163.com"", ""rainfarmer@gmail.com""]","[""Jiarui Fang"", ""Cho-Jui Hsieh""]","[""Data parallel"", ""Deep Learning"", ""Multiple GPU system"", ""Communication Compression"", ""Sparsification"", ""Quantization""]","Data parallelism has become a dominant method to scale Deep Neural Network (DNN) training across multiple nodes. Since the synchronization of the local models or gradients can be a bottleneck for large-scale distributed training, compressing communication traffic has gained widespread attention recently. Among several recent proposed compression algorithms, +Residual Gradient Compression (RGC) is one of the most successful approaches---it can significantly compress the transmitting message size (0.1% of the gradient size) of each node and still preserve accuracy. However, the literature on compressing deep networks focuses almost exclusively on achieving good compression rate, while the efficiency of RGC in real implementation has been less investigated. In this paper, we develop an RGC method that achieves significant training time improvement in real-world multi-GPU systems. Our proposed RGC system design called RedSync, introduces a set of optimizations to reduce communication bandwidth while introducing limited overhead. We examine the performance of RedSync on two different multiple GPU platforms, including a supercomputer and a multi-card server. Our test cases include image classification on Cifar10 and ImageNet, and language modeling tasks on Penn Treebank and Wiki2 datasets. For DNNs featured with high communication to computation ratio, which has long been considered with poor scalability, RedSync shows significant performance improvement.",/pdf/86315f9a2f989e5fd7a4e26fed9f84dc9ab08e0b.pdf,ICLR,2019,We proposed an implementation to accelerate DNN data parallel training by reducing communication bandwidth requirement. +HkzSQhCcK7,r1gQfcp9Fm,1538090000000.0,1550420000000.0,1356,STCN: Stochastic Temporal Convolutional Networks,"[""eaksan@inf.ethz.ch"", ""otmar.hilliges@inf.ethz.ch""]","[""Emre Aksan"", ""Otmar Hilliges""]","[""latent variables"", ""variational inference"", ""temporal convolutional networks"", ""sequence modeling"", ""auto-regressive modeling""]","Convolutional architectures have recently been shown to be competitive on many +sequence modelling tasks when compared to the de-facto standard of recurrent neural networks (RNNs) while providing computational and modelling advantages due to inherent parallelism. However, currently, there remains a performance +gap to more expressive stochastic RNN variants, especially those with several layers of dependent random variables. In this work, we propose stochastic temporal convolutional networks (STCNs), a novel architecture that combines the computational advantages of temporal convolutional networks (TCN) with the representational power and robustness of stochastic latent spaces. In particular, we propose a hierarchy of stochastic latent variables that captures temporal dependencies at different time-scales. The architecture is modular and flexible due to the decoupling of the deterministic and stochastic layers. We show that the proposed architecture achieves state of the art log-likelihoods across several tasks. Finally, the model is capable of predicting high-quality synthetic samples over a long-range temporal horizon in modelling of handwritten text.",/pdf/40ea3242cdd7d5226389d47bdac8a9e9fdf0de66.pdf,ICLR,2019,We combine the computational advantages of temporal convolutional architectures with the expressiveness of stochastic latent variables. +ry3iBFqgl,,1478300000000.0,1484750000000.0,489,NEWSQA: A MACHINE COMPREHENSION DATASET,"[""adam.trischler@maluuba.com"", ""tong.wang@maluuba.com"", ""eric.yuan@maluuba.com"", ""justin.harris@maluuba.com"", ""alessandro.sordoni@maluuba.com"", ""phil.bachman@maluuba.com"", ""k.suleman@maluuba.com""]","[""Adam Trischler"", ""Tong Wang"", ""Xingdi Yuan"", ""Justin Harris"", ""Alessandro Sordoni"", ""Philip Bachman"", ""Kaheer Suleman""]","[""Natural language processing"", ""Deep learning""]","We present NewsQA, a challenging machine comprehension dataset of over 100,000 question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting in spans of text from the corresponding articles. We collect this dataset through a four- stage process designed to solicit exploratory questions that require reasoning. A thorough analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (25.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available at datasets.maluuba.com/NewsQA.",/pdf/458bb7690cd230d9285b4c8b5b04c52991c9de7f.pdf,ICLR,2017,Crowdsourced QA dataset with natural language questions and multi-word answers +VYfotZsQV5S,BYk70U8jJtJ,1601310000000.0,1614990000000.0,3698,MISSO: Minimization by Incremental Stochastic Surrogate Optimization for Large Scale Nonconvex and Nonsmooth Problems,"[""~Belhal_Karimi1"", ""~Hoi_To_Wai1"", ""~Eric_Moulines1"", ""~Ping_Li3""]","[""Belhal Karimi"", ""Hoi To Wai"", ""Eric Moulines"", ""Ping Li""]","[""nonconvex"", ""optimization"", ""stochastic"", ""sampling"", ""MCMC"", ""majorization-minimization""]","Many constrained, nonconvex and nonsmooth optimization problems can be tackled using the majorization-minimization (MM) method which alternates between constructing a surrogate function which upper bounds the objective function, and then minimizing this surrogate. For problems which minimize a finite sum of functions, a stochastic version of the MM method selects a batch of functions at random at each iteration and optimizes the accumulated surrogate. +However, in many cases of interest such as variational inference for latent variable models, the surrogate functions are expressed as an expectation. In this contribution, we propose a doubly stochastic MM method based on Monte Carlo approximation of these stochastic surrogates. +We establish asymptotic and non-asymptotic convergence of our scheme in a constrained, nonconvex, nonsmooth optimization setting. We apply our new framework for inference of logistic regression model with missing data and for variational inference of Bayesian variants of LeNet-5 and Resnet-18 on respectively the MNIST and CIFAR-10 datasets.",/pdf/c89a0d39ba7014dec13711023a9967b751205e79.pdf,ICLR,2021,"We develop an incremental optimization method with stochastic surrogate functions for nonconvex and nonsmooth problems, providing asymptotic and non-asymptotic guarantees for it." +HkJq1Ocxl,,1478290000000.0,1488890000000.0,451,Programming With a Differentiable Forth Interpreter,"[""m.bosnjak@cs.ucl.ac.uk"", ""t.rocktaschel@cs.ucl.ac.uk"", ""j.narad@cs.ucl.ac.uk"", ""s.riedel@cs.ucl.ac.uk""]","[""Matko Bo\u0161njak"", ""Tim Rockt\u00e4schel"", ""Jason Naradowsky"", ""Sebastian Riedel""]",[],"There are families of neural networks that can learn to compute any function, provided sufficient training data. However, given that in practice training data is scarce for all but a small set of problems, a core question is how to incorporate prior knowledge into a model. Here we consider the case of prior procedural knowledge, such as knowing the overall recursive structure of a sequence transduction program or the fact that a program will likely use arithmetic operations on real numbers to solve a task. To this end we present a differentiable interpreter for the programming language Forth. Through a neural implementation of the dual stack machine that underlies Forth, programmers can write program sketches with slots that can be filled with behaviour trained from program input-output data. As the program interpreter is end-to-end differentiable, we can optimize this behaviour directly through gradient descent techniques on user specified objectives, and also integrate the program into any larger neural computation graph. We show empirically that our interpreter is able to effectively leverage different levels of prior program structure and learn complex transduction tasks such as sequence sorting or addition with substantially less data and better generalisation over problem sizes. In addition, we introduce neural program optimisations based on symbolic computation and parallel branching that lead to significant speed improvements. ",/pdf/2b0c040b545841857e3e9ed4f1f07a97129140c8.pdf,ICLR,2017,"This paper presents the first neural implementation of an abstract machine for an actual language, allowing programmers to inject prior procedural knowledge into neural architectures in a straightforward manner." +H1lIzhC9FX,S1gBPMAcFm,1538090000000.0,1545360000000.0,1268,Learning to remember: Dynamic Generative Memory for Continual Learning,"[""oleksiy.ostapenko@sap.com"", ""mihai.puscas@sap.com"", ""mihaimarian.puscas@unitn.it"", ""tassilo.klein@sap.com"", ""m.nabi@sap.com""]","[""Oleksiy Ostapenko"", ""Mihai Puscas"", ""Tassilo Klein"", ""Moin Nabi""]","[""Continual Learning"", ""Catastrophic Forgetting"", ""Dynamic Network Expansion""]","Continuously trainable models should be able to learn from a stream of data over an undefined period of time. This becomes even more difficult in a strictly incremental context, where data access to previously seen categories is not possible. To that end, we propose making use of a conditional generative adversarial model where the generator is used as a memory module through neural masking to emulate neural plasticity in the human brain. This memory module is further associated with a dynamic capacity expansion mechanism. Taken together, this method facilitates a resource efficient capacity adaption to accommodate new tasks, while retaining previously attained knowledge. The proposed approach outperforms state-of-the-art algorithms on publicly available datasets, overcoming catastrophic forgetting.",/pdf/9c4fda8d895581f021c1794f59e916d4b380620e.pdf,ICLR,2019, +O358nrve1W,yZjuf59eJAS,1601310000000.0,1614990000000.0,821,Neurally Guided Genetic Programming for Turing Complete Programming by Example,"[""~Alexander_Newton_Wild1"", ""~Barry_Porter1""]","[""Alexander Newton Wild"", ""Barry Porter""]","[""Code Synthesis"", ""Neural Code Synthesis"", ""Genetic Programming"", ""Programming By Example""]","The ability to synthesise source code from input/output examples allows nonexperts to generate programs, and experts to abstract away a wide range of simple programming tasks. Current research in this area has explored neural synthesis, SMT solvers, and genetic programming; each of these approaches is limited, however, often using highly specialised target languages for synthesis. In this paper we present a novel hybrid approach using neural networks to guide genetic programming (GP), which allows us to successfully synthesise code from just ten I/O examples in a generalised Turing complete target language, up to and including a sorting algorithm. We show that GP by itself is able to synthesise a set of simple programs, and show which hints (suggested lines of code for inclusion) are of most utility to GP in solving harder problems. Using a form of unstructured curriculum learning, we then demonstrate that neural networks can be used to determine when to make use of these high-utility hints for specific I/O problems and thus enable complex functions to be successfully synthesised. We apply our approach to two different problem sets: common array-to-array programs (including sorting), and a canvas drawing problem set inspired by So & Oh (2018).",/pdf/b46cc8d846f3dcec4e9dc7ee0741dd27e1c6008e.pdf,ICLR,2021,"This paper demonstrates that the use of genetic programming, guided by neural networks to provide search hints, is able to synthesise complex programs in a Turing-complete language -- up to and including the synthesis of sorting algorithms." +taQNxF9Sj6,umsrqaFT9Z_,1601310000000.0,1614990000000.0,2119, Adding Recurrence to Pretrained Transformers,"[""~Davis_Yoshida1"", ""~Allyson_Ettinger1"", ""~Kevin_Gimpel1""]","[""Davis Yoshida"", ""Allyson Ettinger"", ""Kevin Gimpel""]","[""Language modeling"", ""Transformers"", ""Recurrence"", ""Gradient checkpointing"", ""Pretraining""]","Fine-tuning a pretrained transformer for a downstream task has become a standard method in NLP in the last few years. While the results from these models are impressive, applying them can be extremely computationally expensive, as is pretraining new models with the latest architectures. We present a novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time. An additional benefit is that our method removes the fixed context size constraint that most transformer models have, allowing for more flexible use. When applied to the GPT-2 language model, we find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a given amount of computation or memory.",/pdf/45a38e73888eed2caa69fe0627e9fad7d5bd8973.pdf,ICLR,2021,Adding a small recurrence module to pretrained transformer language models allows maintaining performance while lowering memory cost +HyGhN2A5tm,SJx5ykFqFm,1538090000000.0,1550800000000.0,1486,Multi-Agent Dual Learning,"[""yiren@illinois.edu"", ""yingce.xia@gmail.com"", ""hetianyu@mail.ustc.edu.cn"", ""fetia@microsoft.com"", ""taoqin@microsoft.com"", ""czhai@illinois.edu"", ""tie-yan.liu@microsoft.com""]","[""Yiren Wang"", ""Yingce Xia"", ""Tianyu He"", ""Fei Tian"", ""Tao Qin"", ""ChengXiang Zhai"", ""Tie-Yan Liu""]","[""Dual Learning"", ""Machine Learning"", ""Neural Machine Translation""]","Dual learning has attracted much attention in machine learning, computer vision and natural language processing communities. The core idea of dual learning is to leverage the duality between the primal task (mapping from domain X to domain Y) and dual task (mapping from domain Y to X) to boost the performances of both tasks. Existing dual learning framework forms a system with two agents (one primal model and one dual model) to utilize such duality. In this paper, we extend this framework by introducing multiple primal and dual models, and propose the multi-agent dual learning framework. Experiments on neural machine translation and image translation tasks demonstrate the effectiveness of the new framework. +In particular, we set a new record on IWSLT 2014 German-to-English translation with a 35.44 BLEU score, achieve a 31.03 BLEU score on WMT 2014 English-to-German translation with over 2.6 BLEU improvement over the strong Transformer baseline, and set a new record of 49.61 BLEU score on the recent WMT 2018 English-to-German translation.",/pdf/c89b6a5e32a23ea0b972dde2f97eedbca6ba7c1d.pdf,ICLR,2019, +rJxHcgStwr,Hkea3CxKDB,1569440000000.0,1577170000000.0,2471,Handwritten Amharic Character Recognition System Using Convolutional Neural Networks,"[""afetulhak@yahoo.com""]","[""Fetulhak Abdurahman""]","[""Amharic"", ""Handwritten"", ""Character"", ""Convolutional neural network"", ""Recognition""]","Amharic language is an official language of the federal government of the Federal Democratic Republic of Ethiopia. Accordingly, there is a bulk of handwritten Amharic documents available in libraries, information centres, museums, and offices. Digitization of these documents enables to harness already available language technologies to local information needs and developments. Converting these documents will have a lot of advantages including (i) to preserve and transfer history of the country (ii) to save storage space (ii) proper handling of documents (iv) enhance retrieval of information through internet and other applications. Handwritten Amharic character recognition system becomes a challenging task due to inconsistency of a writer, variability in writing styles of different writers, relatively large number of characters of the script, high interclass similarity, structural complexity and degradation of documents due to different reasons. In order to recognize handwritten Amharic character a novel method based on deep neural networks is used which has recently shown exceptional performance in various pattern recognition and machine learning applications, but has not been endeavoured for Ethiopic script. The CNN model is trained and tested our database that contains 132,500 datasets of handwritten Amharic characters. Common machine learning methods usually apply a combination of feature extractor and trainable classifier. The use of CNN leads to significant improvements across different machine-learning classification algorithms. Our proposed CNN model is giving an accuracy of 91.83% on training data and 90.47% on validation data.",/pdf/36b8dc15531953fa6adc97530e20bb2250af7dd8.pdf,ICLR,2020,Recognition of handwritten Amharic characters based on convolutional neural network. +ZHADKD4pl5H,vKJVUXLSNEY,1601310000000.0,1614990000000.0,1399,Wasserstein diffusion on graphs with missing attributes,"[""~Zhixian_Chen1"", ""~Tengfei_Ma1"", ""~Yangqiu_Song1"", ""yangwang@ust.hk""]","[""Zhixian Chen"", ""Tengfei Ma"", ""Yangqiu Song"", ""Yang Wang""]","[""Wasserstein barycenter"", ""graph learning"", ""diffusion"", ""missing features"", ""matrix completion""]","Many real-world graphs are attributed graphs where nodes are associated with non-topological features. While attributes can be missing anywhere in an attributed graph, most of existing node representation learning approaches do not consider such incomplete information. +In this paper, we propose a general non-parametric framework to mitigate this problem. Starting from a decomposition of the attribute matrix, we transform node features into discrete distributions in a lower-dimensional space equipped with the Wasserstein metric. On this Wasserstein space, we propose Wasserstein graph diffusion to smooth the distributional representations of nodes with information from their local neighborhoods. This allows us to reduce the distortion caused by missing attributes and obtain integrated representations expressing information of both topology structures and attributes. We then pull the nodes back to the original space and produce corresponding point representations to facilitate various downstream tasks. To show the power of our representation method, we designed two algorithms based on it for node classification (with missing attributes) and matrix completion respectively, and demonstrate their effectiveness in experiments.",/pdf/4e6ae26ce02dc7b95e941c5f3e40b47a4c06d64a.pdf,ICLR,2021,We propose a new graph representation method based on optimal transport for graphs with missing attributes. +H1lVvgHKDr,HyxxdoeYvB,1569440000000.0,1577170000000.0,2357,Knowledge Transfer via Student-Teacher Collaboration,"[""gtx@pku.edu.cn"", ""rqxiong@pku.edu.cn"", ""liu-zh@pku.edu.cn"", ""swma@pku.edu.cn"", ""fengwu@ustc.edu.cn"", ""tjhuang@pku.edu.cn"", ""wgao@pku.edu.cn""]","[""Tianxiao Gao"", ""Ruiqin Xiong"", ""Zhenhua Liu"", ""Siwei ma"", ""Feng Wu"", ""Tiejun Huang"", ""Wen Gao""]","[""Network Compression and Acceleration"", ""Knowledge Transfer"", ""Student-Teacher Collaboration"", ""Deep Learning.""]","Accompanying with the flourish development in various fields, deep neural networks, however, are still facing with the plight of high computational costs and storage. One way to compress these heavy models is knowledge transfer (KT), in which a light student network is trained through absorbing the knowledge from a powerful teacher network. In this paper, we propose a novel knowledge transfer method which employs a Student-Teacher Collaboration (STC) network during the knowledge transfer process. This is done by connecting the front part of the student network to the back part of the teacher network as the STC network. The back part of the teacher network takes the intermediate representation from the front part of the student network as input to make the prediction. The difference between the prediction from the collaboration network and the output tensor from the teacher network is taken into account of the loss during the train process. Through back propagation, the teacher network provides guidance to the student network in a gradient signal manner. In this way, our method takes advantage of the knowledge from the entire teacher network, who instructs the student network in learning process. Through plentiful experiments, it is proved that our STC method outperforms other KT methods with conventional strategy.",/pdf/766e9dfd3ffa063ac47fd4f320af23fadcaffeab.pdf,ICLR,2020,We propose a novel knowledge transfer method which employs a student-teacher collaboration network. +5NA1PinlGFu,F9XokM8Wvc5,1601310000000.0,1615110000000.0,3388,Colorization Transformer,"[""~Manoj_Kumar1"", ""~Dirk_Weissenborn1"", ""~Nal_Kalchbrenner1""]","[""Manoj Kumar"", ""Dirk Weissenborn"", ""Nal Kalchbrenner""]",[],"We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60\% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran",/pdf/f2f5d9057587995de8d113d1ba35dd7d8b98f48e.pdf,ICLR,2021,Self-attention for colorization +uKhGRvM8QNH,xDsQoCh7QDc,1601310000000.0,1615880000000.0,182,Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors,"[""~Linfeng_Zhang2"", ""~Kaisheng_Ma1""]","[""Linfeng Zhang"", ""Kaisheng Ma""]","[""Knowledge Distillation"", ""Object Detection"", ""Teacher-Student Learning"", ""Non-Local Modules"", ""Attention Modules""]","Knowledge distillation, in which a student model is trained to mimic a teacher model, has been proved as an effective technique for model compression and model accuracy boosting. However, most knowledge distillation methods, designed for image classification, have failed on more challenging tasks, such as object detection. In this paper, we suggest that the failure of knowledge distillation on object detection is mainly caused by two reasons: (1) the imbalance between pixels of foreground and background and (2) lack of distillation on the relation between different pixels. Observing the above reasons, we propose attention-guided distillation and non-local distillation to address the two problems, respectively. Attention-guided distillation is proposed to find the crucial pixels of foreground objects with attention mechanism and then make the students take more effort to learn their features. Non-local distillation is proposed to enable students to learn not only the feature of an individual pixel but also the relation between different pixels captured by non-local modules. Experiments show that our methods achieve excellent AP improvements on both one-stage and two-stage, both anchor-based and anchor-free detectors. For example, Faster RCNN (ResNet101 backbone) with our distillation achieves 43.9 AP on COCO2017, which is 4.1 higher than the baseline. Codes have been released on Github.",/pdf/1e6969024e2fab8681c02ff62a2dbfc4feedcff4.pdf,ICLR,2021,We propose two knowledge distillation methods on object detection - attention-guided distillation and non-local distillation which lead to 4.1 AP improvements on Faster RCNN101 in MS COCO2017. +SklzIjActX,rJxtmNxcKX,1538090000000.0,1545360000000.0,153,HIGHLY EFFICIENT 8-BIT LOW PRECISION INFERENCE OF CONVOLUTIONAL NEURAL NETWORKS,"[""haihao.shen@intel.com"", ""jiong.gong@intel.com"", ""xiaoli.liu@intel.com"", ""guoming.zhang@intel.com"", ""ge.jin@intel.com"", ""eric.lin@intel.com""]","[""Haihao Shen"", ""Jiong Gong"", ""Xiaoli Liu"", ""Guoming Zhang"", ""Ge Jin"", ""and Eric Lin""]","[""8-bit low precision inference"", ""convolutional neural networks"", ""statistical accuracy"", ""8-bit Winograd convolution""]","High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents a general technique toward 8-bit low precision inference of convolutional neural networks, including 1) channel-wise scale factors of weights, especially for depthwise convolution, 2) Winograd convolution, and 3) topology-wise 8-bit support. We experiment the techniques on top of a widely-used deep learning framework. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without the need of fine-tuning or retraining. We perform a systematical and comprehensive study on 18 widely-used convolutional neural networks and demonstrate the effectiveness of 8-bit low precision inference across a wide range of applications and use cases, including image classification, object detection, image segmentation, and super resolution. We show that the inference throughput +and latency are improved by 1.6X and 1.5X respectively with minimal within 0.6%1to no loss in accuracy from FP32 baseline. We believe the methodology can provide the guidance and reference design of 8-bit low precision inference for other frameworks. All the code and models will be publicly available soon.",/pdf/b9dd890039376e056a3498b29c2612b1fffa81e1.pdf,ICLR,2019,We present a general technique toward 8-bit low precision inference of convolutional neural networks. +rJgYxn09Fm,ryxvKh3qFQ,1538090000000.0,1551480000000.0,1098,Learning Implicitly Recurrent CNNs Through Parameter Sharing,"[""savarese@ttic.edu"", ""mmaire@uchicago.edu""]","[""Pedro Savarese"", ""Michael Maire""]","[""deep learning"", ""architecture search"", ""computer vision""]","We introduce a parameter sharing scheme, in which different layers of a convolutional neural network (CNN) are defined by a learned linear combination of parameter tensors from a global bank of templates. Restricting the number of templates yields a flexible hybridization of traditional CNNs and recurrent networks. Compared to traditional CNNs, we demonstrate substantial parameter savings on standard image classification tasks, while maintaining accuracy. +Our simple parameter sharing scheme, though defined via soft weights, in practice often yields trained networks with near strict recurrent structure; with negligible side effects, they convert into networks with actual loops. Training these networks thus implicitly involves discovery of suitable recurrent architectures. Though considering only the aspect of recurrent links, our trained networks achieve accuracy competitive with those built using state-of-the-art neural architecture search (NAS) procedures. +Our hybridization of recurrent and convolutional networks may also represent a beneficial architectural bias. Specifically, on synthetic tasks which are algorithmic in nature, our hybrid networks both train faster and extrapolate better to test examples outside the span of the training set.",/pdf/c0ca6381a4b14e3fd45586e92ef0e243d5f1fd0e.pdf,ICLR,2019,We propose a method that enables CNN folding to create recurrent connections +Hkeh21BKPH,B1g6Ff1Kvr,1569440000000.0,1577170000000.0,1967,Towards Finding Longer Proofs,"[""zombori@renyi.hu"", ""csadrian@renyi.hu"", ""henrykmichalewski@gmail.com"", ""cezary.kaliszyk@uibk.ac.at"", ""josef.urban@gmail.com""]","[""Zsolt Zombori"", ""Adri\u00e1n Csisz\u00e1rik"", ""Henryk Michalewski"", ""Cezary Kaliszyk"", ""Josef Urban""]","[""automated theorem proving"", ""reinforcement learning"", ""curriculum learning"", ""internal guidance""]","We present a reinforcement learning (RL) based guidance system for automated theorem proving geared towards Finding Longer Proofs (FLoP). FLoP focuses on generalizing from short proofs to longer ones of similar structure. To achieve that, FLoP uses state-of-the-art RL approaches that were previously not applied in theorem proving. In particular, we show that curriculum learning significantly outperforms previous learning-based proof guidance on a synthetic dataset of increasingly difficult arithmetic problems.",/pdf/7e15c70ba6b27c894f7c1b8a0180a60ceeccaa2b.pdf,ICLR,2020,"We present FLoP, a reinforcement learning based guidance system for automated theorem proving geared towards Finding Longer Proofs." +K5j7D81ABvt,UPU4oPYmweA,1601310000000.0,1611610000000.0,3399,Disambiguating Symbolic Expressions in Informal Documents,"[""~Dennis_M\u00fcller1"", ""~Cezary_Kaliszyk1""]","[""Dennis M\u00fcller"", ""Cezary Kaliszyk""]",[],"We propose the task of \emph{disambiguating} symbolic expressions in informal STEM documents in the form of \LaTeX files -- that is, determining their precise semantics and abstract syntax tree -- as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid \LaTeX before overfitting. Consequently, we describe a methodology using a \emph{transformer} language model pre-trained on sources obtained from \url{arxiv.org}, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking syntax and semantics of symbolic expressions into account.",/pdf/006f5f9df1ed650389c8a89fd0087c3a9cb81605.pdf,ICLR,2021, +4SiMia0kjba,mljNZkk7i0,1601310000000.0,1614990000000.0,1578,Causal Probabilistic Spatio-temporal Fusion Transformers in Two-sided Ride-Hailing Markets,"[""~Shixiang_Wan1"", ""~Shikai_Luo1"", ""~Hongtu_Zhu2""]","[""Shixiang Wan"", ""Shikai Luo"", ""Hongtu Zhu""]","[""Spatio-temporal Prediction"", ""Causal Inference"", ""Efficient Transformers"", ""Two-sided Markets""]","Achieving accurate spatio-temporal predictions in large-scale systems is extremely valuable in many real-world applications, such as weather forecasts, retail forecasting, and urban traffic forecasting. So far, most existing methods for multi-horizon, multi-task and multi-target predictions select important predicting variables via their correlations with responses, and thus it is highly possible that many forecasting models generated from those methods are not causal, leading to poor interpretability. The aim of this paper is to develop a collaborative causal spatio-temporal fusion transformer, named CausalTrans, to establish the collaborative causal effects of predictors on multiple forecasting targets, such as supply and demand in ride-sharing platforms. Specifically, we integrate the causal attention with the Conditional Average Treatment Effect (CATE) estimation method for causal inference. Moreover, we propose a novel and fast multi-head attention evolved from Taylor expansion instead of softmax, reducing time complexity from $O(\mathcal{V}^2)$ to $O(\mathcal{V})$, where $\mathcal{V}$ is the number of nodes in a graph. We further design a spatial graph fusion mechanism to significantly reduce the parameters' scale. We conduct a wide range of experiments to demonstrate the interpretability of causal attention, the effectiveness of various model components, and the time efficiency of our CausalTrans. As shown in these experiments, our CausalTrans framework can achieve up to 15$\%$ error reduction compared with various baseline methods. ",/pdf/2e1d95eab956d7894f701146a590095527edb2b9.pdf,ICLR,2021,We develop a novel causal transformer with causal inference and efficient taylor attention to address large scale spatio-temporal predictions. Our method achieves up to 15% error reduction compared with various baseline methods. +8VXvj1QNRl1,H-v7DYhnRs3,1601310000000.0,1615460000000.0,3746,On the Transfer of Disentangled Representations in Realistic Settings,"[""~Andrea_Dittadi1"", ""~Frederik_Tr\u00e4uble1"", ""~Francesco_Locatello1"", ""~Manuel_Wuthrich1"", ""~Vaibhav_Agrawal1"", ""~Ole_Winther1"", ""~Stefan_Bauer1"", ""~Bernhard_Sch\u00f6lkopf1""]","[""Andrea Dittadi"", ""Frederik Tr\u00e4uble"", ""Francesco Locatello"", ""Manuel Wuthrich"", ""Vaibhav Agrawal"", ""Ole Winther"", ""Stefan Bauer"", ""Bernhard Sch\u00f6lkopf""]","[""representation learning"", ""disentanglement"", ""real-world""]","Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and real-world impact remain questionable. +We introduce a new high-resolution dataset with 1M simulated images and over 1,800 annotated real-world images of the same setup. In contrast to previous work, this new dataset exhibits correlations, a complex underlying structure, and allows to evaluate transfer to unseen simulated and real-world settings where the encoder i) remains in distribution or ii) is out of distribution. +We propose new architectures in order to scale disentangled representation learning to realistic high-resolution settings and conduct a large-scale empirical study of disentangled representations on this dataset. We observe that disentanglement is a good predictor for out-of-distribution (OOD) task performance.",/pdf/bd4ae699aba426a89c027ef66f9f0cc2dcb0e187.pdf,ICLR,2021,We scale disentangled representation learning to a new realistic dataset and conduct a large-scale empirical study on OOD generalization. +N33d7wjgzde,YUi6aROnbBc,1601310000000.0,1620680000000.0,203,Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning,"[""~Tsung-Wei_Ke2"", ""~Jyh-Jing_Hwang1"", ""~Stella_Yu2""]","[""Tsung-Wei Ke"", ""Jyh-Jing Hwang"", ""Stella Yu""]","[""weakly supervised representation learning"", ""representation learning for computer vision"", ""metric learning"", ""semantic segmentation""]","Weakly supervised segmentation requires assigning a label to every pixel based on training instances with partial annotations such as image-level tags, object bounding boxes, labeled points and scribbles. This task is challenging, as coarse annotations (tags, boxes) lack precise pixel localization whereas sparse annotations (points, scribbles) lack broad region coverage. Existing methods tackle these two types of weak supervision differently: Class activation maps are used to localize coarse labels and iteratively refine the segmentation model, whereas conditional random fields are used to propagate sparse labels to the entire image. + +We formulate weakly supervised segmentation as a semi-supervised metric learning problem, where pixels of the same (different) semantics need to be mapped to the same (distinctive) features. We propose 4 types of contrastive relationships between pixels and segments in the feature space, capturing low-level image similarity, semantic annotation, co-occurrence, and feature affinity They act as priors; the pixel-wise feature can be learned from training images with any partial annotations in a data-driven fashion. In particular, unlabeled pixels in training images participate not only in data-driven grouping within each image, but also in discriminative feature learning within and across images. We deliver a universal weakly supervised segmenter with significant gains on Pascal VOC and DensePose. Our code is publicly available at https://github.com/twke18/SPML.",/pdf/4e7e3f71b7b4a72b0db3a1576e17b7acb8f9e435.pdf,ICLR,2021,We propose a unified pixel-to-segment contrastive learning loss formulation for weakly supervised semantic segmentation with various types of annotations. +SJeuueSYDH,HkgIbaeKvS,1569440000000.0,1577170000000.0,2404,Distributed Training Across the World,"[""ligeng@mit.edu"", ""luyao11175@gmail.com"", ""yujunlin@mit.edu"", ""songhan@mit.edu""]","[""Ligeng Zhu"", ""Yao Lu"", ""Yujun Lin"", ""Song Han""]","[""Distributed Training"", ""Bandwidth""]","Traditional synchronous distributed training is performed inside a cluster, since it requires high bandwidth and low latency network (e.g. 25Gb Ethernet or Infini-band). However, in many application scenarios, training data are often distributed across many geographic locations, where physical distance is long and latency is high. Traditional synchronous distributed training cannot scale well under such limited network conditions. In this work, we aim to scale distributed learning un-der high-latency network. To achieve this, we propose delayed and temporally sparse (DTS) update that enables synchronous training to tolerate extreme network conditions without compromising accuracy. We benchmark our algorithms on servers deployed across three continents in the world: London (Europe), Tokyo(Asia), Oregon (North America) and Ohio (North America). Under such challenging settings, DTS achieves90×speedup over traditional methods without loss of accuracy on ImageNet.",/pdf/20bb9f0dd3087e7851a6710e5ce57765087793a1.pdf,ICLR,2020,Conventional distributed learning is only performed inside cluster because of latency requirements. We scale the distributed training across the world under high latency network. +jfPU-u_52Tx,umwwt6sHZTO,1601310000000.0,1614990000000.0,448,Federated Generalized Bayesian Learning via Distributed Stein Variational Gradient Descent,"[""~Rahif_Kassab1"", ""~Osvaldo_Simeone1""]","[""Rahif Kassab"", ""Osvaldo Simeone""]","[""Federated Learning"", ""Distributed Variational Inference""]","This paper introduces Distributed Stein Variational Gradient Descent (DSVGD), a non-parametric generalized Bayesian inference framework for federated learning. DSVGD maintains a number of non-random and interacting particles at a central server to represent the current iterate of the model global posterior. The particles are iteratively downloaded and updated by one of the agents with the end goal of minimizing the global free energy. By varying the number of particles, DSVGD enables a flexible trade-off between per-iteration communication load and number of communication rounds. DSVGD is shown to compare favorably to benchmark frequentist and Bayesian federated learning strategies, also scheduling a single device per iteration, in terms of accuracy and scalability with respect to the number of agents, while also providing well-calibrated, and hence trustworthy, predictions.",/pdf/35e66e689a316bff6f017f2f2d345e3a3fa44f9d.pdf,ICLR,2021,This paper introduces a non-parametric generalized Bayesian federated learning framework based on SVGD that allows a flexible trade-off between per-iteration communication load and number of communication rounds. +rcQdycl0zyk,JWwWIiMXGMp,1601310000000.0,1618920000000.0,1018,Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with $1/n$ Parameters,"[""~Aston_Zhang2"", ""~Yi_Tay1"", ""~SHUAI_Zhang5"", ""~Alvin_Chan1"", ""~Anh_Tuan_Luu2"", ""~Siu_Hui1"", ""~Jie_Fu2""]","[""Aston Zhang"", ""Yi Tay"", ""SHUAI Zhang"", ""Alvin Chan"", ""Anh Tuan Luu"", ""Siu Hui"", ""Jie Fu""]","[""hypercomplex representation learning""]","Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, “fully-connected layers with quaternions” (quaternions are 4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary $n$D hypercomplex space, providing more architectural flexibility using arbitrarily $1/n$ learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.",/pdf/98639a764ded8e038fa188dc104694519947e67c.pdf,ICLR,2021,We propose to parameterize hypercomplex multiplications using arbitrarily $1/n$ learnable parameters compared with the fully-connected layer counterpart. +Bkle6T4YvB,BJxbM_xODr,1569440000000.0,1577170000000.0,805,From English to Foreign Languages: Transferring Pre-trained Language Models,"[""ketranmanh@gmail.com""]","[""Ke Tran""]","[""pretrained language model"", ""zero-shot transfer"", ""parsing"", ""natural language inference""]","Pre-trained models have demonstrated their effectiveness in many downstream natural language processing (NLP) tasks. The availability of multilingual pre-trained models enables zero-shot transfer of NLP tasks from high resource languages to low resource ones. However, recent research in improving pre-trained models focuses heavily on English. While it is possible to train the latest neural architectures for other languages from scratch, it is undesirable due to the required amount of compute. In this work, we tackle the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget. With a single GPU, our approach can obtain a foreign BERT-base model within a day and a foreign BERT-large within two days. Furthermore, evaluating our models on six languages, we demonstrate that our models are better than multilingual BERT on two zero-shot tasks: natural language inference and dependency parsing.",/pdf/e8d3036658a184a016b5b819a825f13c93e7c56f.pdf,ICLR,2020,How to train non-English BERT within one day on using a single GPU +ClZ4IcqnFXB,2gLkLMFqQv,1601310000000.0,1614990000000.0,2361,Active Feature Acquisition with Generative Surrogate Models,"[""~Yang_Li19"", ""~Junier_Oliva1""]","[""Yang Li"", ""Junier Oliva""]","[""Reinforcement Learning"", ""Active Feature Acquisition"", ""Feature Selection""]","Many real-world situations allow for the acquisition of additional relevant information when making an assessment with limited or uncertain data. However, traditional ML approaches either require all features to be acquired beforehand or regard part of them as missing data that cannot be acquired. In this work, we propose models that perform active feature acquisition (AFA) to improve the prediction assessments at evaluation time. We formulate the AFA problem as a Markov decision process (MDP) and resolve it using reinforcement learning (RL). The AFA problem yields sparse rewards and contains a high-dimensional complicated action space. Thus, we propose learning a generative surrogate model that captures the complicated dependencies among input features to assess potential information gain from acquisitions. We also leverage the generative surrogate model to provide intermediate rewards and auxiliary information to the agent. Furthermore, we extend AFA in a task we coin active instance recognition (AIR) for the unsupervised case where the target variables are the unobserved features themselves and the goal is to collect information for a particular instance in a cost-efficient way. Empirical results demonstrate that our approach achieves considerably better performance than previous state of the art methods on both supervised and unsupervised tasks.",/pdf/8eec9e0691fd947567878d7545b589bc054ff34f.pdf,ICLR,2021,We propose models that actively acquire features at evaluation time to maximize the prediction performance as well as minimize the acquisition cost. +BJlxmAKlg,,1478250000000.0,1483500000000.0,154,ReasoNet: Learning to Stop Reading in Machine Comprehension,"[""yeshen@microsoft.com"", ""pshuang@microsoft.com"", ""jfgao@microsoft.com"", ""wzchen@microsoft.com""]","[""Yelong Shen"", ""Po-Sen Huang"", ""Jianfeng Gao"", ""Weizhu Chen""]","[""Deep learning"", ""Natural language processing""]","Teaching a computer to read a document and answer general questions pertaining to the document is a challenging yet unsolved problem. In this paper, we describe a novel neural network architecture called Reasoning Network ({ReasoNet}) for machine comprehension tasks. ReasoNet makes use of multiple turns to effectively exploit and then reason over the relation among queries, documents, and answers. Different from previous approaches using a fixed number of turns during inference, ReasoNet introduces a termination state to relax this constraint on the reasoning depth. With the use of reinforcement learning, ReasoNet can dynamically determine whether to continue the comprehension process after digesting intermediate results, or to terminate reading when it concludes that existing information is adequate to produce an answer. ReasoNet has achieved state-of-the-art performance in machine comprehension datasets, including unstructured CNN and Daily Mail datasets, and a structured Graph Reachability dataset. +",/pdf/a12e2b52fbec4d4bda776f98b3478d66ffef14e1.pdf,ICLR,2017,ReasoNet Reader for machine reading and comprehension +Sm_4MDxPWXf,K1GyntFNOO1,1601310000000.0,1614990000000.0,2160,StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling,"[""~Yikang_Shen1"", ""~Yi_Tay1"", ""chezheng@google.com"", ""~Dara_Bahri1"", ""metzler@google.com"", ""~Aaron_Courville3""]","[""Yikang Shen"", ""Yi Tay"", ""Che Zheng"", ""Dara Bahri"", ""Donald Metzler"", ""Aaron Courville""]","[""Unsupervised Dependency Parsing"", ""Unsupervised Constituency Parsing"", ""Masked Language Model""]","There are two major classes of natural language grammars --- the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can induce dependency and constituency structure at the same time. To achieve this, we propose a new parsing framework that can jointly generates constituency tree and dependency graph. Then we integrate the induced dependency relations into transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing and masked language modeling at the same time.",/pdf/3d2b8f718aae725f129cf5e31f090fa988a6274a.pdf,ICLR,2021,We propose a novel neural network based model that can do unsupervised dependency and constituency parsing at the same time. +Ske9VANKDH,rJgMf88dPB,1569440000000.0,1577170000000.0,1086,An Optimization Principle Of Deep Learning?,"[""u0952128@utah.edu"", ""yang.4972@buckeyemail.osu.edu"", ""yi.zhou@utah.edu""]","[""Cheng Chen"", ""Junjie Yang"", ""Yi Zhou""]",[],"Training deep neural networks (DNNs) has achieved great success in recent years. Modern DNN trainings utilize various types of training techniques that are developed in different aspects, e.g., activation functions for neurons, batch normalization for hidden layers, skip connections for network architecture and stochastic algorithms for optimization. Despite the effectiveness of these techniques, it is still mysterious how they help accelerate DNN trainings in practice. In this paper, we propose an optimization principle that is parameterized by $\gamma>0$ for stochastic algorithms in nonconvex and over-parameterized optimization. The principle guarantees the convergence of stochastic algorithms to a global minimum with a monotonically diminishing parameter distance to the minimizer and leads to a $\mathcal{O}(1/\gamma K)$ sub-linear convergence rate, where $K$ is the number of iterations. Through extensive experiments, we show that DNN trainings consistently obey the $\gamma$-optimization principle and its theoretical implications. In particular, we observe that the trainings that apply the training techniques achieve accelerated convergence and obey the principle with a large $\gamma$, which is consistent with the $\mathcal{O}(1/\gamma K)$ convergence rate result under the optimization principle. We think the $\gamma$-optimization principle captures and quantifies the impacts of various DNN training techniques and can be of independent interest from a theoretical perspective.",/pdf/e08ad8677c7ba430ab90dd492d78d0cff750aebb.pdf,ICLR,2020, +H1meywxRW,Hy7xJDgC-,1509090000000.0,1519350000000.0,282,DCN+: Mixed Objective And Deep Residual Coattention for Question Answering,"[""cxiong@salesforce.com"", ""richard@socher.org"", ""victor@victorzhong.com""]","[""Caiming Xiong"", ""Victor Zhong"", ""Richard Socher""]","[""question answering"", ""deep learning"", ""natural language processing"", ""reinforcement learning""]","Traditional models for question answering optimize using cross entropy loss, which encourages exact answers at the cost of penalizing nearby or overlapping answers that are sometimes equally accurate. We propose a mixed objective that combines cross entropy loss with self-critical policy learning, using rewards derived from word overlap to solve the misalignment between evaluation metric and optimization objective. In addition to the mixed objective, we introduce a deep residual coattention encoder that is inspired by recent work in deep self-attention and residual networks. Our proposals improve model performance across question types and input lengths, especially for long questions that requires the ability to capture long-term dependencies. On the Stanford Question Answering Dataset, our model achieves state of the art results with 75.1% exact match accuracy and 83.1% F1, while the ensemble obtains 78.9% exact match accuracy and 86.0% F1.",/pdf/79c9bd64f30bdd26caa8ee47fdc03cacbe24fc7f.pdf,ICLR,2018,"We introduce the DCN+ with deep residual coattention and mixed-objective RL, which achieves state of the art performance on the Stanford Question Answering Dataset." +HkpRBFxRb,SyhCrKxCb,1509100000000.0,1518730000000.0,336,Learning to Mix n-Step Returns: Generalizing Lambda-Returns for Deep Reinforcement Learning,"[""sahil@cse.iitm.ac.in"", ""girishraguvir@gmail.com"", ""sriramesh4@gmail.com"", ""ravi@cse.iitm.ac.in""]","[""Sahil Sharma"", ""Girish Raguvir J *"", ""Srivatsan Ramesh *"", ""Balaraman Ravindran""]","[""Reinforcement Learning"", ""Lambda-Returns""]","Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using the next state's value function. lambda-returns define the target of the RL agent as a weighted combination of rewards estimated by using multiple many-step look-aheads. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in lambda-returns is a rather ad-hoc design choice. Our major contribution is that we propose a generalization of lambda-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. In contrast to lambda-returns wherein the RL agent is restricted to use an exponentially decaying weighting scheme, CAR allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. Our experiments, in addition to showing the efficacy of CAR, also empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and lambda-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.",/pdf/c0e71ac9a9c1270bda65a7ad8f311746974689d3.pdf,ICLR,2018,A novel way to generalize lambda-returns by allowing the RL agent to decide how much it wants to weigh each of the n-step returns. +S1zk9iRqF7,BJeVxRbKFm,1538090000000.0,1549660000000.0,498,PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees,"[""james.jordon@wolfson.ox.ac.uk"", ""jsyoon0823@gmail.com"", ""mihaela.vanderschaar@eng.ox.ac.uk""]","[""James Jordon"", ""Jinsung Yoon"", ""Mihaela van der Schaar""]","[""Synthetic data generation"", ""Differential privacy"", ""Generative adversarial networks"", ""Private Aggregation of Teacher ensembles""]","Machine learning has the potential to assist many communities in using the large datasets that are becoming more and more available. Unfortunately, much of that potential is not being realized because it would require sharing data in a way that compromises privacy. In this paper, we investigate a method for ensuring (differential) privacy of the generator of the Generative Adversarial Nets (GAN) framework. The resulting model can be used for generating synthetic data on which algorithms can be trained and validated, and on which competitions can be conducted, without compromising the privacy of the original dataset. Our method modifies the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs. Our modified framework (which we call PATE-GAN) allows us to tightly bound the influence of any individual sample on the model, resulting in tight differential privacy guarantees and thus an improved performance over models with the same guarantees. We also look at measuring the quality of synthetic data from a new angle; we assert that for the synthetic data to be useful for machine learning researchers, the relative performance of two algorithms (trained and tested) on the synthetic dataset should be the same as their relative performance (when trained and tested) on the original dataset. Our experiments, on various datasets, demonstrate that PATE-GAN consistently outperforms the state-of-the-art method with respect to this and other notions of synthetic data quality.",/pdf/9bef793ee88e366d6fb82472f8bb87690c8cf96f.pdf,ICLR,2019, +1-j4VLSHApJ,DUjij9NgeD,1601310000000.0,1614990000000.0,82,Learn2Weight: Weights Transfer Defense against Similar-domain Adversarial Attacks,"[""~Siddhartha_Datta1""]","[""Siddhartha Datta""]","[""adversarial attack"", ""robustness"", ""domain adaptation"", ""privacy-preserving machine learning""]","Recent work in black-box adversarial attacks for NLP systems has attracted attention. Prior black-box attacks assume that attackers can observe output labels from target models based on selected inputs. In this work, inspired by adversarial transferability, we propose a new type of black-box NLP adversarial attack that an attacker can choose a similar domain and transfer the adversarial examples to the target domain and cause poor performance in target model. Based on domain adaptation theory, we then propose a defensive strategy, called Learn2Weight, which trains to predict the weight adjustments for target model in order to defense the attack of similar-domain adversarial examples. Using Amazon multi-domain sentiment classification dataset, we empirically show that Learn2Weight model is effective against the attack compared to standard black-box defense methods such as adversarial training and defense distillation. This work contributes to the growing literature on machine learning safety.",/pdf/c67e3a339818b545ecfacdd280aa62a3cee8c233.pdf,ICLR,2021,"We introduce Learn2Weight, a defense inspired by weights transfer learning, to defend against adversarial attacks that leverage domain similarities." +S1lxKlSKPH,Syxl5axYPB,1569440000000.0,1583910000000.0,2422,Consistency Regularization for Generative Adversarial Networks,"[""zhanghan@google.com"", ""zizhaoz@google.com"", ""augustusodena@google.com"", ""honglak@google.com""]","[""Han Zhang"", ""Zizhao Zhang"", ""Augustus Odena"", ""Honglak Lee""]","[""Generative Adversarial Networks"", ""Consistency Regularization"", ""GAN""]","Generative Adversarial Networks (GANs) are known to be difficult to train, despite considerable research effort. Several regularization techniques for stabilizing training have been proposed, but they introduce non-trivial computational overheads and interact poorly with existing techniques like spectral normalization. In this work, we propose a simple, effective training stabilizer based on the notion of consistency regularization—a popular technique in the semi-supervised learning literature. In particular, we augment data passing into the GAN discriminator and penalize the sensitivity of the discriminator to these augmentations. We conduct a series of experiments to demonstrate that consistency regularization works effectively with spectral normalization and various GAN architectures, loss functions and optimizer settings. Our method achieves the best FID scores for unconditional image generation compared to other regularization methods on CIFAR-10 and CelebA. Moreover, Our consistency regularized GAN (CR-GAN) improves state of-the-art FID scores for conditional generation from 14.73 to 11.48 on CIFAR-10 and from 8.73 to 6.66 on ImageNet-2012.",/pdf/c1d47b4f171c43090944d6f855cdc74403239414.pdf,ICLR,2020, +BJx0sjC5FX,BJxwI2CtYm,1538090000000.0,1550790000000.0,670,RNNs implicitly implement tensor-product representations,"[""tom.mccoy@jhu.edu"", ""tal.linzen@jhu.edu"", ""ewan.dunbar@univ-paris-diderot.fr"", ""smolensky@jhu.edu""]","[""R. Thomas McCoy"", ""Tal Linzen"", ""Ewan Dunbar"", ""Paul Smolensky""]","[""tensor-product representations"", ""compositionality"", ""neural network interpretability"", ""recurrent neural networks""]","Recurrent neural networks (RNNs) can learn continuous vector representations of symbolic structures such as sequences and sentences; these representations often exhibit linear regularities (analogies). Such regularities motivate our hypothesis that RNNs that show such regularities implicitly compile symbolic structures into tensor product representations (TPRs; Smolensky, 1990), which additively combine tensor products of vectors representing roles (e.g., sequence positions) and vectors representing fillers (e.g., particular words). To test this hypothesis, we introduce Tensor Product Decomposition Networks (TPDNs), which use TPRs to approximate existing vector representations. We demonstrate using synthetic data that TPDNs can successfully approximate linear and tree-based RNN autoencoder representations, suggesting that these representations exhibit interpretable compositional structure; we explore the settings that lead RNNs to induce such structure-sensitive representations. By contrast, further TPDN experiments show that the representations of four models trained to encode naturally-occurring sentences can be largely approximated with a bag of words, with only marginal improvements from more sophisticated structures. We conclude that TPDNs provide a powerful method for interpreting vector representations, and that standard RNNs can induce compositional sequence representations that are remarkably well approximated byTPRs; at the same time, existing training tasks for sentence representation learning may not be sufficient for inducing robust structural representations",/pdf/012572946d1f31cec4aa3efcd75f48b5c36e58c9.pdf,ICLR,2019,"RNNs implicitly implement tensor-product representations, a principled and interpretable method for representing symbolic structures in continuous space." +SkgCV205tQ,Hkebfr0qKm,1538090000000.0,1545360000000.0,1498,Accelerating first order optimization algorithms,"[""nyamen_tato.ange_adrienne@courrier.uqam.ca"", ""nkambou.roger@uqam.ca""]","[""Ange tato"", ""Roger nkambou""]","[""Optimization"", ""Optimizer"", ""Adam"", ""Gradient Descent""]","There exist several stochastic optimization algorithms. However in most cases, it is difficult to tell for a particular problem which will be the best optimizer to choose as each of them are good. Thus, we present a simple and intuitive technique, when applied to first order optimization algorithms, is able to improve the speed of convergence and reaches a better minimum for the loss function compared to the original algorithms. The proposed solution modifies the update rule, based on the variation of the direction of the gradient during training. We conducted several tests with Adam and AMSGrad on two different datasets. The preliminary results show that the proposed technique improves the performance of existing optimization algorithms and works well in practice.",/pdf/916f8b0b2e74846fbd21f36c7c9e0d6ebfcf339d.pdf,ICLR,2019, +SkglVlSFPS,BJlLTrxYDB,1569440000000.0,1577170000000.0,2236,Uncertainty - sensitive learning and planning with ensembles,"[""pmilos@mimuw.edu.pl"", ""lukasz.kucinski@gmail.com"", ""konrad.czechowski@gmail.com"", ""p.kozakowski@mimuw.edu.pl"", ""maciej.klimek@gmail.com""]","[""Piotr Mi\u0142o\u015b"", ""\u0141ukasz Kuci\u0144ski"", ""Konrad Czechowski"", ""Piotr Kozakowski"", ""Maciej Klimek""]","[""deep reinfocement learning"", ""mcts"", ""ensembles"", ""uncertainty""]","We propose a reinforcement learning framework for discrete environments in which an agent optimizes its behavior on two timescales. For the short one, it uses tree search methods to perform tactical decisions. The long strategic level is handled with an ensemble of value functions learned using $TD$-like backups. Combining these two techniques brings synergies. The planning module performs \textit{what-if} analysis allowing to avoid short-term pitfalls and boost backups of the value function. Notably, our method performs well in environments with sparse rewards where standard $TD(1)$ backups fail. On the other hand, the value functions compensate for inherent short-sightedness of planning. Importantly, we use ensembles to measure the epistemic uncertainty of value functions. This serves two purposes: a) it stabilizes planning, b) it guides exploration. + +We evaluate our methods on discrete environments with sparse rewards: the Deep sea chain environment, toy Montezuma's Revenge, and Sokoban. In all the cases, we obtain speed-up of learning and boost to the final performance.",/pdf/adbfde55dab1f41a282f6358c7e859af2f7707b7.pdf,ICLR,2020, +6FtFPKw8aLj,lrR7I-j5qs2,1601310000000.0,1614990000000.0,3089,Systematic Analysis of Cluster Similarity Indices: How to Validate Validation Measures,"[""~Martijn_G\u00f6sgens1"", ""~Liudmila_Prokhorenkova1"", ""~Aleksei_Tikhonov1""]","[""Martijn G\u00f6sgens"", ""Liudmila Prokhorenkova"", ""Aleksei Tikhonov""]","[""cluster similarity indices"", ""cluster validation"", ""clustering"", ""community detection"", ""constant baseline""]","There are many cluster similarity indices used to evaluate clustering algorithms, and choosing the best one for a particular task remains an open problem. We demonstrate that this problem is crucial: there are many disagreements among the indices, these disagreements do affect which algorithms are chosen in applications, and this can lead to degraded performance in real-world systems. We propose a theoretical solution to this problem: we develop a list of desirable properties and theoretically verify which indices satisfy them. This allows for making an informed choice: given a particular application, one can first make a selection of properties that are desirable for a given application and then identify indices satisfying these. We observe that many popular indices have significant drawbacks. Instead, we advocate using other ones that are not so widely adopted but have beneficial properties.",/pdf/3db791b30884341d620dc33001fb7ac836503524.pdf,ICLR,2021,Provide a systematic theoretical analysis of cluster similarity indices: define a number of properties that are desirable across many applications and check them for a number of known indices. +BJeRg205Fm,HJeJL44PKQ,1538090000000.0,1545360000000.0,1129,"Neural Network Regression with Beta, Dirichlet, and Dirichlet-Multinomial Outputs","[""peter.sadowski@hawaii.edu"", ""pfbaldi@ics.uci.edu""]","[""Peter Sadowski"", ""Pierre Baldi""]","[""regression"", ""uncertainty"", ""deep learning""]","We propose a method for quantifying uncertainty in neural network regression models when the targets are real values on a $d$-dimensional simplex, such as probabilities. We show that each target can be modeled as a sample from a Dirichlet distribution, where the parameters of the Dirichlet are provided by the output of a neural network, and that the combined model can be trained using the gradient of the data likelihood. This approach provides interpretable predictions in the form of multidimensional distributions, rather than point estimates, from which one can obtain confidence intervals or quantify risk in decision making. Furthermore, we show that the same approach can be used to model targets in the form of empirical counts as samples from the Dirichlet-multinomial compound distribution. In experiments, we verify that our approach provides these benefits without harming the performance of the point estimate predictions on two diverse applications: (1) distilling deep convolutional networks trained on CIFAR-100, and (2) predicting the location of particle collisions in the XENON1T Dark Matter detector.",/pdf/307f854d16a91d35691110152d1995c9dfb8a767.pdf,ICLR,2019,Neural network regression should use Dirichlet output distribution when targets are probabilities in order to quantify uncertainty of predictions. +HyWWpw5ex,,1478290000000.0,1484350000000.0,439,Recurrent Coevolutionary Feature Embedding Processes for Recommendation,"[""hanjundai@gatech.edu"", ""yichen.wang@gatech.edu"", ""rstrivedi@gatech.edu"", ""lsong@cc.gatech.edu""]","[""Hanjun Dai*"", ""Yichen Wang*"", ""Rakshit Trivedi"", ""Le Song""]","[""Deep learning"", ""Applications""]","Recommender systems often use latent features to explain the behaviors of users and capture the properties of items. As users interact with different items over time, user and item features can influence each other, evolve and co-evolve over time. To accurately capture the fine grained nonlinear coevolution of these features, we propose a recurrent coevolutionary feature embedding process model, which combines recurrent neural network (RNN) with a multi-dimensional point process model. The RNN learns a nonlinear representation of user and item embeddings which take into account mutual influence between user and item features, and the feature evolution over time. We also develop an efficient stochastic gradient algorithm for learning parameters. Experiments on diverse real-world datasets demonstrate significant improvements in user behavior prediction compared to state-of-the-arts. ",/pdf/cec0f6a192463f414ca58a4d62732dba426b70b9.pdf,ICLR,2017,"Our work combines recurrent neural network with point process models for recommendation, which captures the co-evolution nature of users' and items' latent features." +rJfUCoR5KX,rkxf5YPqYX,1538090000000.0,1550790000000.0,899,An Empirical study of Binary Neural Networks' Optimisation,"[""milad.alizadeh@cs.ox.ac.uk"", ""javier.fernandezmarques@cs.ox.ac.uk"", ""nicholas.lane@cs.ox.ac.uk"", ""yarin.gal@cs.ox.ac.uk""]","[""Milad Alizadeh"", ""Javier Fern\u00e1ndez-Marqu\u00e9s"", ""Nicholas D. Lane"", ""Yarin Gal""]","[""binary neural networks"", ""quantized neural networks"", ""straight-through-estimator""]","Binary neural networks using the Straight-Through-Estimator (STE) have been shown to achieve state-of-the-art results, but their training process is not well-founded. This is due to the discrepancy between the evaluated function in the forward path, and the weight updates in the back-propagation, updates which do not correspond to gradients of the forward path. Efficient convergence and accuracy of binary models often rely on careful fine-tuning and various ad-hoc techniques. In this work, we empirically identify and study the effectiveness of the various ad-hoc techniques commonly used in the literature, providing best-practices for efficient training of binary models. We show that adapting learning rates using second moment methods is crucial for the successful use of the STE, and that other optimisers can easily get stuck in local minima. We also find that many of the commonly employed tricks are only effective towards the end of the training, with these methods making early stages of the training considerably slower. Our analysis disambiguates necessary from unnecessary ad-hoc techniques for training of binary neural networks, paving the way for future development of solid theoretical foundations for these. Our newly-found insights further lead to new procedures which make training of existing binary neural networks notably faster.",/pdf/e434d122f2e13a68a20323fd2b5a626d32be27fa.pdf,ICLR,2019, +XI-OJ5yyse,S5L4S8qLzau,1601310000000.0,1622480000000.0,1903,CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks,"[""~Jiaqi_Ma1"", ""~Bo_Chang1"", ""~Xuefei_Zhang1"", ""~Qiaozhu_Mei1""]","[""Jiaqi Ma"", ""Bo Chang"", ""Xuefei Zhang"", ""Qiaozhu Mei""]","[""Graph Neural Network"", ""Gaussian Copula"", ""Gaussian Graphical Model""]","Graph-structured data are ubiquitous. However, graphs encode diverse types of information and thus play different roles in data representation. In this paper, we distinguish the \textit{representational} and the \textit{correlational} roles played by the graphs in node-level prediction tasks, and we investigate how Graph Neural Network (GNN) models can effectively leverage both types of information. Conceptually, the representational information provides guidance for the model to construct better node features; while the correlational information indicates the correlation between node outcomes conditional on node features. Through a simulation study, we find that many popular GNN models are incapable of effectively utilizing the correlational information. By leveraging the idea of the copula, a principled way to describe the dependence among multivariate random variables, we offer a general solution. The proposed Copula Graph Neural Network (CopulaGNN) can take a wide range of GNN models as base models and utilize both representational and correlational information stored in the graphs. Experimental results on two types of regression tasks verify the effectiveness of the proposed method.",/pdf/598a3ed080b201ac3a97abb2d3a6340a3b4bebb6.pdf,ICLR,2021,"We distinguish the representational and the correlational information encoded by the graphs in node-level prediction tasks, and propose a novel Copula Graph Neural Network to effectively leverage both information." +B1ggosR9Ym,SygVjGl5K7,1538090000000.0,1545360000000.0,590,Using Deep Siamese Neural Networks to Speed up Natural Products Research,"[""n3robert@ucsd.edu"", ""poornavsargoor@gmail.com"", ""vthanvan@eng.ucsd.edu"", ""s2ravich@eng.ucsd.edu"", ""beowulf.zc@gmail.com"", ""wgerwick@ucsd.edu"", ""gary@ucsd.edu""]","[""Nicholas Roberts"", ""Poornav S. Purushothama"", ""Vishal T. Vasudevan"", ""Siddarth Ravichandran"", ""Chen Zhang"", ""William H. Gerwick"", ""Garrison W. Cottrell""]","[""clustering"", ""deep learning"", ""application"", ""chemistry"", ""natural products""]","Natural products (NPs, compounds derived from plants and animals) are an important source of novel disease treatments. A bottleneck in the search for new NPs is structure determination. One method is to use 2D Nuclear Magnetic Resonance (NMR) imaging, which indicates bonds between nuclei in the compound, and hence is the ""fingerprint"" of the compound. Computing a similarity score between 2D NMR spectra for a novel compound and a compound whose structure is known helps determine the structure of the novel compound. Standard approaches to this problem do not appear to scale to larger databases of compounds. Here we use deep convolutional Siamese networks to map NMR spectra to a cluster space, where similarity is given by the distance in the space. This approach results in an AUC score that is more than four times better than an approach using Latent Dirichlet Allocation.",/pdf/4b8bc24d0b32e3cad97d2e3a1d51fba9e6b98552.pdf,ICLR,2019,We learn a direct mapping from NMR spectra of small molecules to a molecular structure based cluster space. +vujTf_I8Kmc,Gf9xuZDuaFJ,1601310000000.0,1617520000000.0,2058,Attentional Constellation Nets for Few-Shot Learning,"[""~Weijian_Xu1"", ""~yifan_xu1"", ""~Huaijin_Wang1"", ""~Zhuowen_Tu1""]","[""Weijian Xu"", ""yifan xu"", ""Huaijin Wang"", ""Zhuowen Tu""]","[""few-shot learning"", ""constellation models""]","The success of deep convolutional neural networks builds on top of the learning of effective convolution operations, capturing a hierarchy of structured features via filtering, activation, and pooling. However, the explicit structured features, e.g. object parts, are not expressive in the existing CNN frameworks. In this paper, we tackle the few-shot learning problem and make an effort to enhance structured features by expanding CNNs with a constellation model, which performs cell feature clustering and encoding with a dense part representation; the relationships among the cell features are further modeled by an attention mechanism. With the additional constellation branch to increase the awareness of object parts, our method is able to attain the advantages of the CNNs while making the overall internal representations more robust in the few-shot learning setting. Our approach attains a significant improvement over the existing methods in few-shot learning on the CIFAR-FS, FC100, and mini-ImageNet benchmarks.",/pdf/4bfc13fc5e8eadc3b396aa15c6e583195e33ef5e.pdf,ICLR,2021,We tackle the few-shot learning problem by introducing an explicit cell feature clustering procedure with relation learning via self-attention. +7nfCtKep-v,R33a4bv6ENm,1601310000000.0,1614990000000.0,1465,EXPLORING VULNERABILITIES OF BERT-BASED APIS,"[""~Xuanli_He2"", ""~Lingjuan_Lyu1"", ""~Lichao_Sun1"", ""~Xiaojun_Chang3"", ""~Jun_Zhao1""]","[""Xuanli He"", ""Lingjuan Lyu"", ""Lichao Sun"", ""Xiaojun Chang"", ""Jun Zhao""]","[""BERT-based models"", ""vulnerabilities"", ""attribute inference"", ""transferability""]","Natural language processing (NLP) tasks, ranging from text classification to text +generation, have been revolutionised by pretrained BERT models. This allows +corporations to easily build powerful APIs by encapsulating fine-tuned BERT +models. These BERT-based APIs are often designed to not only provide reliable +service but also protect intellectual properties or privacy-sensitive information of +the training data. However, a series of privacy and robustness issues may still exist +when a fine-tuned BERT model is deployed as a service. In this work, we first +present an effective model extraction attack, where the adversary can practically +steal a BERT-based API (the target/victim model). We then demonstrate: (1) +how the extracted model can be further exploited to develop effective attribute +inference attack to expose sensitive information of the training data of the victim +model; (2) how the extracted model can lead to highly transferable adversarial +attacks against the victim model. Extensive experiments on multiple benchmark +datasets under various realistic settings validate the potential privacy and adversarial +vulnerabilities of BERT-based APIs.",/pdf/e8ffd95cd74888a30b2b184f4a6a9bcac05ac15b.pdf,ICLR,2021,We demonstrate that the extracted model can be used to enhance the sensitive attribute inference and adversarial transferability. +ryGs6iA5Km,Hyg-Tz65t7,1538090000000.0,1550860000000.0,835,How Powerful are Graph Neural Networks?,"[""keyulu@mit.edu"", ""weihuahu@stanford.edu"", ""jure@cs.stanford.edu"", ""stefje@mit.edu""]","[""Keyulu Xu*"", ""Weihua Hu*"", ""Jure Leskovec"", ""Stefanie Jegelka""]","[""graph neural networks"", ""theory"", ""deep learning"", ""representational power"", ""graph isomorphism"", ""deep multisets""]","Graph Neural Networks (GNNs) are an effective framework for representation learning of graphs. GNNs follow a neighborhood aggregation scheme, where the representation vector of a node is computed by recursively aggregating and transforming representation vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs to capture different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.",/pdf/8150ad6eb6c1bdd6c4742ba883d742e2b2fe1421.pdf,ICLR,2019,We develop theoretical foundations for the expressive power of GNNs and design a provably most powerful GNN. +B1gn-pEKwH,rye2_EuLDS,1569440000000.0,1577170000000.0,389,"INFERENCE, PREDICTION, AND ENTROPY RATE OF CONTINUOUS-TIME, DISCRETE-EVENT PROCESSES","[""smarzen@cmc.edu"", ""chaos@cse.ucdavis.edu""]","[""Sarah Marzen"", ""James P. Crutchfield""]","[""continuous-time prediction""]","The inference of models, prediction of future symbols, and entropy rate estimation of discrete-time, discrete-event processes is well-worn ground. However, many time series are better conceptualized as continuous-time, discrete-event processes. Here, we provide new methods for inferring models, predicting future symbols, and estimating the entropy rate of continuous-time, discrete-event processes. The methods rely on an extension of Bayesian structural inference that takes advantage of neural network’s universal approximation power. Based on experiments with simple synthetic data, these new methods seem to be competitive with state-of- the-art methods for prediction and entropy rate estimation as long as the correct model is inferred.",/pdf/1df7f7913ba29e9a58ab4edd867cd69590b4f2a6.pdf,ICLR,2020,"A new method for inferring a model of, estimating the entropy rate of, and predicting continuous-time, discrete-event processes." +rkHywl-A-,Sk4yvlZCW,1509130000000.0,1519330000000.0,595,Learning Robust Rewards with Adverserial Inverse Reinforcement Learning,"[""justinjfu@eecs.berkeley.edu"", ""katieluo@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Justin Fu"", ""Katie Luo"", ""Sergey Levine""]","[""inverse reinforcement learning"", ""deep reinforcement learning""]","Reinforcement learning provides a powerful and general framework for decision +making and control, but its application in practice is often hindered by the need +for extensive feature and reward engineering. Deep reinforcement learning methods +can remove the need for explicit engineering of policy or value features, but +still require a manually specified reward function. Inverse reinforcement learning +holds the promise of automatic reward acquisition, but has proven exceptionally +difficult to apply to large, high-dimensional problems with unknown dynamics. In +this work, we propose AIRL, a practical and scalable inverse reinforcement learning +algorithm based on an adversarial reward learning formulation that is competitive +with direct imitation learning algorithms. Additionally, we show that AIRL is +able to recover portable reward functions that are robust to changes in dynamics, +enabling us to learn policies even under significant variation in the environment +seen during training.",/pdf/9c208c6225b248177307013ef386488d07c0b34e.pdf,ICLR,2018,"We propose an adversarial inverse reinforcement learning algorithm capable of learning reward functions which can transfer to new, unseen environments." +B1esx6EYvr,ryxR4F4Lwr,1569440000000.0,1583910000000.0,351,"A critical analysis of self-supervision, or what we can learn from a single image","[""yuki@robots.ox.ac.uk"", ""chrisr@robots.ox.ac.uk"", ""vedaldi@robots.ox.ac.uk""]","[""Asano YM."", ""Rupprecht C."", ""Vedaldi A.""]","[""self-supervision"", ""feature representation learning"", ""CNN""]","We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels. We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training. +We conclude that: +(1) the weights of the early layers of deep networks contain limited information about the statistics of natural images, that +(2) such low-level statistics can be learned through self-supervision just as well as through strong supervision, and that +(3) the low-level statistics can be captured via synthetic transformations instead of using a large image dataset.",/pdf/83de939dbc3579dae2d5459efeb3e0f683c310d2.pdf,ICLR,2020,We evaluate self-supervised feature learning methods and find that with sufficient data augmentation early layers can be learned using just one image. This is informative about self-supervision and the role of augmentations. +Bk8N0RLxx,,1478060000000.0,1478060000000.0,40,Vocabulary Selection Strategies for Neural Machine Translation,"[""gurvan.lhostis@polytechnique.edu"", ""grangier@fb.com"", ""michaelauli@fb.com""]","[""Gurvan L'Hostis"", ""David Grangier"", ""Michael Auli""]","[""Natural language processing""]",Classical translation models constrain the space of possible outputs by selecting a subset of translation rules based on the input sentence. Recent work on improving the efficiency of neural translation models adopted a similar strategy by restricting the output vocabulary to a subset of likely candidates given the source. In this paper we experiment with context and embedding-based selection methods and extend previous work by examining speed and accuracy trade-offs in more detail. We show that decoding time on CPUs can be reduced by up to 90% and training time by 25% on the WMT15 English-German and WMT16 English-Romanian tasks at the same or only negligible change in accuracy. This brings the time to decode with a state of the art neural translation system to just over 140 words per seconds on a single CPU core for English-German.,/pdf/7227c1343250912077aa3080f3198bfc6e10d3ba.pdf,ICLR,2017,Neural machine translation can reach same accuracy with a 10x speedup by pruning the vocabulary prior to decoding. +trPMYEn1FCX,mVImz0-gFe,1601760000000.0,1614990000000.0,3828,GENERATIVE MODEL-ENHANCED HUMAN MOTION PREDICTION,"[""ucabab6@ucl.ac.uk"", ""rrg27@cam.ac.uk"", ""r.gray@ucl.ac.uk"", ""ashwani.jha@ucl.ac.uk"", ""p.nachev@ucl.ac.uk""]","[""Anthony Bourached"", ""Ryan-Rhys Griffiths"", ""Robert Gray"", ""Ashwani Jha"", ""Parashkev Nachev""]",[],"The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out-of-distribution (OoD). Here we formulate a new OoD benchmark based on the Human3.6M and CMU motion capture datasets, and introduce a hy- brid framework for hardening discriminative architectures to OoD failure by aug- menting them with a generative model. When applied to current state-of-the-art discriminative models, we show that the proposed approach improves OoD ro- bustness without sacrificing in-distribution performance, and can theoretically facilitate model interpretability. We suggest human motion predictors ought to be constructed with OoD challenges in mind, and provide an extensible general framework for hard- ening diverse discriminative architectures to extreme distributional shift.",/pdf/2a7974bc06dd078e4cd1115d6cb317c92b4d2f1e.pdf,ICLR,2021, +GH7QRzUDdXG,1tvcd7LLGns,1601310000000.0,1616060000000.0,1446,A Geometric Analysis of Deep Generative Image Models and Its Applications,"[""~Binxu_Wang1"", ""~Carlos_R_Ponce1""]","[""Binxu Wang"", ""Carlos R Ponce""]","[""Deep generative model"", ""Interpretability"", ""GAN"", ""Differential Geometry"", ""Optimization"", ""Model Inversion"", ""Feature Visualization""]","Generative adversarial networks (GANs) have emerged as a powerful unsupervised method to model the statistical patterns of real-world data sets, such as natural images. These networks are trained to map random inputs in their latent space to new samples representative of the learned data. However, the structure of the latent space is hard to intuit due to its high dimensionality and the non-linearity of the generator, which limits the usefulness of the models. Understanding the latent space requires a way to identify input codes for existing real-world images (inversion), and a way to identify directions with known image transformations (interpretability). Here, we use a geometric framework to address both issues simultaneously. We develop an architecture-agnostic method to compute the Riemannian metric of the image manifold created by GANs. The eigen-decomposition of the metric isolates axes that account for different levels of image variability. An empirical analysis of several pretrained GANs shows that image variation around each position is concentrated along surprisingly few major axes (the space is highly anisotropic) and the directions that create this large variation are similar at different positions in the space (the space is homogeneous). We show that many of the top eigenvectors correspond to interpretable transforms in the image space, with a substantial part of eigenspace corresponding to minor transforms which could be compressed out. This geometric understanding unifies key previous results related to GAN interpretability. We show that the use of this metric allows for more efficient optimization in the latent space (e.g. GAN inversion) and facilitates unsupervised discovery of interpretable axes. Our results illustrate that defining the geometry of the GAN image manifold can serve as a general framework for understanding GANs. ",/pdf/6c6825f68048a90722aee1790a7d0f3d24ad2d55.pdf,ICLR,2021,"We developed tools to compute the metric tensor of image manifold learnt by GANs, empirically analyzed their geometry, and found this knowledge useful to GAN inversion and finding interpretable axes." +zOGdf9K8aC,uFqH4HGvcR6,1601310000000.0,1614990000000.0,879,Self-Supervised Variational Auto-Encoders,"[""johngatop@gmail.com"", ""~Jakub_Mikolaj_Tomczak1""]","[""Ioannis Gatopoulos"", ""Jakub Mikolaj Tomczak""]","[""generative modeling"", ""deep learning"", ""deep autoencoders""]","Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), that utilizes deterministic and discrete transformations of data. This class of models allows performing both conditional and unconditional sampling while simplifying the objective function. First, we use a single self-supervised transformation as a latent variable, where a transformation is either downscaling or edge detection. Next, we consider a hierarchical architecture, i.e., multiple transformations, and we show its benefits compared to the VAE. The flexibility of selfVAE in data reconstruction finds a particularly interesting use case in data compression tasks, where we can trade-off memory for better data quality, and vice-versa. We present the performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA).",/pdf/2899d264a51f9932839d3ccb59b1c5e773044e05.pdf,ICLR,2021,"We present a novel class of generative models, called self-supervised Variational Auto-Encoder, where we improve VAEs by applying deterministic and discrete transformations of data." +SkFqf0lAZ,ryO9fAxRZ,1509120000000.0,1519900000000.0,446,Memory Architectures in Recurrent Neural Network Language Models,"[""dyogatama@google.com"", ""yishu.miao@cs.ox.ac.uk"", ""melisgl@google.com"", ""lingwang@google.com"", ""akuncoro@google.com"", ""cdyer@google.com"", ""pblunsom@google.com""]","[""Dani Yogatama"", ""Yishu Miao"", ""Gabor Melis"", ""Wang Ling"", ""Adhiguna Kuncoro"", ""Chris Dyer"", ""Phil Blunsom""]",[],"We compare and analyze sequential, random access, and stack memory architectures for recurrent neural network language models. Our experiments on the Penn Treebank and Wikitext-2 datasets show that stack-based memory architectures consistently achieve the best performance in terms of held out perplexity. We also propose a generalization to existing continuous stack models (Joulin & Mikolov,2015; Grefenstette et al., 2015) to allow a variable number of pop operations more naturally that further improves performance. We further evaluate these language models in terms of their ability to capture non-local syntactic dependencies on a subject-verb agreement dataset (Linzen et al., 2016) and establish new state of the art results using memory augmented language models. Our results demonstrate the value of stack-structured memory for explaining the distribution of words in natural language, in line with linguistic theories claiming a context-free backbone for natural language.",/pdf/8218b6c1408de91e74934504ce1e4e756ff68024.pdf,ICLR,2018, +SJxFWRVKDr,B1xEHL4uwS,1569440000000.0,1577170000000.0,973,Characterizing Missing Information in Deep Networks Using Backpropagated Gradients,"[""gukyeong.kwon@gatech.edu"", ""mohit.p@gatech.edu"", ""cantemel@gatech.edu"", ""alregib@gatech.edu""]","[""Gukyeong Kwon"", ""Mohit Prabhushankar"", ""Dogancan Temel"", ""Ghassan AlRegib""]","[""Representation learning"", ""Missing Information in Deep Networks"", ""Gradient-based Representation""]","Deep networks face challenges of ensuring their robustness against inputs that cannot be effectively represented by information learned from training data. We attribute this vulnerability to the limitations inherent to activation-based representation. To complement the learned information from activation-based representation, we propose utilizing a gradient-based representation that explicitly focuses on missing information. In addition, we propose a directional constraint on the gradients as an objective during training to improve the characterization of missing information. To validate the effectiveness of the proposed approach, we compare the anomaly detection performance of gradient-based and activation-based representations. We show that the gradient-based representation outperforms the activation-based representation by 0.093 in CIFAR-10 and 0.361 in CURE-TSR datasets in terms of AUROC averaged over all classes. Also, we propose an anomaly detection algorithm that uses the gradient-based representation, denoted as GradCon, and validate its performance on three benchmarking datasets. The proposed method outperforms the majority of the state-of-the-art algorithms in CIFAR-10, MNIST, and fMNIST datasets with an average AUROC of 0.664, 0.973, and 0.934, respectively.",/pdf/4588a38c2d1d3349e8599a37694aae75797ce6f0.pdf,ICLR,2020,We propose a gradient-based representation for characterizing information that deep networks have not learned. +HJggj3VKPH,rylRt-jevH,1569440000000.0,1577170000000.0,138,On the Dynamics and Convergence of Weight Normalization for Training Neural Networks,"[""ydukler@math.ucla.edu"", ""qgu@cs.ucla.edu"", ""montufar@math.ucla.edu""]","[""Yonatan Dukler"", ""Quanquan Gu"", ""Guido Montufar""]","[""Normalization methods"", ""Weight Normalization"", ""Convergence Theory""]","We present a proof of convergence for ReLU networks trained with weight normalization. In the analysis, we consider over-parameterized 2-layer ReLU networks initialized at random and trained with batch gradient descent and a fixed step size. The proof builds on recent theoretical works that bound the trajectory of parameters from their initialization and monitor the network predictions via the evolution of a ''neural tangent kernel'' (Jacot et al. 2018). We discover that training with weight normalization decomposes such a kernel via the so called ''length-direction decoupling''. This in turn leads to two convergence regimes and can rigorously explain the utility of WeightNorm. From the modified convergence we make a few curious observations including a natural form of ''lazy training'' where the direction of each weight vector remains stationary. ",/pdf/c159ea63fe613799dcfd3259bad9f16ad825ded1.pdf,ICLR,2020,We prove ReLU networks trained with weight normalization converge and analyze distinct behavior of different convergence regimes. +Syg7VaNYPB,rkghk3zvPS,1569440000000.0,1577170000000.0,481,Generative Latent Flow,"[""zxiao@uchicago.edu"", ""yanq@uchicago.edu"", ""amit@marx.uchicago.edu""]","[""Zhisheng Xiao"", ""Qing Yan"", ""Yali Amit""]","[""Generative Model"", ""Auto-encoder"", ""Normalizing Flow""]","In this work, we propose the Generative Latent Flow (GLF), an algorithm for generative modeling of the data distribution. GLF uses an Auto-encoder (AE) to learn latent representations of the data, and a normalizing flow to map the distribution of the latent variables to that of simple i.i.d noise. In contrast to some other Auto-encoder based generative models, which use various regularizers that encourage the encoded latent distribution to match the prior distribution, our model explicitly constructs a mapping between these two distributions, leading to better density matching while avoiding over regularizing the latent variables. We compare our model with several related techniques, and show that it has many relative advantages including fast convergence, single stage training and minimal reconstruction trade-off. We also study the relationship between our model and its stochastic counterpart, and show that our model can be viewed as a vanishing noise limit of VAEs with flow prior. Quantitatively, under standardized evaluations, our method achieves state-of-the-art sample quality and diversity among AE based models on commonly used datasets, and is competitive with GANs' benchmarks. ",/pdf/0f05c66caacf515cab4ab7dae0d957f3bbc2c56b.pdf,ICLR,2020,"We propose a generative model that combines deterministic Auto-encoders and normalizing flows, and we show that our model's sample quality greatly outperforms that of other AE based generative models." +SylUzpNFDS,r1gVZlsIDH,1569440000000.0,1577170000000.0,414,SoftLoc: Robust Temporal Localization under Label Misalignment,"[""schroeterj1@cardiff.ac.uk"", ""sidorovk@cardiff.ac.uk"", ""marshallad@cardiff.ac.uk""]","[""Julien Schroeter"", ""Kirill Sidorov"", ""Dave Marshall""]","[""deep learning"", ""temporal localization"", ""robustness"", ""label misalignment"", ""music"", ""time series""]","This work addresses the long-standing problem of robust event localization in the presence of temporally of misaligned labels in the training data. We propose a novel versatile loss function that generalizes a number of training regimes from standard fully-supervised cross-entropy to count-based weakly-supervised learning. Unlike classical models which are constrained to strictly fit the annotations during training, our soft localization learning approach relaxes the reliance on the exact position of labels instead. Training with this new loss function exhibits strong robustness to temporal misalignment of labels, thus alleviating the burden of precise annotation of temporal sequences. We demonstrate state-of-the-art performance against standard benchmarks in a number of challenging experiments and further show that robustness to label noise is not achieved at the expense of raw performance. ",/pdf/f57c2a79658f39795824539baf23a711eca158e3.pdf,ICLR,2020,This work introduces a novel loss function for the robust training of temporal localization DNN in the presence of misaligned labels. +eHg0cXYigrT,NiLf2o2R0iQ,1601310000000.0,1614990000000.0,3025,Conditional Generative Modeling for De Novo Hierarchical Multi-Label Functional Protein Design,"[""~Tim_Kucera1"", ""karsten.borgwardt@bsse.ethz.ch"", ""~Matteo_Togninalli1"", ""~Laetitia_Papaxanthos2""]","[""Tim Kucera"", ""Karsten Michael Borgwardt"", ""Matteo Togninalli"", ""Laetitia Papaxanthos""]","[""protein design"", ""conditional generative adversarial networks"", ""gene ontology"", ""hierarchical multi-label"", ""GO"", ""GAN""]","The availability of vast protein sequence information and rich functional annotations thereof has a large potential for protein design applications in biomedicine and synthetic biology. To this date, there exists no method for the general-purpose design of proteins without any prior knowledge about the protein of interest, such as costly and rare structure information or seed sequence fragments. However, the Gene Ontology (GO) database provides information about the hierarchical organisation of protein functions, and thus could inform generative models about the underlying complex sequence-function relationships, replacing the need for structural data. We therefore propose to use conditional generative adversarial networks (cGANs) on the task of fast de novo hierarchical multi-label protein design. We generate protein sequences exhibiting properties of a large set of molecular functions extracted from the GO database, using a single model and without any prior information. We shed light on efficient conditioning mechanisms and adapted network architectures thanks to a thorough hyperparameter selection process and analysis. We further provide statistically- and biologically-driven evaluation measures for generative models in the context of protein design to assess the quality of the generated sequences and facilitate progress in the field. We show that our proposed model, ProteoGAN, outperforms several baselines when designing proteins given a functional label and generates well-formed sequences.",/pdf/9e789dd8d7baf0d6f20fe0ccae7d966bc5993e73.pdf,ICLR,2021,We develop a conditional Generative Adversarial Network for sequence-based protein design with hierarchical multi-label gene ontology annotations. +BkeqO7x0-,Byy5dXxA-,1509080000000.0,1519410000000.0,231,Unsupervised Cipher Cracking Using Discrete GANs,"[""aidan.n.gomez@gmail.com"", ""huang@cs.toronto.edu"", ""ivan@for.ai"", ""bryan@for.ai"", ""osama@for.ai"", ""lukaszkaiser@google.com""]","[""Aidan N. Gomez"", ""Sicong Huang"", ""Ivan Zhang"", ""Bryan M. Li"", ""Muhammad Osama"", ""Lukasz Kaiser""]",[],"This work details CipherGAN, an architecture inspired by CycleGAN used for inferring the underlying cipher mapping given banks of unpaired ciphertext and plaintext. We demonstrate that CipherGAN is capable of cracking language data enciphered using shift and Vigenere ciphers to a high degree of fidelity and for vocabularies much larger than previously achieved. We present how CycleGAN can be made compatible with discrete data and train in a stable way. We then prove that the technique used in CipherGAN avoids the common problem of uninformative discrimination associated with GANs applied to discrete data. +",/pdf/bd215d841896ff3b1a3845369249f90df431b666.pdf,ICLR,2018, +HP-tcf48fT,OtvdSIipEWZ,1601310000000.0,1614990000000.0,370,Learning to Search for Fast Maximum Common Subgraph Detection,"[""~Yunsheng_Bai1"", ""~Derek_Qiang_Xu2"", ""~Yizhou_Sun1"", ""~Wei_Wang13""]","[""Yunsheng Bai"", ""Derek Qiang Xu"", ""Yizhou Sun"", ""Wei Wang""]","[""graph matching"", ""maximum common subgraph"", ""graph neural network"", ""reinforcement learning"", ""search""]","Detecting the Maximum Common Subgraph (MCS) between two input graphs is fundamental for applications in biomedical analysis, malware detection, cloud computing, etc. This is especially important in the task of drug design, where the successful extraction of common substructures in compounds can reduce the number of experiments needed to be conducted by humans. However, MCS computation is NP-hard, and state-of-the-art MCS solvers rely on heuristics in search which in practice cannot find good solution for large graph pairs under a limited search budget. Here we propose GLSearch, a Graph Neural Network based model for MCS detection, which learns to search. Our model uses a state-of-the-art branch and bound algorithm as the backbone search algorithm to extract subgraphs by selecting one node pair at a time. In order to make better node selection decision at each step, we replace the node selection heuristics with a novel task-specific Deep Q-Network (DQN), allowing the search process to find larger common subgraphs faster. To enhance the training of DQN, we leverage the search process to provide supervision in a pre-training stage and guide our agent during an imitation learning stage. Therefore, our framework allows search and reinforcement learning to mutually benefit each other. Experiments on synthetic and real-world large graph pairs demonstrate that our model outperforms state-of-the-art MCS solvers and neural graph matching network models.",/pdf/b61292c3947145f53e08add9e351a792df0e5d0f.pdf,ICLR,2021,We design a fast RL based approach to maximum common subgraph detection. +u_bGm5lrm72,Btw92pL2BhO,1601310000000.0,1614990000000.0,2386,DIET-SNN: A Low-Latency Spiking Neural Network with Direct Input Encoding & Leakage and Threshold Optimization,"[""~Nitin_Rathi1"", ""~Kaushik_Roy1""]","[""Nitin Rathi"", ""Kaushik Roy""]","[""Spiking neural networks"", ""threshold optimization"", ""leak optimization"", ""input encoding"", ""deep convolutional networks""]","Bio-inspired spiking neural networks (SNNs), operating with asynchronous binary signals (or spikes) distributed over time, can potentially lead to greater computational efficiency on event-driven hardware. The state-of-the-art SNNs suffer from high inference latency, resulting from inefficient input encoding, and sub-optimal settings of the neuron parameters (firing threshold, and membrane leak). We propose DIET-SNN, a low latency deep spiking network that is trained with gradient descent to optimize the membrane leak and the firing threshold along with other network parameters (weights). The membrane leak and threshold for each layer of the SNN are optimized with end-to-end backpropagation to achieve competitive accuracy at reduced latency. The analog pixel values of an image are directly applied to the input layer of DIET-SNN without the need to convert to spike-train. The first convolutional layer is trained to convert inputs into spikes where leaky-integrate-and-fire (LIF) neurons integrate the weighted inputs and generate an output spike when the membrane potential crosses the trained firing threshold. The trained membrane leak controls the flow of input information and attenuates irrelevant inputs to increase the activation sparsity in the convolutional and linear layers of the network. The reduced latency combined with high activation sparsity provides large improvements in computational efficiency. We evaluate DIET-SNN on image classification tasks from CIFAR and ImageNet datasets on VGG and ResNet architectures. We achieve top-1 accuracy of 69% with 5 timesteps (inference latency) on the ImageNet dataset with 12x less compute energy than an equivalent standard ANN. Additionally, DIET-SNN performs 20-500x faster inference compared to other state-of-the-art SNN models.",/pdf/10f4d10801bc66d08c4ebfb26bd74b975b4a26d2.pdf,ICLR,2021,Gradient based training of threshold and leak in deep spiking networks to achieve high activation sparsity and low inference latency +Hyx4knR9Ym,SygVFzS5Ym,1538090000000.0,1550890000000.0,976,Generalizable Adversarial Training via Spectral Normalization,"[""farnia@stanford.edu"", ""jessez@stanford.edu"", ""dntse@stanford.edu""]","[""Farzan Farnia"", ""Jesse Zhang"", ""David Tse""]","[""Adversarial attacks"", ""adversarial training"", ""spectral normalization"", ""generalization guarantee""]","Deep neural networks (DNNs) have set benchmarks on a wide array of supervised learning tasks. Trained DNNs, however, often lack robustness to minor adversarial perturbations to the input, which undermines their true practicality. Recent works have increased the robustness of DNNs by fitting networks using adversarially-perturbed training samples, but the improved performance can still be far below the performance seen in non-adversarial settings. A significant portion of this gap can be attributed to the decrease in generalization performance due to adversarial training. In this work, we extend the notion of margin loss to adversarial settings and bound the generalization error for DNNs trained under several well-known gradient-based attack schemes, motivating an effective regularization scheme based on spectral normalization of the DNN's weight matrices. We also provide a computationally-efficient method for normalizing the spectral norm of convolutional layers with arbitrary stride and padding schemes in deep convolutional networks. We evaluate the power of spectral normalization extensively on combinations of datasets, network architectures, and adversarial training schemes.",/pdf/64740bf76332942ea67a811e70429e97efa94552.pdf,ICLR,2019, +rJNpifWAb,BJC_jf-RZ,1509140000000.0,1521500000000.0,945,Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches,"[""wenyemin@cs.toronto.edu"", ""pvicol@cs.toronto.edu"", ""jimmy@psi.toronto.edu"", ""trandustin@google.com"", ""rgrosse@cs.toronto.edu""]","[""Yeming Wen"", ""Paul Vicol"", ""Jimmy Ba"", ""Dustin Tran"", ""Roger Grosse""]","[""weight perturbation"", ""reparameterization gradient"", ""gradient variance reduction"", ""evolution strategies"", ""LSTM"", ""regularization"", ""optimization""]","Stochastic neural net weights are used in a variety of contexts, including regularization, Bayesian neural nets, exploration in reinforcement learning, and evolution strategies. Unfortunately, due to the large number of weights, all the examples in a mini-batch typically share the same weight perturbation, thereby limiting the variance reduction effect of large mini-batches. We introduce flipout, an efficient method for decorrelating the gradients within a mini-batch by implicitly sampling pseudo-independent weight perturbations for each example. Empirically, flipout achieves the ideal linear variance reduction for fully connected networks, convolutional networks, and RNNs. We find significant speedups in training neural networks with multiplicative Gaussian perturbations. We show that flipout is effective at regularizing LSTMs, and outperforms previous methods. Flipout also enables us to vectorize evolution strategies: in our experiments, a single GPU with flipout can handle the same throughput as at least 40 CPU cores using existing methods, equivalent to a factor-of-4 cost reduction on Amazon Web Services.",/pdf/ed0074b05172e2dba2583bc875e3e72782d22f12.pdf,ICLR,2018,"We introduce flipout, an efficient method for decorrelating the gradients computed by stochastic neural net weights within a mini-batch by implicitly sampling pseudo-independent weight perturbations for each example." +H1V4QhAqYQ,SyeJpwEqtQ,1538090000000.0,1545360000000.0,1352,Augment your batch: better training with larger batches,"[""elad.hoffer@gmail.com"", ""itayhubara@gmail.com"", ""giladiniv@gmail.com"", ""daniel.soudry@gmail.com""]","[""Elad Hoffer"", ""Itay Hubara"", ""Niv Giladi"", ""Daniel Soudry""]","[""Large Batch Training"", ""Augmentation"", ""Deep Learning""]","Recently, there is regained interest in large batch training of neural networks, both of theory and practice. New insights and methods allowed certain models to be trained using large batches with no adverse impact on performance. Most works focused on accelerating wall clock training time by modifying the learning rate schedule, without introducing accuracy degradation. +We propose to use large batch training to boost accuracy and accelerate convergence by combining it with data augmentation. Our method, ""batch augmentation"", suggests using multiple instances of each sample at the same large batch. We show empirically that this simple yet effective method improves convergence and final generalization accuracy. We further suggest possible reasons for its success.",/pdf/6b7e1e373d3577d8a2797be714d86fc03a53ace6.pdf,ICLR,2019,Improve accuracy by large batches composed of multiple instances of each sample at the same batch +B1xfElrKPr,r1l0fUeYDr,1569440000000.0,1577170000000.0,2239,Enhancing the Transformer with explicit relational encoding for math problem solving,"[""imanol@idsia.ch"", ""paul.smolensky@gmail.com"", ""rfernand@microsoft.com"", ""jojic@microsoft.com"", ""juergen@idsia.ch"", ""jfgao@microsoft.com""]","[""Imanol Schlag"", ""Paul Smolensky"", ""Roland Fernandez"", ""Nebojsa Jojic"", ""J\u00fcrgen Schmidhuber"", ""Jianfeng Gao""]","[""Tensor Product Representation"", ""Transformer"", ""Mathematics Dataset"", ""Attention""]","We incorporate Tensor-Product Representations within the Transformer in order to better support the explicit representation of relation structure. +Our Tensor-Product Transformer (TP-Transformer) sets a new state of the art on the recently-introduced Mathematics Dataset containing 56 categories of free-form math word-problems. +The essential component of the model is a novel attention mechanism, called TP-Attention, which explicitly encodes the relations between each Transformer cell and the other cells from which values have been retrieved by attention. TP-Attention goes beyond linear combination of retrieved values, strengthening representation-building and resolving ambiguities introduced by multiple layers of regular attention. +The TP-Transformer's attention maps give better insights into how it is capable of solving the Mathematics Dataset's challenging problems. +Pretrained models and code will be made available after publication.",/pdf/ce964236c1868816c6086056643482230225d596.pdf,ICLR,2020,Our Tensor-Product Transformer sets a new state of the art on the recently-introduced Mathematics Dataset containing 56 categories of free-form math word-problems. +GafvgJTFkgb,fb2UM9bRUD,1601310000000.0,1614990000000.0,2334,A Technical and Normative Investigation of Social Bias Amplification,"[""~Angelina_Wang1"", ""~Olga_Russakovsky1""]","[""Angelina Wang"", ""Olga Russakovsky""]","[""bias amplification"", ""fairness"", ""societal considerations""]","The conversation around the fairness of machine learning models is growing and evolving. In this work, we focus on the issue of bias amplification: the tendency of models trained from data containing social biases to further amplify these biases. This problem is brought about by the algorithm, on top of the level of bias already present in the data. We make two main contributions regarding its measurement. First, building off of Zhao et al. (2017), we introduce and analyze a new, decoupled metric for measuring bias amplification, $\text{BiasAmp}_{\rightarrow}, which possesses a number of attractive properties, including the ability to pinpoint the cause of bias amplification. Second, we thoroughly analyze and discuss the normative implications of this metric. We provide suggestions about its measurement by cautioning against predicting sensitive attributes, encouraging the use of confidence intervals due to fluctuations in the fairness of models across runs, and discussing what bias amplification means in the context of domains where labels either don't exist at test time or correspond to uncertain future events. Throughout this paper, we work to provide a deeply interrogative look at the technical measurement of bias amplification, guided by our normative ideas of what we want it to encompass.",/pdf/f56be31be3b87bb3bebb4ee68bbe4f7b588dd87e.pdf,ICLR,2021,"We examine bias amplification from both normative and technical perspectives, including introducing a new metric for measuring bias amplification that mitigates the shortcomings of prior work." +HytSvlWRZ,r1YSDlZCZ,1509130000000.0,1518730000000.0,599,Subspace Network: Deep Multi-Task Censored Regression for Modeling Neurodegenerative Diseases,"[""sunmeng2@msu.edu"", ""baytasin@msu.edu"", ""atlaswang@tamu.edu"", ""jiayuz@msu.edu""]","[""Mengying Sun"", ""Inci M. Baytas"", ""Zhangyang Wang"", ""Jiayu Zhou""]","[""subspace"", ""censor"", ""multi-task"", ""deep network""]","Over the past decade a wide spectrum of machine learning models have been developed to model the neurodegenerative diseases, associating biomarkers, especially non-intrusive neuroimaging markers, with key clinical scores measuring the cognitive status of patients. Multi-task learning (MTL) has been extensively explored in these studies to address challenges associated to high dimensionality and small cohort size. However, most existing MTL approaches are based on linear models and suffer from two major limitations: 1) they cannot explicitly consider upper/lower bounds in these clinical scores; 2) they lack the capability to capture complicated non-linear effects among the variables. In this paper, we propose the Subspace Network, an efficient deep modeling approach for non-linear multi-task censored regression. Each layer of the subspace network performs a multi-task censored regression to improve upon the predictions from the last layer via sketching a low-dimensional subspace to perform knowledge transfer among learning tasks. We show that under mild assumptions, for each layer the parametric subspace can be recovered using only one pass of training data. In addition, empirical results demonstrate that the proposed subspace network quickly picks up correct parameter subspaces, and outperforms state-of-the-arts in predicting neurodegenerative clinical scores using information in brain imaging. ",/pdf/523f895a9b8741be3e63a328364e6cede5b42c86.pdf,ICLR,2018, +HJWHIKqgl,,1478300000000.0,1559850000000.0,494,Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy,"[""dsuth@cs.ubc.ca"", ""htung@cs.cmu.edu"", ""heiko.strathmann@gmail.com"", ""soumyajitde.cse@gmail.com"", ""aramdas@berkeley.edu"", ""alex@smola.org"", ""arthur.gretton@gmail.com""]","[""Danica J. Sutherland"", ""Hsiao-Yu Tung"", ""Heiko Strathmann"", ""Soumyajit De"", ""Aaditya Ramdas"", ""Alex Smola"", ""Arthur Gretton""]","[""Unsupervised Learning""]","We propose a method to optimize the representation and distinguishability of samples from two probability distributions, by maximizing the estimated power of a statistical test based on the maximum mean discrepancy (MMD). This optimized MMD is applied to the setting of unsupervised learning by generative adversarial networks (GAN), in which a model attempts to generate realistic samples, and a discriminator attempts to tell these apart from data samples. In this context, the MMD may be used in two roles: first, as a discriminator, either directly on the samples, or on features of the samples. Second, the MMD can be used to evaluate the performance of a generative model, by testing the model’s samples against a reference data set. In the latter role, the optimized MMD is particularly helpful, as it gives an interpretable indication of how the model and data distributions differ, even in cases where individual model samples are not easily distinguished either by eye or by classifier.",/pdf/1451bb4600b139d6f10801e3f1fe3da99348f438.pdf,ICLR,2017,"A way to optimize the power of an MMD test, to use it for evaluating generative models and training GANs" +r1VPNiA5Fm,SylXyMGttm,1538090000000.0,1545360000000.0,11,The Universal Approximation Power of Finite-Width Deep ReLU Networks,"[""pdmytro@nari.ee.ethz.ch"", ""philipp.grohs@univie.ac.at"", ""dennis.elbraechter@univie.ac.at"", ""boelcskei@nari.ee.ethz.ch""]","[""Dmytro Perekrestenko"", ""Philipp Grohs"", ""Dennis Elbr\u00e4chter"", ""Helmut B\u00f6lcskei""]","[""rate-distortion optimality"", ""ReLU"", ""deep learning"", ""approximation theory"", ""Weierstrass function""]","We show that finite-width deep ReLU neural networks yield rate-distortion optimal approximation (Bölcskei et al., 2018) of a wide class of functions, including polynomials, windowed sinusoidal functions, one-dimensional oscillatory textures, and the Weierstrass function, a fractal function which is continuous but nowhere differentiable. Together with the recently established universal approximation result for affine function systems (Bölcskei et al., 2018), this demonstrates that deep neural networks approximate vastly different signal structures generated by the affine group, the Weyl-Heisenberg group, or through warping, and even certain fractals, all with approximation error decaying exponentially in the number of neurons. We also prove that in the approximation of sufficiently smooth functions finite-width deep networks require strictly fewer neurons than finite-depth wide networks.",/pdf/5afa865295c8963d1f0e2a15c252b9bc94fea857.pdf,ICLR,2019, +w_haMPbUgWb,CJTEri2EiLZ,1601310000000.0,1614990000000.0,975,Rewriter-Evaluator Framework for Neural Machine Translation,"[""~Yangming_Li1"", ""~Kaisheng_Yao2""]","[""Yangming Li"", ""Kaisheng Yao""]","[""Neural Machine Translation"", ""Post-editing mechanism"", ""Polish Mechanism"", ""Proper Termination Policy""]","Encoder-decoder architecture has been widely used in neural machine translation (NMT). A few methods have been proposed to improve it with multiple passes of decoding. However, their full potential is limited by a lack of appropriate termination policy. To address this issue, we present a novel framework, Rewriter-Evaluator. It consists of a rewriter and an evaluator. Translating a source sentence involves multiple passes. At every pass, the rewriter produces a new translation to improve the past translation and the evaluator estimates the translation quality to decide whether to terminate the rewriting process. We also propose a prioritized gradient descent (PGD) method that facilitates training the rewriter and the evaluator jointly. Though incurring multiple passes of decoding, Rewriter-Evaluator with the proposed PGD method can be trained with similar time to that of training encoder-decoder models. We apply the proposed framework to improve the general NMT models (e.g., Transformer). We conduct extensive experiments on two translation tasks, Chinese-English and English-German, and show that the proposed framework notably improves the performances of NMT models and significantly outperforms previous baselines.",/pdf/58062c7affd6811932a72ea115bdd6f6618842ae.pdf,ICLR,2021,"We propose a novel framework, Rewriter-Evaluator, that achieves proper termination policy for multi-pass decoding." +Hkx6hANtwH,HyeDpoY_DS,1569440000000.0,1583910000000.0,1373,LambdaNet: Probabilistic Type Inference using Graph Neural Networks,"[""jiayi@cs.utexas.edu"", ""maruth@utexas.edu"", ""gdurrett@cs.utexas.edu"", ""isil@cs.utexas.edu""]","[""Jiayi Wei"", ""Maruth Goyal"", ""Greg Durrett"", ""Isil Dillig""]","[""Type inference"", ""Graph neural network"", ""Programming languages"", ""Pointer network""]","As gradual typing becomes increasingly popular in languages like Python and TypeScript, there is a growing need to infer type annotations automatically. While type annotations help with tasks like code completion and static error catching, these annotations cannot be fully inferred by compilers and are tedious to annotate by hand. This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network. Our approach first uses lightweight source code analysis to generate a program abstraction called a type dependency graph, which links type variables with logical constraints as well as name and usage information. Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions. Our neural architecture can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training. Our experimental results show that our approach outperforms prior work in this space by 14% (absolute) on library types, while having the ability to make type predictions that are out of scope for existing techniques. ",/pdf/86657ef3180cce2cdbdc9e6f22aeef474f4d60de.pdf,ICLR,2020,"We have presented LambdaNet, a neural architecture for type inference that combines the strength of explicit program analysis with graph neural networks." +Sy2ogebAW,Sy3iglZAW,1509130000000.0,1519420000000.0,553,Unsupervised Neural Machine Translation,"[""mikel.artetxe@ehu.eus"", ""gorka.labaka@ehu.eus"", ""e.agirre@ehu.eus"", ""kyunghyun.cho@nyu.edu""]","[""Mikel Artetxe"", ""Gorka Labaka"", ""Eneko Agirre"", ""Kyunghyun Cho""]","[""neural machine translation"", ""unsupervised learning""]","In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely remove the need of parallel data and propose a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora. Our model builds upon the recent work on unsupervised embedding mappings, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingual corpora alone using a combination of denoising and backtranslation. Despite the simplicity of the approach, our system obtains 15.56 and 10.21 BLEU points in WMT 2014 French-to-English and German-to-English translation. The model can also profit from small parallel corpora, and attains 21.81 and 15.24 points when combined with 100,000 parallel sentences, respectively. Our implementation is released as an open source project.",/pdf/a51f2353f6b78b314e36cd3fda7c9609c7158590.pdf,ICLR,2018,"We introduce the first successful method to train neural machine translation in an unsupervised manner, using nothing but monolingual corpora" +Sk03Yi10Z,ryThKjyA-,1509040000000.0,1518730000000.0,159,An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems.,"[""songyiping@pku.edu.cn"", ""ruiyan@pku.edu.cn"", ""chengte@mail.ncku.edu.tw"", ""nie@iro.umontreal.ca"", ""mzhang_cs@pku.edu.cn"", ""zhaody@pku.edu.cn""]","[""Yiping Song"", ""Rui Yan"", ""Cheng-Te Li"", ""Jian-Yun Nie"", ""Ming Zhang"", ""Dongyan Zhao""]","[""conversation systems"", ""retrieval method"", ""generation method""]","Human-computer conversation systems have attracted much attention in Natural Language Processing. Conversation systems can be roughly divided into two categories: retrieval-based and generation-based systems. Retrieval systems search a user-issued utterance (namely a query) in a large conversational repository and return a reply that best matches the query. Generative approaches synthesize new replies. Both ways have certain advantages but suffer from their own disadvantages. We propose a novel ensemble of retrieval-based and generation-based conversation system. The retrieved candidates, in addition to the original query, are fed to a reply generator via a neural network, so that the model is aware of more information. The generated reply together with the retrieved ones then participates in a re-ranking process to find the final reply to output. Experimental results show that such an ensemble system outperforms each single module by a large margin. +",/pdf/79ba1e1fef02d9584a289be8067908ef7c1d4525.pdf,ICLR,2018,A novel ensemble of retrieval-based and generation-based for open-domain conversation systems. +CLnj31GZ4cI,PlbwKmM-ID,1601310000000.0,1614990000000.0,1534,K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters,"[""~Ruize_Wang1"", ""~Duyu_Tang1"", ""~Nan_Duan1"", ""~zhongyu_wei1"", ""~Xuanjing_Huang1"", ""jianshuj@microsoft.com"", ""gucao@microsoft.com"", ""djiang@microsoft.com"", ""~Ming_Zhou1""]","[""Ruize Wang"", ""Duyu Tang"", ""Nan Duan"", ""zhongyu wei"", ""Xuanjing Huang"", ""Jianshu Ji"", ""Guihong Cao"", ""Daxin Jiang"", ""Ming Zhou""]",[],"We study the problem of injecting knowledge into large pre-trained models like BERT and RoBERTa. Existing methods typically update the original parameters of pre-trained models when injecting knowledge. However, when multiple kinds of knowledge are injected, they may suffer from catastrophic forgetting. To address this, we propose K-Adapter, which remains the original parameters of the pre-trained model fixed and supports continual knowledge infusion. Taking RoBERTa as the pre-trained model, K-Adapter has a neural adapter for each kind of infused knowledge, like a plug-in connected to RoBERTa. There is no information flow between different adapters, thus different adapters are efficiently trained in a distributed way. We inject two kinds of knowledge, including factual knowledge obtained from automatically aligned text-triplets on Wikipedia and Wikidata, and linguistic knowledge obtained from dependency parsing. Results on three knowledge-driven tasks (total six datasets) including relation classification, entity typing and question answering demonstrate that each adapter improves the performance, and the combination of both adapters brings further improvements. Probing experiments further indicate that K-Adapter captures richer factual and commonsense knowledge than RoBERTa.",/pdf/d61bb4c56ead10012fe2e4f86b4c402af4653b6a.pdf,ICLR,2021, +BkV4VS9ll,,1478280000000.0,1480320000000.0,245,The Incredible Shrinking Neural Network: New Perspectives on Learning Representations Through The Lens of Pruning,"[""nwolfe@cs.cmu.edu"", ""adityasharma@cmu.edu"", ""drude@nt.upb.de"", ""bhiksha@cs.cmu.edu""]","[""Nikolas Wolfe"", ""Aditya Sharma"", ""Lukas Drude"", ""Bhiksha Raj""]","[""Theory"", ""Deep learning""]","How much can pruning algorithms teach us about the fundamentals of learning representations in neural networks? A lot, it turns out. Neural network model compression has become a topic of great interest in recent years, and many different techniques have been proposed to address this problem. In general, this is motivated by the idea that smaller models typically lead to better generalization. At the same time, the decision of what to prune and when to prune necessarily forces us to confront our assumptions about how neural networks actually learn to represent patterns in data. In this work we set out to test several long-held hypotheses about neural network learning representations and numerical approaches to pruning. To accomplish this we first reviewed the historical literature and derived a novel algorithm to prune whole neurons (as opposed to the traditional method of pruning weights) from optimally trained networks using a second-order Taylor method. We then set about testing the performance of our algorithm and analyzing the quality of the decisions it made. As a baseline for comparison we used a first-order Taylor method based on the Skeletonization algorithm and an exhaustive brute-force serial pruning algorithm. Our proposed algorithm worked well compared to a first-order method, but not nearly as well as the brute-force method. Our error analysis led us to question the validity of many widely-held assumptions behind pruning algorithms in general and the trade-offs we often make in the interest of reducing computational complexity. We discovered that there is a straightforward way, however expensive, to serially prune 40-70\% of the neurons in a trained network with minimal effect on the learning representation and without any re-training. ",/pdf/5ed60af9d389193f9e5e312b42e139e254500ea3.pdf,ICLR,2017,Pruning algorithms reveal fundamental insights into neural network learning representations +Bkel1krKPS,B1ehjoqdwB,1569440000000.0,1577170000000.0,1456,Attention on Abstract Visual Reasoning,"[""l.hahne@stud.uni-goettingen.de"", ""timo.lueddecke@phys.uni-goettingen.de"", ""worgott@gwdg.de"", ""david.kappel@phys.uni-goettingen.de""]","[""Lukas Hahne"", ""Timo L\u00fcddecke"", ""Florentin W\u00f6rg\u00f6tter"", ""David Kappel""]","[""Transformer Networks"", ""Self-Attention"", ""Wild Relation Networks"", ""Procedurally Generated Matrices""]","Attention mechanisms have been boosting the performance of deep learning models on a wide range of applications, ranging from speech understanding to program induction. However, despite experiments from psychology which suggest that attention plays an essential role in visual reasoning, the full potential of attention mechanisms has so far not been explored to solve abstract cognitive tasks on image data. In this work, we propose a hybrid network architecture, grounded on self-attention and relational reasoning. We call this new model Attention Relation Network (ARNe). ARNe combines features from the recently introduced Transformer and the Wild Relation Network (WReN). We test ARNe on the Procedurally Generated Matrices (PGMs) datasets for abstract visual reasoning. ARNe excels the WReN model on this task by 11.28 ppt. Relational concepts between objects are efficiently learned demanding only 35% of the training samples to surpass reported accuracy of the base line model. Our proposed hybrid model, represents an alternative on learning abstract relations using self-attention and demonstrates that the Transformer network is also well suited for abstract visual reasoning.",/pdf/0829320eaecb985379f33f6975ab658e5d090ae7.pdf,ICLR,2020,Introducing Attention Relation Network (ARNe) that combines features from WReN and Transformer Networks. +HJYQLb-RW,Syd7IbWAb,1509130000000.0,1518730000000.0,678,On the limitations of first order approximation in GAN dynamics,"[""jerryzli@mit.edu"", ""madry@mit.edu"", ""jpeebles@mit.edu"", ""ludwigs@mit.edu""]","[""Jerry Li"", ""Aleksander Madry"", ""John Peebles"", ""Ludwig Schmidt""]","[""GANs"", ""first order dynamics"", ""convergence"", ""mode collapse""]","Generative Adversarial Networks (GANs) have been proposed as an approach to learning generative models. While GANs have demonstrated promising performance on multiple vision tasks, their learning dynamics are not yet well understood, neither in theory nor in practice. In particular, the work in this domain has been focused so far only on understanding the properties of the stationary solutions that this dynamics might converge to, and of the behavior of that dynamics in this solutions’ immediate neighborhood. + +To address this issue, in this work we take a first step towards a principled study of the GAN dynamics itself. To this end, we propose a model that, on one hand, exhibits several of the common problematic convergence behaviors (e.g., vanishing gradient, mode collapse, diverging or oscillatory behavior), but on the other hand, is sufficiently simple to enable rigorous convergence analysis. + +This methodology enables us to exhibit an interesting phenomena: a GAN with an optimal discriminator provably converges, while guiding the GAN training using only a first order approximation of the discriminator leads to unstable GAN dynamics and mode collapse. This suggests that such usage of the first order approximation of the discriminator, which is a de-facto standard in all the existing GAN dynamics, might be one of the factors that makes GAN training so challenging in practice. Additionally, our convergence result constitutes the first rigorous analysis of a dynamics of a concrete parametric GAN.",/pdf/e324b0acc509730c7415f803dbd18102509b85ae.pdf,ICLR,2018,"To understand GAN training, we define simple GAN dynamics, and show quantitative differences between optimal and first order updates in this model." +Sks9_ajex,,1478380000000.0,1486940000000.0,610,Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer,"[""sergey.zagoruyko@enpc.fr"", ""nikos.komodakis@enpc.fr""]","[""Sergey Zagoruyko"", ""Nikos Komodakis""]","[""Computer vision"", ""Deep learning"", ""Supervised Learning""]","Attention plays a critical role in human visual experience. Furthermore, it has recently been demonstrated that attention can also play an important role in the context of applying artificial neural networks to a variety of tasks from fields such as computer vision and NLP. In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network. To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures.",/pdf/db088e768de05a386c4b19d7832062562661d4f7.pdf,ICLR,2017, +SJgNwi09Km,HyenSUDcKX,1538090000000.0,1550820000000.0,253,Learning Latent Superstructures in Variational Autoencoders for Deep Multidimensional Clustering,"[""xlibo@cse.ust.hk"", ""zchenbb@cse.ust.hk"", ""kmpoon@eduhk.hk"", ""lzhang@cse.ust.hk""]","[""Xiaopeng Li"", ""Zhourong Chen"", ""Leonard K. M. Poon"", ""Nevin L. Zhang""]","[""latent tree model"", ""variational autoencoder"", ""deep learning"", ""latent variable model"", ""bayesian network"", ""structure learning"", ""stepwise em"", ""message passing"", ""graphical model"", ""multidimensional clustering"", ""unsupervised learning""]","We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features. In general, our superstructure is a tree structure of multiple super latent variables and it is automatically learned from data. When there is only one latent variable in the superstructure, our model reduces to one that assumes the latent features to be generated from a Gaussian mixture model. We call our model the latent tree variational autoencoder (LTVAE). Whereas previous deep learning methods for clustering produce only one partition of data, LTVAE produces multiple partitions of data, each being given by one super latent variable. This is desirable because high dimensional data usually have many different natural facets and can be meaningfully partitioned in multiple ways.",/pdf/ad13cd6b4501e8ea49a21ac9a09f07dcf1c9dcea.pdf,ICLR,2019,We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features. +T1XmO8ScKim,rkCHde57YTZE,1601310000000.0,1616050000000.0,2769,Probabilistic Numeric Convolutional Neural Networks,"[""~Marc_Anton_Finzi1"", ""~Roberto_Bondesan1"", ""~Max_Welling1""]","[""Marc Anton Finzi"", ""Roberto Bondesan"", ""Max Welling""]","[""probabilistic numerics"", ""gaussian processes"", ""discretization error"", ""pde"", ""superpixel"", ""irregularly spaced time series"", ""misssing data"", ""spatial uncertainty""]","Continuous input signals like images and time series that are irregularly sampled or have missing values are challenging for existing deep learning methods. Coherently defined feature representations must depend on the values in unobserved regions of the input. Drawing from the work in probabilistic numerics, we propose Probabilistic Numeric Convolutional Neural Networks which represent features as Gaussian processes, providing a probabilistic description of discretization error. We then define a convolutional layer as the evolution of a PDE defined on this GP, followed by a nonlinearity. This approach also naturally admits steerable equivariant convolutions under e.g. the rotation group. In experiments we show that our approach yields a $3\times$ reduction of error from the previous state of the art on the SuperPixel-MNIST dataset and competitive performance on the medical time series dataset PhysioNet2012.",/pdf/132819644044c301e530ea14a0a17e7e4d6756d7.pdf,ICLR,2021,We build a neural network which integrates internal discretization error and missing values probabilistically with GPs +BkexaxBKPB,S1gDCeZtDB,1569440000000.0,1577170000000.0,2561,Generative Adversarial Nets for Multiple Text Corpora,"[""d-klabjan@northwestern.edu"", ""baiyang@u.northwestern.edu""]","[""Diego Klabjan"", ""Baiyang Wang""]","[""GAN"", ""NLP"", ""embeddings""]","Generative adversarial nets (GANs) have been successfully applied to the artificial generation of image data. In terms of text data, much has been done on the artificial generation of natural language from a single corpus. We consider multiple text corpora as the input data, for which there can be two applications of GANs: (1) the creation of consistent cross-corpus word embeddings given different word embeddings per corpus; (2) the generation of robust bag-of-words document embeddings for each corpora. We demonstrate our GAN models on real-world text data sets from different corpora, and show that embeddings from both models lead to improvements in supervised learning problems.",/pdf/51a0bcd2b5ebbe1b7218b39ef4a66ae50c5fdf00.pdf,ICLR,2020,Constructing robust embeddings by means of GANs from multiple corpora +ByloIiCqYQ,BkgXmERtYX,1538090000000.0,1551160000000.0,203,Maximal Divergence Sequential Autoencoder for Binary Software Vulnerability Detection,"[""tue.le.ict@jvn.edu.vn"", ""nguyenvutuan1995@gmail.com"", ""trunglm@monash.edu"", ""dinh.phung@monash.edu"", ""paul.montague@dst.defence.gov.au"", ""olivier.devel@dst.defence.gov.au"", ""lizhen.qu@data61.csiro.au""]","[""Tue Le"", ""Tuan Nguyen"", ""Trung Le"", ""Dinh Phung"", ""Paul Montague"", ""Olivier De Vel"", ""Lizhen Qu""]","[""Vulnerabilities Detection"", ""Sequential Auto-Encoder"", ""Separable Representation""]","Due to the sharp increase in the severity of the threat imposed by software vulnerabilities, the detection of vulnerabilities in binary code has become an important concern in the software industry, such as the embedded systems industry, and in the field of computer security. However, most of the work in binary code vulnerability detection has relied on handcrafted features which are manually chosen by a select few, knowledgeable domain experts. In this paper, we attempt to alleviate this severe binary vulnerability detection bottleneck by leveraging recent advances in deep learning representations and propose the Maximal Divergence Sequential Auto-Encoder. In particular, latent codes representing vulnerable and non-vulnerable binaries are encouraged to be maximally divergent, while still being able to maintain crucial information from the original binaries. We conducted extensive experiments to compare and contrast our proposed methods with the baselines, and the results show that our proposed methods outperform the baselines in all performance measures of interest.",/pdf/0bf56cd54ef7cdb2f04c6e446f9d8aeef964996b.pdf,ICLR,2019,We propose a novel method named Maximal Divergence Sequential Auto-Encoder that leverages Variational AutoEncoder representation for binary code vulnerability detection. +ZVqZIA1GA_,eXSNHXbE0C,1601310000000.0,1614990000000.0,968,Deformable Capsules for Object Detection,"[""~Rodney_LaLonde1"", ""~Naji_Khosravan1"", ""~Ulas_Bagci1""]","[""Rodney LaLonde"", ""Naji Khosravan"", ""Ulas Bagci""]","[""Representation Learning"", ""Capsule Networks"", ""Object Detection""]","Capsule networks promise significant benefits over convolutional networks by storing stronger internal representations, and routing information based on the agreement between intermediate representations' projections. Despite this, their success has been mostly limited to small-scale classification datasets due to their computationally expensive nature. Recent studies have partially overcome this burden by locally-constraining the dynamic routing of features with convolutional capsules. Though memory efficient, convolutional capsules impose geometric constraints which fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class-capsules scaling-up to bigger tasks such as detection or large-scale classification. In this study, we introduce deformable capsules (DeformCaps), a new capsule structure (SplitCaps), and a novel dynamic routing algorithm (SE-Routing) to balance computational efficiency with the need for modeling a large number of objects and classes. We demonstrate that the proposed methods allow capsules to efficiently scale-up to large-scale computer vision tasks for the first time, and create the first-ever capsule network for object detection in the literature. Our proposed architecture is a one-stage detection framework and obtains results on MS COCO which are on-par with state-of-the-art one-stage CNN-based methods, while producing fewer false positive detections.",/pdf/3c8bd091aa5881ae837e276aa60e78e85835e725.pdf,ICLR,2021,"Introducing deformable capsules, a new capsule detection head structure, and a novel dynamic routing algorithm makes large-scale object detection with capsule neural networks feasible for the first time in the literature." +PmVfnB0nkqr,tGPpSH4EAk,1601310000000.0,1614990000000.0,512,Autonomous Learning of Object-Centric Abstractions for High-Level Planning,"[""~Steven_James1"", ""~Benjamin_Rosman1"", ""~George_Konidaris1""]","[""Steven James"", ""Benjamin Rosman"", ""George Konidaris""]","[""reinforcement learning"", ""planning"", ""PDDL"", ""multitask"", ""transfer"", ""objects""]","We propose a method for autonomously learning an object-centric representation of a continuous and high-dimensional environment that is suitable for planning. Such representations can immediately be transferred between tasks that share the same types of objects, resulting in agents that require fewer samples to learn a model of a new task. We first demonstrate our approach on a simple domain where the agent learns a compact, lifted representation that generalises across objects. We then apply it to a series of Minecraft tasks to learn object-centric representations, including object types—directly from pixel data—that can be leveraged to solve new tasks quickly. The resulting learned representations enable the use of a task-level planner, resulting in an agent capable of forming complex, long-term plans with considerably fewer environment interactions.",/pdf/4c27e2c9beddabd13cf77c12a049313b36c80e30.pdf,ICLR,2021,We show how to learn an object-centric representation from pixels that can be used by a classical planner. +SJxHMaEtwB,BygdvD9IwS,1569440000000.0,1577170000000.0,411,Domain-invariant Learning using Adaptive Filter Decomposition,"[""ze.w@duke.edu"", ""xiuyuan.cheng@duke.edu"", ""guillermo.sapiro@duke.edu"", ""qiang.qiu@duke.edu""]","[""Ze Wang"", ""Xiuyuan Cheng"", ""Guillermo Sapiro"", ""Qiang Qiu""]",[],"Domain shifts are frequently encountered in real-world scenarios. In this paper, we consider the problem of domain-invariant deep learning by explicitly modeling domain shifts with only a small amount of domain-specific parameters in a Convolutional Neural Network (CNN). By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we show for the first time, both empirically and theoretically, that domain shifts can be effectively handled by decomposing a regular convolutional layer into a domain-specific basis layer and a domain-shared basis coefficient layer, while both remain convolutional. An input channel will now first convolve spatially only with each respective domain-specific basis to ``absorb"" domain variations, and then output channels are linearly combined using common basis coefficients trained to promote shared semantics across domains. We use toy examples, rigorous analysis, and real-world examples to show the framework's effectiveness in cross-domain performance and domain adaptation. With the proposed architecture, we need only a small set of basis elements to model each additional domain, which brings a negligible amount of additional parameters, typically a few hundred.",/pdf/b6e2874ba1ce34797ca4a7f75c35f8bf6e3ceb44.pdf,ICLR,2020, +B1G5ViAqFm,S1eqoiK_tQ,1538090000000.0,1547150000000.0,27,Convolutional Neural Networks on Non-uniform Geometrical Signals Using Euclidean Spectral Transformation,"[""chiyu.jiang@berkeley.edu"", ""dqw@berkeley.edu"", ""jingweih@stanford.edu"", ""pmarcus@me.berkeley.edu"", ""niessner@tum.de""]","[""Chiyu Max Jiang"", ""Dequan Wang"", ""Jingwei Huang"", ""Philip Marcus"", ""Matthias Niessner""]","[""Non-uniform Fourier Transform"", ""3D Learning"", ""CNN"", ""surface reconstruction""]","Convolutional Neural Networks (CNN) have been successful in processing data signals that are uniformly sampled in the spatial domain (e.g., images). However, most data signals do not natively exist on a grid, and in the process of being sampled onto a uniform physical grid suffer significant aliasing error and information loss. Moreover, signals can exist in different topological structures as, for example, points, lines, surfaces and volumes. It has been challenging to analyze signals with mixed topologies (for example, point cloud with surface mesh). To this end, we develop mathematical formulations for Non-Uniform Fourier Transforms (NUFT) to directly, and optimally, sample nonuniform data signals of different topologies defined on a simplex mesh into the spectral domain with no spatial sampling error. The spectral transform is performed in the Euclidean space, which removes the translation ambiguity from works on the graph spectrum. Our representation has four distinct advantages: (1) the process causes no spatial sampling error during initial sampling, (2) the generality of this approach provides a unified framework for using CNNs to analyze signals of mixed topologies, (3) it allows us to leverage state-of-the-art backbone CNN architectures for effective learning without having to design a particular architecture for a particular data structure in an ad-hoc fashion, and (4) the representation allows weighted meshes where each element has a different weight (i.e., texture) indicating local properties. We achieve good results on-par with state-of-the-art for 3D shape retrieval task, and new state-of-the-art for point cloud to surface reconstruction task.",/pdf/77e67f8e10b7d6ebf9b6c70c117154a6ede31b56.pdf,ICLR,2019,"We use non-Euclidean Fourier Transformation of shapes defined by a simplicial complex for deep learning, achieving significantly better results than point-based sampling techiques used in current 3D learning literature." +S1eq9yrYvH,SJg8Gi0dvS,1569440000000.0,1577170000000.0,1888,Subjective Reinforcement Learning for Open Complex Environments,"[""yzl18@mails.tsinghua.edu.cn"", ""ghc18@mails.tsinghua.edu.cn"", ""suxin16@mails.tsinghua.edu.cn"", ""gsq15@mails.tsinghua.edu.cn"", ""chenfeng@mail.tsinghua.edu.cn""]","[""Zhile Yang*"", ""Haichuan Gao*"", ""Xin Su"", ""Shangqi Guo"", ""Feng Chen""]","[""reinforcement learning theory"", ""subjective learning""]","Solving tasks in open environments has been one of the long-time pursuits of reinforcement learning researches. We propose that data confusion is the core underlying problem. Although there exist methods that implicitly alleviate it from different perspectives, we argue that their solutions are based on task-specific prior knowledge that is constrained to certain kinds of tasks and lacks theoretical guarantees. In this paper, Subjective Reinforcement Learning Framework is proposed to state the problem from a broader and systematic view, and subjective policy is proposed to represent existing related algorithms in general. Theoretical analysis is given about the conditions for the superiority of a subjective policy, and the relationship between model complexity and the overall performance. Results are further applied as guidance for algorithm designing without task-specific prior knowledge about tasks. +",/pdf/c221e2db26dd4fb27c75ce0f463293b592e8c256.pdf,ICLR,2020, +r1lkKn4KDS,Bygcov19Ir,1569440000000.0,1577170000000.0,62,Learning Reusable Options for Multi-Task Reinforcement Learning,"[""fmaxgarcia@gmail.com"", ""cnota@cs.umass.edu"", ""pthomas@cs.umass.edu""]","[""Francisco M. Garcia"", ""Chris Nota"", ""Philip S. Thomas""]","[""Reinforcement Learning"", ""Temporal Abstraction"", ""Options"", ""Multi-Task RL""]","Reinforcement learning (RL) has become an increasingly active area of research in recent years. Although there are many algorithms that allow an agent to solve tasks efficiently, they often ignore the possibility that prior experience related to the task at hand might be available. For many practical applications, it might be unfeasible for an agent to learn how to solve a task from scratch, given that it is generally a computationally expensive process; however, prior experience could be leveraged to make these problems tractable in practice. In this paper, we propose a framework for exploiting existing experience by learning reusable options. We show that after an agent learns policies for solving a small number of problems, we are able to use the trajectories generated from those policies to learn reusable options that allow an agent to quickly learn how to solve novel and related problems.",/pdf/33d657badde56cdeb25b94d7634ef46aac6d5444.pdf,ICLR,2020,We discover options for multi-task RL by maximizing the probability of reproducing optimal trajectories while minimizing the number of decisions needed to do so. +B1Gi6LeRZ,BJGoaLx0W,1509090000000.0,1519360000000.0,279,Learning from Between-class Examples for Deep Sound Recognition,"[""tokozume@mi.t.u-tokyo.ac.jp"", ""ushiku@mi.t.u-tokyo.ac.jp"", ""harada@mi.t.u-tokyo.ac.jp""]","[""Yuji Tokozume"", ""Yoshitaka Ushiku"", ""Tatsuya Harada""]","[""sound recognition"", ""supervised learning"", ""feature learning""]","Deep learning methods have achieved high performance in sound recognition tasks. Deciding how to feed the training data is important for further performance improvement. We propose a novel learning method for deep sound recognition: Between-Class learning (BC learning). Our strategy is to learn a discriminative feature space by recognizing the between-class sounds as between-class sounds. We generate between-class sounds by mixing two sounds belonging to different classes with a random ratio. We then input the mixed sound to the model and train the model to output the mixing ratio. The advantages of BC learning are not limited only to the increase in variation of the training data; BC learning leads to an enlargement of Fisher’s criterion in the feature space and a regularization of the positional relationship among the feature distributions of the classes. The experimental results show that BC learning improves the performance on various sound recognition networks, datasets, and data augmentation schemes, in which BC learning proves to be always beneficial. Furthermore, we construct a new deep sound recognition network (EnvNet-v2) and train it with BC learning. As a result, we achieved a performance surpasses the human level.",/pdf/635103060ef1a68ee4cf6f8139561a4cec5a360f.pdf,ICLR,2018,We propose an novel learning method for deep sound recognition named BC learning. +nxJ8ugF24q2,Z6TV17WESEI,1601310000000.0,1614990000000.0,498,Learning Disconnected Manifolds: Avoiding The No Gan's Land by Latent Rejection,"[""~Thibaut_Issenhuth1"", ""~Ugo_Tanielian1"", ""~David_Picard1"", ""~Jeremie_Mary1""]","[""Thibaut Issenhuth"", ""Ugo Tanielian"", ""David Picard"", ""Jeremie Mary""]",[],"Standard formulations of GANs, where a continuous function deforms a connected latent space, have been shown to be misspecified when fitting disconnected manifolds. In particular, when covering different classes of images, the generator will necessarily sample some low quality images in between the modes. Rather than modify the learning procedure, a line of works aims at improving the sampling quality from trained generators. Thus, it is now common to introduce a rejection step within the generation procedure. +Building on this, we propose to train an additional network and transform the latent space via an adversarial learning of importance weights. This idea has several advantages: 1) it provides a way to inject disconnectedness on any GAN architecture, 2) the rejection avoids going through both the generator and the discriminator saving computation time, 3) this importance weights formulation provides a principled way to estimate the Wasserstein's distance to the true distribution, enabling its minimization. We demonstrate the effectiveness of our method on different datasets, both synthetic and high dimensional, and stress its superiority on highly disconnected data.",/pdf/df5fbdaa6e47d9632abbaf5b63bc74befa0a784d.pdf,ICLR,2021, +BkfhZnC9t7,BkxXl7hcYm,1538090000000.0,1545360000000.0,1210,Zero-shot Learning for Speech Recognition with Universal Phonetic Model,"[""xinjianl@andrew.cmu.edu"", ""sdalmia@cs.cmu.edu"", ""dmortens@cs.cmu.edu"", ""fmetze@cs.cmu.edu"", ""awb@cs.cmu.edu""]","[""Xinjian Li"", ""Siddharth Dalmia"", ""David R. Mortensen"", ""Florian Metze"", ""Alan W Black""]","[""zero-shot learning"", ""speech recognition"", ""acoustic modeling""]","There are more than 7,000 languages in the world, but due to the lack of training sets, only a small number of them have speech recognition systems. Multilingual speech recognition provides a solution if at least some audio training data is available. Often, however, phoneme inventories differ between the training languages and the target language, making this approach infeasible. In this work, we address the problem of building an acoustic model for languages with zero audio resources. Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available. We adopt the idea of zero-shot learning, and decompose phonemes into corresponding phonetic attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over phonetic attributes, and then compute phoneme distributions with a customized acoustic model. We extensively evaluate our English-trained model on 20 unseen languages, and find that on average, it achieves 9.9% better phone error rate over a traditional CTC based acoustic model trained on English.",/pdf/5e9a77e615c3c7b55ce9960657cdc9d83c78ebec.pdf,ICLR,2019,We apply zero-shot learning for speech recognition to recognize unseen phonemes +Ip195saXqIX,Wa6shp5Rxk7,1601310000000.0,1614990000000.0,1536,Knowledge Distillation By Sparse Representation Matching,"[""~Dat_Thanh_Tran1"", ""~Moncef_Gabbouj1"", ""~Alexandros_Iosifidis2""]","[""Dat Thanh Tran"", ""Moncef Gabbouj"", ""Alexandros Iosifidis""]","[""Knowledge Distillation"", ""Sparse Representation"", ""Transfer Learning""]","Knowledge Distillation refers to a class of methods that transfers the knowledge from a teacher network to a student network. In this paper, we propose Sparse Representation Matching (SRM), a method to transfer intermediate knowledge obtained from one Convolutional Neural Network (CNN) to another by utilizing sparse representation learning. SRM first extracts sparse representations of the hidden features of the teacher CNN, which are then used to generate both pixel-level and image-level labels for training intermediate feature maps of the student network. We formulate SRM as a neural processing block, which can be efficiently optimized using stochastic gradient descent and integrated into any CNN in a plug-and-play manner. Our experiments demonstrate that SRM is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets. ",/pdf/9db4948aa9ad12de0b0b7ed12f8e4af05e3dcea4.pdf,ICLR,2021,A knowledge distillation method that utilizes sparse representation to transfer intermediate knowledge in convolutional neural networks +HJ9rLLcxg,,1478280000000.0,1484960000000.0,297,Dataset Augmentation in Feature Space,"[""terrance@uoguelph.ca"", ""gwtaylor@uoguelph.ca""]","[""Terrance DeVries"", ""Graham W. Taylor""]","[""Unsupervised Learning""]","Dataset augmentation, the practice of applying a wide array of domain-specific transformations to synthetically expand a training set, is a standard tool in supervised learning. While effective in tasks such as visual recognition, the set of transformations must be carefully designed, implemented, and tested for every new domain, limiting its re-use and generality. In this paper, we adopt a simpler, domain-agnostic approach to dataset augmentation. We start with existing data points and apply simple transformations such as adding noise, interpolating, or extrapolating between them. Our main insight is to perform the transformation not in input space, but in a learned feature space. A re-kindling of interest in unsupervised representation learning makes this technique timely and more effective. It is a simple proposal, but to-date one that has not been tested empirically. Working in the space of context vectors generated by sequence-to-sequence models, we demonstrate a technique that is effective for both static and sequential data. +",/pdf/e71c24cd3ac439f1182d49e1db0451552b447455.pdf,ICLR,2017,We argue for domain-agnostic data augmentation in feature space by applying simple transformations to seq2seq context vectors. +ryxaSsActQ,SJe_wP2FYm,1538090000000.0,1545360000000.0,128,Dual Skew Divergence Loss for Neural Machine Translation,"[""wuyingting@sjtu.edu.cn"", ""zhaohai@cs.sjtu.edu.cn"", ""wangrui.nlp@gmail.com""]","[""Yingting Wu"", ""Hai Zhao"", ""Rui Wang""]",[],"For neural sequence model training, maximum likelihood (ML) has been commonly adopted to optimize model parameters with respect to the corresponding objective. However, in the case of sequence prediction tasks like neural machine translation (NMT), training with the ML-based cross entropy loss would often lead to models that overgeneralize and plunge into local optima. In this paper, we propose an extended loss function called dual skew divergence (DSD), which aims to give a better tradeoff between generalization ability and error avoidance during NMT training. Our empirical study indicates that switching to DSD loss after the convergence of ML training helps the model skip the local optimum and stimulates a stable performance improvement. The evaluations on WMT 2014 English-German and English-French translation tasks demonstrate that the proposed loss indeed helps bring about better translation performance than several baselines.",/pdf/e4094f01a80524a4bc661cc502bf1a5c8aca2710.pdf,ICLR,2019, +xcd5iTC6J-W,os-y8JXpSl,1601310000000.0,1614990000000.0,2616,Hidden Markov models are recurrent neural networks: A disease progression modeling application,"[""~Matthew_Baucum1"", ""~Anahita_Khojandi1"", ""~Theodore_Papamarkou1""]","[""Matthew Baucum"", ""Anahita Khojandi"", ""Theodore Papamarkou""]","[""hidden markov models"", ""recurrent neural networks"", ""disease progression""]","Hidden Markov models (HMMs) are commonly used for disease progression modeling when the true state of a patient is not fully known. Since HMMs may have multiple local optima, performance can be improved by incorporating additional patient covariates to inform parameter estimation. To allow for this, we formulate a special case of recurrent neural networks (RNNs), which we name hidden Markov recurrent neural networks (HMRNNs), and prove that each HMRNN has the same likelihood function as a corresponding discrete-observation HMM. As a neural network, the HMRNN can also be combined with any other predictive neural networks that take patient covariate information as input. We first show that parameter estimates from HMRNNs are numerically close to those obtained from HMMs via the Baum-Welch algorithm, thus empirically validating their theoretical equivalence. We then demonstrate how the HMRNN can be combined with other neural networks to improve parameter estimation and prediction, using an Alzheimer's disease dataset. The HMRNN yields parameter estimates that improve disease forecasting performance and offer a novel clinical interpretation compared with a standard HMM.",/pdf/0c00320647513da20d5ef10ff7bc46a47ad3893b.pdf,ICLR,2021,We develop a neural network implementation of hidden Markov models that can be combined with other predictive networks to improve predictive accuracy and parameter solutions. +H1GaLiAcY7,BJeAI6TFFQ,1538090000000.0,1545360000000.0,218,Learning to Separate Domains in Generalized Zero-Shot and Open Set Learning: a probabilistic perspective,"[""hzdong15@fudan.edu.cn"", ""yanweifu@fudan.edu.cn"", ""lsigal@cs.ubc.ca"", ""sjhwang82@kaist.ac.kr"", ""ygj@fudan.edu.cn"", ""xyxue@fudan.edu.cn""]","[""Hanze Dong"", ""Yanwei Fu"", ""Leonid Sigal"", ""SungJu Hwang"", ""Yu-Gang Jiang"", ""Xiangyang Xue""]","[""Generalized zero-shot learning"", ""domain division"", ""bootstrapping"", ""Kolmogorov-Smirnov""]","This paper studies the problem of domain division which aims to segment instances drawn from different probabilistic distributions. This problem exists in many previous recognition tasks, such as Open Set Learning (OSL) and Generalized Zero-Shot Learning (G-ZSL), where the testing instances come from either seen or unseen/novel classes with different probabilistic distributions. Previous works only calibrate the confident prediction of classifiers of seen classes (WSVM Scheirer et al. (2014)) or taking unseen classes as outliers Socher et al. (2013). In contrast, this paper proposes a probabilistic way of directly estimating and fine-tuning the decision boundary between seen and unseen classes. In particular, we propose a domain division algorithm to split the testing instances into known, unknown and uncertain domains, and then conduct recognition tasks in each domain. Two statistical tools, namely, bootstrapping and KolmogorovSmirnov (K-S) Test, for the first time, are introduced to uncover and fine-tune the decision boundary of each domain. Critically, the uncertain domain is newly introduced in our framework to adopt those instances whose domain labels cannot be predicted confidently. Extensive experiments demonstrate that our approach achieved the state-of-the-art performance on OSL and G-ZSL benchmarks.",/pdf/b5586db9a8a033a4158a7580f888e81e8d2867bf.pdf,ICLR,2019, This paper studies the problem of domain division by segmenting instances drawn from different probabilistic distributions. +BJl4g0NYvB,B1lCC4Xuwr,1569440000000.0,1577170000000.0,925,Causal Induction from Visual Observations for Goal Directed Tasks,"[""surajn@stanford.edu"", ""yukez@cs.stanford.edu"", ""ssilvio@stanford.edu"", ""feifeili@cs.stanford.edu""]","[""Suraj Nair"", ""Yuke Zhu"", ""Silvio Savarese"", ""Li Fei-Fei""]","[""meta-learning"", ""causal reasoning"", ""policy learning""]","Causal reasoning has been an indispensable capability for humans and other intelligent animals to interact with the physical world. In this work, we propose to endow an artificial agent with the capability of causal reasoning for completing goal-directed tasks. We develop learning-based approaches to inducing causal knowledge in the form of directed acyclic graphs, which can be used to contextualize a learned goal-conditional policy to perform tasks in novel environments with latent causal structures. We leverage attention mechanisms in our causal induction model and goal-conditional policy, enabling us to incrementally generate the causal graph from the agent's visual observations and to selectively use the induced graph for determining actions. Our experiments show that our method effectively generalizes towards completing new tasks in novel environments with previously unseen causal structures.",/pdf/5194231b0e7353e6df86d0efb5ed33e7ae0c2495.pdf,ICLR,2020,Meta-learning algorithm for inducing causal structure from visual observations and using it to complete goal conditioned tasks +Hkp3uhxCW,SJ22u3gA-,1509110000000.0,1518730000000.0,396,Revisiting Bayes by Backprop,"[""meirefortunato@google.com"", ""cblundell@google.com"", ""vinyals@google.com""]","[""Meire Fortunato"", ""Charles Blundell"", ""Oriol Vinyals""]","[""Bayesian"", ""Deep Learning"", ""Recurrent Neural Networks"", ""LSTM""]","In this work we explore a straightforward variational Bayes scheme for Recurrent Neural Networks. +Firstly, we show that a simple adaptation of truncated backpropagation through time can yield good quality uncertainty estimates and superior regularisation at only a small extra computational cost during training, also reducing the amount of parameters by 80\%. +Secondly, we demonstrate how a novel kind of posterior approximation yields further improvements to the performance of Bayesian RNNs. We incorporate local gradient information into the approximate posterior to sharpen it around the current batch statistics. We show how this technique is not exclusive to recurrent neural networks and can be applied more widely to train Bayesian neural networks. +We also empirically demonstrate how Bayesian RNNs are superior to traditional RNNs on a language modelling benchmark and an image captioning task, as well as showing how each of these methods improve our model over a variety of other schemes for training them. We also introduce a new benchmark for studying uncertainty for language models so future methods can be easily compared.",/pdf/e7fa9985a7554187401ff61b690c0f7a0f892ba3.pdf,ICLR,2018, Variational Bayes scheme for Recurrent Neural Networks +piLPYqxtWuA,sxsAUhuMBjx,1601310000000.0,1614840000000.0,1673,FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,"[""~Yi_Ren2"", ""~Chenxu_Hu1"", ""~Xu_Tan1"", ""~Tao_Qin1"", ""sheng.zhao@microsoft.com"", ""~Zhou_Zhao2"", ""~Tie-Yan_Liu1""]","[""Yi Ren"", ""Chenxu Hu"", ""Xu Tan"", ""Tao Qin"", ""Sheng Zhao"", ""Zhou Zhao"", ""Tie-Yan Liu""]","[""text to speech"", ""speech synthesis"", ""non-autoregressive generation"", ""one-to-many mapping"", ""end-to-end""]","Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.",/pdf/16d889999a70fcdfe5fbe9165d7d341e47b05e93.pdf,ICLR,2021,We propose a non-autoregressive TTS model named FastSpeech 2 to better solve the one-to-many mapping problem in TTS and surpass autoregressive models in voice quality. +ByOK0rwlx,,1478090000000.0,1478160000000.0,43,Ternary Weight Decomposition and Binary Activation Encoding for Fast and Compact Neural Network,"[""manbai@d-itlab.co.jp"", ""tmatsumoto@d-itlab.co.jp"", ""yamashita@cs.chubu.ac.jp"", ""hf@cs.chubu.ac.jp""]","[""Mitsuru Ambai"", ""Takuya Matsumoto"", ""Takayoshi Yamashita"", ""Hironobu Fujiyoshi""]","[""Deep learning""]","This paper aims to reduce test-time computational load of a deep neural network. Unlike previous methods which factorize a weight matrix into multiple real-valued matrices, our method factorizes both weights and activations into integer and noninteger components. In our method, the real-valued weight matrix is approximated by a multiplication of a ternary matrix and a real-valued co-efficient matrix. Since the ternary matrix consists of three integer values, {-1, 0, +1}, it only consumes 2 bits per element. At test-time, an activation vector that passed from a previous layer is also transformed into a weighted sum of binary vectors, {-1, +1}, which enables fast feed-forward propagation based on simple logical operations: AND, XOR, and bit count. This makes it easier to deploy a deep network on low-power CPUs or to design specialized hardware. +In our experiments, we tested our method on three different networks: a CNN for handwritten digits, VGG-16 model for ImageNet classification, and VGG-Face for large-scale face recognition. In particular, when we applied our method to three fully connected layers in the VGG-16, 15x acceleration and memory compression up to 5.2% were achieved with only a 1.43% increase in the top-5 error. Our experiments also revealed that compressing convolutional layers can accelerate inference of the entire network in exchange of slight increase in error.",/pdf/de3d7a3678c6bd1f250b4066a8f2fd0b0b9da868.pdf,ICLR,2017, +#NAME?,Z3Ujic4PhhD,1601310000000.0,1615520000000.0,1447,Meta-learning Symmetries by Reparameterization,"[""~Allan_Zhou1"", ""tknowles@stanford.edu"", ""~Chelsea_Finn1""]","[""Allan Zhou"", ""Tom Knowles"", ""Chelsea Finn""]","[""meta-learning"", ""equivariance"", ""convolution"", ""symmetry""]","Many successful deep learning architectures are equivariant to certain transformations in order to conserve parameters and improve generalization: most famously, convolution layers are equivariant to shifts of the input. This approach only works when practitioners know the symmetries of the task and can manually construct an architecture with the corresponding equivariances. Our goal is an approach for learning equivariances from data, without needing to design custom task-specific architectures. We present a method for learning and encoding equivariances into networks by learning corresponding parameter sharing patterns from data. Our method can provably represent equivariance-inducing parameter sharing for any finite group of symmetry transformations. Our experiments suggest that it can automatically learn to encode equivariances to common transformations used in image processing tasks.",/pdf/eeb98a32337d0e4cdf8bb6b8073933c781bbf51e.pdf,ICLR,2021,A method for automatically meta-learning and encoding equivariances into neural networks. +HJx-3grYDB,B1e7Me-tvB,1569440000000.0,1588180000000.0,2526,Learning Nearly Decomposable Value Functions Via Communication Minimization,"[""tonghanwang1996@gmail.com"", ""1040594377@qq.com"", ""chongyeezheng@gmail.com"", ""chongjie@tsinghua.edu.cn""]","[""Tonghan Wang*"", ""Jianhao Wang*"", ""Chongyi Zheng"", ""Chongjie Zhang""]","[""Multi-agent reinforcement learning"", ""Nearly decomposable value function"", ""Minimized communication"", ""Multi-agent systems""]","Reinforcement learning encounters major challenges in multi-agent settings, such as scalability and non-stationarity. Recently, value function factorization learning emerges as a promising way to address these challenges in collaborative multi-agent systems. However, existing methods have been focusing on learning fully decentralized value functions, which are not efficient for tasks requiring communication. To address this limitation, this paper presents a novel framework for learning nearly decomposable Q-functions (NDQ) via communication minimization, with which agents act on their own most of the time but occasionally send messages to other agents in order for effective coordination. This framework hybridizes value function factorization learning and communication learning by introducing two information-theoretic regularizers. These regularizers are maximizing mutual information between agents' action selection and communication messages while minimizing the entropy of messages between agents. We show how to optimize these regularizers in a way that is easily integrated with existing value function factorization methods such as QMIX. Finally, we demonstrate that, on the StarCraft unit micromanagement benchmark, our framework significantly outperforms baseline methods and allows us to cut off more than $80\%$ of communication without sacrificing the performance. The videos of our experiments are available at https://sites.google.com/view/ndq.",/pdf/1fd327b39cca477f8ce468645d5c704d3b104e0b.pdf,ICLR,2020, +rJl8viCqKQ,SJlppUX5YQ,1538090000000.0,1545360000000.0,264,Low Latency Privacy Preserving Inference,"[""brutzkus@gmail.com"", ""oren.elisha@microsoft.com"", ""rani.gb@gmail.com""]","[""Alon Brutzkus"", ""Oren Elisha"", ""Ran Gilad-Bachrach""]","[""privacy"", ""classification"", ""homomorphic encryption"", ""neural networks""]","When applying machine learning to sensitive data one has to balance between accuracy, information leakage, and computational-complexity. Recent studies have shown that Homomorphic Encryption (HE) can be used for protecting against information leakage while applying neural networks. However, this comes with the cost of limiting the kind of neural networks that can be used (and hence the accuracy) and with latency of the order of several minutes even for relatively simple networks. In this study we improve on previous results both in the kind of networks that can be applied and in terms of the latency. Most of the improvement is achieved by novel ways to represent the data to make better use of the capabilities of the encryption scheme.",/pdf/0fefca3465d09b490f85ad3c3d445b95ce2214fa.pdf,ICLR,2019,"This work presents methods, combining neural-networks and encryptions, to make predictions while preserving the privacy of the data owner with low latency" +SJgwzCEKwH,B1gKW04dwB,1569440000000.0,1588130000000.0,1004,Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness,"[""zhao.pu@husky.neu.edu"", ""pin-yu.chen@ibm.com"", ""daspa@us.ibm.com"", ""knatesa@us.ibm.com"", ""xue.lin@northeastern.edu""]","[""Pu Zhao"", ""Pin-Yu Chen"", ""Payel Das"", ""Karthikeyan Natesan Ramamurthy"", ""Xue Lin""]","[""mode connectivity"", ""adversarial robustness"", ""backdoor attack"", ""error-injection attack"", ""evasion attacks"", ""loss landscapes""]","Mode connectivity provides novel geometric insights on analyzing loss landscapes and enables building high-accuracy pathways between well-trained neural networks. In this work, we propose to employ mode connectivity in loss landscapes to study the adversarial robustness of deep neural networks, and provide novel methods for improving this robustness. Our experiments cover various types of adversarial attacks applied to different network architectures and datasets. When network models are tampered with backdoor or error-injection attacks, our results demonstrate that the path connection learned using limited amount of bonafide data can effectively mitigate adversarial effects while maintaining the original accuracy on clean data. Therefore, mode connectivity provides users with the power to repair backdoored or error-injected models. We also use mode connectivity to investigate the loss landscapes of regular and robust models against evasion attacks. Experiments show that there exists a barrier in adversarial robustness loss on the path connecting regular and adversarially-trained models. A high correlation is observed between the adversarial robustness loss and the largest eigenvalue of the input Hessian matrix, for which theoretical justifications are provided. Our results suggest that mode connectivity offers a holistic tool and practical means for evaluating and improving adversarial robustness.",/pdf/fb8082dd5515e11c88f59b0f4911266f1891fb61.pdf,ICLR,2020,"A novel approach using mode connectivity in loss landscapes to mitigate adversarial effects, repair tampered models, and evaluate adversarial robustness" +HyxzRsR9Y7,S1lsmBj5tm,1538090000000.0,1550880000000.0,872,Learning Self-Imitating Diverse Policies,"[""gangwan2@uiuc.edu"", ""lqiang@cs.utexas.edu"", ""jianpeng@illinois.edu""]","[""Tanmay Gangwani"", ""Qiang Liu"", ""Jian Peng""]","[""Reinforcement-learning"", ""Imitation-learning"", ""Ensemble-training""]","The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.",/pdf/14e098a5877dfe191af380aa73904233e8c84eaa.pdf,ICLR,2019,Policy optimization by using past good rollouts from the agent; learning shaped rewards via divergence minimization; SVPG with JS-kernel for population-based exploration. +Whq-nTgCbNR,SQxhfszyxPO,1601310000000.0,1614990000000.0,1941,Anomaly detection in dynamical systems from measured time series,"[""~Andrei_Ivanov1"", ""a.golovkina@spbu.ru""]","[""Andrei Ivanov"", ""Anna Golovkina""]","[""dynamical systems"", ""polynomial neural networks"", ""anomaly detection""]","The paper addresses a problem of abnormalities detection in nonlinear processes represented by measured time series. Anomaly detection problem is usually formulated as finding outlier data points relative to some usual signals such as unexpected spikes, drops, or trend changes. In nonlinear dynamical systems, there are cases where a time series does not contain statistical outliers while the process corresponds to an abnormal configuration of the dynamical system. Since the polynomial neural architecture has a strong connection with the theory of differential equations, we use it for the feature extraction that describes the dynamical system itself. The paper discusses in both simulations and a practical example with real measurements the applicability of the proposed approach and it's benchmarking with existing methods.",/pdf/b8ef31bf32c05d86b109fb9678da6cfd9103f916.pdf,ICLR,2021,Deep polynomial neural networks for abnormalities detection in nonlinear processes represented by measured time series. +uz5uw6gM0m,pBR6CBQE0m9,1601310000000.0,1616970000000.0,368,One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks,"[""thetish@google.com"", ""abhidas@google.com"", ""~Brendan_Juba1"", ""~Rina_Panigrahy1"", ""~Vatsal_Sharan1"", ""wanxin@google.com"", ""~Qiuyi_Zhang1""]","[""Atish Agarwala"", ""Abhimanyu Das"", ""Brendan Juba"", ""Rina Panigrahy"", ""Vatsal Sharan"", ""Xin Wang"", ""Qiuyi Zhang""]","[""deep learning theory"", ""multi-task learning""]","Can deep learning solve multiple, very different tasks simultaneously? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks---for example, when the distinct tasks are encoded by well-separated clusters or decision trees over some task-code attributes. Indeed, more strongly, we present a novel analysis that shows that families of simple programming-like constructs for the codes encoding the tasks are learnable by two-layer neural networks with standard training. We study more generally how the complexity of learning such combined tasks grows with the complexity of the task codes; we find that learning many tasks can be provably hard, even though the individual tasks are easy to learn. We provide empirical support for the usefulness of the learning bounds by training networks on clusters, decision trees, and SQL-style aggregation.",/pdf/c3144d0b41029c0a6eb962e25853af28fe75daf2.pdf,ICLR,2021,"Theoretical bounds and experimental results showing that neural networks trained with SGD can provably solve multiple, very different tasks simultaneously." +c_E8kFWfhp0,Kg5xVSxkm-O,1601310000000.0,1616030000000.0,2601,gradSim: Differentiable simulation for system identification and visuomotor control,"[""~J._Krishna_Murthy1"", ""~Miles_Macklin1"", ""~Florian_Golemo1"", ""~Vikram_Voleti1"", ""~Linda_Petrini1"", ""~Martin_Weiss4"", ""~Breandan_Considine2"", ""~J\u00e9r\u00f4me_Parent-L\u00e9vesque2"", ""kevincxie@cs.toronto.edu"", ""kenny@di.ku.dk"", ""~Liam_Paull1"", ""~Florian_Shkurti1"", ""~Derek_Nowrouzezahrai1"", ""~Sanja_Fidler1""]","[""J. Krishna Murthy"", ""Miles Macklin"", ""Florian Golemo"", ""Vikram Voleti"", ""Linda Petrini"", ""Martin Weiss"", ""Breandan Considine"", ""J\u00e9r\u00f4me Parent-L\u00e9vesque"", ""Kevin Xie"", ""Kenny Erleben"", ""Liam Paull"", ""Florian Shkurti"", ""Derek Nowrouzezahrai"", ""Sanja Fidler""]","[""Differentiable simulation"", ""System identification"", ""Physical parameter estimation"", ""3D scene understanding"", ""3D vision"", ""Differentiable rendering"", ""Differentiable physics""]","In this paper, we tackle the problem of estimating object physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current best solutions to the problem require precise 3D labels which are labor intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. In this work we present gradSim, a framework that overcomes the dependence on 3D supervision by combining differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This unique combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Furthermore, our unified computation graph across dynamics and rendering engines enables the learning of challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to/better than techniques that require precise 3D labels.",/pdf/4a6d5a30558be4f1d305beba6c91e7617ddb5c96.pdf,ICLR,2021,Differentiable models of time-varying dynamics and image formation pipelines result in highly accurate physical parameter estimation from video +ZcKPWuhG6wy,K-aP9GWAJw9,1601310000000.0,1616030000000.0,1249,Tradeoffs in Data Augmentation: An Empirical Study,"[""~Raphael_Gontijo-Lopes1"", ""smullin-physics@stanfordalumni.org"", ""~Ekin_Dogus_Cubuk1"", ""~Ethan_Dyer1""]","[""Raphael Gontijo-Lopes"", ""Sylvia Smullin"", ""Ekin Dogus Cubuk"", ""Ethan Dyer""]","[""Generalization"", ""Interpretability"", ""Understanding Data Augmentation""]","Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of distribution shift or augmentation diversity. Inspired by these, we conduct an empirical study to quantify how data augmentation improves model generalization. We introduce two interpretable and easy-to-compute measures: Affinity and Diversity. We find that augmentation performance is predicted not by either of these alone but by jointly optimizing the two.",/pdf/3d2fc5aa6c81e18581fdf14b971478263093ec81.pdf,ICLR,2021,We quantify mechanisms of how data augmentation works with two metrics we introduce: Affinity and Diversity. +HJeOekHKwr,B1xZ8BiOwr,1569440000000.0,1583910000000.0,1512,Smoothness and Stability in GANs,"[""caseychu@stanford.edu"", ""minami@preferred.jp"", ""fukumizu@ism.ac.jp""]","[""Casey Chu"", ""Kentaro Minami"", ""Kenji Fukumizu""]","[""generative adversarial networks"", ""stability"", ""smoothness"", ""convex conjugate""]","Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator's architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions.",/pdf/c5094b4085292576e13c161f612bfa4767425746.pdf,ICLR,2020,We develop a principled theoretical framework for understanding and enforcing the stability of various types of GANs +B1eyA3VFwS,H1gUjfQ4vr,1569440000000.0,1577170000000.0,249,Enforcing Physical Constraints in Neural Neural Networks through Differentiable PDE Layer,"[""chiyu.jiang@berkeley.edu"", ""kkashinath@lbl.gov"", ""prabhat@lbl.gov"", ""pmarcus@me.berkeley.edu""]","[""Chiyu \""Max\"" Jiang"", ""Karthik Kashinath"", ""Prabhat"", ""Philip Marcus""]","[""PDE"", ""Hard Constraints"", ""Turbulence"", ""Super-Resolution"", ""Spectral Methods""]","Recent studies at the intersection of physics and deep learning have illustrated successes in the application of deep neural networks to partially or fully replace costly physics simulations. Enforcing physical constraints to solutions generated +by neural networks remains a challenge, yet it is essential to the accuracy and trustworthiness of such model predictions. Many systems in the physical sciences are governed by Partial Differential Equations (PDEs). Enforcing these as hard +constraints, we show, are inefficient in conventional frameworks due to the high dimensionality of the generated fields. To this end, we propose the use of a novel differentiable spectral projection layer for neural networks that efficiently enforces +spatial PDE constraints using spectral methods, yet is fully differentiable, allowing for its use as a layer in neural networks that supports end-to-end training. We show that its computational cost is cheaper than a regular convolution layer. We apply it to +an important class of physical systems – incompressible turbulent flows, where the divergence-free PDE constraint is required. We train a 3D Conditional Generative Adversarial Network (CGAN) for turbulent flow super-resolution efficiently, whilst +guaranteeing the spatial PDE constraint of zero divergence. Furthermore, our empirical results show that the model produces realistic flow fields with more accurate flow statistics when trained with hard constraints imposed via the proposed +novel differentiable spectral projection layer, as compared to soft constrained and unconstrained counterparts.",/pdf/0cb6a7e7202debafe704399062922e4d72f3e447.pdf,ICLR,2020,A novel way of enforcing hard linear constraints within a convolutional neural network using a differentiable PDE layer. +S1eL4kBYwr,rklUh22uvS,1569440000000.0,1577170000000.0,1655,UNITER: Learning UNiversal Image-TExt Representations,"[""yen-chun.chen@microsoft.com"", ""lindsey.li@microsoft.com"", ""licheng.yu@microsoft.com"", ""ahmed.elkholy@microsoft.com"", ""fiahmed@microsoft.com"", ""zhe.gan@microsoft.com"", ""yu.cheng@microsoft.com"", ""jingjl@microsoft.com""]","[""Yen-Chun Chen"", ""Linjie Li"", ""Licheng Yu"", ""Ahmed El Kholy"", ""Faisal Ahmed"", ""Zhe Gan"", ""Yu Cheng"", ""Jingjing Liu""]","[""Self-supervised Representation Learning"", ""Large-scale Pre-training"", ""Vision and Language""]","Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design three pre-training tasks: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Region Modeling (MRM, with three variants). Different from concurrent work on multimodal pre-training that apply joint random masking to both modalities, we use Conditioned Masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). Comprehensive analysis shows that conditioned masking yields better performance than unconditioned masking. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks for UNITER. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks over nine datasets, including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2. ",/pdf/d6ea888654af03a15e6c9c7276c44721eb1450f7.pdf,ICLR,2020,"We introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over image-text datasets, achieves state-of-the-art results across six Vision-and-Language tasks over nine datasets." +SJg7KhVKPH,BJxnOwnjLB,1569440000000.0,1583910000000.0,73,Depth-Adaptive Transformer,"[""maha.elbayad@inria.fr"", ""thomagram@gmail.com"", ""egrave@fb.com"", ""michael.auli@gmail.com""]","[""Maha Elbayad"", ""Jiatao Gu"", ""Edouard Grave"", ""Michael Auli""]","[""Deep learning"", ""natural language processing"", ""sequence modeling""]","State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers.",/pdf/6083b51a6c735b02350d9490ad0a5f3340cd90cf.pdf,ICLR,2020,Sequence model that dynamically adjusts the amount of computation for each input. +r1nzLmWAb,ByvzIQZCZ,1509140000000.0,1518730000000.0,1168,Video Action Segmentation with Hybrid Temporal Networks,"[""liding@rochester.edu"", ""chenliang.xu@rochester.edu""]","[""Li Ding"", ""Chenliang Xu""]","[""action segmentation"", ""video labeling"", ""temporal networks""]","Action segmentation as a milestone towards building automatic systems to understand untrimmed videos has received considerable attention in the recent years. It is typically being modeled as a sequence labeling problem but contains intrinsic and sufficient differences than text parsing or speech processing. In this paper, we introduce a novel hybrid temporal convolutional and recurrent network (TricorNet), which has an encoder-decoder architecture: the encoder consists of a hierarchy of temporal convolutional kernels that capture the local motion changes of different actions; the decoder is a hierarchy of recurrent neural networks that are able to learn and memorize long-term action dependencies after the encoding stage. Our model is simple but extremely effective in terms of video sequence labeling. The experimental results on three public action segmentation datasets have shown that the proposed model achieves superior performance over the state of the art.",/pdf/9cb1db4642c01584e6ca3c886e730f3743542a24.pdf,ICLR,2018,We propose a new hybrid temporal network that achieves state-of-the-art performance on video action segmentation on three public datasets. +rkTBjG-AZ,r1bXjGZAW,1509140000000.0,1518730000000.0,926,DeepArchitect: Automatically Designing and Training Deep Architectures,"[""negrinho@cs.cmu.edu"", ""ggordon@cs.cmu.edu""]","[""Renato Negrinho"", ""Geoff Gordon""]","[""architecture search"", ""deep learning"", ""hyperparameter tuning""]","In deep learning, performance is strongly affected by the choice of architecture +and hyperparameters. While there has been extensive work on automatic hyperpa- +rameter optimization for simple spaces, complex spaces such as the space of deep +architectures remain largely unexplored. As a result, the choice of architecture is +done manually by the human expert through a slow trial and error process guided +mainly by intuition. In this paper we describe a framework for automatically +designing and training deep models. We propose an extensible and modular lan- +guage that allows the human expert to compactly represent complex search spaces +over architectures and their hyperparameters. The resulting search spaces are tree- +structured and therefore easy to traverse. Models can be automatically compiled to +computational graphs once values for all hyperparameters have been chosen. We +can leverage the structure of the search space to introduce different model search +algorithms, such as random search, Monte Carlo tree search (MCTS), and sequen- +tial model-based optimization (SMBO). We present experiments comparing the +different algorithms on CIFAR-10 and show that MCTS and SMBO outperform +random search. We also present experiments on MNIST, showing that the same +search space achieves near state-of-the-art performance with a few samples. These +experiments show that our framework can be used effectively for model discov- +ery, as it is possible to describe expressive search spaces and discover competitive +models without much effort from the human expert. Code for our framework and +experiments has been made publicly available",/pdf/1b4b11e6581fddc7fc91f511087335cef3ad7fb1.pdf,ICLR,2018,We describe a modular and composable language for describing expressive search spaces over architectures and simple model search algorithms applied to these search spaces. +#NAME?,_4kXnVYMKa1,1601310000000.0,1615450000000.0,363,Isometric Propagation Network for Generalized Zero-shot Learning,"[""~Lu_Liu7"", ""~Tianyi_Zhou1"", ""~Guodong_Long2"", ""~Jing_Jiang6"", ""~Xuanyi_Dong1"", ""~Chengqi_Zhang1""]","[""Lu Liu"", ""Tianyi Zhou"", ""Guodong Long"", ""Jing Jiang"", ""Xuanyi Dong"", ""Chengqi Zhang""]","[""Zero-shot learning"", ""isometric"", ""prototype propagation"", ""alignment of semantic and visual space""]","Zero-shot learning (ZSL) aims to classify images of an unseen class only based on a few attributes describing that class but no access to any training sample. A popular strategy is to learn a mapping between the semantic space of class attributes and the visual space of images based on the seen classes and their data. Thus, an unseen class image can be ideally mapped to its corresponding class attributes. The key challenge is how to align the representations in the two spaces. For most ZSL settings, the attributes for each seen/unseen class are only represented by a vector while the seen-class data provide much more information. Thus, the imbalanced supervision from the semantic and the visual space can make the learned mapping easily overfitting to the seen classes. To resolve this problem, we propose Isometric Propagation Network (IPN), which learns to strengthen the relation between classes within each space and align the class dependency in the two spaces. Specifically, IPN learns to propagate the class representations on an auto-generated graph within each space. In contrast to only aligning the resulted static representation, we regularize the two dynamic propagation procedures to be isometric in terms of the two graphs' edge weights per step by minimizing a consistency loss between them. IPN achieves state-of-the-art performance on three popular ZSL benchmarks. To evaluate the generalization capability of IPN, we further build two larger benchmarks with more diverse unseen classes and demonstrate the advantages of IPN on them.",/pdf/da2fc73fa15eb07f35399aa7799e950e718cb61e.pdf,ICLR,2021,We improve the current zero-shot learning performance by a dynamic alignment between the semantic space and visual space that encourages the isometry of the class-prototype propagation procedures in the two spaces. +D3PcGLdMx0,5cJ8d_FC9-V,1601310000000.0,1614560000000.0,698,MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning,"[""~Nanyi_Fei1"", ""~Zhiwu_Lu1"", ""~Tao_Xiang1"", ""~Songfang_Huang1""]","[""Nanyi Fei"", ""Zhiwu Lu"", ""Tao Xiang"", ""Songfang Huang""]","[""few-shot learning"", ""episodic training"", ""cross-episode attention""]","Most recent few-shot learning (FSL) approaches are based on episodic training whereby each episode samples few training instances (shots) per class to imitate the test condition. However, this strict adhering to test condition has a negative side effect, that is, the trained model is susceptible to the poor sampling of few shots. In this work, for the first time, this problem is addressed by exploiting inter-episode relationships. Specifically, a novel meta-learning via modeling episode-level relationships (MELR) framework is proposed. By sampling two episodes containing the same set of classes for meta-training, MELR is designed to ensure that the meta-learned model is robust against the presence of poorly-sampled shots in the meta-test stage. This is achieved through two key components: (1) a Cross-Episode Attention Module (CEAM) to improve the ability of alleviating the effects of poorly-sampled shots, and (2) a Cross-Episode Consistency Regularization (CECR) to enforce that the two classifiers learned from the two episodes are consistent even when there are unrepresentative instances. Extensive experiments for non-transductive standard FSL on two benchmarks show that our MELR achieves 1.0%-5.0% improvements over the baseline (i.e., ProtoNet) used for FSL in our model and outperforms the latest competitors under the same settings.",/pdf/b13008d5731f5acac5931a7669386147ba3088da.pdf,ICLR,2021,This is the first work on explicitly modeling episode-level relationships for few-shot learning. +B9t708KMr9d,WuzKSAOiCs7,1601310000000.0,1614990000000.0,3122,Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification,"[""shiyunsheng01@baidu.com"", ""huangzhengjie@baidu.com"", ""~shikun_feng1"", ""zhonghui03@baidu.com"", ""wangwenjin02@baidu.com"", ""sunyu02@baidu.com""]","[""Yunsheng Shi"", ""Zhengjie Huang"", ""shikun feng"", ""Hui Zhong"", ""Wenjin Wang"", ""Yu Sun""]","[""Unified Message Passing Model"", ""Graph Neural Network"", ""Label Propagation Algorithm"", ""Semi-Supervised Classification.""]","Graph neural network (GNN) and label propagation algorithm (LPA) are both message passing algorithms, which have achieved superior performance in semi-supervised classification. GNN performs \emph{feature propagation} by a neural network to make predictions, while LPA uses \emph{label propagation} across graph adjacency matrix to get results. However, there is still no good way to combine these two kinds of algorithms. In this paper, we proposed a new {\bf Uni}fied {\bf M}essage {\bf P}assaging Model (UniMP) that can incorporate \emph{feature propagation} and \emph{label propagation} with a shared message passing network, providing a better performance in semi-supervised classification. First, we adopt a Graph Transformer jointly label embedding to propagate both the feature and label information. Second, to train UniMP without overfitting in self-loop label information, we propose a masked label prediction strategy, in which some percentage of training labels are simply masked at random, and then predicted. UniMP conceptually unifies feature propagation and label propagation and be empirically powerful. It obtains new state-of-the-art semi-supervised classification results in Open Graph Benchmark (OGB). ",/pdf/4f5ed05ca1a8ffe29bd4dafc8973381b4fabe3d4.pdf,ICLR,2021,"We propose a unified message passing model, incorporating feature propagation and label propagation for getting better performance in semi-supervised classification." +SyVuRiC5K7,r1gNKqKqK7,1538090000000.0,1549620000000.0,911,LEARNING TO PROPAGATE LABELS: TRANSDUCTIVE PROPAGATION NETWORK FOR FEW-SHOT LEARNING,"[""csyanbin@gmail.com"", ""juho.lee@stats.ox.ac.uk"", ""mike_seop@aitrics.com"", ""shkim@aitrics.com"", ""eunhoy@kaist.ac.kr"", ""sjhwang82@kaist.ac.kr"", ""yi.yang@uts.edu.au""]","[""Yanbin Liu"", ""Juho Lee"", ""Minseop Park"", ""Saehoon Kim"", ""Eunho Yang"", ""Sung Ju Hwang"", ""Yi Yang""]","[""few-shot learning"", ""meta-learning"", ""label propagation"", ""manifold learning""]","The goal of few-shot learning is to learn a classifier that generalizes well even when trained with a limited number of training instances per class. The recently introduced meta-learning approaches tackle this problem by learning a generic classifier across a large number of multiclass classification tasks and generalizing the model to a new task. Yet, even with such meta-learning, the low-data problem in the novel classification task still remains. In this paper, we propose Transductive Propagation Network (TPN), a novel meta-learning framework for transductive inference that classifies the entire test set at once to alleviate the low-data problem. Specifically, we propose to learn to propagate labels from labeled instances to unlabeled test instances, by learning a graph construction module that exploits the manifold structure in the data. TPN jointly learns both the parameters of feature embedding and the graph construction in an end-to-end manner. We validate TPN on multiple benchmark datasets, on which it largely outperforms existing few-shot learning approaches and achieves the state-of-the-art results. ",/pdf/d0e2ccbb91943186e71e62c25ca5f1bbc994e31c.pdf,ICLR,2019,We propose a novel meta-learning framework for transductive inference that classifies the entire test set at once to alleviate the low-data problem. +rypT3fb0b,rJuF3fWCZ,1509140000000.0,1518730000000.0,973,LEARNING TO SHARE: SIMULTANEOUS PARAMETER TYING AND SPARSIFICATION IN DEEP LEARNING,"[""dejiao@umich.edu"", ""hzwang@umich.edu"", ""mario.figueiredo@lx.it.pt"", ""girasole@umich.edu""]","[""Dejiao Zhang"", ""Haozhu Wang"", ""Mario Figueiredo"", ""Laura Balzano""]","[""Compressing neural network"", ""simultaneously parameter tying and sparsification"", ""group ordered l1 regularization""]","Deep neural networks (DNNs) usually contain millions, maybe billions, of parameters/weights, making both storage and computation very expensive. This has motivated a large body of work to reduce the complexity of the neural network by using sparsity-inducing regularizers. Another well-known approach for controlling the complexity of DNNs is parameter sharing/tying, where certain sets of weights are forced to share a common value. Some forms of weight sharing are hard-wired to express certain in- variances, with a notable example being the shift-invariance of convolutional layers. However, there may be other groups of weights that may be tied together during the learning process, thus further re- ducing the complexity of the network. In this paper, we adopt a recently proposed sparsity-inducing regularizer, named GrOWL (group ordered weighted l1), which encourages sparsity and, simulta- neously, learns which groups of parameters should share a common value. GrOWL has been proven effective in linear regression, being able to identify and cope with strongly correlated covariates. Unlike standard sparsity-inducing regularizers (e.g., l1 a.k.a. Lasso), GrOWL not only eliminates unimportant neurons by setting all the corresponding weights to zero, but also explicitly identifies strongly correlated neurons by tying the corresponding weights to a common value. This ability of GrOWL motivates the following two-stage procedure: (i) use GrOWL regularization in the training process to simultaneously identify significant neurons and groups of parameter that should be tied together; (ii) retrain the network, enforcing the structure that was unveiled in the previous phase, i.e., keeping only the significant neurons and enforcing the learned tying structure. We evaluate the proposed approach on several benchmark datasets, showing that it can dramatically compress the network with slight or even no loss on generalization performance. +",/pdf/10f87359c2c4fa8b505a2a40ef3d2a23247a8fda.pdf,ICLR,2018,We have proposed using the recent GrOWL regularizer for simultaneous parameter sparsity and tying in DNN learning. +aJLjjpi0Vty,kU1optsDhmB,1601310000000.0,1614990000000.0,3783,Collaborative Filtering with Smooth Reconstruction of the Preference Function,"[""~Ali_Shirali1"", ""~Reza_Kazemi1"", ""~Arash_Amini3""]","[""Ali Shirali"", ""Reza Kazemi"", ""Arash Amini""]","[""collaborative filtering"", ""recommender system"", ""sampling theory""]","The problem of predicting the rating of a set of users to a set of items in a recommender system based on partial knowledge of the ratings is widely known as collaborative filtering. In this paper, we consider a mapping of the items into a vector space and study the prediction problem by assuming an underlying smooth preference function for each user, the quantization at each given vector yields the associated rating. To estimate the preference functions, we implicitly cluster the users with similar ratings to form dominant types. Next, we associate each dominant type with a smooth preference function; i.e., the function values for items with nearby vectors shall be close to each other. +The latter is accomplished by a rich representation learning in a so-called frequency domain. In this framework, we propose two approaches for learning user and item representations. First, we use an alternating optimization method in the spirit of $k$-means to cluster users and map items. We further make this approach less prone to overfitting by a boosting technique. +Second, we present a feedforward neural network architecture consisting of interpretable layers which implicitely clusters the users. The performance of the method is evaluated on two benchmark datasets (ML-100k and ML-1M). Albeit the method benefits from simplicity, it shows a remarkable performance and opens a venue for future research. All codes are publicly available on the GitLab.",/pdf/7ffc634747a4de99e2404a4817592af42ae488ea.pdf,ICLR,2021,"By putting minimal constraints on the preference function, we reconstruct ratings in two different interpretable ways." +BJx7N1SKvB,r1lqBj3dPS,1569440000000.0,1577170000000.0,1647,A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions,"[""adlam@google.com"", ""jpennin@google.com"", ""jlev@google.com""]","[""Ben Adlam"", ""Jake Levinson"", ""Jeffrey Pennington""]",[],"One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks are often used to model large complex datasets, which may themselves contain millions or even billions of constraints. In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity. We analyze the performance of a simple regression model trained on the random features $F=f(WX+B)$ for a random weight matrix $W$ and random bias vector $B$, obtaining an exact formula for the asymptotic training error on a noisy autoencoding task. The role of the bias can be understood as parameterizing a distribution over activation functions, and our analysis actually extends to general such distributions, even those not expressible with a traditional additive bias. Intruigingly, we find that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of nonlinearities might be useful for approximate kernel methods or neural network architecture design.",/pdf/ce43d830186bc4455778324d5c578d134adaf67f.pdf,ICLR,2020, +XLfdzwNKzch,44qLUIgG4x,1601310000000.0,1616050000000.0,1127,SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning,"[""~Myeongjang_Pyeon1"", ""~Jihwan_Moon2"", ""~Taeyoung_Hahn1"", ""~Gunhee_Kim1""]","[""Myeongjang Pyeon"", ""Jihwan Moon"", ""Taeyoung Hahn"", ""Gunhee Kim""]","[""AutoML"", ""Neural Architecture Search"", ""Greedy Learning"", ""Deep Learning""]","Backward locking and update locking are well-known sources of inefficiency in backpropagation that prevent from concurrently updating layers. Several works have recently suggested using local error signals to train network blocks asynchronously to overcome these limitations. However, they often require numerous iterations of trial-and-error to find the best configuration for local training, including how to decouple network blocks and which auxiliary networks to use for each block. In this work, we propose a differentiable search algorithm named SEDONA to automate this process. Experimental results show that our algorithm can consistently discover transferable decoupled architectures for VGG and ResNet variants, and significantly outperforms the ones trained with end-to-end backpropagation and other state-of-the-art greedy-leaning methods in CIFAR-10, Tiny-ImageNet and ImageNet.",/pdf/0b88f9a4f997ec52f81dce9a0410a6b5646e18a8.pdf,ICLR,2021,"Our approach is the first attempt to automate decoupling neural networks for greedy block-wise learning and outperforms both end-to-end backprop and state-of-the-art greedy-learning methods on CIFAR-10, Tiny-ImageNet and ImageNet classification." +vlcVTDaufN,ajTcIgHh9h1,1601310000000.0,1614990000000.0,1783,Differentiable Combinatorial Losses through Generalized Gradients of Linear Programs,"[""gaox2@vcu.edu"", ""~Han_Zhang6"", ""~Aliakbar_Panahi1"", ""~Tom_Arodz1""]","[""Xi Gao"", ""Han Zhang"", ""Aliakbar Panahi"", ""Tom Arodz""]","[""combinatorial optimization"", ""linear programs"", ""generalized gradient""]","Combinatorial problems with linear objective function play a central role in many computer science applications, and efficient algorithms for solving them are well known. However, the solutions to these problems are not differentiable with respect to the parameters specifying the problem instance – for example, shortest distance between two nodes in a graph is not a differentiable function of graph edge weights. Recently, attempts to integrate combinatorial and, more broadly, convex optimization solvers into gradient-trained models resulted in several approaches for differentiating over the solution vector to the optimization problem. However, in many cases, the interest is in differentiating over only the objective value, not the solution vector, and using existing approaches introduces unnecessary overhead. Here, we show how to perform gradient descent directly over the objective value of the solution to combinatorial problems. We demonstrate advantage of the approach in examples involving sequence-to-sequence modeling using differentiable encoder-decoder architecture with softmax or Gumbel-softmax, and in weakly supervised learning involving a convolutional, residual feed-forward network for image classification. +",/pdf/5f56218dc9cfdf7fd981ed0dbc31be519e834c14.pdf,ICLR,2021,"We show how to differentiate over the objective value of the optimal solution to a combinatorial problem, using a single run to a black-box combinatorial solver." +Px7xIKHjmMS,#NAME?,1601310000000.0,1614990000000.0,2207,Beyond GNNs: A Sample Efficient Architecture for Graph Problems,"[""~Pranjal_Awasthi3"", ""abhidas@google.com"", ""~Sreenivas_Gollapudi2""]","[""Pranjal Awasthi"", ""Abhimanyu Das"", ""Sreenivas Gollapudi""]","[""Graph Neural Networks"", ""Deep Learning Theory"", ""Graph Connectivity"", ""Minimum Spanning Trees""]","Despite their popularity in learning problems over graph structured data, existing Graph Neural Networks (GNNs) have inherent limitations for fundamental graph problems such as shortest paths, $k$-connectivity, minimum spanning tree and minimum cuts. In all these instances, it is known that one needs GNNs of high depth, scaling at a polynomial rate with the number of nodes $n$, to provably encode the solution space. This in turn affects their statistical efficiency thus requiring a significant amount of training data in order to obtain networks with good performance. In this work we propose a new hybrid architecture to overcome this limitation. Our proposed architecture that we call as GNNplus networks involve a combination of multiple parallel low depth GNNs along with simple pooling layers involving low depth fully connected networks. We provably demonstrate that for many graph problems, the solution space can be encoded by GNNplus networks using depth that scales only poly-logarithmically in the number of nodes. This significantly improves the amount of training data needed that we establish via improved generalization bounds. Finally, we empirically demonstrate the effectiveness of our proposed architecture for a variety of graph problems. +",/pdf/5b1b77ad997fdde546669c8af0e05570b6e03bdb.pdf,ICLR,2021,"We propose a new provably sample efficient GNN architecture for learning many fundamental graph problems, with sample complexity that scales poly-logarithmically in the graph size. " +OodqmQT3fir,HNVpJSawP0f,1601310000000.0,1614990000000.0,3371,XLVIN: eXecuted Latent Value Iteration Nets,"[""~Andreea_Deac1"", ""~Petar_Veli\u010dkovi\u01071"", ""ognjen7amg@gmail.com"", ""~Pierre-Luc_Bacon1"", ""~Jian_Tang1"", ""mladennik@gmail.com""]","[""Andreea Deac"", ""Petar Veli\u010dkovi\u0107"", ""Ognjen Milinkovic"", ""Pierre-Luc Bacon"", ""Jian Tang"", ""Mladen Nikolic""]","[""value iteration"", ""graph neural networks"", ""reinforcement learning""]","Value Iteration Networks (VINs) have emerged as a popular method to perform implicit planning within deep reinforcement learning, enabling performance improvements on tasks requiring long-range reasoning and understanding of environment dynamics. This came with several limitations, however: the model is not explicitly incentivised to perform meaningful planning computations, the underlying state space is assumed to be discrete, and the Markov decision process (MDP) is assumed fixed and known. We propose eXecuted Latent Value Iteration Networks (XLVINs), which combine recent developments across contrastive self-supervised learning, graph representation learning and neural algorithmic reasoning to alleviate all of the above limitations, successfully deploying VIN-style models on generic environments. XLVINs match the performance of VIN-like models when the underlying MDP is discrete, fixed and known, and provide significant improvements to model-free baselines across three general MDP setups.",/pdf/b8deb72b99ad0755d9e701b657dfa00df4464939.pdf,ICLR,2021,"We combine contrastive self-supervised learning, graph representation learning and neural algorithm execution to perform value iteration in the latent space, generalising VINs to arbitrary domains." +Mub9VkGZoZe,8gnEvcI4lIg,1601310000000.0,1614990000000.0,1597,Identifying Informative Latent Variables Learned by GIN via Mutual Information,"[""~Chen_Zhang6"", ""~Yitong_Sun1"", ""~Mingtian_Zhang1""]","[""Chen Zhang"", ""Yitong Sun"", ""Mingtian Zhang""]",[],"How to learn a good representation of data is one of the most important topics of machine learning. Disentanglement of representations, though believed to be the core feature of good representations, has caused a lot of debates and discussions in recent. Sorrenson et al. (2020), using the techniques developed in nonlinear independent analysis theory, show that general incompressible-flow networks (GIN) can recover the underlying latent variables that generate the data, and thus can provide a compact and disentangled representation. However, in this paper, we point out that the method taken by GIN for informative latent variables identification is not theoretically supported and can be disproved by experiments. We propose to use the mutual information between latent variables and the auxiliary variable to correctly identify informative latent variables. We directly verify the improvement brought by our method in experiments on synthetic data. We further show the advantage of our method on various downstream tasks including classification, outlier detection and adversarial attack defence.",/pdf/6e09f0204ad80de78b3559854ce573146b695c70.pdf,ICLR,2021,Identifying Informative Latent Variables Learned by GIN via Mutual Information +P5RQfyAmrU,Mdf57io_V3aO,1601310000000.0,1614990000000.0,1888, Model-centric data manifold: the data through the eyes of the model,"[""lgrementieri@nextbit.it"", ""~Rita_Fioresi1""]","[""Luca Grementieri"", ""Rita Fioresi""]","[""Deep Learning"", ""Information Geometry"", ""Data Manifold"", ""Fisher matrix""]","We discover that deep ReLU neural network classifiers can see a low-dimensional Riemannian manifold structure on data. Such structure comes via the local data matrix, a variation of the Fisher information matrix, where the role of the model parameters is taken by the data variables. We obtain a foliation of the data domain and we show that the dataset on which the model is trained lies on a leaf, the data leaf, whose dimension is bounded by the number of classification labels. We validate our results with some experiments with the MNIST dataset: paths on the data leaf connect valid images, while other leaves cover noisy images. +",/pdf/1f031e168bd5c1774699b588bc76902a86441a15.pdf,ICLR,2021,We discover that deep ReLU neural network classifiers can see a low-dimensional Riemannian manifold structure on data. +vY0bnzBBvtr,IJw-vSLym4e,1601310000000.0,1614990000000.0,2775,Provably More Efficient Q-Learning in the One-Sided-Feedback/Full-Feedback Settings,"[""~Xiao-Yue_Gong1"", ""~David_Simchi-Levi2""]","[""Xiao-Yue Gong"", ""David Simchi-Levi""]","[""Q-learning"", ""episodic MDP"", ""full-feedback"", ""one-sided-feedback"", ""inventory control"", ""inventory""]","Motivated by the episodic version of the classical inventory control problem, we propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs $ \tilde{\mathcal{O}}(H^3\sqrt{ T})$ regret and FQL incurs $\tilde{\mathcal{O}}(H^2\sqrt{ T})$ regret, where $H$ is the length of each episode and $T$ is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.",/pdf/c33e873651d5c90d20e74adf8740b63a1d04ebbd.pdf,ICLR,2021,"We propose a new Q-learning algorithm that is provably more efficient for the one-sided/full feedback settings than existing Q-learning algorithms, showing the potential for adapting reinforcement learning to more varied structures of problems" +HJxrVA4FDS,rkerAmLuvS,1569440000000.0,1583910000000.0,1073,Disentangling neural mechanisms for perceptual grouping,"[""junkyung_kim@brown.edu"", ""drew_linsley@brown.edu"", ""kalpit_thakkar@brown.edu"", ""thomas_serre@brown.edu""]","[""Junkyung Kim*"", ""Drew Linsley*"", ""Kalpit Thakkar"", ""Thomas Serre""]","[""Perceptual grouping"", ""visual cortex"", ""recurrent feedback"", ""horizontal connections"", ""top-down connections""]","Forming perceptual groups and individuating objects in visual scenes is an essential step towards visual intelligence. This ability is thought to arise in the brain from computations implemented by bottom-up, horizontal, and top-down connections between neurons. However, the relative contributions of these connections to perceptual grouping are poorly understood. We address this question by systematically evaluating neural network architectures featuring combinations bottom-up, horizontal, and top-down connections on two synthetic visual tasks, which stress low-level ""Gestalt"" vs. high-level object cues for perceptual grouping. We show that increasing the difficulty of either task strains learning for networks that rely solely on bottom-up connections. Horizontal connections resolve straining on tasks with Gestalt cues by supporting incremental grouping, whereas top-down connections rescue learning on tasks with high-level object cues by modifying coarse predictions about the position of the target object. Our findings dissociate the computational roles of bottom-up, horizontal and top-down connectivity, and demonstrate how a model featuring all of these interactions can more flexibly learn to form perceptual groups.",/pdf/867434cbc79e43a38c672dfaa275569e7d763529.pdf,ICLR,2020,Horizontal and top-down feedback connections are responsible for complementary perceptual grouping strategies in biological and recurrent vision systems. +cy0jU8F60Hy,8C7X5ERPA30,1601310000000.0,1614990000000.0,2789,ACT: Asymptotic Conditional Transport,"[""~Huangjie_Zheng1"", ""~Mingyuan_Zhou1""]","[""Huangjie Zheng"", ""Mingyuan Zhou""]","[""Statistical distance"", ""Divergence"", ""Optimal Transport"", ""Implicit Distribution"", ""Deep Generative Models"", ""GANs""]","We propose conditional transport (CT) as a new divergence to measure the difference between two probability distributions. The CT divergence consists of the expected cost of a forward CT, which constructs a navigator to stochastically transport a data point of one distribution to the other distribution, and that of a backward CT which reverses the transport direction. To apply it to the distributions whose probability density functions are unknown but random samples are accessible, we further introduce asymptotic CT (ACT), whose estimation only requires access to mini-batch based discrete empirical distributions. Equipped with two navigators that amortize the computation of conditional transport plans, the ACT divergence comes with unbiased sample gradients that are straightforward to compute, making it amenable to mini-batch stochastic gradient descent based optimization. When applied to train a generative model, the ACT divergence is shown to strike a good balance between mode covering and seeking behaviors and strongly resist mode collapse. To model high-dimensional data, we show that it is sufficient to modify the adversarial game of an existing generative adversarial network (GAN) to a game played by a generator, a forward navigator, and a backward navigator, which try to minimize a distribution-to-distribution transport cost by optimizing both the distribution of the generator and conditional transport plans specified by the navigators, versus a critic that does the opposite by inflating the point-to-point transport cost. On a wide variety of benchmark datasets for generative modeling, substituting the default statistical distance of an existing GAN with the ACT divergence is shown to consistently improve the performance. ",/pdf/5555cee53e8558ea6769639ade36077d9e305c8d.pdf,ICLR,2021,We propose asymptotic conditional transport as a new probability-distance measure and apply it to improve existing deep generative models. +BkgtDsCcKQ,B1lB6lvcK7,1538090000000.0,1550480000000.0,282,Function Space Particle Optimization for Bayesian Neural Networks,"[""wzy196@gmail.com"", ""rtz19970824@gmail.com"", ""dcszj@mail.tsinghua.edu.cn"", ""dcszb@mail.tsinghua.edu.cn""]","[""Ziyu Wang"", ""Tongzheng Ren"", ""Jun Zhu"", ""Bo Zhang""]","[""Bayesian neural networks"", ""uncertainty estimation"", ""variational inference""]","While Bayesian neural networks (BNNs) have drawn increasing attention, their posterior inference remains challenging, due to the high-dimensional and over-parameterized nature. To address this issue, several highly flexible and scalable variational inference procedures based on the idea of particle optimization have been proposed. These methods directly optimize a set of particles to approximate the target posterior. However, their application to BNNs often yields sub-optimal performance, as such methods have a particular failure mode on over-parameterized models. In this paper, we propose to solve this issue by performing particle optimization directly in the space of regression functions. We demonstrate through extensive experiments that our method successfully overcomes this issue, and outperforms strong baselines in a variety of tasks including prediction, defense against adversarial examples, and reinforcement learning.",/pdf/fef83ae6842b76284e2de4d04977d1575d0efdf4.pdf,ICLR,2019, +SJiHOSeR-,ry5HOBx0Z,1509080000000.0,1518730000000.0,258,Contextual memory bandit for pro-active dialog engagement,"[""julien.perez@naverlabs.com""]","[""julien perez"", ""Tomi Silander""]","[""contextual bandit"", ""memory network"", ""proactive dialog engagement""]","An objective of pro-activity in dialog systems is to enhance the usability of conversational +agents by enabling them to initiate conversation on their own. While +dialog systems have become increasingly popular during the last couple of years, +current task oriented dialog systems are still mainly reactive and users tend to +initiate conversations. In this paper, we propose to introduce the paradigm of contextual +bandits as framework for pro-active dialog systems. Contextual bandits +have been the model of choice for the problem of reward maximization with partial +feedback since they fit well to the task description. As a second contribution, +we introduce and explore the notion of memory into this paradigm. We propose +two differentiable memory models that act as parts of the parametric reward estimation +function. The first one, Convolutional Selective Memory Networks, uses +a selection of past interactions as part of the decision support. The second model, +called Contextual Attentive Memory Network, implements a differentiable attention +mechanism over the past interactions of the agent. The goal is to generalize +the classic model of contextual bandits to settings where temporal information +needs to be incorporated and leveraged in a learnable manner. Finally, we illustrate +the usability and performance of our model for building a pro-active mobile +assistant through an extensive set of experiments.",/pdf/b92de1cb62f44060b0e68e2c5df85e796053c91a.pdf,ICLR,2018, +GBjukBaBLXK,F9aGQJsTtZJ,1601310000000.0,1614990000000.0,2130,Conditional Coverage Estimation for High-quality Prediction Intervals,"[""~Ziyi_Huang1"", ""~Henry_Lam2"", ""~Haofeng_Zhang1""]","[""Ziyi Huang"", ""Henry Lam"", ""Haofeng Zhang""]",[],"Deep learning has achieved state-of-the-art performance to generate high-quality prediction intervals (PIs) for uncertainty quantification in regression tasks. The high-quality criterion requires PIs to be as narrow as possible, whilst maintaining a pre-specified level of data (marginal) coverage. However, most existing works for high-quality PIs lack accurate information on conditional coverage, which may cause unreliable predictions if it is significantly smaller than the marginal coverage. To address this problem, we propose a novel end-to-end framework which could output high-quality PIs and simultaneously provide their conditional coverage estimation. In doing so, we design a new loss function that is both easy-to-implement and theoretically justified via an exponential concentration bound. Our evaluation on real-world benchmark datasets and synthetic examples shows that our approach not only outperforms the state-of-the-arts on high-quality PIs in terms of average PI width, but also accurately estimates conditional coverage information that is useful in assessing model uncertainty. ",/pdf/9c830f56dc7a4c346edaed8e8809dae94134532c.pdf,ICLR,2021, +B1ydPgTpW,SkAvwxTpZ,1508870000000.0,1518730000000.0,66,Predicting Auction Price of Vehicle License Plate with Deep Recurrent Neural Network,"[""vincichow@cuhk.edu.hk""]","[""Vinci Chow""]","[""price predictions"", ""expert system"", ""recurrent neural networks"", ""deep learning"", ""natural language processing""]","In Chinese societies, superstition is of paramount importance, and vehicle license plates with desirable numbers can fetch very high prices in auctions. Unlike other valuable items, license plates are not allocated an estimated price before auction. + +I propose that the task of predicting plate prices can be viewed as a natural language processing (NLP) task, as the value depends on the meaning of each individual character on the plate and its semantics. I construct a deep recurrent neural network (RNN) to predict the prices of vehicle license plates in Hong Kong, based on the characters on a plate. I demonstrate the importance of having a deep network and of retraining. Evaluated on 13 years of historical auction prices, the deep RNN's predictions can explain over 80 percent of price variations, outperforming previous models by a significant margin. I also demonstrate how the model can be extended to become a search engine for plates and to provide estimates of the expected price distribution.",/pdf/9ebb806499d1786c018781d90a2ef3364bfc68aa.pdf,ICLR,2018,"Predicting auction price of vehicle license plates in Hong Kong with deep recurrent neural network, based on the characters on the plates." +Y45i-hDynr,DYVc0SqQtD_,1601310000000.0,1614990000000.0,907,Parameterized Pseudo-Differential Operators for Graph Convolutional Neural Networks,"[""~Kevin_M._Potter1"", ""ssleder@sandia.gov"", ""mdsmith@sandia.gov"", ""jtencer@sandia.gov""]","[""Kevin M. Potter"", ""Steven Richard Sleder"", ""Matthew David Smith"", ""John Tencer""]","[""graph convolutional neural network"", ""superpixel"", ""FAUST"", ""differential operators""]","We present a novel graph convolutional layer that is fast, conceptually simple, and provides high accuracy with reduced overfitting. Based on pseudo-differential operators, our layer operates on graphs with relative position information available for each pair of connected nodes. We evaluate our method on a variety of supervised learning tasks, including superpixel image classification using the MNIST, CIFAR10, and CIFAR100 superpixel datasets, node correspondence using the FAUST dataset, and shape classification using the ModelNet10 dataset. The new layer outperforms multiple recent architectures on superpixel image classification tasks using the MNIST and CIFAR100 superpixel datasets and performs comparably with recent results on the CIFAR10 superpixel dataset. We measure test accuracy without bias to the test set by selecting the model with the best training accuracy. The new layer achieves a test error rate of 0.80% on the MNIST superpixel dataset, beating the closest reported rate of 0.95% by a factor of more than 15%. After dropping roughly 70% of the edge connections from the input by performing a Delaunay triangulation, our model still achieves a competitive error rate of 1.04%.",/pdf/a86c6d26e57905eac75b2d2e545fb460ebf4d220.pdf,ICLR,2021,We introduce a differential operator based graph convolutional layer that outperforms other work on superpixel image classification tasks in speed and accuracy. +S1zz2i0cY7,S1x8hZacKX,1538090000000.0,1550870000000.0,690,Integer Networks for Data Compression with Latent-Variable Models,"[""jballe@google.com"", ""nickj@google.com"", ""dminnen@google.com""]","[""Johannes Ball\u00e9"", ""Nick Johnston"", ""David Minnen""]","[""data compression"", ""variational models"", ""network quantization""]","We consider the problem of using variational latent-variable models for data compression. For such models to produce a compressed binary sequence, which is the universal data representation in a digital world, the latent representation needs to be subjected to entropy coding. Range coding as an entropy coding technique is optimal, but it can fail catastrophically if the computation of the prior differs even slightly between the sending and the receiving side. Unfortunately, this is a common scenario when floating point math is used and the sender and receiver operate on different hardware or software platforms, as numerical round-off is often platform dependent. We propose using integer networks as a universal solution to this problem, and demonstrate that they enable reliable cross-platform encoding and decoding of images using variational models.",/pdf/0a713a9470745f64dc33d571c435df57adc94b57.pdf,ICLR,2019,We train variational models with quantized networks for computational determinism. This enables using them for cross-platform data compression. +gIHd-5X324,aRjL1mC0Qrp,1601310000000.0,1613190000000.0,138,Rethinking Soft Labels for Knowledge Distillation: A Bias–Variance Tradeoff Perspective,"[""~Helong_Zhou1"", ""~Liangchen_Song1"", ""~Jiajie_Chen1"", ""~Ye_Zhou2"", ""~Guoli_Wang2"", ""~Junsong_Yuan2"", ""~Qian_Zhang7""]","[""Helong Zhou"", ""Liangchen Song"", ""Jiajie Chen"", ""Ye Zhou"", ""Guoli Wang"", ""Junsong Yuan"", ""Qian Zhang""]","[""Knowledge distillation"", ""soft labels"", ""teacher-student model""]","Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies (M ̈uller et al., 2019; Yuan et al., 2020) revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available in the supplementary.",/pdf/2e59cd77d556e74ed337a3cd2829da3be79dc212.pdf,ICLR,2021,"For knowledge distillation, we analyze the regularization effect introduced by soft labels from a bias-variance perspective and propose weighted soft labels to handle the tradeoff." +rJgE9CEYPS,rylsaKOdPr,1569440000000.0,1577170000000.0,1278,Discriminability Distillation in Group Representation Learning,"[""zhangmanyuan@sensetime.com"", ""songguanglu@sensetime.com"", ""yuliu@ee.cuhk.edu.hk"", ""zhouhang@link.cuhk.edu.hk""]","[""Manyuan Zhang\uff0cGuanglu Song\uff0cYu Liu\uff0cHang Zhou""]",[],"Learning group representation is a commonly concerned issue in tasks where the basic unit is a group, set or sequence. +The computer vision community tries to tackle it by aggregating the elements in a group based on an indicator either defined by human such as the quality or saliency of an element, or generated by a black box such as the attention score or output of a RNN. + +This article provides a more essential and explicable view. +We claim the most significant indicator to show whether the group representation can be benefited from an element is not the quality, or an inexplicable score, but the \textit{discrimiability}. +Our key insight is to explicitly design the \textit{discrimiability} using embedded class centroids on a proxy set, +and show the discrimiability distribution \textit{w.r.t.} the element space can be distilled by a light-weight auxiliary distillation network. +This processing is called \textit{discriminability distillation learning} (DDL). +We show the proposed DDL can be flexibly plugged into many group based recognition tasks without influencing the training procedure of the original tasks. Comprehensive experiments on set-to-set face recognition and action recognition valid the advantage of DDL on both accuracy and efficiency, and it pushes forward the state-of-the-art results on these tasks by an impressive margin.",/pdf/aaac35bc50fd15dc9310aad48faa3690d06bdd9c.pdf,ICLR,2020, +H1cWzoxA-,H1YZzjg0b,1509110000000.0,1519370000000.0,366,Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling,"[""tao.shen@student.uts.edu.au"", ""tianyizh@uw.edu"", ""guodong.long@uts.edu.au"", ""jing.jiang@uts.edu.au"", ""chengqi.zhang@uts.edu.au""]","[""Tao Shen"", ""Tianyi Zhou"", ""Guodong Long"", ""Jing Jiang"", ""Chengqi Zhang""]","[""deep learning"", ""attention mechanism"", ""sequence modeling"", ""natural language processing"", ""sentence embedding""]","Recurrent neural networks (RNN), convolutional neural networks (CNN) and self-attention networks (SAN) are commonly used to produce context-aware representations. RNN can capture long-range dependency but is hard to parallelize and not time-efficient. CNN focuses on local dependency but does not perform well on some tasks. SAN can model both such dependencies via highly parallelizable computation, but memory requirement grows rapidly in line with sequence length. In this paper, we propose a model, called ""bi-directional block self-attention network (Bi-BloSAN)"", for RNN/CNN-free sequence encoding. It requires as little memory as RNN but with all the merits of SAN. Bi-BloSAN splits the entire sequence into blocks, and applies an intra-block SAN to each block for modeling local context, then applies an inter-block SAN to the outputs for all blocks to capture long-range dependency. Thus, each SAN only needs to process a short sequence, and only a small amount of memory is required. Additionally, we use feature-level attention to handle the variation of contexts around the same word, and use forward/backward masks to encode temporal order information. On nine benchmark datasets for different NLP tasks, Bi-BloSAN achieves or improves upon state-of-the-art accuracy, and shows better efficiency-memory trade-off than existing RNN/CNN/SAN. ",/pdf/0b97f247ebabc1c9e173399eaadb27d0e949901e.pdf,ICLR,2018,"A self-attention network for RNN/CNN-free sequence encoding with small memory consumption, highly parallelizable computation and state-of-the-art performance on several NLP tasks" +S1Jhfftgx,,1478200000000.0,1478620000000.0,79,Enforcing constraints on outputs with unconstrained inference,"[""lee.jayyoon@gmail.com"", ""michael.wick@oracle.com"", ""jean.baptiste.tristan@oracle.com""]","[""Jay Yoon Lee"", ""Michael L. Wick"", ""Jean-Baptiste Tristan""]","[""Natural language processing"", ""Structured prediction"", ""Deep learning""]"," Increasingly, practitioners apply neural networks to complex + problems in natural language processing (NLP), such as syntactic + parsing, that have rich output structures. Many such applications + require deterministic constraints on the output values; for example, + requiring that the sequential outputs encode a valid tree. While + hidden units might capture such properties, the network is not + always able to learn them from the training data alone, and + practitioners must then resort to post-processing. In this paper, we + present an inference method for neural networks that enforces + deterministic constraints on outputs without performing + post-processing or expensive discrete search over the feasible + space. Instead, for each input, we nudge the continuous weights + until the network's unconstrained inference procedure generates an + output that satisfies the constraints. We find that our method + reduces the number of violating outputs by up to 81\%, while + improving accuracy.",/pdf/a97734ee064ecfd6fa3c08eb26560950428a7d37.pdf,ICLR,2017,"An inference method for enforcing hard constraints on the outputs of neural networks without combinatorial search, with applications in NLP and structured prediction." +rkgpy3C5tX,HkeTCro9F7,1538090000000.0,1551080000000.0,1024,Amortized Bayesian Meta-Learning,"[""sachinr@princeton.edu"", ""abeatson@cs.princeton.edu""]","[""Sachin Ravi"", ""Alex Beatson""]","[""variational inference"", ""meta-learning"", ""few-shot learning"", ""uncertainty quantification""]","Meta-learning, or learning-to-learn, has proven to be a successful strategy in attacking problems in supervised learning and reinforcement learning that involve small amounts of data. State-of-the-art solutions involve learning an initialization and/or learning algorithm using a set of training episodes so that the meta learner can generalize to an evaluation episode quickly. These methods perform well but often lack good quantification of uncertainty, which can be vital to real-world applications when data is lacking. We propose a meta-learning method which efficiently amortizes hierarchical variational inference across tasks, learning a prior distribution over neural network weights so that a few steps of Bayes by Backprop will produce a good task-specific approximate posterior. We show that our method produces good uncertainty estimates on contextual bandit and few-shot learning benchmarks.",/pdf/8a6cffb9153b7642051354aac96e10151f05ae3d.pdf,ICLR,2019,We propose a meta-learning method which efficiently amortizes hierarchical variational inference across training episodes. +rJgJDAVKvB,ryllkrvuDr,1569440000000.0,1583910000000.0,1161,Learning to Plan in High Dimensions via Neural Exploration-Exploitation Trees,"[""binghong@gatech.edu"", ""bodai@google.com"", ""qinjielin2018@u.northwestern.edu"", ""guoye2018@u.northwestern.edu"", ""hanliu@northwestern.edu"", ""lsong@cc.gatech.edu""]","[""Binghong Chen"", ""Bo Dai"", ""Qinjie Lin"", ""Guo Ye"", ""Han Liu"", ""Le Song""]","[""learning to plan"", ""representation learning"", ""learning to design algorithm"", ""reinforcement learning"", ""meta learning""]","We propose a meta path planning algorithm named \emph{Neural Exploration-Exploitation Trees~(NEXT)} for learning from prior experience for solving new path planning problems in high dimensional continuous state and action spaces. Compared to more classical sampling-based methods like RRT, our approach achieves much better sample efficiency in high-dimensions and can benefit from prior experience of planning in similar environments. More specifically, NEXT exploits a novel neural architecture which can learn promising search directions from problem structures. The learned prior is then integrated into a UCB-type algorithm to achieve an online balance between \emph{exploration} and \emph{exploitation} when solving a new problem. We conduct thorough experiments to show that NEXT accomplishes new planning problems with more compact search trees and significantly outperforms state-of-the-art methods on several benchmarks.",/pdf/c45825c9605af935d5e51f065e4b4499bf2b5bde.pdf,ICLR,2020,We propose a meta path planning algorithm which exploits a novel attention-based neural module that can learn generalizable structures from prior experiences to drastically reduce the sample requirement for solving new path planning problems. +SygagpEKwB,Skxi8mr8vS,1569440000000.0,1583910000000.0,355,Disentangling Factors of Variations Using Few Labels,"[""flocatello@tuebingen.mpg.de"", ""tschannen@google.com"", ""stefan.bauer@tuebingen.mpg.de"", ""raetsch@inf.ethz.ch"", ""bs@tuebingen.mpg.de"", ""bachem@google.com""]","[""Francesco Locatello"", ""Michael Tschannen"", ""Stefan Bauer"", ""Gunnar R\u00e4tsch"", ""Bernhard Sch\u00f6lkopf"", ""Olivier Bachem""]",[],"Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01--0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations.",/pdf/dacac03f52b564ce8585074d3c0ffcdab7ef9f1e.pdf,ICLR,2020, +Hkl1iRNFwS,rJxdukYuDS,1569440000000.0,1583910000000.0,1303,The Early Phase of Neural Network Training,"[""jfrankle@mit.edu"", ""dschwab@gc.cuny.edu"", ""arimorcos@gmail.com""]","[""Jonathan Frankle"", ""David J. Schwab"", ""Ari S. Morcos""]","[""empirical"", ""learning dynamics"", ""lottery tickets"", ""critical periods"", ""early""]","Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state and its updates during these early iterations of training, and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are label-agnostic, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning.",/pdf/974ac3614f6b53c7b83ae7fc0141a2e6e24118b4.pdf,ICLR,2020,"We thoroughly investigate neural network learning dynamics over the early phase of training, finding that these changes are crucial and difficult to approximate, though extended pretraining can recover them." +r1gs9JgRZ,H1kjqyxCZ,1509060000000.0,1518730000000.0,203,Mixed Precision Training,"[""pauliusm@nvidia.com"", ""sharan@baidu.com"", ""alben@nvidia.com"", ""gdiamos@baidu.com"", ""eriche@google.com"", ""dagarcia@nvidia.com"", ""bginsburg@nvidia.com"", ""mhouston@nvidia.com"", ""okuchaiev@nvidia.com"", ""gavenkatesh@nvidia.com"", ""skyw@nvidia.com""]","[""Paulius Micikevicius"", ""Sharan Narang"", ""Jonah Alben"", ""Gregory Diamos"", ""Erich Elsen"", ""David Garcia"", ""Boris Ginsburg"", ""Michael Houston"", ""Oleksii Kuchaiev"", ""Ganesh Venkatesh"", ""Hao Wu""]","[""Half precision"", ""float16"", ""Convolutional neural networks"", ""Recurrent neural networks""]","Increasing the size of a neural network typically improves accuracy but also increases the memory and compute requirements for training the model. We introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. This nearly halves memory requirements and, on recent GPUs, speeds up arithmetic. Weights, activations, and gradients are stored in IEEE half-precision format. Since this format has a narrower range than single-precision we propose three techniques for preventing the loss of critical information. Firstly, we recommend maintaining a single-precision copy of weights that accumulates the gradients after each optimizer step (this copy is rounded to half-precision for the forward- and back-propagation). Secondly, we propose loss-scaling to preserve gradient values with small magnitudes. Thirdly, we use half-precision arithmetic that accumulates into single-precision outputs, which are converted to half-precision before storing to memory. We demonstrate that the proposed methodology works across a wide variety of tasks and modern large scale (exceeding 100 million parameters) model architectures, trained on large datasets.",/pdf/575abe1d433099e0cd046ec3bbcf10eaa1022ec6.pdf,ICLR,2018, +87ZwsaQNHPZ,BVWoyGozdV3,1601310000000.0,1621670000000.0,2578,CPT: Efficient Deep Neural Network Training via Cyclic Precision,"[""~Yonggan_Fu1"", ""hg31@rice.edu"", ""meng.li@fb.com"", ""xy33@rice.edu"", ""yd31@rice.edu"", ""~Vikas_Chandra2"", ""~Yingyan_Lin1""]","[""Yonggan Fu"", ""Han Guo"", ""Meng Li"", ""Xin Yang"", ""Yining Ding"", ""Vikas Chandra"", ""Yingyan Lin""]","[""Efficient training"", ""low precision training""]","Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. In this paper, we attempt to explore low-precision training from a new perspective as inspired by recent findings in understanding DNN training: we conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy efficiency of DNN training. Specifically, we propose Cyclic Precision Training (CPT) to cyclically vary the precision between two boundary values which can be identified using a simple precision range test within the first few training epochs. Extensive simulations and ablation studies on five datasets and eleven models demonstrate that CPT's effectiveness is consistent across various models/tasks (including classification and language modeling). Furthermore, through experiments and visualization we show that CPT helps to (1) converge to a wider minima with a lower generalization error and (2) reduce training variance which we believe opens up a new design knob for simultaneously improving the optimization and efficiency of DNN training.",/pdf/2f7dc996e355dc20f6d4b64c9b9cf272cfac9117.pdf,ICLR,2021,We propose Cyclic Precision Training towards better accuracy-efficiency trade-offs in DNN training. +S1efAp4YvB,r1lxOw-uvH,1569440000000.0,1577170000000.0,848,Interpreting video features: a comparison of 3D convolutional networks and convolutional LSTM networks,"[""sbroome@kth.se"", ""manttari@kth.se"", ""johnf@kth.se"", ""hedvig@kth.se""]","[""Joonatan M\u00e4ntt\u00e4ri*"", ""Sofia Broom\u00e9*"", ""John Folkesson"", ""Hedvig Kjellstr\u00f6m""]","[""interpretability"", ""spatiotemporal"", ""video"", ""features"", ""saliency"", ""temporal""]","A number of techniques for interpretability have been presented for deep learning +in computer vision, typically with the goal of understanding what it is that the networks +have actually learned underneath a given classification decision. However, +when it comes to deep video architectures, interpretability is still in its infancy and +we do not yet have a clear concept of how we should decode spatiotemporal features. +In this paper, we present a study comparing how 3D convolutional networks +and convolutional LSTM networks respectively learn features across temporally +dependent frames. This is the first comparison of two video models that both +convolve to learn spatial features but that have principally different methods of +modeling time. Additionally, we extend the concept of meaningful perturbation +introduced by Fong & Vedaldi (2017) to the temporal dimension to search for the +most meaningful part of a sequence for a classification decision.",/pdf/8df5655e1b0bdbf915f86fdb61028b55cb060643.pdf,ICLR,2020,We investigate what spatiotemporal features are focused on in video data by two models that are principally different in the way that they model temporal dependencies. +SJem8lSFwB,HyGK6FxYPH,1569440000000.0,1583910000000.0,2317,Dynamic Model Pruning with Feedback,"[""tao.lin@epfl.ch"", ""sebastian.stich@epfl.ch"", ""luis.barba@inf.ethz.ch"", ""daniil.dmitriev@epfl.ch"", ""martin.jaggi@epfl.ch""]","[""Tao Lin"", ""Sebastian U. Stich"", ""Luis Barba"", ""Daniil Dmitriev"", ""Martin Jaggi""]","[""network pruning"", ""dynamic reparameterization"", ""model compression""]","Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).",/pdf/7f93209aee403ee4d65eb3d568866a276e5e99db.pdf,ICLR,2020, +0jPp4dKp3PL,#NAME?,1601310000000.0,1614990000000.0,1102,Integrating linguistic knowledge into DNNs: Application to online grooming detection,"[""~Jay_Morgan1"", ""~Adeline_Paiement1"", ""n.lorenzo-dus@swansea.ac.uk"", ""a.l.kinzel@swansea.ac.uk"", ""mdc@infogrep.it""]","[""Jay Morgan"", ""Adeline Paiement"", ""Nuria Lorenzo-Dus"", ""Anina Kinzel"", ""Matteo Di Cristofaro""]","[""Machine Learning"", ""Corpus Linguistics""]","Online grooming (OG) of children is a pervasive issue in an increasingly interconnected world. We explore various complementary methods to incorporate Corpus Linguistics (CL) knowledge into accurate and interpretable Deep Learning (DL) models. They provide an implicit text normalisation that adapts embedding spaces to the groomers' usage of language, and they focus the DNN's attention onto the expressions of OG strategies. We apply these integration to two architecture types and improve on the state-of-the-art on a new OG corpus.",/pdf/3a3947f27fb91e53eb240a6b8cf2588f0afad2f0.pdf,ICLR,2021,Incorporating Corpus Linguistic knowledge in Deep Learning models to create accurate and interpretable models. +9p2ekP904Rs,T7RIpCcP3qW,1601310000000.0,1616400000000.0,3652,Representation Learning via Invariant Causal Mechanisms,"[""~Jovana_Mitrovic1"", ""~Brian_McWilliams2"", ""~Jacob_C_Walker1"", ""~Lars_Holger_Buesing1"", ""~Charles_Blundell1""]","[""Jovana Mitrovic"", ""Brian McWilliams"", ""Jacob C Walker"", ""Lars Holger Buesing"", ""Charles Blundell""]","[""Representation Learning"", ""Self-supervised Learning"", ""Contrastive Methods"", ""Causality""]","Self-supervised learning has emerged as a strategy to reduce the reliance on costly supervised signal by pretraining representations only using unlabeled data. These methods combine heuristic proxy classification tasks with data augmentations and have achieved significant success, but our theoretical understanding of this success remains limited. In this paper we analyze self-supervised representation learning using a causal framework. We show how data augmentations can be more effectively utilized through explicit invariance constraints on the proxy classifiers employed during pretraining. Based on this, we propose a novel self-supervised objective, Representation Learning via Invariant Causal Mechanisms (ReLIC), that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. Further, using causality we generalize contrastive learning, a particular kind of self-supervised method, and provide an alternative theoretical explanation for the success of these methods. Empirically, ReLIC significantly outperforms competing methods in terms of robustness and out-of-distribution generalization on ImageNet, while also significantly outperforming these methods on Atari achieving above human-level performance on 51 out of 57 games.",/pdf/34eb2506b5a0b489bced58ab4bb038ff7356ade7.pdf,ICLR,2021,We propose a new self-supervised objective with an explicit invariance regularizer and provide an alternative explanation for the success of contrastive learning using causality; we outperform competing methods on ImageNet and Atari. +r1l9Nj09YQ,B1grA7jdKX,1538090000000.0,1545360000000.0,24,Towards Language Agnostic Universal Representations,"[""araghaja@microsoft.com"", ""xiaso@microsoft.com"", ""satiwary@microsoft.com""]","[""Armen Aghajanyan"", ""Xia Song"", ""Saurabh Tiwary""]","[""universal representations"", ""language agnostic representations"", ""NLP"", ""GAN""]","When a bilingual student learns to solve word problems in math, we expect the student to be able to solve these problem in both languages the student is fluent in, even if the math lessons were only taught in one language. However, current representations in machine learning are language dependent. In this work, we present a method to decouple the language from the problem by learning language agnostic representations and therefore allowing training a model in one language and applying to a different one in a zero shot fashion. We learn these representations by taking inspiration from linguistics, specifically the Universal Grammar hypothesis and learn universal latent representations that are language agnostic (Chomsky, 2014; Montague, 1970). We demonstrate the capabilities of these representations by showing that the models trained on a single language using language agnostic representations achieve very similar accuracies in other languages.",/pdf/86a4d6ef164d35a6b3d2174d822e4a8745bc0d49.pdf,ICLR,2019,"By taking inspiration from linguistics, specifically the Universal Grammar hypothesis, we learn language agnostic universal representations which we can utilize to do zero-shot learning across languages." +BJ4BVhRcYX,ryxWWNRctm,1538090000000.0,1545360000000.0,1450,INTERPRETABLE CONVOLUTIONAL FILTER PRUNING,"[""zqin@gmu.edu"", ""fyu2@gmu.edu"", ""chliu@clarkson.edu"", ""xchen26@gmu.edu""]","[""Zhuwei Qin"", ""Fuxun Yu"", ""Chenchen Liu"", ""Xiang Chen""]",[],"The sophisticated structure of Convolutional Neural Network (CNN) allows for +outstanding performance, but at the cost of intensive computation. As significant +redundancies inevitably present in such a structure, many works have been proposed +to prune the convolutional filters for computation cost reduction. Although +extremely effective, most works are based only on quantitative characteristics of +the convolutional filters, and highly overlook the qualitative interpretation of individual +filter’s specific functionality. In this work, we interpreted the functionality +and redundancy of the convolutional filters from different perspectives, and proposed +a functionality-oriented filter pruning method. With extensive experiment +results, we proved the convolutional filters’ qualitative significance regardless of +magnitude, demonstrated significant neural network redundancy due to repetitive +filter functions, and analyzed the filter functionality defection under inappropriate +retraining process. Such an interpretable pruning approach not only offers outstanding +computation cost optimization over previous filter pruning methods, but +also interprets filter pruning process.",/pdf/3365464c9f60f06eb097a49f5a0d9eaea5c969f1.pdf,ICLR,2019, +BJeTCAEtDB,H1eV3q5OPr,1569440000000.0,1577170000000.0,1449,Feature Map Transform Coding for Energy-Efficient CNN Inference,"[""brian.chmiel@intel.com"", ""chaimbaskin@cs.technion.ac.il"", ""ron.banner@intel.com"", ""evgeniizh@campus.technion.ac.il"", ""yevgeny_ye@campus.technion.ac.il"", ""alex.k@cs.technion.ac.il"", ""bron@cs.technion.ac.il"", ""avi.mendelson@cs.technion.ac.il""]","[""Brian Chmiel"", ""Chaim Baskin"", ""Ron Banner"", ""Evgenii Zheltonozhskii"", ""Yevgeny Yermolin"", ""Alex Karbachevsky"", ""Alex M. Bronstein"", ""Avi Mendelson""]","[""compression"", ""efficient inference"", ""quantization"", ""memory bandwidth"", ""entropy""]"," Convolutional neural networks (CNNs) achieve state-of-the-art accuracy in a variety of tasks in computer vision and beyond. One of the major obstacles hindering the ubiquitous use of CNNs for inference on low-power edge devices is their high computational complexity and memory bandwidth requirements. The latter often dominates the energy footprint on modern hardware. In this paper, we introduce a lossy transform coding approach, inspired by image and video compression, designed to reduce the memory bandwidth due to the storage of intermediate activation calculation results. Our method does not require fine-tuning the network weights and halves the data transfer volumes to the main memory by compressing feature maps, which are highly correlated, with variable length coding. Our method outperform previous approach in term of the number of bits per value with minor accuracy degradation on ResNet-34 and MobileNetV2. We analyze the performance of our approach on a variety of CNN architectures and demonstrate that FPGA implementation of ResNet-18 with our approach results in a reduction of around 40% in the memory energy footprint, compared to quantized network, with negligible impact on accuracy. When allowing accuracy degradation of up to 2%, the reduction of 60% is achieved. A reference implementation}accompanies the paper.",/pdf/edd4a25ea2685086faed15224fff4eaf95fcd3ae.pdf,ICLR,2020,Using PCA as decorellation transformation on activations to reduce memory bandwidth and energy footprint of NN accelerators +BJlLdhNFPr,H1elP9RVIS,1569440000000.0,1577170000000.0,41,Explaining A Black-box By Using A Deep Variational Information Bottleneck Approach,"[""seojinb@cs.cmu.edu""]","[""Seojin Bang"", ""Pengtao Xie"", ""Heewook Lee"", ""Wei Wu"", ""Eric Xing""]","[""interpretable machine learning"", ""information bottleneck principle"", ""black-box""]","Interpretable machine learning has gained much attention recently. Briefness and comprehensiveness are necessary in order to provide a large amount of information concisely when explaining a black-box decision system. However, existing interpretable machine learning methods fail to consider briefness and comprehensiveness simultaneously, leading to redundant explanations. We propose the variational information bottleneck for interpretation, VIBI, a system-agnostic interpretable method that provides a brief but comprehensive explanation. VIBI adopts an information theoretic principle, information bottleneck principle, as a criterion for finding such explanations. For each instance, VIBI selects key features that are maximally compressed about an input (briefness), and informative about a decision made by a black-box system on that input (comprehensive). We evaluate VIBI on three datasets and compare with state-of-the-art interpretable machine learning methods in terms of both interpretability and fidelity evaluated by human and quantitative metrics.",/pdf/049834d045f32f1e395037ebdbffc68e270af5d1.pdf,ICLR,2020, +-757TnNDwIn,_OJKtuPfaGo,1601310000000.0,1614990000000.0,987,Generative Adversarial Neural Architecture Search with Importance Sampling,"[""~SEYED_SAEED_CHANGIZ_REZAEI1"", ""~Fred_X._Han1"", ""dniu@ualberta.ca"", ""msalameh@ualberta.ca"", ""~Keith_G_Mills1"", ""jui.shangling@huawei.com""]","[""SEYED SAEED CHANGIZ REZAEI"", ""Fred X. Han"", ""Di Niu"", ""Mohammad Salameh"", ""Keith G Mills"", ""Shangling Jui""]","[""Nueral Architecture Search"", ""Deep Learning"", ""Generative Adversarial Network"", ""Graph Neural Network"", ""Computer Vision""]","Despite the empirical success of neural architecture search (NAS) in deep learning applications, the optimality, reproducibility and cost of NAS schemes remain hard to assess. The variation in search spaces adopted has further affected a fair comparison between search strategies. In this paper, we focus on search strategies in NAS and propose Generative Adversarial NAS (GA-NAS), promoting stable and reproducible neural architecture search. GA-NAS is theoretically inspired by importance sampling for rare event simulation, and iteratively refits a generator to previously discovered top architectures, thus increasingly focusing on important parts of the search space. We propose an efficient adversarial learning approach in GA-NAS, where the generator is not trained based on a large number of observations on architecture performance, but based on the relative prediction made by a discriminator, thus significantly reducing the number of evaluations required. +Extensive experiments show that GA-NAS beats the best published results under several cases on the public NAS benchmarks including NAS-Bench-101, NAS-Bench-201, and NAS-Bench-301. We further show that GA-NAS can handle ad-hoc search constraints and search spaces. GA-NAS can find new architectures that enhance EfficientNet and ProxylessNAS in terms of ImageNet Top-1 accuracy and/or the number of parameters by searching in their original search spaces.",/pdf/5cbfdc3a0292f8b22c3c49b07442c1ea4116651f.pdf,ICLR,2021,"We propose Generative Adversarial NAS (GA-NAS), as a search strategy for NAS problems, based on a generative adversarial learning framework and importance sampling for rare event simulation." +SJgNkpVFPr,BJgt8QLBvS,1569440000000.0,1577170000000.0,298,VILD: Variational Imitation Learning with Diverse-quality Demonstrations,"[""voot.tangkaratt@riken.jp"", ""bo.han@riken.jp"", ""emtiyaz.khan@riken.jp"", ""sugi@k.u-tokyo.ac.jp""]","[""Voot Tangkaratt"", ""Bo Han"", ""Mohammad Emtiyaz Khan"", ""Masashi Sugiyama""]","[""Imitation learning"", ""inverse reinforcement learning"", ""noisy demonstrations""]","The goal of imitation learning (IL) is to learn a good policy from high-quality demonstrations. However, the quality of demonstrations in reality can be diverse, since it is easier and cheaper to collect demonstrations from a mix of experts and amateurs. IL in such situations can be challenging, especially when the level of demonstrators' expertise is unknown. We propose a new IL paradigm called Variational Imitation Learning with Diverse-quality demonstrations (VILD), where we explicitly model the level of demonstrators' expertise with a probabilistic graphical model and estimate it along with a reward function. We show that a naive estimation approach is not suitable to large state and action spaces, and fix this issue by using a variational approach that can be easily implemented using existing reinforcement learning methods. Experiments on continuous-control benchmarks demonstrate that VILD outperforms state-of-the-art methods. Our work enables scalable and data-efficient IL under more realistic settings than before.",/pdf/814c7b1f53be5dbc6ca7b161da63021cfd0222b2.pdf,ICLR,2020,We propose an imitation learning method to learn from diverse-quality demonstrations collected by demonstrators with different level of expertise. +EQfpYwF3-b,UlOGWf-mznZ,1601310000000.0,1615800000000.0,2517,Deep Learning meets Projective Clustering,"[""~Alaa_Maalouf1"", ""~Harry_Lang1"", ""~Daniela_Rus1"", ""~Dan_Feldman1""]","[""Alaa Maalouf"", ""Harry Lang"", ""Daniela Rus"", ""Dan Feldman""]","[""Compressing Deep Networks"", ""NLP"", ""Matrix Factorization"", ""SVD""]","A common approach for compressing Natural Language Processing (NLP) networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD (Singular Value Decomposition), and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and the rows of $A_j$ represent their projections onto the $j$-dimensional subspace that minimizes the sum of squared distances (``errors'') to the points. +In practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring $A$ based on a single subspace may lead to large errors that turn into large drops in accuracy. + +Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces, each of dimension $j$, that minimizes the sum of squared distances over every point (row in $A$) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer. + +Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the existing SVD approach. +On RoBERTa we achieve $43\%$ compression of the embedding layer with less than a $0.8\%$ average drop in accuracy as compared to a $3\%$ drop previously.",/pdf/b30e3cfa2920dfd21d347c92e0226bcb13aab969.pdf,ICLR,2021,We suggest a novel technique for compressing a fully connected layer (or an embedding layer). +XdprrZhBk8,h-2Ql7aDjwt,1601310000000.0,1614990000000.0,1314,On the Predictability of Pruning Across Scales,"[""~Jonathan_S_Rosenfeld1"", ""~Jonathan_Frankle1"", ""~Michael_Carbin1"", ""~Nir_Shavit1""]","[""Jonathan S Rosenfeld"", ""Jonathan Frankle"", ""Michael Carbin"", ""Nir Shavit""]","[""neural networks"", ""deep learning"", ""generalization error"", ""scaling"", ""scalability"", ""pruning""]","We show that the error of iteratively-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing that it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different sparsities are freely interchangeable. We demonstrate the accuracy of this functional approximation over scales spanning orders of magnitude in depth, width, dataset size, and sparsity. We show that the scaling law functional form holds (generalizes) for large scale data (CIFAR-10, ImageNet), architectures (ResNets, VGGs) and iterative pruning algorithms (IMP, SynFlow). As neural networks become ever larger and more expensive to train, our findings suggest a framework for reasoning conceptually and analytically about pruning.",/pdf/3d785326479f12ae6506b18ec8041d4b737b6f30.pdf,ICLR,2021,"We show pruning generalization error is predictable, and specify the scaling law predicting it across scales, empirically." +S1gmvyHFDS,SkeXksT_wB,1569440000000.0,1577170000000.0,1760,Provenance detection through learning transformation-resilient watermarking,"[""j.hayes@cs.ucl.ac.uk"", ""dvij@google.com"", ""yutianc@google.com"", ""sedielem@google.com"", ""pushmeet@google.com"", ""ncasagrande@google.com""]","[""Jamie Hayes"", ""Krishnamurthy Dvijotham"", ""Yutian Chen"", ""Sander Dieleman"", ""Pushmeet Kohli"", ""Norman Casagrande""]","[""watermarking"", ""provenance detection""]","Advancements in deep generative models have made it possible to synthesize images, videos and audio signals that are hard to distinguish from natural signals, creating opportunities for potential abuse of these capabilities. This motivates the problem of tracking the provenance of signals, i.e., being able to determine the original source of a signal. Watermarking the signal at the time of signal creation is a potential solution, but current techniques are brittle and watermark detection mechanisms can easily be bypassed by doing some post-processing (cropping images, shifting pitch in the audio etc.). In this paper, we introduce ReSWAT (Resilient Signal Watermarking via Adversarial Training), a framework for learning transformation-resilient watermark detectors that are able to detect a watermark even after a signal has been through several post-processing transformations. Our detection method can be applied to domains with continuous data representations such as images, videos or sound signals. Experiments on watermarking image and audio signals show that our method can reliably detect the provenance of a synthetic signal, even if the signal has been through several post-processing transformations, and improve upon related work in this setting. Furthermore, we show that for specific kinds of transformations (perturbations bounded in the $\ell_2$ norm), we can even get formal guarantees on the ability of our model to detect the watermark. We provide qualitative examples of watermarked image and audio samples in the anonymous code submission link.",/pdf/8da5d919cb4e9e00a81f1bd4b375742021d6c27c.pdf,ICLR,2020,Develop a method to detect the provenance of signals that have undergone adversarial transformations. +SJ60SbW0b,S1pASbbA-,1509130000000.0,1518730000000.0,676,Modeling Latent Attention Within Neural Networks,"[""crgrimm@umich.edu"", ""dilip_arumugam@brown.edu"", ""siddharth_karamcheti@brown.edu"", ""david_abel@brown.edu"", ""lsw@brown.edu"", ""mlittman@cs.brown.edu""]","[""Christopher Grimm"", ""Dilip Arumugam"", ""Siddharth Karamcheti"", ""David Abel"", ""Lawson L.S. Wong"", ""Michael L. Littman""]","[""deep learning"", ""neural network"", ""attention"", ""attention mechanism"", ""interpretability"", ""visualization""]","Deep neural networks are able to solve tasks across a variety of domains and modalities of data. Despite many empirical successes, we lack the ability to clearly understand and interpret the learned mechanisms that contribute to such effective behaviors and more critically, failure modes. In this work, we present a general method for visualizing an arbitrary neural network's inner mechanisms and their power and limitations. Our dataset-centric method produces visualizations of how a trained network attends to components of its inputs. The computed ""attention masks"" support improved interpretability by highlighting which input attributes are critical in determining output. We demonstrate the effectiveness of our framework on a variety of deep neural network architectures in domains from computer vision and natural language processing. The primary contribution of our approach is an interpretable visualization of attention that provides unique insights into the network's underlying decision-making process irrespective of the data modality.",/pdf/7f1cd7921f4c93578b7e0a50a2105858b328300b.pdf,ICLR,2018,We develop a technique to visualize attention mechanisms in arbitrary neural networks. +B1e5ef-C-,BkTFeGbC-,1509130000000.0,1519450000000.0,772,"A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs","[""arora@cs.princeton.edu"", ""mkhodak@princeton.edu"", ""nsaunshi@cs.princeton.edu"", ""kiran.vodrahalli@columbia.edu""]","[""Sanjeev Arora"", ""Mikhail Khodak"", ""Nikunj Saunshi"", ""Kiran Vodrahalli""]","[""theory"", ""LSTM"", ""unsupervised learning"", ""word embeddings"", ""compressed sensing"", ""sparse recovery"", ""document representation"", ""text classification""]","Low-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, the standard sparse recovery tool, which may explain why they lead to better representations in practice.",/pdf/2a40a72cc75752b216ca3ff040b19603ce0c0dbe.pdf,ICLR,2018,We use the theory of compressed sensing to prove that LSTMs can do at least as well on linear text classification as Bag-of-n-Grams. +H1_EDpogx,,1478380000000.0,1478400000000.0,604,Near-Data Processing for Machine Learning,"[""genesis1104@snu.ac.kr"", ""lees231@dsl.snu.ac.kr"", ""godqhr825@snu.ac.kr"", ""pss015@snu.ac.kr"", ""hokiespa@snu.ac.kr"", ""eychung@yonsei.ac.kr"", ""sryoon@snu.ac.kr""]","[""Hyeokjun Choe"", ""Seil Lee"", ""Hyunha Nam"", ""Seongsik Park"", ""Seijoon Kim"", ""Eui-Young Chung"", ""Sungroh Yoon""]",[],"In computer architecture, near-data processing (NDP) refers to augmenting the memory or the storage with processing power so that it can process the data stored therein. By offloading the computational burden of CPU and saving the need for transferring raw data in its entirety, NDP exhibits a great potential for acceleration and power reduction. Despite this potential, specific research activities on NDP have witnessed only limited success until recently, often owing to performance mismatches between logic and memory process technologies that put a limit on the processing capability of memory. Recently, there have been two major changes in the game, igniting the resurgence of NDP with renewed interest. The first is the success of machine learning (ML), which often demands a great deal of computation for training, requiring frequent transfers of big data. The second is the advent of NAND flash-based solid-state drives (SSDs) containing multicore processors that can accommodate extra computation for data processing. Sparked by these application needs and technological support, we evaluate the potential of NDP for ML using a new SSD platform that allows us to simulate in-storage processing (ISP) of ML workloads. Our platform (named ISP-ML) is a full-fledged simulator of a realistic multi-channel SSD that can execute various ML algorithms using the data stored in the SSD. For thorough performance analysis and in-depth comparison with alternatives, we focus on a specific algorithm: stochastic gradient decent (SGD), which is the de facto standard for training differentiable learning machines including deep neural networks. We implement and compare three variants of SGD (synchronous, Downpour, and elastic averaging) using ISP-ML, exploiting the multiple NAND channels for parallelizing SGD. In addition, we compare the performance of ISP and that of conventional in-host processing, revealing the advantages of ISP. Based on the advantages and limitations identified through our experiments, we further discuss directions for future research on ISP for accelerating ML.",/pdf/f8b0436ea7370779edd703b7d7d35a9e34752fa9.pdf,ICLR,2017, +SyjjD1WRb,SyssDy-CZ,1509120000000.0,1518730000000.0,484,Evolutionary Expectation Maximization for Generative Models with Binary Latents,"[""enrico.guiraud@cern.ch"", ""jakob.heinrich.drefs@uni-oldenburg.de"", ""joerg.luecke@uni-oldenburg.de""]","[""Enrico Guiraud"", ""Jakob Drefs"", ""Joerg Luecke""]","[""unsupervised"", ""learning"", ""evolutionary"", ""sparse"", ""coding"", ""noisyOR"", ""BSC"", ""EM"", ""expectation-maximization"", ""variational EM"", ""optimization""]","We establish a theoretical link between evolutionary algorithms and variational parameter optimization of probabilistic generative models with binary hidden variables. +While the novel approach is independent of the actual generative model, here we use two such models to investigate its applicability and scalability: a noisy-OR Bayes Net (as a standard example of binary data) and Binary Sparse Coding (as a model for continuous data). + +Learning of probabilistic generative models is first formulated as approximate maximum likelihood optimization using variational expectation maximization (EM). +We choose truncated posteriors as variational distributions in which discrete latent states serve as variational parameters. In the variational E-step, +the latent states are then +optimized according to a tractable free-energy objective. Given a data point, we can show that evolutionary algorithms can be used for the variational optimization loop by (A)~considering the bit-vectors of the latent states as genomes of individuals, and by (B)~defining the fitness of the +individuals as the (log) joint probabilities given by the used generative model. + +As a proof of concept, we apply the novel evolutionary EM approach to the optimization of the parameters of noisy-OR Bayes nets and binary sparse coding on artificial and real data (natural image patches). Using point mutations and single-point cross-over for the evolutionary algorithm, we find that scalable variational EM algorithms are obtained which efficiently improve the data likelihood. In general we believe that, with the link established here, standard as well as recent results in the field of evolutionary optimization can be leveraged to address the difficult problem of parameter optimization in generative models.",/pdf/9a0e69b40eaaf268ec6baee744a45a4245948f45.pdf,ICLR,2018,We present Evolutionary EM as a novel algorithm for unsupervised training of generative models with binary latent variables that intimately connects variational EM with evolutionary optimization +HylthC4twr,BklF75Y_DS,1569440000000.0,1577170000000.0,1364,Frequency Analysis for Graph Convolution Network,"[""hoang.nguyen.rh@riken.jp"", ""takanori.maehara@riken.jp""]","[""Hoang NT"", ""Takanori Maehara""]","[""graph signal processing"", ""frequency analysis"", ""graph convolution neural network"", ""simplified convolution network"", ""semi-supervised vertex classification""]","In this work, we develop quantitative results to the learnablity of a two-layers Graph Convolutional Network (GCN). Instead of analyzing GCN under some classes of functions, our approach provides a quantitative gap between a two-layers GCN and a two-layers MLP model. Our analysis is based on the graph signal processing (GSP) approach, which can provide much more useful insights than the message-passing computational model. Interestingly, based on our analysis, we have been able to empirically demonstrate a few case when GCN and other state-of-the-art models cannot learn even when true vertex features are extremely low-dimensional. To demonstrate our theoretical findings and propose a solution to the aforementioned adversarial cases, we build a proof of concept graph neural network model with stacked filters named Graph Filters Neural Network (gfNN). +",/pdf/7b77beae2b58e112aa72ed0dc311abc693ea9a51.pdf,ICLR,2020,"We study the filtering effect of GCN and SGC on benchmark datasets, find that all datasets are low-frequency and state-of-the-art models do not work in high-frequency settings." +SJxyCRVKvB,HJewKS9dwr,1569440000000.0,1577170000000.0,1415,Granger Causal Structure Reconstruction from Heterogeneous Multivariate Time Series,"[""yfchu@bupt.edu.cn"", ""daemon.wxw@alibaba-inc.com"", ""cyfeng@bupt.edu.cn"", ""jason.mjx@alibaba-inc.com"", ""jingren.zhou@alibaba-inc.com"", ""yang.yhx@alibaba-inc.com""]","[""Yunfei Chu"", ""Xiaowei Wang"", ""Chunyan Feng"", ""Jianxin Ma"", ""Jingren Zhou"", ""Hongxia Yang""]","[""causal inference"", ""Granger causality"", ""time series"", ""inductive"", ""LSTM"", ""attention""]","Granger causal structure reconstruction is an emerging topic that can uncover causal relationship behind multivariate time series data. In many real-world systems, it is common to encounter a large amount of multivariate time series data collected from heterogeneous individuals with sharing commonalities, however there are ongoing concerns regarding its applicability in such large scale complex scenarios, presenting both challenges and opportunities for Granger causal reconstruction. To bridge this gap, we propose a Granger cAusal StructurE Reconstruction (GASER) framework for inductive Granger causality learning and common causal structure detection on heterogeneous multivariate time series. In particular, we address the problem through a novel attention mechanism, called prototypical Granger causal attention. Extensive experiments, as well as an online A/B test on an E-commercial advertising platform, demonstrate the superior performances of GASER.",/pdf/9ab8bb083f93c3f36dabfb02dee8fa835987265d.pdf,ICLR,2020,We propose a network architecture that inductively reconstructs Granger causality via a prototypical Granger causal attention mechanism. +1toB0Fo9CZy,0GDnQ1aobQl,1601310000000.0,1614990000000.0,415,Neural Architecture Search of SPD Manifold Networks,"[""~Rhea_Sanjay_Sukthanker1"", ""~Zhiwu_Huang1"", ""~Suryansh_Kumar1"", ""~Erik_Goron1"", ""~Yan_Wu4"", ""~Luc_Van_Gool1""]","[""Rhea Sanjay Sukthanker"", ""Zhiwu Huang"", ""Suryansh Kumar"", ""Erik Goron"", ""Yan Wu"", ""Luc Van Gool""]","[""Neural Architecture Search"", ""AutoML""]","In this paper, we propose a new neural architecture search (NAS) problem of Symmetric Positive Definite (SPD) manifold networks. Unlike the conventional NAS problem, our problem requires to search for a unique computational cell called the SPD cell. This SPD cell serves as a basic building block of SPD neural architectures. An efficient solution to our problem is important to minimize the extraneous manual effort in the SPD neural architecture design. To accomplish this goal, we first introduce a geometrically rich and diverse SPD neural architecture search space for an efficient SPD cell design. Further, we model our new NAS problem using the supernet strategy, which models the architecture search problem as a one-shot training process of a single supernet. Based on the supernet modeling, we exploit a differentiable NAS algorithm on our relaxed continuous search space for SPD neural architecture search. Statistical evaluation of our method on drone, action, and emotion recognition tasks mostly provides better results than the state-of-the-art SPD networks and NAS algorithms. Empirical results show that our algorithm excels in discovering better SPD network design and providing models that are more than 3 times lighter than searched by state-of-the-art NAS algorithms.",/pdf/ebc7331fdf5a246e6a77448ed0c2e26040a068aa.pdf,ICLR,2021, +Byl3K2VtwB,rylWSXDa8H,1569440000000.0,1577170000000.0,94,Unsupervised Learning of Node Embeddings by Detecting Communities,"[""thang.duong@epfl.ch"", ""dungmin97@gmail.com"", ""giangpna98@gmail.com"", ""thanhcls1316@gmail.com"", ""h.yin1@uq.edu.au"", ""matthias.weidlich@hu-berlin.de"", ""quocviethung1@gmail.com"", ""karl.aberer@epfl.ch""]","[""Chi Thang Duong"", ""Dung Hoang"", ""Truong Giang Le Ba"", ""Thanh Le Cong"", ""Hongzhi Yin"", ""Matthias Weidlich"", ""Quoc Viet Hung Nguyen"", ""Karl Aberer""]","[""Unsupervised Learning"", ""Graph Embedding"", ""Community Detection"", ""Mincut"", ""Normalized cut"", ""Deep Learning""]","We present Deep MinCut (DMC), an unsupervised approach to learn node embeddings for graph-structured data. It derives node representations based on their membership in communities. As such, the embeddings directly provide interesting insights into the graph structure, so that the separate node clustering step of existing methods is no longer needed. DMC learns both, node embeddings and communities, simultaneously by minimizing the mincut loss, which captures the number of connections between communities. Striving for high scalability, we also propose a training process for DMC based on minibatches. We provide empirical evidence that the communities learned by DMC are meaningful and that the node embeddings are competitive in different node classification benchmarks.",/pdf/0d016dc74c3f07ca2df564481b337ea353d6265f.pdf,ICLR,2020,"A neural network approach for unsupervised learning of node embeddings of a graph, while at the same time learning structural characteristics in terms of communities of nodes" +ni_nys-C9D6,DVK6Nelc5Q,1601310000000.0,1614990000000.0,1047,Differentiate Everything with a Reversible Domain-Specific Language,"[""~JinGuo_Liu1"", ""thaut@logic.cs.tsukuba.ac.jp""]","[""JinGuo Liu"", ""Taine Zhao""]","[""Reversible computing"", ""automatic differentiation"", ""Julia""]","Reverse-mode automatic differentiation (AD) suffers from the issue of having too much space overhead to trace back intermediate computational states for backpropagation. +The traditional method to trace back states is called checkpointing that stores intermediate states into a global stack and restore state through either stack pop or re-computing. +The overhead of stack manipulations and re-computing makes the general purposed (or not tensor-based) AD engines unable to meet many industrial needs. +Instead of checkpointing, we propose to use reverse computing to trace back states by designing and implementing a reversible programming eDSL, where a program can be executed bi-directionally without implicit stack operations. The absence of implicit stack operations makes the program compatible with existing compiler features, including utilizing existing optimization passes and compiling the code as GPU kernels. +We implement AD for sparse matrix operations and some machine learning applications to show that our framework has state-of-the-art performance.",/pdf/8641e76bb2a8d2e7fcdd3376f1a332094cbc48fc.pdf,ICLR,2021,Design a reversible eDSL in Julia with native high performance AD support. +lXoWPoi_40,KWtVYupwY7R,1601310000000.0,1614990000000.0,3157,Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule,"[""~Nikhil_Iyer1"", ""thejasvenkatesh97@gmail.com"", ""~Nipun_Kwatra1"", ""~Ramachandran_Ramjee1"", ""~Muthian_Sivathanu1""]","[""Nikhil Iyer"", ""V Thejas"", ""Nipun Kwatra"", ""Ramachandran Ramjee"", ""Muthian Sivathanu""]","[""deep learning"", ""learning rate"", ""generalization""]","Several papers argue that wide minima generalize better than narrow minima. In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.84% higher absolute accuracy using the original training budget or up to 57% reduced training time while achieving the original reported accuracy. For example, we achieve state-of-the-art (SOTA) accuracy for IWSLT'14 (DE-EN) and WMT'14 (DE-EN) datasets by just modifying the learning rate schedule of a high performing model.",/pdf/f86a4ca22696a4ab7784eb47049311ba2825e8f2.pdf,ICLR,2021,"We present a hypothesis on the density of wide and narrow minima in deep learning landscapes, which also motivates a principled explore-exploit learning rate schedule." +S1sRrN-CW,BysijzbCb,1509140000000.0,1518730000000.0,951,Revisiting Knowledge Base Embedding as Tensor Decomposition,"[""xptree@gmail.com"", ""haoma@microsoft.com"", ""yuxdong@microsoft.com"", ""kuansanw@microsoft.com"", ""jietang@tsinghua.edu.cn""]","[""Jiezhong Qiu"", ""Hao Ma"", ""Yuxiao Dong"", ""Kuansan Wang"", ""Jie Tang""]","[""Knowledge base embedding""]","We study the problem of knowledge base (KB) embedding, which is usually addressed through two frameworks---neural KB embedding and tensor decomposition. In this work, we theoretically analyze the neural embedding framework and subsequently connect it with tensor based embedding. Specifically, we show that in neural KB embedding the two commonly adopted optimization solutions---margin-based and negative sampling losses---are closely related to each other. We also reach the closed-form tensor that is implicitly approximated by popular neural KB approaches, revealing the underlying connection between neural and tensor based KB embedding models. Grounded in the theoretical results, we further present a tensor decomposition based framework KBTD to directly approximate the derived closed form tensor. Under this framework, the neural KB embedding models, such as NTN, TransE, Bilinear, and DISTMULT, are unified into a general tensor optimization architecture. Finally, we conduct experiments on the link prediction task in WordNet and Freebase, empirically demonstrating the effectiveness of the KBTD framework. +",/pdf/4e9e3d851b60e8aa75b53c344e0ed3988c5300fa.pdf,ICLR,2018, +uFHwB6YTxXz,t_lYDdmIai,1601310000000.0,1614990000000.0,3255,Distribution-Based Invariant Deep Networks for Learning Meta-Features,"[""~Gwendoline_de_Bie1"", ""~Herilalaina_Rakotoarison1"", ""~Gabriel_Peyr\u00e92"", ""~Mich\u00e8le_Sebag1""]","[""Gwendoline de Bie"", ""Herilalaina Rakotoarison"", ""Gabriel Peyr\u00e9"", ""Mich\u00e8le Sebag""]","[""invariant neural networks"", ""universal approximation"", ""meta-feature learning""]","Recent advances in deep learning from probability distributions successfully achieve classification or regression from distribution samples, thus invariant under permutation of the samples. The first contribution of the paper is to extend these neural architectures to achieve invariance under permutation of the features, too. The proposed architecture, called Dida, inherits the NN properties of universal approximation, and its robustness with respect to Lipschitz-bounded transformations of the input distribution is established. The second contribution is to empirically and comparatively demonstrate the merits of the approach on two tasks defined at the dataset level. On both tasks, Dida learns meta-features supporting the characterization of a (labelled) dataset. The first task consists of predicting whether two dataset patches are extracted from the same initial dataset. The second task consists of predicting whether the learning performance achieved by a hyper-parameter configuration under a fixed algorithm (ranging in k-NN, SVM, logistic regression and linear SGD) dominates that of another configuration, for a dataset extracted from the OpenML benchmarking suite. On both tasks, Dida outperforms the state of the art: DSS and Dataset2Vec architectures, as well as the models based on the hand-crafted meta-features of the literature. ",/pdf/cdb82b75ec8870140f4784ed4f830c5b9edbe47c.pdf,ICLR,2021,"Existing distributional-based neural network are extended to achieve invariance under permutation of the features, with theoritical guarantees of universal approximation and robustness, suitable for learning dataset meta-features." +UQz4_jo70Ci,mVuK-epH1Tu,1601310000000.0,1614990000000.0,3316,SiamCAN:Simple yet Effective Method to enhance Siamese Short-Term Tracking,"[""~Yue_Zhao10"", ""~Zhibin_Yu3""]","[""Yue Zhao"", ""Zhibin Yu""]","[""Siamese trackers"", ""cross-attention"", ""light structure"", ""anchor-free""]","Most traditional Siamese trackers are used to regard the location of the max response map as the center of target. However, it is difficult for these traditional methods to calculate response value accurately when face the similar object, deformation, background clutters and other challenges. So how to get the reliable response map is the key to improve tracking performance. Accordingly, a simple yet effective short-term tracking framework (called SiamCAN),by which bridging the information flow between search branch and template branch, is proposed to solve the above problem in this paper. Moreover, in order to get more accurate target estimation, an anchor-free mechanism and specialized training strategy are applied to narrow the gap between the predicted bounding box and groundtruth. The proposed method achieves state-of-the-art performance on four visual tracking benchmarks including UAV123, OTB100, VOT2018 and VOT2019, outperforming the strong baseline, SiamBAN, by 0.327 $\displaystyle \rightarrow$ 0.331 on VOT2019 and 0.631 $\displaystyle \rightarrow$ 0.638 success score, 0.833 $\displaystyle \rightarrow$ 0.850 precision score on UAV123.",/pdf/e26f1c11a33745ac2c29b61ea63bf8191976bbd7.pdf,ICLR,2021,"We propose a series of simple yet effective method for Siamese trackers to solve similar distractor, scale variation and background clutters challenge." +iAmZUo0DxC0,TrJaIOrfMte,1601310000000.0,1614210000000.0,1169,Unlearnable Examples: Making Personal Data Unexploitable,"[""~Hanxun_Huang1"", ""~Xingjun_Ma1"", ""~Sarah_Monazam_Erfani1"", ""~James_Bailey1"", ""~Yisen_Wang1""]","[""Hanxun Huang"", ""Xingjun Ma"", ""Sarah Monazam Erfani"", ""James Bailey"", ""Yisen Wang""]","[""Unlearnable Examples"", ""Data Protection"", ""Adversarial Machine Learning""]","The volume of ""free"" data on the internet has been key to the current success of deep learning. However, it also raises privacy concerns about the unauthorized exploitation of personal data for training commercial models. It is thus crucial to develop methods to prevent unauthorized data exploitation. This paper raises the question: can data be made unlearnable for deep learning models? We present a type of error-minimizing noise that can indeed make training examples unlearnable. Error-minimizing noise is intentionally generated to reduce the error of one or more of the training example(s) close to zero, which can trick the model into believing there is ""nothing"" to learn from these example(s). The noise is restricted to be imperceptible to human eyes, and thus does not affect normal data utility. We empirically verify the effectiveness of error-minimizing noise in both sample-wise and class-wise forms. We also demonstrate its flexibility under extensive experimental settings and practicability in a case study of face recognition. Our work establishes an important first step towards making personal data unexploitable to deep learning models.",/pdf/eb123b0f1c20d0c5d47b33fa7feca81748e02666.pdf,ICLR,2021,We present a type of error-minimizing noise that can make training examples unlearnable to deep learning. +rJleKgrKwS,HJlqqpgKvB,1569440000000.0,1583910000000.0,2423,Differentiable learning of numerical rules in knowledge graphs,"[""poweiw@cs.cmu.edu"", ""daria.stepanova@de.bosch.com"", ""csaba.domokos@de.bosch.com"", ""zkolter@cs.cmu.edu""]","[""Po-Wei Wang"", ""Daria Stepanova"", ""Csaba Domokos"", ""J. Zico Kolter""]","[""knowledge graphs"", ""rule learning"", ""differentiable neural logic""]","Rules over a knowledge graph (KG) capture interpretable patterns in data and can be used for KG cleaning and completion. Inspired by the TensorLog differentiable logic framework, which compiles rule inference into a sequence of differentiable operations, recently a method called Neural LP has been proposed for learning the parameters as well as the structure of rules. However, it is limited with respect to the treatment of numerical features like age, weight or scientific measurements. We address this limitation by extending Neural LP to learn rules with numerical values, e.g., ”People younger than 18 typically live with their parents“. We demonstrate how dynamic programming and cumulative sum operations can be exploited to ensure efficiency of such extension. Our novel approach allows us to extract more expressive rules with aggregates, which are of higher quality and yield more accurate predictions compared to rules learned by the state-of-the-art methods, as shown by our experiments on synthetic and real-world datasets.",/pdf/d66021fe991280447daf0d6a7e0f577b9a11b7ed.pdf,ICLR,2020,We present an efficient approach to integrating numerical comparisons into differentiable rule learning in knowledge graphs +zmgJIjyWSOw,YUOqFUKSApB,1601310000000.0,1614990000000.0,2873,UserBERT: Self-supervised User Representation Learning,"[""~Tianyu_Li2"", ""ali.cevahir@rakuten.com"", ""derek.cho@rakuten.com"", ""hao.gong@rakuten.com"", ""duykhuong.nguyen@rakuten.com"", ""bjorn.stenger@rakuten.com""]","[""Tianyu Li"", ""Ali Cevahir"", ""Derek Cho"", ""Hao Gong"", ""DuyKhuong Nguyen"", ""Bjorn Stenger""]","[""user representations"", ""representation learning"", ""self-supervised learning"", ""pretraining"", ""transfer learning""]","This paper extends the BERT model to user data for pretraining user representations in a self-supervised way. By viewing actions (e.g., purchases and clicks) in behavior sequences (i.e., usage history) in an analogous way to words in sentences, we propose methods for the tokenization, the generation of input representation vectors and a novel pretext task to enable the pretraining model to learn from its own input, omitting the burden of collecting additional data. Further, our model adopts a unified structure to simultaneously learn from long-term and short-term user behavior as well as user profiles. Extensive experiments demonstrate that the learned representations result in significant improvements when transferred to three different real-world tasks, particularly in comparison with task-specific modeling and representations obtained from multi-task learning. ",/pdf/feb947372309661cb7c05de634f971edcaefa2e5.pdf,ICLR,2021,On pretraining user representations via self-supervision +TwkEGci1Y-,JsC14UhQEX,1601310000000.0,1614990000000.0,3649,On the Role of Pre-training for Meta Few-Shot Learning,"[""~Chia-You_Chen1"", ""~Hsuan-Tien_Lin1"", ""~Gang_Niu1"", ""~Masashi_Sugiyama1""]","[""Chia-You Chen"", ""Hsuan-Tien Lin"", ""Gang Niu"", ""Masashi Sugiyama""]","[""Meta-Learning"", ""Episodic Training"", ""Pre-training"", ""Disentanglement""]","Few-shot learning aims to classify unknown classes of examples with a few new examples per class. There are two key routes for few-shot learning. One is to (pre-)train a classifier with examples from known classes, and then transfer the pre-trained classifier to unknown classes using the new examples. The other, called meta few-shot learning, is to couple pre-training with episodic training, which contains episodes of few-shot learning tasks simulated from the known classes. Pre-training is known to play a crucial role for the transfer route, but the role of pre-training for the episodic route is less clear. In this work, we study the role of pre-training for the episodic route. We find that pre-training serves a major role of disentangling representations of known classes, which makes the resulting learning tasks easier for episodic training. The finding allows us to shift the huge simulation burden of episodic learning to a simpler pre-training stage. We justify such a benefit of shift by designing a new disentanglement-based pre-training model, which helps episodic learning achieve competitive performance more efficiently. ",/pdf/26e329ed90409c1790e110996365121fc828f220.pdf,ICLR,2021, +SJgw_sRqFQ,Hkgkn2KcKm,1538090000000.0,1550900000000.0,361,The Unusual Effectiveness of Averaging in GAN Training,"[""yasin001@e.ntu.edu.sg"", ""foocs@i2r.a-star.edu.sg"", ""stefan.winkler@adsc-create.edu.sg"", ""ekhyap@ntu.edu.sg"", ""georgios@sutd.edu.sg"", ""vijay@i2r.a-star.edu.sg""]","[""Yasin Yaz{\\i}c{\\i}"", ""Chuan-Sheng Foo"", ""Stefan Winkler"", ""Kim-Hui Yap"", ""Georgios Piliouras"", ""Vijay Chandrasekhar""]","[""Generative Adversarial Networks (GANs)"", ""Moving Average"", ""Exponential Moving Average"", ""Convergence"", ""Limit Cycles""]","We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the -- to our knowledge -- first theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one for simple bilinear games and also enhances the stability of general GAN training. We establish experimentally that both techniques are strikingly effective in the non-convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets -- mixture of Gaussians, CIFAR-10, STL-10, CelebA and ImageNet -- to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images.\footnote{~The code is available at \url{https://github.com/yasinyazici/EMA_GAN}}",/pdf/11defe52225dd3afbb727c2dad6fd51e5e9f856f.pdf,ICLR,2019, +B1gXYR4YDH,HJeGk7udvr,1569440000000.0,1577170000000.0,1241,DSReg: Using Distant Supervision as a Regularizer,"[""yuxian_meng@shannonai.com"", ""muyu_li@shannonai.com"", ""xiaoya_li@shannonai.com"", ""wei_wu@shannonai.com"", ""wufei@zju.edu.cn"", ""jiwei_li@shannonai.com""]","[""Yuxian Meng"", ""Muyu Li"", ""Xiaoya Li"", ""Wei Wu"", ""Fei Wu"", ""Jiwei Li""]",[],"In this paper, we aim at tackling a general issue in NLP tasks where some of the negative examples are highly similar to the positive examples, i.e., hard-negative examples). We propose the distant supervision as a regularizer (DSReg) approach to tackle this issue. We convert the original task to a multi-task learning problem, in which we first utilize the idea of distant supervision to retrieve hard-negative examples. The obtained hard-negative examples are then used as a regularizer, and we jointly optimize the original target objective of distinguishing positive examples from negative examples along with the auxiliary task objective of distinguishing soften positive examples (comprised of positive examples and hard-negative examples) from easy-negative examples. In the neural context, this can be done by feeding the final token representations to different output layers. Using this unbelievably simple strategy, we improve the performance of a range of different NLP tasks, including text classification, sequence labeling and reading comprehension. ",/pdf/dc311d99d631764f3da837dac2d31a6841ea7181.pdf,ICLR,2020, +rJxok1BYPr,HyxXYyi_vH,1569440000000.0,1577170000000.0,1482,Black Box Recursive Translations for Molecular Optimization,"[""farhand7@gmail.com"", ""vishnu.sresht@pfizer.com"", ""stephen.ra@pfizer.com""]","[""Farhan Damani"", ""Vishnu Sresht"", ""Stephen Ra""]","[""molecules"", ""chemistry"", ""drug design"", ""generative models"", ""application"", ""translation""]","Machine learning algorithms for generating molecular structures offer a promising new approach to drug discovery. We cast molecular optimization as a translation problem, where the goal is to map an input compound to a target compound with improved biochemical properties. Remarkably, we observe that when generated molecules are iteratively fed back into the translator, molecular compound attributes improve with each step. We show that this finding is invariant to the choice of translation model, making this a ""black box"" algorithm. We call this method Black Box Recursive Translation (BBRT), a new inference method for molecular property optimization. This simple, powerful technique operates strictly on the inputs and outputs of any translation model. We obtain new state-of-the-art results for molecular property optimization tasks using our simple drop-in replacement with well-known sequence and graph-based models. Our method provides a significant boost in performance relative to its non-recursive peers with just a simple ""``for"" loop. Further, BBRT is highly interpretable, allowing users to map the evolution of newly discovered compounds from known starting points. ",/pdf/c2b95a257381caa62992063791aa48b0da46f74e.pdf,ICLR,2020,We introduce a black box algorithm for repeated optimization of compounds using a translation framework. +BJge3TNKwH,H1xmNhkODB,1569440000000.0,1583910000000.0,769,Sliced Cramer Synaptic Consolidation for Preserving Deeply Learned Representations,"[""skolouri@hrl.com"", ""naketz@hrl.com"", ""a.soltoggio@lboro.ac.uk"", ""pkpilly@hrl.com""]","[""Soheil Kolouri"", ""Nicholas A. Ketz"", ""Andrea Soltoggio"", ""Praveen K. Pilly""]","[""selective plasticity"", ""catastrophic forgetting"", ""intransigence""]","Deep neural networks suffer from the inability to preserve the learned data representation (i.e., catastrophic forgetting) in domains where the input data distribution is non-stationary, and it changes during training. Various selective synaptic plasticity approaches have been recently proposed to preserve network parameters, which are crucial for previously learned tasks while learning new tasks. We explore such selective synaptic plasticity approaches through a unifying lens of memory replay and show the close relationship between methods like Elastic Weight Consolidation (EWC) and Memory-Aware-Synapses (MAS). We then propose a fundamentally different class of preservation methods that aim at preserving the distribution of internal neural representations for previous tasks while learning a new one. We propose the sliced Cram\'{e}r distance as a suitable choice for such preservation and evaluate our Sliced Cramer Preservation (SCP) algorithm through extensive empirical investigations on various network architectures in both supervised and unsupervised learning settings. We show that SCP consistently utilizes the learning capacity of the network better than online-EWC and MAS methods on various incremental learning tasks.",/pdf/94dc267f1eb2462a0e0706040c156c440b3a27ae.pdf,ICLR,2020,"""A novel framework for overcoming catastrophic forgetting by preserving the distribution of the network's output at an arbitrary layer.""" +g21u6nlbPzn,o974SqnLLK-,1601310000000.0,1613420000000.0,948,VA-RED$^2$: Video Adaptive Redundancy Reduction,"[""~Bowen_Pan2"", ""~Rameswar_Panda1"", ""~Camilo_Luciano_Fosco1"", ""~Chung-Ching_Lin2"", ""~Alex_J_Andonian1"", ""~Yue_Meng1"", ""~Kate_Saenko1"", ""~Aude_Oliva1"", ""~Rogerio_Feris1""]","[""Bowen Pan"", ""Rameswar Panda"", ""Camilo Luciano Fosco"", ""Chung-Ching Lin"", ""Alex J Andonian"", ""Yue Meng"", ""Kate Saenko"", ""Aude Oliva"", ""Rogerio Feris""]",[],"Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED$^2$, which is input-dependent. Specifically, our VA-RED$^2$ framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient. Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves $20\% - 40\%$ reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss. Project page: http://people.csail.mit.edu/bpan/va-red/.",/pdf/474acfaf4c205394a38367bb1c69399a31b1264e.pdf,ICLR,2021, +Bkp_y7qxe,,1478270000000.0,1484810000000.0,186,Unsupervised Deep Learning of State Representation Using Robotic Priors ,"[""timothee.lesort@ensta-paristech.fr"", ""david.filliat@ensta-paristech.fr""]","[""Timothee LESORT"", ""David FILLIAT""]","[""Deep learning"", ""Computer vision"", ""Unsupervised Learning""]","Our understanding of the world depends highly on how we represent it. Using background knowledge about its complex underlying physical rules, our brain can produce intuitive and simplified representations which it can easily use to solve problems. The approach of this paper aims to reproduce this simplification process using a neural network to produce a simple low dimensional state representation of the world from images acquired by a robot. As proposed in Jonschkowski & Brock (2015), we train the neural network in an unsupervised way, using the ""a priori"" knowledge we have about the world as loss functions called ""robotic priors"" that we implemented through a siamese network. This approach has been used to learn a one dimension representation of a Baxter head position from raw images. The experiment resulted in a 97,7% correlation between the learned representation and the ground truth, and show that relevant visual features form the environment are learned.",/pdf/f96e9d307271e9f77d2a1df049bd9f0b8198987a.pdf,ICLR,2017,This paper introduces a method for training a deep neural network to learn a representation of a robot's environment state using a priori knowledge. +HJgBA2VYwH,rkgfFOo4PH,1569440000000.0,1588330000000.0,263,FSPool: Learning Set Representations with Featurewise Sort Pooling,"[""yz5n12@ecs.soton.ac.uk"", ""jsh2@ecs.soton.ac.uk"", ""apb@ecs.soton.ac.uk""]","[""Yan Zhang"", ""Jonathon Hare"", ""Adam Pr\u00fcgel-Bennett""]","[""set auto-encoder"", ""set encoder"", ""pooling""]","Traditional set prediction models can struggle with simple datasets due to an issue we call the responsibility problem. We introduce a pooling method for sets of feature vectors based on sorting features across elements of the set. This can be used to construct a permutation-equivariant auto-encoder that avoids this responsibility problem. On a toy dataset of polygons and a set version of MNIST, we show that such an auto-encoder produces considerably better reconstructions and representations. Replacing the pooling function in existing set encoders with FSPool improves accuracy and convergence speed on a variety of datasets.",/pdf/f07fcc4c3ba0265ce70db205212cfd3ad65f3647.pdf,ICLR,2020,Sort in encoder and undo sorting in decoder to avoid responsibility problem in set auto-encoders +Bkl7bREtDr,ryg_c-4OPB,1569440000000.0,1583910000000.0,959,AMRL: Aggregated Memory For Reinforcement Learning,"[""jacob_beck@alumni.brown.edu"", ""kamil.ciosek@microsoft.com"", ""sam.devlin@microsoft.com"", ""sebastian.tschiatschek@microsoft.com"", ""cheng.zhang@microsoft.com"", ""katja.hofmann@microsoft.com""]","[""Jacob Beck"", ""Kamil Ciosek"", ""Sam Devlin"", ""Sebastian Tschiatschek"", ""Cheng Zhang"", ""Katja Hofmann""]","[""deep learning"", ""reinforcement learning"", ""rl"", ""memory"", ""noise"", ""machine learning""]","In many partially observable scenarios, Reinforcement Learning (RL) agents must rely on long-term memory in order to learn an optimal policy. We demonstrate that using techniques from NLP and supervised learning fails at RL tasks due to stochasticity from the environment and from exploration. Utilizing our insights on the limitations of traditional memory methods in RL, we propose AMRL, a class of models that can learn better policies with greater sample efficiency and are resilient to noisy inputs. Specifically, our models use a standard memory module to summarize short-term context, and then aggregate all prior states from the standard model without respect to order. We show that this provides advantages both in terms of gradient decay and signal-to-noise ratio over time. Evaluating in Minecraft and maze environments that test long-term memory, we find that our model improves average return by 19% over a baseline that has the same number of parameters and by 9% over a stronger baseline that has far more parameters.",/pdf/de1a9c01e9b3fac27abfcf1fc8d89b3da4d61570.pdf,ICLR,2020,"In Deep RL, order-invariant functions can be used in conjunction with standard memory modules to improve gradient decay and resilience to noise." +PiKUvDj5jyN,a66Djf3XNR_,1601310000000.0,1614990000000.0,244,Relational Learning with Variational Bayes,"[""~Kuang-Hung_Liu1""]","[""Kuang-Hung Liu""]","[""Relational learning"", ""unsupervised learning"", ""variational inference"", ""probabilistic graphical model""]","In psychology, relational learning refers to the ability to recognize and respond to +relationship among objects irrespective of the nature of those objects. Relational +learning has long been recognized as a hallmark of human cognition and a key +question in artificial intelligence research. In this work, we propose an unsupervised +learning method for addressing the relational learning problem where we +learn the underlying relationship between a pair of data irrespective of the nature +of those data. The central idea of the proposed method is to encapsulate the relational +learning problem with a probabilistic graphical model in which we perform +inference to learn about data relationships and other relational processing tasks.",/pdf/e081e29af7c90758b53df2a3c9fb1512387e5408.pdf,ICLR,2021,We propose an unsupervised learning method for addressing the relational learning problem where we learn the underlying relationship between a pair of data irrespective of the nature of those data. +r1AMITFaW,ByRG8TFpW,1508660000000.0,1518730000000.0,39,Dependent Bidirectional RNN with Extended-long Short-term Memory,"[""yuanhans@usc.edu"", ""yuzhongh@usc.edu"", ""cckuo@sipi.usc.edu""]","[""Yuanhang Su"", ""Yuzhong Huang"", ""C.-C. Jay Kuo""]","[""RNN"", ""memory"", ""LSTM"", ""GRU"", ""BRNN"", ""encoder-decoder"", ""Natural language processing""]","In this work, we first conduct mathematical analysis on the memory, which is +defined as a function that maps an element in a sequence to the current output, +of three RNN cells; namely, the simple recurrent neural network (SRN), the long +short-term memory (LSTM) and the gated recurrent unit (GRU). Based on the +analysis, we propose a new design, called the extended-long short-term memory +(ELSTM), to extend the memory length of a cell. Next, we present a multi-task +RNN model that is robust to previous erroneous predictions, called the dependent +bidirectional recurrent neural network (DBRNN), for the sequence-in-sequenceout +(SISO) problem. Finally, the performance of the DBRNN model with the +ELSTM cell is demonstrated by experimental results.",/pdf/6491fab95f7411c6b20c5ee7035e313915d36419.pdf,ICLR,2018,A recurrent neural network cell with extended-long short-term memory and a multi-task RNN model for sequence-in-sequence-out problems +7R7fAoUygoa,pyvwxs27VA6,1601310000000.0,1615960000000.0,1938,Optimal Regularization can Mitigate Double Descent,"[""~Preetum_Nakkiran1"", ""pvenkat@g.harvard.edu"", ""~Sham_M._Kakade1"", ""~Tengyu_Ma1""]","[""Preetum Nakkiran"", ""Prayaag Venkat"", ""Sham M. Kakade"", ""Tengyu Ma""]","[""double descent"", ""generalization"", ""regularization"", ""regression"", ""monotonicity""]","Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as ""double descent"", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned $\ell_2$ regularization achieves monotonic test performance as we grow either the sample size or the model size. +We also demonstrate empirically that optimally-tuned $\ell_2$ regularization can mitigate double descent for more general models, including neural networks. +Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.",/pdf/3424b7750a87532de0707e6ced4dd6f62ee9ca29.pdf,ICLR,2021,Optimal regularization can provably avoid double-descent in certain settings. +TaUJl6Kt3rW,zIrHnngqoJ,1601310000000.0,1614990000000.0,3221,SkillBERT: “Skilling” the BERT to classify skills!,"[""ambernigam@hsph.harvard.edu"", ""shikha@peoplestrong.com"", ""kuldeep@peoplestrong.com"", ""arpansaxena17may@gmail.com""]","[""Amber Nigam"", ""Shikha Tyagi"", ""Kuldeep Tyagi"", ""Arpan Saxena""]",[],"In the age of digital recruitment, job posts can attract a large number of applications, and screening them manually can become a very tedious task. These recruitment records are stored in the form of tables in our recruitment database (Electronic Recruitment Records, referred to as ERRs). We have released a de-identified ERR dataset to the public domain. We also propose a BERT-based model, SkillBERT, the embeddings of which are used as features for classifying skills present in the ERRs into groups referred to as ""competency groups"". A competency group is a group of similar skills and it is used as matching criteria (instead of matching on skills) for finding the overlap of skills between the candidates and the jobs. This proxy match takes advantage of the BERT's capability of deriving meaning from the structure of competency groups present in the skill dataset. In our experiments, the SkillBERT, which is trained from scratch on the skills present in job requisitions, is shown to be better performing than the pre-trained BERT and the Word2Vec. We have also explored K-means clustering and spectral clustering on SkillBERT embeddings to generate cluster-based features. Both algorithms provide similar performance benefits. Last, we have experimented with different machine learning algorithms like Random Forest, XGBoost, and a deep learning algorithm Bi-LSTM . We did not observe a significant performance difference among the algorithms, although XGBoost and Bi-LSTM perform slightly better than Random Forest. The features created using SkillBERT are most predictive in the classification task, which demonstrates that the SkillBERT is able to capture information about the skills' ontology from the data. We have made the source code and the trained models of our experiments publicly available.",/pdf/f727cad7b5b29867a465a8b52d819f1ab45a50d4.pdf,ICLR,2021,Learning semantic information from the skill-information of both candidates and jobs that could make hiring an efficient and intuitive process +Tp7kI90Htd,p4qsaIoVsSz,1601310000000.0,1615980000000.0,146,Generalization in data-driven models of primary visual cortex,"[""~Konstantin-Klemens_Lurz1"", ""mohammad.bashiri@uni-tuebingen.de"", ""konstantin-friedrich.willeke@uni-tuebingen.de"", ""akshay-kumar.jagadish@student.uni-tuebingen.de"", ""eric.wang2@bcm.edu"", ""~Edgar_Y._Walker1"", ""~Santiago_A_Cadena1"", ""taliah.muhammad@bcm.edu"", ""~Erick_Cobos1"", ""~Andreas_S._Tolias1"", ""~Alexander_S_Ecker1"", ""~Fabian_H._Sinz1""]","[""Konstantin-Klemens Lurz"", ""Mohammad Bashiri"", ""Konstantin Willeke"", ""Akshay Jagadish"", ""Eric Wang"", ""Edgar Y. Walker"", ""Santiago A Cadena"", ""Taliah Muhammad"", ""Erick Cobos"", ""Andreas S. Tolias"", ""Alexander S Ecker"", ""Fabian H. Sinz""]","[""neuroscience"", ""cognitive science"", ""multitask learning"", ""transfer learning"", ""representation learning"", ""network architecture"", ""computational biology"", ""visual perception""]","Deep neural networks (DNN) have set new standards at predicting responses of neural populations to visual input. Most such DNNs consist of a convolutional network (core) shared across all neurons which learns a representation of neural computation in visual cortex and a neuron-specific readout that linearly combines the relevant features in this representation. The goal of this paper is to test whether such a representation is indeed generally characteristic for visual cortex, i.e. generalizes between animals of a species, and what factors contribute to obtaining such a generalizing core. To push all non-linear computations into the core where the generalizing cortical features should be learned, we devise a novel readout that reduces the number of parameters per neuron in the readout by up to two orders of magnitude compared to the previous state-of-the-art. It does so by taking advantage of retinotopy and learns a Gaussian distribution over the neuron’s receptive field position. With this new readout we train our network on neural responses from mouse primary visual cortex (V1) and obtain a gain in performance of 7% compared to the previous state-of-the-art network. We then investigate whether the convolutional core indeed captures general cortical features by using the core in transfer learning to a different animal. When transferring a core trained on thousands of neurons from various animals and scans we exceed the performance of training directly on that animal by 12%, and outperform a commonly used VGG16 core pre-trained on imagenet by 33%. In addition, transfer learning with our data-driven core is more data-efficient than direct training, achieving the same performance with only 40% of the data. Our model with its novel readout thus sets a new state-of-the-art for neural response prediction in mouse visual cortex from natural images, generalizes between animals, and captures better characteristic cortical features than current task-driven pre-training approaches such as VGG16.",/pdf/a10fba1a4a77e56503923a67cf4e95e82d6f9b59.pdf,ICLR,2021,We introduce a novel network architecture which sets a new state of the art at predicting neural responses to visual input and successfully learns generalizing features of mouse visual cortex (V1). +DdGCxq9C_Gr,31l8EpDReU,1601310000000.0,1614990000000.0,1921,Dropout's Dream Land: Generalization from Learned Simulators to Reality,"[""~Zac_Wellmer1"", ""~James_Kwok1""]","[""Zac Wellmer"", ""James Kwok""]","[""Reinforcement Learning""]","A World Model is a generative model used to simulate an environment. World Models have proven capable of learning spatial and temporal representations of Reinforcement Learning environments. In some cases, a World Model offers an agent the opportunity to learn entirely inside of its own dream environment. In this work we explore improving the generalization capabilities from dream environments to reality (Dream2Real). We present a general approach to improve a controller's ability to transfer from a neural network dream environment to reality at little additional cost. These improvements are gained by drawing on inspiration from domain randomization, where the basic idea is to randomize as much of a simulator as possible without fundamentally changing the task at hand. Generally, domain randomization assumes access to a pre-built simulator with configurable parameters but oftentimes this is not available. By training the World Model using dropout, the dream environment is capable of creating a nearly infinite number of \textit{different} dream environments. Our experimental results show that Dropout's Dream Land is an effective technique to bridge the reality gap between dream environments and reality. Furthermore, we additionally perform an extensive set of ablation studies. ",/pdf/bafb6e74c184e43eb874a44cffdcc49055d0f5ec.pdf,ICLR,2021,A simple technique to bridge the sim2real gap between learned artificial neural network simulators and reality +H1vCXOe0b,B1vC7dxA-,1509090000000.0,1518730000000.0,306,Interpreting Deep Classification Models With Bayesian Inference,"[""eleyanh@nus.edu.sg"", ""elefjia@nus.edu.sg""]","[""Hanshu Yan"", ""Jiashi Feng""]",[],"In this paper, we propose a novel approach to interpret a well-trained classification model through systematically investigating effects of its hidden units on prediction making. We search for the core hidden units responsible for predicting inputs as the class of interest under the generative Bayesian inference framework. We model such a process of unit selection as an Indian Buffet Process, and derive a simplified objective function via the MAP asymptotic technique. The induced binary optimization problem is efficiently solved with a continuous relaxation method by attaching a Switch Gate layer to the hidden layers of interest. The resulted interpreter model is thus end-to-end optimized via standard gradient back-propagation. Experiments are conducted with two popular deep convolutional classifiers, respectively well-trained on the MNIST dataset and the CI- FAR10 dataset. The results demonstrate that the proposed interpreter successfully finds the core hidden units most responsible for prediction making. The modified model, only with the selected units activated, can hold correct predictions at a high rate. Besides, this interpreter model is also able to extract the most informative pixels in the images by connecting a Switch Gate layer to the input layer. +",/pdf/eaca65a30731ce94393b37dbaa0fa7f2ebd226e9.pdf,ICLR,2018, +rJxgknCcK7,rkgWtPhcYX,1538090000000.0,1550790000000.0,954,FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models,"[""wgrathwohl@cs.toronto.edu"", ""rtqichen@cs.toronto.edu"", ""jessebett@cs.toronto.edu"", ""ilyasu@openai.com"", ""duvenaud@cs.toronto.edu""]","[""Will Grathwohl"", ""Ricky T. Q. Chen"", ""Jesse Bettencourt"", ""Ilya Sutskever"", ""David Duvenaud""]","[""generative models"", ""density estimation"", ""approximate inference"", ""ordinary differential equations""]","A promising class of generative models maps points from a simple distribution to a complex distribution through an invertible neural network. Likelihood-based training of these models requires restricting their architectures to allow cheap computation of Jacobian determinants. Alternatively, the Jacobian trace can be used if the transformation is specified by an ordinary differential equation. In this paper, we use Hutchinson’s trace estimator to give a scalable unbiased estimate of the log-density. The result is a continuous-time invertible generative model with unbiased density estimation and one-pass sampling, while allowing unrestricted neural network architectures. We demonstrate our approach on high-dimensional density estimation, image generation, and variational inference, achieving the state-of-the-art among exact likelihood methods with efficient sampling.",/pdf/769f89fff587ac645343026f52226673a2729e32.pdf,ICLR,2019,We use continuous time dynamics to define a generative model with exact likelihoods and efficient sampling that is parameterized by unrestricted neural networks. +H1Nyf7W0Z,rkOoZXbCZ,1509140000000.0,1518730000000.0,1132,Alpha-divergence bridges maximum likelihood and reinforcement learning in neural sequence generation,"[""sotetsu.koyamada@gmail.com""]","[""Sotetsu Koyamada"", ""Yuta Kikuchi"", ""Atsunori Kanemura"", ""Shin-ichi Maeda"", ""Shin Ishii""]","[""neural network"", ""reinforcement learning"", ""natural language processing"", ""machine translation"", ""alpha-divergence""]","Neural sequence generation is commonly approached by using maximum- likelihood (ML) estimation or reinforcement learning (RL). However, it is known that they have their own shortcomings; ML presents training/testing discrepancy, whereas RL suffers from sample inefficiency. We point out that it is difficult to resolve all of the shortcomings simultaneously because of a tradeoff between ML and RL. In order to counteract these problems, we propose an objective function for sequence generation using α-divergence, which leads to an ML-RL integrated method that exploits better parts of ML and RL. We demonstrate that the proposed objective function generalizes ML and RL objective functions because it includes both as its special cases (ML corresponds to α → 0 and RL to α → 1). We provide a proposition stating that the difference between the RL objective function and the proposed one monotonically decreases with increasing α. Experimental results on machine translation tasks show that minimizing the proposed objective function achieves better sequence generation performance than ML-based methods.",/pdf/4122d80b6740caf9641d8bbc9dc1cf00e2259f51.pdf,ICLR,2018,Propose new objective function for neural sequence generation which integrates ML-based and RL-based objective functions. +H1lK_lBtvS,B1gMzalKwr,1569440000000.0,1583910000000.0,2406,Classification-Based Anomaly Detection for General Data,"[""liron.bergman@mail.huji.ac.il"", ""yedid@cs.huji.ac.il""]","[""Liron Bergman"", ""Yedid Hoshen""]","[""anomaly detection""]","Anomaly detection, finding patterns that substantially deviate from those seen previously, is one of the fundamental problems of artificial intelligence. Recently, classification-based methods were shown to achieve superior results on this task. In this work, we present a unifying view and propose an open-set method, GOAD, to relax current generalization assumptions. Furthermore, we extend the applicability of transformation-based methods to non-image data using random affine transformations. Our method is shown to obtain state-of-the-art accuracy and is applicable to broad data types. The strong performance of our method is extensively validated on multiple datasets from different domains. ",/pdf/0cc1c7474fb7a5d646d371f84aef56b3e920caa7.pdf,ICLR,2020,"Anomaly detection method that uses: openset techniques for better generalization, random-transformation classification for non-image data." +dyaIRud1zXg,ONoqKgrizmQ,1601310000000.0,1616630000000.0,309,Information Laundering for Model Privacy,"[""wang8740@umn.edu"", ""yu.xiang@utah.edu"", ""0618johnny@gmail.com"", ""~Jie_Ding2""]","[""Xinran Wang"", ""Yu Xiang"", ""Jun Gao"", ""Jie Ding""]","[""Adversarial Attack"", ""Machine Learning"", ""Model privacy"", ""Privacy-utility tradeoff"", ""Security""]","In this work, we propose information laundering, a novel framework for enhancing model privacy. Unlike data privacy that concerns the protection of raw data information, model privacy aims to protect an already-learned model that is to be deployed for public use. The private model can be obtained from general learning methods, and its deployment means that it will return a deterministic or random response for a given input query. An information-laundered model consists of probabilistic components that deliberately maneuver the intended input and output for queries of the model, so the model's adversarial acquisition is less likely. Under the proposed framework, we develop an information-theoretic principle to quantify the fundamental tradeoffs between model utility and privacy leakage and derive the optimal design.",/pdf/1ad035bf98810a860bec5ef38d3032842170c0e5.pdf,ICLR,2021,"We propose information laundering, a novel framework for enhancing model privacy." +BJeVklHtPr,B1x-CFyKwr,1569440000000.0,1577170000000.0,2059,Batch Normalization has Multiple Benefits: An Empirical Study on Residual Networks,"[""sohamde@google.com"", ""slsmith@google.com""]","[""Soham De"", ""Samuel L Smith""]","[""batch normalization"", ""residual networks"", ""initialization"", ""batch size"", ""learning rate"", ""ImageNet""]","Many state of the art models rely on two architectural innovations; skip connections and batch normalization. However batch normalization has a number of limitations. It breaks the independence between training examples within a batch, performs poorly when the batch size is too small, and significantly increases the cost of computing a parameter update in some models. This work identifies two practical benefits of batch normalization. First, it improves the final test accuracy. Second, it enables efficient training with larger batches and larger learning rates. However we demonstrate that the increase in the largest stable learning rate does not explain why the final test accuracy is increased under a finite epoch budget. Furthermore, we show that the gap in test accuracy between residual networks with and without batch normalization can be dramatically reduced by improving the initialization scheme. We introduce “ZeroInit”, which trains a 1000 layer deep Wide-ResNet without normalization to 94.3% test accuracy on CIFAR-10 in 200 epochs at batch size 64. This initialization scheme outperforms batch normalization when the batch size is very small, and is competitive with batch normalization for batch sizes that are not too large. We also show that ZeroInit matches the validation accuracy of batch normalization when training ResNet-50-V2 on ImageNet at batch size 1024.",/pdf/adcf9566d7becda40f0435aa684392f7c6902397.pdf,ICLR,2020,The multiple benefits of batch normalization can only be understood if one experiments at a range of batch sizes +rklklCVYvB,SyefUZ7_wB,1569440000000.0,1577170000000.0,915,Time2Vec: Learning a Vector Representation of Time,"[""mehran.kazemi@borealisai.com"", ""rishab.goel@borealisai.com"", ""sepehr.eghbali@rbc.com"", ""janahan.ramanan@borealisai.com"", ""jaspreet.sahota@borealisai.com"", ""sttsanjay@gmail.com"", ""stella.wu@borealisai.com"", ""cathal.smyth@rbc.com"", ""pascal.poupart@borealisai.com"", ""marcus.brubaker@borealisai.com""]","[""Seyed Mehran Kazemi"", ""Rishab Goel"", ""Sepehr Eghbali"", ""Janahan Ramanan"", ""Jaspreet Sahota"", ""Sanjay Thakur"", ""Stella Wu"", ""Cathal Smyth"", ""Pascal Poupart"", ""Marcus Brubaker""]",[],"Time is an important feature in many applications involving events that occur synchronously and/or asynchronously. To effectively consume time information, recent studies have focused on designing new architectures. In this paper, we take an orthogonal but complementary approach by providing a model-agnostic vector representation for time, called Time2Vec, that can be easily imported into many existing and future architectures and improve their performances. We show on a range of models and problems that replacing the notion of time with its Time2Vec representation improves the performance of the final model.",/pdf/7f65d5be0ae209d58bc5ee91db2a881cb2f18d2e.pdf,ICLR,2020, +OZgVHzdKicb,aN5qXaVpa0p,1601310000000.0,1614990000000.0,2833,Reinforcement Learning with Bayesian Classifiers: Efficient Skill Learning from Outcome Examples,"[""kevintli@berkeley.edu"", ""~Abhishek_Gupta1"", ""~Vitchyr_H._Pong1"", ""~Ashwin_Reddy1"", ""~Aurick_Zhou1"", ""justinvyu@berkeley.edu"", ""~Sergey_Levine1""]","[""Kevin Li"", ""Abhishek Gupta"", ""Vitchyr H. Pong"", ""Ashwin Reddy"", ""Aurick Zhou"", ""Justin Yu"", ""Sergey Levine""]","[""Reinforcement Learning"", ""Goal Reaching"", ""Bayesian Classification"", ""Reward Inference""]","Exploration in reinforcement learning is, in general, a challenging problem. In this work, we study a more tractable class of reinforcement learning problems defined by data that provides examples of successful outcome states. In this case, the reward function can be obtained automatically by training a classifier to classify states as successful or not. We argue that, with appropriate representation and regularization, such a classifier can guide a reinforcement learning algorithm to an effective solution. However, as we will show, this requires the classifier to make uncertainty-aware predictions that are very difficult with standard deep networks. To address this, we propose a novel mechanism for obtaining calibrated uncertainty based on an amortized technique for computing the normalized maximum likelihood distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions from data, while being able to guide algorithms towards the specified goal more effectively. We show how using amortized normalized maximum likelihood for reward inference is able to provide effective reward guidance for solving a number of challenging navigation and robotic manipulation tasks which prove difficult for other algorithms.",/pdf/4cb6111b9f472cc00efdfc95670c7987db5cc420.pdf,ICLR,2021,Bayesian classifiers allow efficient reinforcement learning and reward inference from outcome examples +HylKvyHYwS,HJeo1paODS,1569440000000.0,1577170000000.0,1775,Learning with Protection: Rejection of Suspicious Samples under Adversarial Environment,"[""mkato.csecon@gmail.com"", ""gatheluck@gmail.com"", ""hirokatsu.kataoka@aist.go.jp"", ""shigeo@waseda.jp""]","[""Masahiro Kato"", ""Yoshihiro Fukuhara"", ""Hirokatsu Kataoka"", ""Shigeo Morishima""]","[""Learning with Rejection"", ""Adversarial Examples""]","We propose a novel framework for avoiding the misclassification of data by using a framework of learning with rejection and adversarial examples. Recent developments in machine learning have opened new opportunities for industrial innovations such as self-driving cars. However, many machine learning models are vulnerable to adversarial attacks and industrial practitioners are concerned about accidents arising from misclassification. To avoid critical misclassifications, we define a sample that is likely to be mislabeled as a suspicious sample. Our main idea is to apply a framework of learning with rejection and adversarial examples to assist in the decision making for such suspicious samples. We propose two frameworks, learning with rejection under adversarial attacks and learning with protection. Learning with rejection under adversarial attacks is a naive extension of the learning with rejection framework for handling adversarial examples. Learning with protection is a practical application of learning with rejection under adversarial attacks. This algorithm transforms the original multi-class classification problem into a binary classification for a specific class, and we reject suspicious samples to protect a specific label. We demonstrate the effectiveness of the proposed method in experiments.",/pdf/aefbd7043edbf32f6e64a464dc0c1c7f3db98f16.pdf,ICLR,2020, +HJlWXhC5Km,SyxTCZAct7,1538090000000.0,1545360000000.0,1330,Learning to Control Visual Abstractions for Structured Exploration in Deep Reinforcement Learning,"[""cdi@google.com"", ""tkulkarni@google.com"", ""avdnoord@google.com"", ""amnih@google.com"", ""vmnih@google.com""]","[""catalin ionescu"", ""tejas kulkarni"", ""aaron van de oord"", ""andriy mnih"", ""vlad mnih""]","[""exploration"", ""deep reinforcement learning"", ""intrinsic motivation"", ""unsupervised learning""]","Exploration in environments with sparse rewards is a key challenge for reinforcement learning. How do we design agents with generic inductive biases so that they can explore in a consistent manner instead of just using local exploration schemes like epsilon-greedy? We propose an unsupervised reinforcement learning agent which learns a discrete pixel grouping model that preserves spatial geometry of the sensors and implicitly of the environment as well. We use this representation to derive geometric intrinsic reward functions, like centroid coordinates and area, and learn policies to control each one of them with off-policy learning. These policies form a basis set of behaviors (options) which allows us explore in a consistent way and use them in a hierarchical reinforcement learning setup to solve for extrinsically defined rewards. We show that our approach can scale to a variety of domains with competitive performance, including navigation in 3D environments and Atari games with sparse rewards.",/pdf/748c4242f1ffa85472f907e610157d4664387154.pdf,ICLR,2019,structured exploration in deep reinforcement learning via unsupervised visual abstraction discovery and control +Rcmk0xxIQV,w4_nX9r_cN3X,1601310000000.0,1616300000000.0,1649,QPLEX: Duplex Dueling Multi-Agent Q-Learning,"[""~Jianhao_Wang1"", ""~Zhizhou_Ren1"", ""~Terry_Liu2"", ""~Yang_Yu5"", ""~Chongjie_Zhang1""]","[""Jianhao Wang"", ""Zhizhou Ren"", ""Terry Liu"", ""Yang Yu"", ""Chongjie Zhang""]","[""Multi-agent reinforcement learning"", ""Value factorization"", ""Dueling structure""]","We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.",/pdf/b346f46b6a8ed100f25b8d16fde7d2500965fd70.pdf,ICLR,2021,A novel multi-agent Q-learning algorithm with a complete IGM (Individual-Global-Max) function class. +S1LVSrcge,,1478280000000.0,1488480000000.0,246,Variable Computation in Recurrent Neural Networks,"[""yacine.jernite@nyu.edu"", ""egrave@fb.com"", ""ajoulin@fb.com"", ""tmikolov@fb.com""]","[""Yacine Jernite"", ""Edouard Grave"", ""Armand Joulin"", ""Tomas Mikolov""]","[""Natural language processing"", ""Deep learning""]","Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video, speech or language) can have highly variable information flow, most recurrent models still consume input features at a constant rate and perform a constant number of computations per time step, which can be detrimental to both speed and model capacity. In this paper, we explore a modification to existing recurrent units which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure. We show experimentally that not only do our models require fewer operations, they also lead to better performance overall on evaluation tasks.",/pdf/4c681b897b2c27a1c8791eda73ed6815f74132c7.pdf,ICLR,2017,"We show that an RNN can learn to control the amount of computation it does at each time step, leading to better efficiency and performance as well as discovering time patterns of interest." +Syx9ET4YPB,rygozjmwDH,1569440000000.0,1577170000000.0,497,Do Image Classifiers Generalize Across Time?,"[""vaishaal@berkeley.edu"", ""achald@cs.cmu.edu"", ""roelofs@cs.berkely.edu"", ""deva@cs.cmu.edu"", ""brecht@berkeley.edu"", ""ludwigschmidt2@gmail.com""]","[""Vaishaal Shankar"", ""Achal Dave"", ""Rebecca Roelofs"", ""Deva Ramanan"", ""Ben Recht"", ""Ludwig Schmidt""]","[""robustness"", ""image classification"", ""distribution shift""]","We study the robustness of image classifiers to temporal perturbations derived from videos. As part of this study, we construct ImageNet-Vid-Robust and YTBB-Robust, containing a total 57,897 images grouped into 3,139 sets of perceptually similar images. Our datasets were derived from ImageNet-Vid and Youtube-BB respectively and thoroughly re-annotated by human experts for image similarity. We evaluate a diverse array of classifiers pre-trained on ImageNet and show a median classification accuracy drop of 16 and 10 percent on our two datasets. Additionally, we evaluate three detection models and show that natural perturbations induce both classification as well as localization errors, leading to a median drop in detection mAP of 14 points. Our analysis demonstrates that perturbations occurring naturally in videos pose a substantial and realistic challenge to deploying convolutional neural networks in environments that require both reliable and low-latency predictions.",/pdf/6a70b8252f8901af3e6b77af6fe6ac6ecc30b2d4.pdf,ICLR,2020,We systematically measure the sensitivity of image classifiers to temporal perturbations by introducing two human-reviewed benchmarks of similar video frames. +mPmCP2CXc7p,b5FCnvp6SY4,1601310000000.0,1614990000000.0,2895,Dynamic Feature Selection for Efficient and Interpretable Human Activity Recognition,"[""~Randy_Ardywibowo1"", ""~Shahin_Boluki1"", ""~Zhangyang_Wang1"", ""~Bobak_J_Mortazavi1"", ""~Shuai_Huang1"", ""~Xiaoning_Qian2""]","[""Randy Ardywibowo"", ""Shahin Boluki"", ""Zhangyang Wang"", ""Bobak J Mortazavi"", ""Shuai Huang"", ""Xiaoning Qian""]","[""dynamic feature selection"", ""human activity recognition"", ""sparse monitoring""]","In many machine learning tasks, input features with varying degrees of predictive capability are usually acquired at some cost. For example, in human activity recognition (HAR) and mobile health (mHealth) applications, monitoring performance should be achieved with a low cost to gather different sensory features, as maintaining sensors incur monetary, computation, and energy cost. We propose an adaptive feature selection method that dynamically selects features for prediction at any given time point. We formulate this problem as an $\ell_0$ minimization problem across time, and cast the combinatorial optimization problem into a stochastic optimization formulation. We then utilize a differentiable relaxation to make the problem amenable to gradient-based optimization. Our evaluations on four activity recognition datasets show that our method achieves a favorable trade-off between performance and the number of features used. Moreover, the dynamically selected features of our approach are shown to be interpretable and associated with the actual activity types.",/pdf/90ae614f77bddbf5e9f886151a7fe2c54ddf1cdb.pdf,ICLR,2021,We propose a task-driven dynamic feature selection method to perform human activity recognition efficiently. +B1akgy9xx,,1478250000000.0,1481610000000.0,157,Making Stochastic Neural Networks from Deterministic Ones,"[""kiminlee@kaist.ac.kr"", ""jaehyungkim@kaist.ac.kr"", ""songchong@kaist.edu"", ""jinwoos@kaist.ac.kr""]","[""Kimin Lee"", ""Jaehyung Kim"", ""Song Chong"", ""Jinwoo Shin""]","[""Deep learning"", ""Multi-modal learning"", ""Structured prediction""]","It has been believed that stochastic feedforward neural networks (SFNN) have several advantages beyond deterministic deep neural networks (DNN): they have more expressive power allowing multi-modal mappings and regularize better due to their stochastic nature. However, training SFNN is notoriously harder. In this paper, we aim at developing efficient training methods for large-scale SFNN, in particular using known architectures and pre-trained parameters of DNN. To this end, we propose a new intermediate stochastic model, called Simplified-SFNN, which can be built upon any baseline DNN and approximates certain SFNN by simplifying its upper latent units above stochastic ones. The main novelty of our approach is in establishing the connection between three models, i.e., DNN +-> Simplified-SFNN -> SFNN, which naturally leads to an efficient training procedure of the stochastic models utilizing pre-trained parameters of DNN. Using several popular DNNs, we show how they can be effectively transferred to the corresponding stochastic models for both multi-modal and classification tasks on MNIST, TFD, CIFAR-10, CIFAR-100 and SVHN datasets. In particular, our stochastic model built from the wide residual network has 28 layers and 36 million parameters, where the former consistently outperforms the latter for the classification tasks on CIFAR-10 and CIFAR-100 due to its stochastic regularizing effect.",/pdf/7e54b645b9aa1f5d582c664a72efdf8027099b94.pdf,ICLR,2017, +SyxZOsA9tX,BJePSQ-KFX,1538090000000.0,1545360000000.0,329,Accelerated Value Iteration via Anderson Mixing,"[""liyujun145@gmail.com"", ""hzxsncz@pku.edu.cn"", ""smsxgz@pku.edu.cn"", ""yangwenhaosms@pku.edu.cn"", ""zsc@megvii.com"", ""zhzhang@math.pku.edu.cn""]","[""Yujun Li"", ""Chengzhuo Ni"", ""Guangzeng Xie"", ""Wenhao Yang"", ""Shuchang Zhou"", ""Zhihua Zhang""]","[""Reinforcement Learning""]","Acceleration for reinforcement learning methods is an important and challenging theme. We introduce the Anderson acceleration technique into the value iteration, developing an accelerated value iteration algorithm that we call Anderson Accelerated Value Iteration (A2VI). We further apply our method to the Deep Q-learning algorithm, resulting in the Deep Anderson Accelerated Q-learning (DA2Q) algorithm. Our approach can be viewed as an approximation of the policy evaluation by interpolating on historical data. A2VI is more efficient than the modified policy iteration, which is a classical approximate method for policy evaluation. We give a theoretical analysis of our algorithm and conduct experiments on both toy problems and Atari games. Both the theoretical and empirical results show the effectiveness of our algorithm.",/pdf/c6280c4ffd5eb1aaa023024ac8ce55d96e2d94cf.pdf,ICLR,2019, +S1xNb2A9YX,HJgDw6icFQ,1538090000000.0,1549470000000.0,1161,Minimal Images in Deep Neural Networks: Fragile Object Recognition in Natural Images,"[""sanjanas@mit.edu"", ""gby@csail.mit.edu"", ""xboix@mit.edu""]","[""Sanjana Srivastava"", ""Guy Ben-Yosef"", ""Xavier Boix""]",[],"The human ability to recognize objects is impaired when the object is not shown in full. ""Minimal images"" are the smallest regions of an image that remain recognizable for humans. Ullman et al. (2016) show that a slight modification of the location and size of the visible region of the minimal image produces a sharp drop in human recognition accuracy. In this paper, we demonstrate that such drops in accuracy due to changes of the visible region are a common phenomenon between humans and existing state-of-the-art deep neural networks (DNNs), and are much more prominent in DNNs. We found many cases where DNNs classified one region correctly and the other incorrectly, though they only differed by one row or column of pixels, and were often bigger than the average human minimal image size. We show that this phenomenon is independent from previous works that have reported lack of invariance to minor modifications in object location in DNNs. Our results thus reveal a new failure mode of DNNs that also affects humans to a much lesser degree. They expose how fragile DNN recognition ability is in natural images even without adversarial patterns being introduced. Bringing the robustness of DNNs in natural images to the human level remains an open challenge for the community. ",/pdf/671fe59b5a0eb1fd9e416a6de6266c2cfbfdf809.pdf,ICLR,2019, +RcJHy18g1M,uWLsDLuhyGS,1601310000000.0,1614990000000.0,2788,Outlier Preserving Distribution Mapping Autoencoders ,"[""~Walter_Gerych2"", ""~Elke_Rundensteiner2"", ""~Emmanuel_Agu1""]","[""Walter Gerych"", ""Elke Rundensteiner"", ""Emmanuel Agu""]",[],"State-of-the-art deep outlier detection methods map data into a latent space with the aim of having outliers far away from inliers in this space. Unfortunately, this often fails as the divergence penalty they adopt pushes outliers into the same high-probability regions as inliers. We propose a novel method, OP-DMA, that successfully addresses the above problem. OP-DMA succeeds in mapping outliers to low probability regions in the latent space by leveraging a novel Prior-Weighted Loss (PWL) that utilizes the insight that outliers are likely to have a higher reconstruction error than inliers. Building on this insight, OP-DMA weights the reconstruction error of individual points by a multivariate Gaussian probability density function evaluated at each point's latent representation. We demonstrate and provide theoretical proof that this succeeds to map outliers to low-probability regions. Our experimental study shows that OP-DMA consistently outperforms state-of-art methods on a rich variety of outlier detection benchmark datasets.",/pdf/8174a3949997ac415174c0ee19931d3e098c63c4.pdf,ICLR,2021,A novel type of autoencoder that encourages outliers in the feature space to be easily identifiable in the latent space +ByJDAIe0b,S1A8RIlRZ,1509090000000.0,1518730000000.0,281,Integrating Episodic Memory into a Reinforcement Learning Agent Using Reservoir Sampling,"[""kjyoung@ualberta.ca"", ""rsutton@ualberta.ca"", ""s-yan14@mails.tsinghua.edu.cn""]","[""Kenny J. Young"", ""Shuo Yang"", ""Richard S. Sutton""]","[""reinforcement learning"", ""external memory"", ""deep learning"", ""policy gradient"", ""online learning""]","Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting. +",/pdf/063d2aa84629890443415eaf17b6f5ff56be36f8.pdf,ICLR,2018,External memory for online reinforcement learning based on estimating gradients over a novel reservoir sampling technique. +33TBJachvOX,mFRMRcxjGV0,1601310000000.0,1614990000000.0,622,How to compare adversarial robustness of classifiers from a global perspective,"[""~Niklas_Risse1"", ""jgoepfert@techfak.uni-bielefeld.de"", ""~Christina_G\u00f6pfert1""]","[""Niklas Risse"", ""Jan Philip G\u00f6pfert"", ""Christina G\u00f6pfert""]","[""adversarial robustness"", ""robustness"", ""adversarial defense"", ""adversarial example""]","Adversarial robustness of machine learning models has attracted considerable attention over recent years. Adversarial attacks undermine the reliability of and trust in machine learning models, but the construction of more robust models hinges on a rigorous understanding of adversarial robustness as a property of a given model. Point-wise measures for specific threat models are currently the most popular tool for comparing the robustness of classifiers and are used in most recent publications on adversarial robustness. In this work, we use robustness curves to show that point-wise measures fail to capture important global properties that are essential to reliably compare the robustness of different classifiers. We introduce new ways in which robustness curves can be used to systematically uncover these properties and provide concrete recommendations for researchers and practitioners when assessing and comparing the robustness of trained models. Furthermore, we characterize scale as a way to distinguish small and large perturbations, and relate it to inherent properties of data sets, demonstrating that robustness thresholds must be chosen accordingly. We hope that our work contributes to a shift of focus away from point-wise measures of robustness and towards a discussion of the question what kind of robustness could and should reasonably be expected. We release code to reproduce all experiments presented in this paper, which includes a Python module to calculate robustness curves for arbitrary data sets and classifiers, supporting a number of frameworks, including TensorFlow, PyTorch and JAX.",/pdf/71721c020e98b9f935cc3fde6f9390208b253151.pdf,ICLR,2021,"We demonstrate that point-wise measures are insufficient to adequately compare the adversarial robustness of differently trained models, and provide a module for global robustness analysis to reveal individual strengths of competing methods." +ryl71a4YPB,HJeSc_BBDH,1569440000000.0,1577170000000.0,296,A Unified framework for randomized smoothing based certified defenses,"[""th.zheng@mail.utoronto.ca"", ""dwang45@buffalo.edu"", ""bli@ece.toronto.edu"", ""jinhui@buffalo.edu""]","[""Tianhang Zheng"", ""Di Wang"", ""Baochun Li"", ""Jinhui Xu""]","[""Certificated Defense"", ""Randomized Smoothing"", ""A Unified and Self-Contained Framework""]","Randomized smoothing, which was recently proved to be a certified defensive technique, has received considerable attention due to its scalability to large datasets and neural networks. However, several important questions still remain unanswered in the existing frameworks, such as (i) whether Gaussian mechanism is an optimal choice for certifying $\ell_2$-normed robustness, and (ii) whether randomized smoothing can certify $\ell_\infty$-normed robustness (on high-dimensional datasets like ImageNet). To answer these questions, we introduce a {\em unified} and {\em self-contained} framework to study randomized smoothing-based certified defenses, where we mainly focus on the two most popular norms in adversarial machine learning, {\em i.e.,} $\ell_2$ and $\ell_\infty$ norm. We answer the above two questions by first demonstrating that Gaussian mechanism and Exponential mechanism are the (near) optimal options to certify the $\ell_2$ and $\ell_\infty$-normed robustness. We further show that the largest $\ell_\infty$ radius certified by randomized smoothing is upper bounded by $O(1/\sqrt{d})$, where $d$ is the dimensionality of the data. This theoretical finding suggests that certifying $\ell_\infty$-normed robustness by randomized smoothing may not be scalable to high-dimensional data. The veracity of our framework and analysis is verified by extensive evaluations on CIFAR10 and ImageNet.",/pdf/e42c60dd3bae5b95cd39ef204478cb4920493611.pdf,ICLR,2020, +H1zriGeCZ,BJWrjGeRb,1509070000000.0,1519270000000.0,227,Hyperparameter optimization: a spectral approach,"[""ehazan@cs.princeton.edu"", ""klivans@cs.utexas.edu"", ""yangyuan@cs.cornell.edu""]","[""Elad Hazan"", ""Adam Klivans"", ""Yang Yuan""]","[""Hyperparameter Optimization"", ""Fourier Analysis"", ""Decision Tree"", ""Compressed Sensing""]","We give a simple, fast algorithm for hyperparameter optimization inspired by techniques from the analysis of Boolean functions. We focus on the high-dimensional regime where the canonical example is training a neural network with a large number of hyperparameters. The algorithm --- an iterative application of compressed sensing techniques for orthogonal polynomials --- requires only uniform sampling of the hyperparameters and is thus easily parallelizable. + +Experiments for training deep neural networks on Cifar-10 show that compared to state-of-the-art tools (e.g., Hyperband and Spearmint), our algorithm finds significantly improved solutions, in some cases better than what is attainable by hand-tuning. In terms of overall running time (i.e., time required to sample various settings of hyperparameters plus additional computation time), we are at least an order of magnitude faster than Hyperband and Bayesian Optimization. We also outperform Random Search $8\times$. + +Our method is inspired by provably-efficient algorithms for learning decision trees using the discrete Fourier transform. We obtain improved sample-complexty bounds for learning decision trees while matching state-of-the-art bounds on running time (polynomial and quasipolynomial, respectively). ",/pdf/a39df6d978fe8dd1e2e97e5619c79845272f5c15.pdf,ICLR,2018,A hyperparameter tuning algorithm using discrete Fourier analysis and compressed sensing +SJiFvr9el,,1478280000000.0,1484320000000.0,258,Linear Time Complexity Deep Fourier Scattering Network and Extension to Nonlinear Invariants,"[""randallbalestriero@gmail.com"", ""glotin@univ-tln.fr""]","[""Randall Balestriero"", ""Herve Glotin""]","[""Unsupervised Learning"", ""Applications"", ""Deep learning""]","In this paper we propose a scalable version of a state-of-the-art deterministic time- +invariant feature extraction approach based on consecutive changes of basis and +nonlinearities, namely, the scattering network. The first focus of the paper is to +extend the scattering network to allow the use of higher order nonlinearities as +well as extracting nonlinear and Fourier based statistics leading to the required in- +variants of any inherently structured input. In order to reach fast convolutions and +to leverage the intrinsic structure of wavelets, we derive our complete model in the +Fourier domain. In addition of providing fast computations, we are now able to +exploit sparse matrices due to extremely high sparsity well localized in the Fourier +domain. As a result, we are able to reach a true linear time complexity with in- +puts in the Fourier domain allowing fast and energy efficient solutions to machine +learning tasks. Validation of the features and computational results will be pre- +sented through the use of these invariant coefficients to perform classification on +audio recordings of bird songs captured in multiple different soundscapes. In the +end, the applicability of the presented solutions to deep artificial neural networks +is discussed.",/pdf/ccf6dd518a06d53a282785894acd7544cd1041d6.pdf,ICLR,2017,This paper proposes an extension of the Scattering Network in the Fourier domain and with nonlinear invariant computation for fast and scalable unsupervised representations +KiFeuZu24k,MDtEL_dIdL-,1601310000000.0,1614990000000.0,225,Global Self-Attention Networks for Image Recognition,"[""~Zhuoran_Shen1"", ""~Irwan_Bello1"", ""~Raviteja_Vemulapalli1"", ""~Xuhui_Jia1"", ""~Ching-Hui_Chen2""]","[""Zhuoran Shen"", ""Irwan Bello"", ""Raviteja Vemulapalli"", ""Xuhui Jia"", ""Ching-Hui Chen""]","[""self-attention"", ""neural network architecture"", ""image classification"", ""semantic segmentation""]","Recently, a series of works in computer vision have shown promising results on various image and video understanding tasks using self-attention. However, due to the quadratic computational and memory complexities of self-attention, these works either apply attention only to low-resolution feature maps in later stages of a deep network or restrict the receptive field of attention in each layer to a small local region. To overcome these limitations, this work introduces a new global self-attention module, referred to as the GSA module, which is efficient enough to serve as the backbone component of a deep network. This module consists of two parallel layers: a content attention layer that attends to pixels based only on their content and a positional attention layer that attends to pixels based on their spatial locations. The output of this module is the sum of the outputs of the two layers. Based on the proposed GSA module, we introduce new standalone global attention-based deep networks that use GSA modules instead of convolutions to model pixel interactions. Due to the global extent of the proposed GSA module, a GSA network has the ability to model long-range pixel interactions throughout the network. Our experimental results show that GSA networks outperform the corresponding convolution-based networks significantly on the CIFAR-100 and ImageNet datasets while using less number of parameters and computations. The proposed GSA networks also outperform various existing attention-based networks on the ImageNet dataset.",/pdf/ad0c3a0fda0c29c1b8ab6a0d2562acee5f5083fe.pdf,ICLR,2021,A fully-attentional backbone architecture for vision tasks. +nG4Djb4h8Re,oTj0UFjA5HL,1601310000000.0,1614990000000.0,703,MetaPhys: Few-Shot Adaptation for Non-Contact Physiological Measurement,"[""~Xin_Liu8"", ""~Ziheng_Jiang1"", ""~Joshua_Wolff_Fromm1"", ""~Xuhai_Xu1"", ""~Shwetak_Patel1"", ""~Daniel_McDuff1""]","[""Xin Liu"", ""Ziheng Jiang"", ""Joshua Wolff Fromm"", ""Xuhai Xu"", ""Shwetak Patel"", ""Daniel McDuff""]","[""Healthcare"", ""Meta Learning"", ""Computer Vision""]","There are large individual differences in physiological processes, making designing personalized health sensing algorithms challenging. Existing machine learning systems struggle to generalize well to unseen subjects or contexts, especially in video-based physiological measurement. Although fine-tuning for a user might address this issue, it is difficult to collect large sets of training data for specific individuals because supervised algorithms require medical-grade sensors for generating the training target. Therefore, learning personalized or customized models from a small number of unlabeled samples is very attractive as it would allow fast calibrations. In this paper, we present a novel meta-learning approach called MetaPhys for learning personalized cardiac signals from 18-seconds of video data. MetaPhys works in both supervised and unsupervised manners. We evaluate our proposed approach on two benchmark datasets and demonstrate superior performance in cross-dataset evaluation with substantial reductions (42% to 44%) in errors compared with state-of-the-art approaches. Visualization of attention maps and ablation experiments reveal how the model adapts to each subject and why our proposed approach leads to these improvements. We have also demonstrated our proposed method significantly helps reduce the bias in skin type. +",/pdf/9db81ac5f75626e42396ae15c95a83ceadce9587.pdf,ICLR,2021,MetaPhys: A novel meta-learning approach for learning personalized cardiovascular signals from 18-seconds of video data. +W1G1JZEIy5_,JYJibHNYwak,1601310000000.0,1611610000000.0,2175,MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY,"[""~Sourya_Basu1"", ""gramachandran@salesforce.com"", ""~Nitish_Shirish_Keskar1"", ""~Lav_R._Varshney1""]","[""Sourya Basu"", ""Govardana Sachitanandam Ramachandran"", ""Nitish Shirish Keskar"", ""Lav R. Varshney""]","[""Neural text decoding"", ""sampling algorithms"", ""cross-entropy"", ""repetitions"", ""incoherence""]","Neural text decoding algorithms strongly influence the quality of texts generated using language models, but popular algorithms like top-k, top-p (nucleus), and temperature-based sampling may yield texts that have objectionable repetition or incoherence. Although these methods generate high-quality text after ad hoc parameter tuning that depends on the language model and the length of generated text, not much is known about the control they provide over the statistics of the output. This is important, however, since recent reports show that humans prefer when perplexity is neither too much nor too little and since we experimentally show that cross-entropy (log of perplexity) has a near-linear relation with repetition. First, we provide a theoretical analysis of perplexity in top-k, top-p, and temperature sampling, under Zipfian statistics. Then, we use this analysis to design a feedback-based adaptive top-k text decoding algorithm called mirostat that generates text (of any length) with a predetermined target value of perplexity without any tuning. Experiments show that for low values of k and p, perplexity drops significantly with generated text length and leads to excessive repetitions (the boredom trap). Contrarily, for large values of k and p, perplexity increases with generated text length and leads to incoherence (confusion trap). Mirostat avoids both traps. Specifically, we show that setting target perplexity value beyond a threshold yields negligible sentence-level repetitions. Experiments with +human raters for fluency, coherence, and quality further verify our findings.",/pdf/8e680c2eecb056aeaf306b71235d6dc6748d9d33.pdf,ICLR,2021,We provide a new text decoding algorithm that directly controls perplexity and hence several important attributes of generated text. +TgSVWXw22FQ,1WpgS7RVd5p,1601310000000.0,1615950000000.0,1968,Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning,"[""~Siyang_Yuan1"", ""~Pengyu_Cheng1"", ""~Ruiyi_Zhang3"", ""~Weituo_Hao1"", ""~Zhe_Gan1"", ""~Lawrence_Carin2""]","[""Siyang Yuan"", ""Pengyu Cheng"", ""Ruiyi Zhang"", ""Weituo Hao"", ""Zhe Gan"", ""Lawrence Carin""]","[""Style Transfer"", ""Mutual Information"", ""Zero-shot Learning"", ""Disentanglement""]","Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. In this paper we propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separate low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness.",/pdf/eeeea139c946265a4c136e25c661eb4a6dd24f3d.pdf,ICLR,2021,An information-theoretic disentangled representation learning framework for zero-shot voice style transfer. +aYbCpFNnHdh,#NAME?,1601310000000.0,1614990000000.0,3392,Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests,"[""~Christopher_Beckham1"", ""~Martin_Weiss4"", ""~Florian_Golemo1"", ""~Sina_Honari1"", ""~Derek_Nowrouzezahrai1"", ""~Christopher_Pal1""]","[""Christopher Beckham"", ""Martin Weiss"", ""Florian Golemo"", ""Sina Honari"", ""Derek Nowrouzezahrai"", ""Christopher Pal""]","[""vqa"", ""clevr"", ""contrastive learning"", ""3d"", ""inverse graphics""]","Different types of \emph{mental rotation tests} have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. 3D computer vision has a long history of examining related problems. However, often what one is most interested in is the answer to a relatively simple question posed in another visual frame of reference -- as opposed to creating a full 3D reconstruction. +Mental rotations tests can also manifest as consequential questions in the real world such as: does the pedestrian that I see, see the car that I am driving? +We explore a controlled setting whereby questions are posed about the properties of a scene if the scene were observed from another viewpoint. To do this we have created a new version of the CLEVR VQA problem setup and dataset that we call CLEVR Mental Rotation Tests or CLEVR-MRT, where the goal is to answer questions about the original CLEVR viewpoint given a single image obtained from a different viewpoint of the same scene. Using CLEVR Mental Rotation Tests we examine standard state of the art methods, show how they fall short, then explore novel neural architectures that involve inferring representations encoded as feature volumes describing a scene. Our new methods use rigid transformations of feature volumes conditioned on the viewpoint camera. We examine the efficacy of different model variants through performing a rigorous ablation study. Furthermore, we examine the use of contrastive learning to infer a volumetric encoder in a self-supervised manner and find that this approach yields the best results of our study using CLEVR-MRT.",/pdf/43a2f708fe7d04df631fd1795a8a11bd586df82d.pdf,ICLR,2021,"We propose a version of CLEVR with the problem of performing VQA under mental rotations, as well as methods that perform well on this task via the use and manipulation of 3D feature volumes." +Hkbd5xZRb,S1JdceW0Z,1509130000000.0,1519570000000.0,615,Spherical CNNs,"[""taco.cohen@gmail.com"", ""geiger.mario@gmail.com"", ""jonas.koehler.ks@gmail.com"", ""m.welling@uva.nl""]","[""Taco S. Cohen"", ""Mario Geiger"", ""Jonas K\u00f6hler"", ""Max Welling""]","[""deep learning"", ""equivariance"", ""convolution"", ""group convolution"", ""3D"", ""vision"", ""omnidirectional"", ""shape recognition"", ""molecular energy regression""]","Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective. + +In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.",/pdf/a6e4deea85776d1c66101a27e5a9c6bdfa9e4b00.pdf,ICLR,2018,"We introduce Spherical CNNs, a convolutional network for spherical signals, and apply it to 3D model recognition and molecular energy regression." +b4ach0lGuYO,7ssAhzYjg9,1601310000000.0,1614990000000.0,551,Iterative Image Inpainting with Structural Similarity Mask for Anomaly Detection,"[""~Hitoshi_Nakanishi1"", ""~Masahiro_Suzuki1"", ""~Yutaka_Matsuo1""]","[""Hitoshi Nakanishi"", ""Masahiro Suzuki"", ""Yutaka Matsuo""]","[""anomaly detection"", ""unsupervised learning"", ""structural similarity"", ""generative adversarial network"", ""deep learning""]","Autoencoders have emerged as popular methods for unsupervised anomaly detection. Autoencoders trained on the normal data are expected to reconstruct only the normal features, allowing anomaly detection by thresholding reconstruction errors. However, in practice, autoencoders fail to model small detail and yield blurry reconstructions, which makes anomaly detection challenging. Moreover, there is objective mismatching that models are trained to minimize total reconstruction errors while expecting a small deviation on normal pixels and a large deviation on anomalous pixels. To tackle these two issues, we propose the iterative image inpainting method that reconstructs partial regions in an adaptive inpainting mask matrix. This method constructs inpainting masks from the anomaly score of structural similarity. Overlaying inpainting mask on images, each pixel is bypassed or reconstructed based on the anomaly score, enhancing reconstruction quality. The iterative update of inpainted images and masks by turns purifies the anomaly score directly and follows the expected objective at test time. We evaluated the proposed method using the MVTec Anomaly Detection dataset. Our method outperformed previous state-of-the-art in several categories and showed remarkable improvement in high-frequency textures.",/pdf/4983d16d37f3124c42c0b3b5f86ad759755ad2b6.pdf,ICLR,2021,We investigated unsupervised anomaly detection method that utilizes inpainting technique iteratively and purifies anomaly score +BkxoglrtvH,HylYMRytPr,1569440000000.0,1577170000000.0,2112,Layerwise Learning Rates for Object Features in Unsupervised and Supervised Neural Networks And Consequent Predictions for the Infant Visual System,"[""cusackrh@tcd.ie"", ""odoherc1@tcd.ie"", ""birbecka@tcd.ie"", ""truzzia@tcd.ie""]","[""Rhodri Cusack"", ""Cliona O'Doherty"", ""Anna Birbeck"", ""Anna Truzzi""]","[""deep learning"", ""unsupervised"", ""supervised"", ""infant learning"", ""age of acquisition"", ""DeepCluster"", ""CORnet"", ""AlexNet""]","To understand how object vision develops in infancy and childhood, it will be necessary to develop testable computational models. Deep neural networks (DNNs) have proven valuable as models of adult vision, but it is not yet clear if they have any value as models of development. As a first model, we measured learning in a DNN designed to mimic the architecture and representational geometry of the visual system (CORnet). We quantified the development of explicit object representations at each level of this network through training by freezing the convolutional layers and training an additional linear decoding layer. We evaluate decoding accuracy on the whole ImageNet validation set, and also for individual visual classes. CORnet, however, uses supervised training and because infants have only extremely impoverished access to labels they must instead learn in an unsupervised manner. We therefore also measured learning in a state-of-the-art unsupervised network (DeepCluster). CORnet and DeepCluster differ in both supervision and in the convolutional networks at their heart, thus to isolate the effect of supervision, we ran a control experiment in which we trained the convolutional network from DeepCluster (an AlexNet variant) in a supervised manner. We make predictions on how learning should develop across brain regions in infants. In all three networks, we also tested for a relationship in the order in which infants and machines acquire visual classes, and found only evidence for a counter-intuitive relationship. We discuss the potential reasons for this.",/pdf/2c13ed9f06fafcb8ec0846dc42d158aa4511d928.pdf,ICLR,2020,Unsupervised networks learn from bottom up; machines and infants acquire visual classes in different orders +tADlrawCrVU,OP5wj5wNlcR,1601310000000.0,1614990000000.0,1170,CoLES: Contrastive learning for event sequences with self-supervision,"[""~Dmitrii_Babaev2"", ""~Nikita_Ovsov1"", ""~Ivan_A_Kireev1"", ""~Gleb_Gusev1"", ""~Maria_Ivanova2"", ""~Alexander_Tuzhilin1""]","[""Dmitrii Babaev"", ""Nikita Ovsov"", ""Ivan A Kireev"", ""Gleb Gusev"", ""Maria Ivanova"", ""Alexander Tuzhilin""]","[""representation learning"", ""contrastive learning"", ""neural networks"", ""event sequiences""]","We address the problem of self-supervised learning on discrete event sequences generated by real-world users. Self-supervised learning incorporates complex information from the raw data in low-dimensional fixed-length vector representations that could be easily applied in various downstream machine learning tasks. In this paper, we propose a new method CoLES, which adopts contrastive learning, previously used for audio and computer vision domains, to the discrete event sequences domain in a self-supervised setting. Unlike most previous studies, we theoretically justify under mild conditions that the augmentation method underlying CoLES provides representative samples of discrete event sequences. We evaluated CoLES on several public datasets and showed that CoLES representations consistently outperform other methods on different downstream tasks. +",/pdf/743085e69db47ad77cf7f0b7ca2301067d6acedf.pdf,ICLR,2021,"We propose a new method CoLES, which adapts self-supervised contrastive learning, to the discrete event sequence domain" +h2EbJ4_wMVq,IDzy4nZwVfP,1601310000000.0,1616180000000.0,2667,CaPC Learning: Confidential and Private Collaborative Learning,"[""~Christopher_A._Choquette-Choo1"", ""~Natalie_Dullerud1"", ""~Adam_Dziedzic1"", ""~Yunxiang_Zhang1"", ""~Somesh_Jha1"", ""~Nicolas_Papernot1"", ""~Xiao_Wang11""]","[""Christopher A. Choquette-Choo"", ""Natalie Dullerud"", ""Adam Dziedzic"", ""Yunxiang Zhang"", ""Somesh Jha"", ""Nicolas Papernot"", ""Xiao Wang""]","[""machine learning"", ""deep learning"", ""privacy"", ""confidentiality"", ""security"", ""homomorphic encryption"", ""mpc"", ""differential privacy""]","Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties. ",/pdf/db02ce664fd72e5ee7ca8809ea8714aa7e6cfdb6.pdf,ICLR,2021,A method that enables parties to improve their own local heterogeneous machine learning models in a collaborative setting where both confidentiality and privacy need to be preserved to prevent both explicit and implicit sharing of private data. +zfO1MwBFu-,XRf5g9w2eJ-,1601310000000.0,1614990000000.0,608,Information Theoretic Regularization for Learning Global Features by Sequential VAE,"[""~Kei_Akuzawa1"", ""~Yusuke_Iwasawa1"", ""~Yutaka_Matsuo1""]","[""Kei Akuzawa"", ""Yusuke Iwasawa"", ""Yutaka Matsuo""]","[""variational autoencoder"", ""VAE"", ""disentanglement"", ""global features"", ""sequential models"", ""representation learning"", ""mutual information""]","Sequential variational autoencoders (VAEs) with global latent variable $z$ have been studied for the purpose of disentangling the global features of data, which is useful in many downstream tasks. To assist the sequential VAEs further in obtaining meaningful $z$, an auxiliary loss that maximizes the mutual information (MI) between the observation and $z$ is often employed. However, by analyzing the sequential VAEs from the information theoretic perspective, we can claim that simply maximizing the MI encourages the latent variables to have redundant information and prevents the disentanglement of global and local features. Based on this analysis, we derive a novel regularization method that makes $z$ informative while encouraging the disentanglement. Specifically, the proposed method removes redundant information by minimizing the MI between $z$ and the local features by using adversarial training. In the experiments, we trained state-space and autoregressive model variants using speech and image datasets. The results indicate that the proposed method improves the performance of the downstream classification and data generation tasks, thereby supporting our information theoretic perspective in the learning of global representations.",/pdf/53a473dc9fd8d9104ea8a659b79f162c92af889d.pdf,ICLR,2021,"To assist sequential VAEs with global latent variable, we propose a new information theoretic regularization method for disentangling the global factors." +S1xLZ2R5KQ,Skgp6bC9tm,1538090000000.0,1545360000000.0,1172,Maximum a Posteriori on a Submanifold: a General Image Restoration Method with GAN,"[""fluo1993@gmail.com"", ""xwu510@gmail.com""]","[""Fangzhou Luo"", ""Xiaolin Wu""]",[],"We propose a general method for various image restoration problems, such as denoising, deblurring, super-resolution and inpainting. The problem is formulated as a constrained optimization problem. Its objective is to maximize a posteriori probability of latent variables, and its constraint is that the image generated by these latent variables must be the same as the degraded image. We use a Generative Adversarial Network (GAN) as our density estimation model. Convincing results are obtained on MNIST dataset.",/pdf/c37f6720ae7ff4c9e85a0534c3c2fa54c5246703.pdf,ICLR,2019, +HJDBUF5le,,1478300000000.0,1488370000000.0,495,Towards a Neural Statistician,"[""h.l.edwards@sms.ed.ac.uk"", ""amos.storkey@ed.ac.uk""]","[""Harrison Edwards"", ""Amos Storkey""]",[],"An efficient learner is one who reuses what they already know to tackle a new problem. For a machine learner, this means understanding the similarities amongst datasets. In order to do this, one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model. Towards this goal, we demonstrate an extension of a variational autoencoder that can learn a method for computing representations, or statistics, of datasets in an unsupervised fashion. The network is trained to produce statistics that encapsulate a generative model for each dataset. Hence the network enables efficient learning from new datasets for both unsupervised and supervised tasks. We show that we are able to learn statistics that can be used for: clustering datasets, transferring generative models to new datasets, selecting representative samples of datasets and classifying previously unseen classes. We refer to our model as a neural statistician, and by this we mean a neural network that can learn to compute summary statistics of datasets without supervision.",/pdf/ab975c81ca2637f798440bd40c1d09f3410c8a16.pdf,ICLR,2017,Learning representations of datasets with an extension of VAEs. +mxfRhLgLg_,vBUO_5RHOEh,1601310000000.0,1614990000000.0,3608,Deep Ecological Inference,"[""~Nic_Fishman1"", ""colin@dataforprogress.org""]","[""Nic Fishman"", ""Colin McAuliffe""]","[""ecological inference"", ""representation learning"", ""multi-task learning"", ""bayesian deep learning""]","We introduce an efficient approximation to the loss function for the ecological inference problem, where individual labels are predicted from aggregates. This allows us to construct ecological versions of linear models, deep neural networks, and Bayesian neural networks. Using these models we infer probabilities of vote choice for candidates in the Maryland 2018 midterm elections for 2,322,277 voters in 2055 precincts. We show that increased network depth and joint learning of multiple races within an election improves the accuracy of ecological inference when compared to benchmark data from polling. Additionally we leverage data on the joint distribution of ballots (available from ballot images which are public for election administration purposes) to show that joint learning leads to significantly improved recovery of the covariance structure for multi-task ecological inference. Our approach also allows learning latent representations of voters, which we show outperform raw covariates for leave-one-out prediction. ",/pdf/6b1e8dca77737d70a170f7f77b8f910606440926.pdf,ICLR,2021,"We extend ecological inference with an efficient loss function, and build models to infer probabilities of vote choice for candidates in the Maryland 2018 midterm elections." +S1gKA6NtPS,rkx5B6b_DB,1569440000000.0,1577170000000.0,864,Deep symbolic regression,"[""petersen33@llnl.gov""]","[""Brenden K. Petersen""]","[""symbolic regression"", ""reinforcement learning"", ""automated machine learning""]","Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of symbolic regression. Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are lacking. We propose a framework that combines deep learning with symbolic regression via a simple idea: use a large model to search the space of small models. More specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions, and employ reinforcement learning to train the network to generate better-fitting expressions. Our algorithm significantly outperforms standard genetic programming-based symbolic regression in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate a priori constraints in situ.",/pdf/88e09c4304f4baceca29f5c70aac5e080b2e4589.pdf,ICLR,2020,"A deep learning approach to symbolic regression, in which an autoregressive RNN emits a distribution over expressions that is optimized using reinforcement learning" +B1x62TNtDS,rkevDIl_wB,1569440000000.0,1585020000000.0,799,Understanding the Limitations of Variational Mutual Information Estimators,"[""jiaming.tsong@gmail.com"", ""ermon@cs.stanford.edu""]","[""Jiaming Song"", ""Stefano Ermon""]",[],"Variational approaches based on neural networks are showing promise for estimating mutual information (MI) between high dimensional variables. However, they can be difficult to use in practice due to poorly understood bias/variance tradeoffs. We theoretically show that, under some conditions, estimators such as MINE exhibit variance that could grow exponentially with the true amount of underlying MI. We also empirically demonstrate that existing estimators fail to satisfy basic self-consistency properties of MI, such as data processing and additivity under independence. Based on a unified perspective of variational approaches, we develop a new estimator that focuses on variance reduction. Empirical results on standard benchmark tasks demonstrate that our proposed estimator exhibits improved bias-variance trade-offs on standard benchmark tasks.",/pdf/381bba14579e1a88d1b1fec45df52d0fa9dd9fc6.pdf,ICLR,2020, +HklyMhCqYQ,rJefbka5t7,1538090000000.0,1545360000000.0,1226,Super-Resolution via Conditional Implicit Maximum Likelihood Estimation,"[""ke.li@eecs.berkeley.edu"", ""shichong.peng@mail.utoronto.ca"", ""malik@eecs.berkeley.edu""]","[""Ke Li*"", ""Shichong Peng*"", ""Jitendra Malik""]","[""super-resolution""]","Single-image super-resolution (SISR) is a canonical problem with diverse applications. Leading methods like SRGAN produce images that contain various artifacts, such as high-frequency noise, hallucinated colours and shape distortions, which adversely affect the realism of the result. In this paper, we propose an alternative approach based on an extension of the method of Implicit Maximum Likelihood Estimation (IMLE). We demonstrate greater effectiveness at noise reduction and preservation of the original colours and shapes, yielding more realistic super-resolved images. ",/pdf/79a8ca4ddd576eef154c8d51b3e2fe710af859ae.pdf,ICLR,2019,We propose a new method for image super-resolution based on IMLE. +Bke6vTVYwH,SyxMkvsDPH,1569440000000.0,1577170000000.0,612,Graph convolutional networks for learning with few clean and many noisy labels,"[""iscen@google.com"", ""giorgos.tolias@cmp.felk.cvut.cz"", ""yannis@avrithis.net"", ""chum@cmp.felk.cvut.cz"", ""cordelias@google.com""]","[""Ahmet Iscen"", ""Giorgos Tolias"", ""Yannis Avrithis"", ""Ondrej Chum"", ""Cordelia Schmid""]",[],"In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given. The structure of clean and noisy data is modeled by a graph per class and Graph Convolutional Networks (GCN) are used to predict class relevance of noisy examples. For each class, the GCN is treated as a binary classifier learning to discriminate clean from noisy examples using a weighted binary cross-entropy loss function, and then the GCN-inferred ""clean"" probability is exploited as a relevance measure. Each noisy example is weighted by its relevance when learning a classifier for the end task. We evaluate our method on an extended version of a few-shot learning problem, where the few clean examples of novel classes are supplemented with additional noisy data. Experimental results show that our GCN-based cleaning process significantly improves the classification accuracy over not cleaning the noisy data and standard few-shot classification where only few clean examples are used. The proposed GCN-based method outperforms the transductive approach (Douze et al., 2018) that is using the same additional data without labels.",/pdf/128d55ded5d893b0eeac73626635a6274b04d1da.pdf,ICLR,2020, +H1ersoRqtm,BkgYV4h5YX,1538090000000.0,1550660000000.0,621,Structured Neural Summarization,"[""t-pafern@microsoft.com"", ""miallama@microsoft.com"", ""mabrocks@microsoft.com""]","[""Patrick Fernandes"", ""Miltiadis Allamanis"", ""Marc Brockschmidt""]","[""Summarization"", ""Graphs"", ""Source Code""]","Summarization of long sequences into a concise statement is a core problem in natural language processing, requiring non-trivial understanding of the input. Based on the promising results of graph neural networks on highly structured data, we develop a framework to extend existing sequence encoders with a graph component that can reason about long-distance relationships in weakly structured data such as text. In an extensive evaluation, we show that the resulting hybrid sequence-graph models outperform both pure sequence models as well as pure graph models on a range of summarization tasks.",/pdf/b2f28c5fce7bbcb71813c99a230093c1dc87b202.pdf,ICLR,2019,One simple trick to improve sequence models: Compose them with a graph model +B9nDuDeanHK,RTBBAy8hV9H,1601310000000.0,1614990000000.0,2591,Weights Having Stable Signs Are Important: Finding Primary Subnetworks and Kernels to Compress Binary Weight Networks,"[""~Zhaole_Sun1"", ""~Anbang_Yao1""]","[""Zhaole Sun"", ""Anbang Yao""]",[],"Binary Weight Networks (BWNs) have significantly lower computational and memory costs compared to their full-precision counterparts. To address the non-differentiable issue of BWNs, existing methods usually use the Straight-Through-Estimator (STE). In the optimization, they learn optimal binary weight outputs represented as a combination of scaling factors and weight signs to approximate 32-bit floating-point weight values, usually with a layer-wise quantization scheme. In this paper, we begin with an empirical study of training BWNs with STE under the settings of using common techniques and tricks. We show that in the context of using batch normalization after convolutional layers, adapting scaling factors with either hand-crafted or learnable methods brings marginal or no accuracy gain to final model, while the change of weight signs is crucial in the training of BWNs. Furthermore, we observe two astonishing training phenomena. Firstly, the training of BWNs demonstrates the process of seeking primary binary sub-networks whose weight signs are determined and fixed at the early training stage, which is akin to recent findings on the lottery ticket hypothesis for efficient learning of sparse neural networks. Secondly, we find binary kernels in the convolutional layers of final models tend to be centered on a limited number of the most frequent binary kernels, showing binary weight networks may has the potential to be further compressed, which breaks the common wisdom that representing each weight with a single bit puts the quantization to the extreme compression. To testify this hypothesis, we additionally propose a binary kernel quantization method, and we call resulting models Quantized Binary-Kernel Networks (QBNs). We hope these new experimental observations would shed new design insights to improve the training and broaden the usages of BWNs.",/pdf/dea3929651b18da796959b9a97d3a5001d8c7b63.pdf,ICLR,2021, +ryHM_fbA-,S1GGdf-AZ,1509140000000.0,1518730000000.0,867,Learning Document Embeddings With CNNs,"[""shunan@layer6.ai"", ""chundi@layer6.ai"", ""maksims.volkovs@gmail.com""]","[""Shunan Zhao"", ""Chundi Lui"", ""Maksims Volkovs""]","[""unsupervised embedding"", ""convolutional neural network""]",This paper proposes a new model for document embedding. Existing approaches either require complex inference or use recurrent neural networks that are difficult to parallelize. We take a different route and use recent advances in language modeling to develop a convolutional neural network embedding model. This allows us to train deeper architectures that are fully parallelizable. Stacking layers together increases the receptive filed allowing each successive layer to model increasingly longer range semantic dependences within the document. Empirically we demonstrate superior results on two publicly available benchmarks. Full code will be released with the final version of this paper.,/pdf/7dde4526398d464473538e0cfaaf34af06538728.pdf,ICLR,2018,Convolutional neural network model for unsupervised document embedding. +Hkl5aoR5tm,SkedZEqcKX,1538090000000.0,1550770000000.0,828,On Self Modulation for Generative Adversarial Networks,"[""iamtingchen@gmail.com"", ""lucic@google.com"", ""neilhoulsby@google.com"", ""sylvaingelly@google.com""]","[""Ting Chen"", ""Mario Lucic"", ""Neil Houlsby"", ""Sylvain Gelly""]","[""unsupervised learning"", ""generative adversarial networks"", ""deep generative modelling""]","Training Generative Adversarial Networks (GANs) is notoriously challenging. We propose and study an architectural modification, self-modulation, which improves GAN performance across different data sets, architectures, losses, regularizers, and hyperparameter settings. Intuitively, self-modulation allows the intermediate feature maps of a generator to change as a function of the input noise vector. While reminiscent of other conditioning techniques, it requires no labeled data. In a large-scale empirical study we observe a relative decrease of 5%-35% in FID. Furthermore, all else being equal, adding this modification to the generator leads to improved performance in 124/144 (86%) of the studied settings. Self-modulation is a simple architectural change that requires no additional parameter tuning, which suggests that it can be applied readily to any GAN.",/pdf/281d614287b0f9a3d21f1a38f02f0793dcee062f.pdf,ICLR,2019,"A simple GAN modification that improves performance across many losses, architectures, regularization schemes, and datasets. " +DktZb97_Fx,8h7SHSnqiYy,1601310000000.0,1615850000000.0,2251,SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness,"[""~Mikhail_Yurochkin1"", ""~Yuekai_Sun1""]","[""Mikhail Yurochkin"", ""Yuekai Sun""]","[""Algorithmic fairness"", ""invariance""]","In this paper, we cast fair machine learning as invariant machine learning. We first formulate a version of individual fairness that enforces invariance on certain sensitive sets. We then design a transport-based regularizer that enforces this version of individual fairness and develop an algorithm to minimize the regularizer efficiently. Our theoretical results guarantee the proposed approach trains certifiably fair ML models. Finally, in the experimental studies we demonstrate improved fairness metrics in comparison to several recent fair training procedures on three ML tasks that are susceptible to algorithmic bias.",/pdf/80d776638f6b356e13bef121dae894df08d5545f.pdf,ICLR,2021,We propose a new invariance-enforcing regularizer for training individually fair ML systems. +H1mCp-ZRZ,Hkf0T-bA-,1509130000000.0,1519370000000.0,736,Action-dependent Control Variates for Policy Optimization via Stein Identity,"[""uestcliuhao@gmail.com"", ""yihao@cs.utexas.edu"", ""maoyi@microsoft.com"", ""dennyzhou@google.com"", ""jianpeng@illinois.edu"", ""lqiang@cs.utexas.edu""]","[""Hao Liu*"", ""Yihao Feng*"", ""Yi Mao"", ""Dengyong Zhou"", ""Jian Peng"", ""Qiang Liu""]","[""reinforcement learning"", ""control variates"", ""sample efficiency"", ""variance reduction""]","Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein’s identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more flexible and general action-dependent baseline functions. Empirical studies show that our method essentially improves the sample efficiency of the state-of-the-art policy gradient approaches. +",/pdf/39a4c39fd098126b624c7f203f2298f3c8fe40cd.pdf,ICLR,2018, +uQnJqzkhrmj,H03r-4fU5ns5,1601310000000.0,1614990000000.0,45,Ranking Cost: One-Stage Circuit Routing by Directly Optimizing Global Objective Function,"[""~Shiyu_Huang2"", ""~Bin_Wang12"", ""lidong106@huawei.com"", ""haojianye@huawei.com"", ""~Jun_Zhu2"", ""tingchen@tsinghua.edu.cn""]","[""Shiyu Huang"", ""Bin Wang"", ""Dong Li"", ""Jianye Hao"", ""Jun Zhu"", ""Ting Chen""]","[""Evolution Strategy"", ""Circuit Routing"", ""A*"", ""PCB""]","Circuit routing has been a historically challenging problem in designing electronic systems such as very large-scale integration (VLSI) and printed circuit boards (PCBs). The main challenge is that connecting a large number of electronic components under specific design rules and constraints involves a very large search space, which is proved to be NP-complete. +Early solutions are typically designed with hard-coded heuristics, which suffer from problems of non-optimum solutions and lack of flexibility for new design needs. Although a few learning-based methods have been proposed recently, their methods are cumbersome and hard to extend to large-scale applications. In this work, we propose a new algorithm for circuit routing, named as Ranking Cost (RC), which innovatively combines search-based methods (i.e., A* algorithm) and learning-based methods (i.e., Evolution Strategies) to form an efficient and trainable router under a proper parameterization. Different from two-stage routing methods ( i.e., first global routing and then detailed routing), our method involves a one-stage procedure that directly optimizes the global objective function, thus it can be easy to adapt to new routing rules and constraints. In our method, we introduce a new set of variables called cost maps, which can help the A* router to find out proper paths to achieve the global object. We also train a ranking parameter, which can produce the ranking order and further improve the performance of our method. Our algorithm is trained in an end-to-end manner and does not use any artificial data or human demonstration. In the experiments, we compare our method with the sequential A* algorithm and a canonical reinforcement learning approach, and results show that our method outperforms these baselines with higher connectivity rates and better scalability. Our ablation study shows that our trained cost maps can capture the global information and guide the routing result to approach global optimum.",/pdf/ec8e118c51f87fa20391fc131e45c55b21167147.pdf,ICLR,2021,"We propose a new novel algorithm, denoted as Ranking Cost (RC), to solve the challenging circuit routing problem, and our method combines the search-based method and learning-based method which makes it more powerful, flexible and scalable." +rkle3i09K7,Hyef-Tm9Km,1538090000000.0,1545360000000.0,677,Robust Determinantal Generative Classifier for Noisy Labels and Adversarial Attacks,"[""kiminlee@kaist.ac.kr"", ""sm3199@kaist.ac.kr"", ""kibok@umich.edu"", ""honglak@eecs.umich.edu"", ""lxbosky@gmail.com"", ""jinwoos@kaist.ac.kr""]","[""Kimin Lee"", ""Sukmin Yun"", ""Kibok Lee"", ""Honglak Lee"", ""Bo Li"", ""Jinwoo Shin""]","[""Noisy Labels"", ""Adversarial Attacks"", ""Generative Models""]","Large-scale datasets may contain significant proportions of noisy (incorrect) class labels, and it is well-known that modern deep neural networks poorly generalize from such noisy training datasets. In this paper, we propose a novel inference method, Deep Determinantal Generative Classifier (DDGC), which can obtain a more robust decision boundary under any softmax neural classifier pre-trained on noisy datasets. Our main idea is inducing a generative classifier on top of hidden feature spaces of the discriminative deep model. By estimating the parameters of generative classifier using the minimum covariance determinant estimator, we significantly improve the classification accuracy, with neither re-training of the deep model nor changing its architectures. In particular, we show that DDGC not only generalizes well from noisy labels, but also is robust against adversarial perturbations due to its large margin property. Finally, we propose the ensemble version ofDDGC to improve its performance, by investigating the layer-wise characteristics of generative classifier. Our extensive experimental results demonstrate the superiority of DDGC given different learning models optimized by various training techniques to handle noisy labels or adversarial samples. For instance, on CIFAR-10 dataset containing 45% noisy training labels, we improve the test accuracy of a deep model optimized by the state-of-the-art noise-handling training method from33.34% to 43.02%.",/pdf/f7e446643b50ca1579ad4babafdd66de443d06f6.pdf,ICLR,2019, +SO73JUgks8,e5_V_JQkLtH,1601310000000.0,1614990000000.0,1091,AUBER: Automated BERT Regularization,"[""~Hyun_Dong_Lee1"", ""~Seongmin_Lee2"", ""~U_Kang1""]","[""Hyun Dong Lee"", ""Seongmin Lee"", ""U Kang""]","[""BERT Regularization"", ""Reinforcement Learning"", ""Automated Regularization""]","How can we effectively regularize BERT? Although BERT proves its effectiveness in various downstream natural language processing tasks, it often overfits when there are only a small number of training instances. A promising direction to regularize BERT is based on pruning its attention heads based on a proxy score for head importance. However, heuristic-based methods are usually suboptimal since they predetermine the order by which attention heads are pruned. In order to overcome such a limitation, we propose AUBER, an effective regularization method that leverages reinforcement learning to automatically prune attention heads from BERT. Instead of depending on heuristics or rule-based policies, AUBER learns a pruning policy that determines which attention heads should or should not be pruned for regularization. Experimental results show that AUBER outperforms existing pruning methods by achieving up to 10% better accuracy. In addition, our ablation study empirically demonstrates the effectiveness of our design choices for AUBER.",/pdf/6795b7bb51964109e24e5ada5b74926007ccd593.pdf,ICLR,2021,We propose a method to automatically regularize BERT to improve its accuracy via reinforcement learning. +CNA6ZrpNDar,zpw4sXjT20,1601310000000.0,1614990000000.0,1586,On the Decision Boundaries of Neural Networks. A Tropical Geometry Perspective,"[""~Motasem_Alfarra1"", ""~Adel_Bibi1"", ""~Hasan_Abed_Al_Kader_Hammoud1"", ""~Mohamed_Gaafar1"", ""~Bernard_Ghanem1""]","[""Motasem Alfarra"", ""Adel Bibi"", ""Hasan Abed Al Kader Hammoud"", ""Mohamed Gaafar"", ""Bernard Ghanem""]","[""Tropical Geometry"", ""Decision Boundaries"", ""Neural Networks""]","This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piecewise linear non-linearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to characterize the decision boundaries of a simple network of the form (Affine, ReLU, Affine). Our main finding is that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of these zonotopes are functions of the network parameters. This geometric characterization provides new perspectives to three tasks. Specifically, we propose a new tropical perspective to the lottery ticket hypothesis, where we view the effect of different initializations on the tropical geometric representation of a network's decision boundaries. Moreover, we propose new tropical based optimization problems that directly influence the decision boundaries of the network for the tasks of network pruning (removing network parameters not contributing to the tropical geometric representation of the decision boundaries) and the generation of adversarial attacks.",/pdf/dfb5433a3dbe5e72fd1eb255f2de60ac75e950c6.pdf,ICLR,2021,"This paper characterizes the decision boundaries of neural networks using tropical geometry, and leverages this characterization into several applications." +rk6H0ZbRb,Sknr0-WAW,1509130000000.0,1518730000000.0,742,Intriguing Properties of Adversarial Examples,"[""cubuk@google.com"", ""barretzoph@google.com"", ""schsam@google.com"", ""qvl@google.com""]","[""Ekin Dogus Cubuk"", ""Barret Zoph"", ""Samuel Stern Schoenholz"", ""Quoc V. Le""]","[""adversarial examples"", ""universality"", ""neural architecture search""]","It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we show that distributions of logit differences have a universal functional form. This functional form is independent of architecture, dataset, and training protocol; nor does it change during training. This leads to adversarial error having a universal scaling, as a power-law, with respect to the size of the adversarial perturbation. We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-the-art deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD). Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white \emph{and} black box attacks compared to previous attempts. +",/pdf/0dc93000a5acf4b206b769dbe0cc6de1231fe944.pdf,ICLR,2018,"Adversarial error has similar power-law form for all datasets and models studied, and architecture matters." +kPheYCFm0Od,WdJxXdEPvuE,1601310000000.0,1614990000000.0,1501,Variational Multi-Task Learning,"[""~Jiayi_Shen3"", ""~Xiantong_Zhen1"", ""~Marcel_Worring1"", ""~Ling_Shao1""]","[""Jiayi Shen"", ""Xiantong Zhen"", ""Marcel Worring"", ""Ling Shao""]","[""multi-task learning"", ""variational Bayesian inference"", ""Gumbel-softmax priors""]","Multi-task learning aims to improve the overall performance of a set of tasks by leveraging their relatedness. When training data is limited using priors is pivotal, but currently this is done in ad-hoc ways. In this paper, we develop variational multi-task learning - VMTL, a general probabilistic inference framework for simultaneously learning multiple related tasks. We cast multi-task learning as a variational Bayesian inference problem, which enables task relatedness to be explored in a principled way by specifying priors. We introduce Gumbel-softmax priors to condition the prior of each task on related tasks. Each prior is represented as a mixture of variational posteriors of other related tasks and the mixing weights are learned in a data-driven manner for each individual task. The posteriors over representations and classifiers are inferred jointly for all tasks and individual tasks are able to improve their performance by using the shared inductive bias. Experimental results demonstrate that VMTL is able to tackle challenging multi-task learning with limited training data well, and it achieves state-of-the-art performance on four benchmarks, consistently surpassing previous methods.",/pdf/430d195178c712782255fdb67f64aebc2f861be4.pdf,ICLR,2021,"We develop variational multi-task learning, a general probabilistic inference framework for exploring task relatedness for both representations and classifiers." +rJeQYjRqYX,rye19ryzFX,1538090000000.0,1545360000000.0,428,Effective Path: Know the Unknowns of Neural Network,"[""qiuyuxian@sjtu.edu.cn"", ""leng-jw@sjtu.edu.cn"", ""yzhu@rochester.edu"", ""chen-quan@sjtu.edu.cn"", ""lichao@cs.sjtu.edu.cn"", ""guo-my@cs.sjtu.edu.cn""]","[""Yuxian Qiu"", ""Jingwen Leng"", ""Yuhao Zhu"", ""Quan Chen"", ""Chao Li"", ""Minyi Guo""]",[],"Despite their enormous success, there is still no solid understanding of deep neural network’s working mechanism. As such, researchers have demonstrated DNNs are vulnerable to small input perturbation, i.e., adversarial attacks. This work proposes the effective path as a new approach to exploring DNNs' internal organization. The effective path is an ensemble of synapses and neurons, which is reconstructed from a trained DNN using our activation-based backward algorithm. The per-image effective path can be aggregated to the class-level effective path, through which we observe that adversarial images activate effective path different from normal images. We propose an effective path similarity-based method to detect adversarial images and demonstrate its high accuracy and broad applicability. +",/pdf/9bb9c3302a4b7aa271c78443ae3e79172558c35c.pdf,ICLR,2019, +Bk8aOm9xl,,1478270000000.0,1484900000000.0,195,Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning,"[""jachiam@berkeley.edu"", ""sastry@coe.berkeley.edu""]","[""Joshua Achiam"", ""Shankar Sastry""]","[""Reinforcement Learning""]","Exploration in complex domains is a key challenge in reinforcement learning, especially for tasks with very sparse rewards. Recent successes in deep reinforcement learning have been achieved mostly using simple heuristic exploration strategies such as $\epsilon$-greedy action selection or Gaussian control noise, but there are many tasks where these methods are insufficient to make any learning progress. Here, we consider more complex heuristics: efficient and scalable exploration strategies that maximize a notion of an agent's surprise about its experiences via intrinsic motivation. We propose to learn a model of the MDP transition probabilities concurrently with the policy, and to form intrinsic rewards that approximate the KL-divergence of the true transition probabilities from the learned model. One of our approximations results in using surprisal as intrinsic motivation, while the other gives the $k$-step learning progress. We show that our incentives enable agents to succeed in a wide range of environments with high-dimensional state spaces and very sparse rewards, including continuous control tasks and games in the Atari RAM domain, outperforming several other heuristic exploration techniques. +",/pdf/5279b8d42388c4de00cddea9345b1a0a3af617af.pdf,ICLR,2017,Learn a dynamics model and use it to make your agent boldly go where it has not gone before. +SkgQBn0cF7,r1g4dF65YX,1538090000000.0,1550900000000.0,1527,Modeling the Long Term Future in Model-Based Reinforcement Learning,"[""rosemary.nan.ke@gmail.com"", ""asg@fb.com"", ""ahmed.touati@umontreal.ca"", ""anirudhgoyal9119@gmail.com"", ""yoshua.umontreal@gmail.com"", ""parikh@gatech.edu"", ""dbatra@gatech.edu""]","[""Nan Rosemary Ke"", ""Amanpreet Singh"", ""Ahmed Touati"", ""Anirudh Goyal"", ""Yoshua Bengio"", ""Devi Parikh"", ""Dhruv Batra""]","[""model-based reinforcement learning"", ""variation inference""]","In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planer would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our methods achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings. ",/pdf/a334c536d3c607d6b8948ba05cfea80a863db50f.pdf,ICLR,2019,"incorporating, in the model, latent variables that encode future content improves the long-term prediction accuracy, which is critical for better planning in model-based RL." +S1ejj64YvS,r1g6YUyuwS,1569440000000.0,1577170000000.0,756,Good Semi-supervised VAE Requires Tighter Evidence Lower Bound,"[""fenghz@zju.edu.cn"", ""kong@cs.umd.edu"", ""zhangtianye1026@zju.edu.cn"", ""3160104527@zju.edu.cn"", ""chenwei@cad.zju.edu.cn""]","[""Haozhe Feng"", ""Kezhi Kong"", ""Tianye Zhang"", ""Siyue Xue"", ""Wei Chen""]","[""VAE"", ""Semi-supervised Learning"", ""ELBO"", ""Generative Model""]","Semi-supervised learning approaches based on generative models have now encountered 3 challenges: (1) The two-stage training strategy is not robust. (2) Good semi-supervised learning results and good generative performance can not be obtained at the same time. (3) Even at the expense of sacrificing generative performance, the semi-supervised classification results are still not satisfactory. To address these problems, we propose One-stage Semi-suPervised Optimal Transport VAE (OSPOT-VAE), a one-stage deep generative model that theoretically unifies the generation and classification loss in one ELBO framework and achieves a tighter ELBO by applying the optimal transport scheme to the distribution of latent variables. We show that with tighter ELBO, our OSPOT-VAE surpasses the best semi-supervised generative models by a large margin across many benchmark datasets. For example, we reduce the error rate from 14.41% to 6.11% on Cifar-10 with 4k labels and achieve state-of-the-art performance with 25.30% on Cifar-100 with 10k labels. We also demonstrate that good generative models and semi-supervised results can be achieved simultaneously by OSPOT-VAE.",/pdf/3df0bcc84eddd1562646ab7158be628feb40fe60.pdf,ICLR,2020,"we propose OSPOT-VAE, a one-stage deep generative model that unifies the generation and classification loss in one ELBO framework and achieves a tighter ELBO." +SyxAb30cY7,BJg7Ees5FQ,1538090000000.0,1550880000000.0,1223,Robustness May Be at Odds with Accuracy,"[""tsipras@mit.edu"", ""shibani@mit.edu"", ""engstrom@mit.edu"", ""turneram@mit.edu"", ""madry@mit.edu""]","[""Dimitris Tsipras"", ""Shibani Santurkar"", ""Logan Engstrom"", ""Alexander Turner"", ""Aleksander Madry""]","[""adversarial examples"", ""robust machine learning"", ""robust optimization"", ""deep feature representations""]","We show that there exists an inherent tension between the goal of adversarial robustness and that of standard generalization. +Specifically, training robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy. We demonstrate that this trade-off between the standard accuracy of a model and its robustness to adversarial perturbations provably exists even in a fairly simple and natural setting. These findings also corroborate a similar phenomenon observed in practice. Further, we argue that this phenomenon is a consequence of robust classifiers learning fundamentally different feature representations than standard classifiers. These differences, in particular, seem to result in unexpected benefits: the features learned by robust models tend to align better with salient data characteristics and human perception.",/pdf/9672620a3cd76af97a26f07b9f1e5a7df627fb18.pdf,ICLR,2019,"We show that adversarial robustness might come at the cost of standard classification performance, but also yields unexpected benefits." +SJeLO34KwS,ByxV947B8S,1569440000000.0,1577170000000.0,43,Dimensional Reweighting Graph Convolution Networks,"[""zoux18@mails.tsinghua.edu.cn"", ""jqy@stanford.edu"", ""zhangjianwei.zjw@alibaba-inc.com"", ""ericzhou.zc@alibaba-inc.com"", ""yaozijun@bupt.edu.cn"", ""yang.yhx@alibaba-inc.com"", ""jietang@tsinghua.edu.cn""]","[""Xu Zou"", ""Qiuye Jia"", ""Jianwei Zhang"", ""Chang Zhou"", ""Zijun Yao"", ""Hongxia Yang"", ""Jie Tang""]","[""graph convolutional networks"", ""representation learning"", ""mean field theory"", ""variance reduction"", ""node classification""]","In this paper, we propose a method named Dimensional reweighting Graph Convolutional Networks (DrGCNs), to tackle the problem of variance between dimensional information in the node representations of GCNs. We prove that DrGCNs can reduce the variance of the node representations by connecting our problem to the theory of the mean field. However, practically, we find that the degrees DrGCNs help vary severely on different datasets. We revisit the problem and develop a new measure K to quantify the effect. This measure guides when we should use dimensional reweighting in GCNs and how much it can help. Moreover, it offers insights to explain the improvement obtained by the proposed DrGCNs. The dimensional reweighting block is light-weighted and highly flexible to be built on most of the GCN variants. Carefully designed experiments, including several fixes on duplicates, information leaks, and wrong labels of the well-known node classification benchmark datasets, demonstrate the superior performances of DrGCNs over the existing state-of-the-art approaches. Significant improvements can also be observed on a large scale industrial dataset.",/pdf/527c97fd66dfc8eb44d55309ad1ac34782a861af.pdf,ICLR,2020,"We propose a simple yet effective reweighting scheme for GCNs, theoretically supported by the mean field theory." +CHTHamtufWN,f6cCd5U0Hg-,1601310000000.0,1614990000000.0,3136,Continual Invariant Risk Minimization,"[""~Francesco_Alesiani1"", ""~Shujian_Yu1"", ""~Mathias_Niepert1""]","[""Francesco Alesiani"", ""Shujian Yu"", ""Mathias Niepert""]","[""Supervised Learning"", ""Causal Learning"", ""Invariant Risk Minimization"", ""Continual Learning""]","Empirical risk minimization can lead to poor generalization behaviour on unseen environments if the learned model does not capture invariant feature represen- tations. Invariant risk minimization (IRM) is a recent proposal for discovering environment-invariant representations. It was introduced by Arjovsky et al. (2019) and extended by Ahuja et al. (2020). The assumption of IRM is that all environ- ments are available to the learning system at the same time. With this work, we generalize the concept of IRM to scenarios where environments are observed se- quentially. We show that existing approaches, including those designed for contin- ual learning, fail to identify the invariant features and models across sequentially presented environments. We extend IRM under a variational Bayesian and bilevel framework, creating a general approach to continual invariant risk minimization. We also describe a strategy to solve the optimization problems using a variant of the alternating direction method of multiplier (ADMM). We show empirically us- ing multiple datasets and with multiple sequential environments that the proposed methods outperforms or is competitive with prior approaches.",/pdf/df1dd4b3bb815a98877cfa1f602aa565675dd445.pdf,ICLR,2021,We study the extension of Invariant Risk Minimization in sequential environments +HyUNwulC-,SkHEwdgCZ,1509100000000.0,1519280000000.0,310,Parallelizing Linear Recurrent Neural Nets Over Sequence Length,"[""eric@ericmart.in"", ""chris.j.cundy@gmail.com""]","[""Eric Martin"", ""Chris Cundy""]","[""rnn"", ""sequence"", ""parallel"", ""qrnn"", ""sru"", ""gilr"", ""gilr-lstm""]","Recurrent neural networks (RNNs) are widely used to model sequential data but +their non-linear dependencies between sequence elements prevent parallelizing +training over sequence length. We show the training of RNNs with only linear +sequential dependencies can be parallelized over the sequence length using the +parallel scan algorithm, leading to rapid training on long sequences even with +small minibatch size. We develop a parallel linear recurrence CUDA kernel and +show that it can be applied to immediately speed up training and inference of +several state of the art RNN architectures by up to 9x. We abstract recent work +on linear RNNs into a new framework of linear surrogate RNNs and develop a +linear surrogate model for the long short-term memory unit, the GILR-LSTM, that +utilizes parallel linear recurrence. We extend sequence learning to new +extremely long sequence regimes that were previously out of reach by +successfully training a GILR-LSTM on a synthetic sequence classification task +with a one million timestep dependency. +",/pdf/e4ececba5b34febc0f471b48fee5a68affa857f6.pdf,ICLR,2018,use parallel scan to parallelize linear recurrent neural nets. train model on length 1 million dependency +ByxF-nAqYX,HyllAkA9F7,1538090000000.0,1545360000000.0,1191,Locally Linear Unsupervised Feature Selection,"[""doquet@lri.fr"", ""sebag@lri.fr""]","[""Guillaume DOQUET"", ""Mich\u00e8le SEBAG""]","[""Unsupervised Learning"", ""Feature Selection"", ""Dimension Reduction""]","The paper, interested in unsupervised feature selection, aims to retain the features best accounting for the local patterns in the data. The proposed approach, called Locally Linear Unsupervised Feature Selection, relies on a dimensionality reduction method to characterize such patterns; each feature is thereafter assessed according to its compliance w.r.t. the local patterns, taking inspiration from Locally Linear Embedding (Roweis and Saul, 2000). The experimental validation of the approach on the scikit-feature benchmark suite demonstrates its effectiveness compared to the state of the art.",/pdf/1bf1de18b66705381e81f187a3dcdaff5d13bae8.pdf,ICLR,2019,Unsupervised feature selection through capturing the local linear structure of the data +ByKWUeWA-,By_WUxZAW,1509130000000.0,1518730000000.0,588,GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets,"[""jsyoon0823@gmail.com"", ""james.jordon@hertford.ox.ac.uk"", ""mihaela.vanderschaar@oxford-man.ox.ac.uk""]","[""Jinsung Yoon"", ""James Jordon"", ""Mihaela van der Schaar""]","[""Individualized Treatment Effects"", ""Counterfactual Estimation"", ""Generative Adversarial Nets""]","Estimating individualized treatment effects (ITE) is a challenging task due to the need for an individual's potential outcomes to be learned from biased data and without having access to the counterfactuals. We propose a novel method for inferring ITE based on the Generative Adversarial Nets (GANs) framework. Our method, termed Generative Adversarial Nets for inference of Individualized Treatment Effects (GANITE), is motivated by the possibility that we can capture the uncertainty in the counterfactual distributions by attempting to learn them using a GAN. We generate proxies of the counterfactual outcomes using a counterfactual generator, G, and then pass these proxies to an ITE generator, I, in order to train it. By modeling both of these using the GAN framework, we are able to infer based on the factual data, while still accounting for the unseen counterfactuals. We test our method on three real-world datasets (with both binary and multiple treatments) and show that GANITE outperforms state-of-the-art methods.",/pdf/8b428f92ef641d25b3ac7bfd62c16c0284d17b7d.pdf,ICLR,2018, +Svfh1_hYEtF,_LOY9E6KXIH,1601310000000.0,1614990000000.0,496,Federated Continual Learning with Weighted Inter-client Transfer,"[""~Jaehong_Yoon1"", ""~Wonyong_Jeong1"", ""~Giwoong_Lee1"", ""~Eunho_Yang1"", ""~Sung_Ju_Hwang1""]","[""Jaehong Yoon"", ""Wonyong Jeong"", ""Giwoong Lee"", ""Eunho Yang"", ""Sung Ju Hwang""]","[""Continual Learning"", ""Federated Learning"", ""Deep Learning""]","There has been a surge of interest in continual learning and federated learning, both of which are important in deep neural networks in real-world scenarios. Yet little research has been done regarding the scenario where each client learns on a sequence of tasks from a private local data stream. This problem of federated continual learning poses new challenges to continual learning, such as utilizing knowledge from other clients, while preventing interference from irrelevant knowledge. To resolve these issues, we propose a novel federated continual learning framework, Federated Weighted Inter-client Transfer (FedWeIT), which decomposes the network weights into global federated parameters and sparse task-specific parameters, and each client receives selective knowledge from other clients by taking a weighted combination of their task-specific parameters.FedWeITminimizes interference between incompatible tasks, and also allows positive knowledge transfer across clients during learning. We validate ourFedWeITagainst existing federated learning and continual learning methods under varying degrees of task similarity across clients, and our model significantly outperforms them with a large reduction in the communication cost.",/pdf/2fe0338a212a03de7d3e6c601e313a9d41cf4227.pdf,ICLR,2021,"We define a problem of federated continual learning and propose a novel federated continual learning framework, Weighted Inter-client Transfer (FedWeIT)." +HkgNdt26Z,HJyNdKnT-,1508840000000.0,1523520000000.0,61,Distributed Fine-tuning of Language Models on Private Data,"[""v.popov@samsung.com"", ""m.kudinov@samsung.com"", ""p.irina@samsung.com"", ""p.vytovtov@partner.samsung.com"", ""a.nevidomsky@samsung.com""]","[""Vadim Popov"", ""Mikhail Kudinov"", ""Irina Piontkovskaya"", ""Petr Vytovtov"", ""Alex Nevidomsky""]","[""distributed training"", ""federated learning"", ""language modeling"", ""differential privacy""]","One of the big challenges in machine learning applications is that training data can be different from the real-world data faced by the algorithm. In language modeling, users’ language (e.g. in private messaging) could change in a year and be completely different from what we observe in publicly available data. At the same time, public data can be used for obtaining general knowledge (i.e. general model of English). We study approaches to distributed fine-tuning of a general model on user private data with the additional requirements of maintaining the quality on the general data and minimization of communication costs. We propose a novel technique that significantly improves prediction quality on users’ language compared to a general model and outperforms gradient compression methods in terms of communication efficiency. The proposed procedure is fast and leads to an almost 70% perplexity reduction and 8.7 percentage point improvement in keystroke saving rate on informal English texts. Finally, we propose an experimental framework for evaluating differential privacy of distributed training of language models and show that our approach has good privacy guarantees.",/pdf/de7dad4dfed9bfaa3818301db1b77de5c25a09f8.pdf,ICLR,2018,We propose a method of distributed fine-tuning of language models on user devices without collection of private data +kWSeGEeHvF8,zLdWOa8A_K,1601310000000.0,1616050000000.0,2401,Benchmarks for Deep Off-Policy Evaluation,"[""~Justin_Fu1"", ""~Mohammad_Norouzi1"", ""~Ofir_Nachum1"", ""~George_Tucker1"", ""~ziyu_wang1"", ""~Alexander_Novikov1"", ""~Mengjiao_Yang1"", ""~Michael_R_Zhang1"", ""~Yutian_Chen1"", ""~Aviral_Kumar2"", ""~Cosmin_Paduraru1"", ""~Sergey_Levine1"", ""~Thomas_Paine1""]","[""Justin Fu"", ""Mohammad Norouzi"", ""Ofir Nachum"", ""George Tucker"", ""ziyu wang"", ""Alexander Novikov"", ""Mengjiao Yang"", ""Michael R Zhang"", ""Yutian Chen"", ""Aviral Kumar"", ""Cosmin Paduraru"", ""Sergey Levine"", ""Thomas Paine""]","[""reinforcement learning"", ""off-policy evaluation"", ""benchmarks""]","Off-policy evaluation (OPE) holds the promise of being able to leverage large, offline datasets for both evaluating and selecting complex policies for decision making. The ability to learn offline is particularly important in many real-world domains, such as in healthcare, recommender systems, or robotics, where online data collection is an expensive and potentially dangerous process. Being able to accurately evaluate and select high-performing policies without requiring online interaction could yield significant benefits in safety, time, and cost for these applications. While many OPE methods have been proposed in recent years, comparing results between papers is difficult because currently there is a lack of a comprehensive and unified benchmark, and measuring algorithmic progress has been challenging due to the lack of difficult evaluation tasks. In order to address this gap, we present a collection of policies that in conjunction with existing offline datasets can be used for benchmarking off-policy evaluation. Our tasks include a range of challenging high-dimensional continuous control problems, with wide selections of datasets and policies for performing policy selection. The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles designed to challenge and test the limits of existing OPE methods. We perform an evaluation of state-of-the-art algorithms and provide open-source access to our data and code to foster future research in this area. ",/pdf/3a90850ebecc25b81a9534180c75842a2b672812.pdf,ICLR,2021,A benchmark proposal for off-policy evaluation and policy selection. +vQzcqQWIS0q,53_FNATacqP,1601310000000.0,1615450000000.0,1708,Learnable Embedding sizes for Recommender Systems,"[""~Siyi_Liu1"", ""~Chen_Gao3"", ""~Yihong_Chen3"", ""jindp@tsinghua.edu.cn"", ""~Yong_Li3""]","[""Siyi Liu"", ""Chen Gao"", ""Yihong Chen"", ""Depeng Jin"", ""Yong Li""]","[""Recommender Systems"", ""Deep Learning"", ""Embedding Size""]","The embedding-based representation learning is commonly used in deep learning recommendation models to map the raw sparse features to dense vectors. The traditional embedding manner that assigns a uniform size to all features has two issues. First, the numerous features inevitably lead to a gigantic embedding table that causes a high memory usage cost. Second, it is likely to cause the over-fitting problem for those features that do not require too large representation capacity. Existing works that try to address the problem always cause a significant drop in recommendation performance or suffers from the limitation of unaffordable training time cost. In this paper, we proposed a novel approach, named PEP (short for Plug-in Embedding Pruning), to reduce the size of the embedding table while avoiding the drop of recommendation accuracy. PEP prunes embedding parameter where the pruning threshold(s) can be adaptively learned from data. Therefore we can automatically obtain a mixed-dimension embedding-scheme by pruning redundant parameters for each feature. PEP is a general framework that can plug in various base recommendation models. Extensive experiments demonstrate it can efficiently cut down embedding parameters and boost the base model's performance. Specifically, it achieves strong recommendation performance while reducing 97-99% parameters. As for the computation cost, PEP only brings an additional 20-30% time cost compare with base models. ",/pdf/57ca06dcb49d0ba8a6c705fcd2a34d58a7cf8c3a.pdf,ICLR,2021,Learning flexible feature-aware embedding sizes effectively and efficiently for recommendation models. +BkgeQ1BYwS,BJl-kB3dDH,1569440000000.0,1577170000000.0,1603,Implicit Generative Modeling for Efficient Exploration,"[""ratzlafn@oregonstate.edu"", ""qinxun.bai@horizon.ai"", ""lif@oregonstate.edu"", ""wei.xu@horizon.ai""]","[""Neale Ratzlaff"", ""Qinxun Bai"", ""Li Fuxin"", ""Wei Xu""]","[""Reinforcement Learning"", ""Exploration"", ""Intrinsic Reward"", ""Implicit Generative Models""]","Efficient exploration remains a challenging problem in reinforcement learning, especially for those tasks where rewards from environments are sparse. A commonly used approach for exploring such environments is to introduce some ""intrinsic"" reward. In this work, we focus on model uncertainty estimation as an intrinsic reward for efficient exploration. In particular, we introduce an implicit generative modeling approach to estimate a Bayesian uncertainty of the agent's belief of the environment dynamics. Each random draw from our generative model is a neural network that instantiates the dynamic function, hence multiple draws would approximate the posterior, and the variance in the future prediction based on this posterior is used as an intrinsic reward for exploration. We design a training algorithm for our generative model based on the amortized Stein Variational Gradient Descent. In experiments, we compare our implementation with state-of-the-art intrinsic reward-based exploration approaches, including two recent approaches based on an ensemble of dynamic models. In challenging exploration tasks, our implicit generative model consistently outperforms competing approaches regarding data efficiency in exploration.",/pdf/3c9abf680176559023acfd7c8657c85d1dbbdb14.pdf,ICLR,2020,We efficiently explore by modeling uncertainty in the environment dynamics with an implicit generative model. +BJl07ySKvS,HylIYt3dwS,1569440000000.0,1588010000000.0,1636,Guiding Program Synthesis by Learning to Generate Examples,"[""llaich@ethz.ch"", ""pavol.bielik@inf.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Larissa Laich"", ""Pavol Bielik"", ""Martin Vechev""]","[""program synthesis"", ""programming by examples""]","A key challenge of existing program synthesizers is ensuring that the synthesized program generalizes well. This can be difficult to achieve as the specification provided by the end user is often limited, containing as few as one or two input-output examples. In this paper we address this challenge via an iterative approach that finds ambiguities in the provided specification and learns to resolve these by generating additional input-output examples. The main insight is to reduce the problem of selecting which program generalizes well to the simpler task of deciding which output is correct. As a result, to train our probabilistic models, we can take advantage of the large amounts of data in the form of program outputs, which are often much easier to obtain than the corresponding ground-truth programs.",/pdf/e38826be42dfd8f83e0dd46f45c4225c5274bf01.pdf,ICLR,2020, +BkxGAREYwB,HJlUtI5ODB,1569440000000.0,1577170000000.0,1421,Deep Expectation-Maximization in Hidden Markov Models via Simultaneous Perturbation Stochastic Approximation,"[""chongli@uw.edu"", ""dshen@alibaba-inc.com"", ""cjshi@uw.edu"", ""yang.yhx@alibaba-inc.com""]","[""Chong Li"", ""Dan Shen"", ""C.J. Richard Shi"", ""Hongxia Yang""]","[""recommender system"", ""gradient approximation"", ""Hidden Markov Model""]","We propose a novel method to estimate the parameters of a collection of Hidden Markov Models (HMM), each of which corresponds to a set of known features. The observation sequence of an individual HMM is noisy and/or insufficient, making parameter estimation solely based on its corresponding observation sequence a challenging problem. The key idea is to combine the classical Expectation-Maximization (EM) algorithm with a neural network, while these two are jointly trained in an end-to-end fashion, mapping the HMM features to its parameters and effectively fusing the information across different HMMs. In order to address the numerical difficulty in computing the gradient of the EM iteration, simultaneous perturbation stochastic approximation (SPSA) is employed to approximate the gradient. We also provide a rigorous proof that the approximated gradient due to SPSA converges to the true gradient almost surely. The efficacy of the proposed method is demonstrated on synthetic data as well as a real-world e-Commerce dataset. ",/pdf/c6589ab53e0441d694d56c34773e4c5084fa4b8a.pdf,ICLR,2020,We rendered Expectation-Maximization iteration as a network layer by approximating its gradient. +r1xGnA4Kvr,HJgHY8tOvB,1569440000000.0,1583910000000.0,1347,Biologically inspired sleep algorithm for increased generalization and adversarial robustness in deep neural networks,"[""tttadros@ucsd.edu"", ""gkrishnan@ucsd.edu"", ""ramyaa.ramyaa@gmail.com"", ""mbazhenov@ucsd.edu""]","[""Timothy Tadros"", ""Giri Krishnan"", ""Ramyaa Ramyaa"", ""Maxim Bazhenov""]","[""Adversarial Robustness"", ""Generalization"", ""Neural Computing"", ""Deep Learning""]","Current artificial neural networks (ANNs) can perform and excel at a variety of tasks ranging from image classification to spam detection through training on large datasets of labeled data. While the trained network may perform well on similar testing data, inputs that differ even slightly from the training data may trigger unpredictable behavior. Due to this limitation, it is possible to design inputs with very small perturbations that can result in misclassification. These adversarial attacks present a security risk to deployed ANNs and indicate a divergence between how ANNs and humans perform classification. Humans are robust at behaving in the presence of noise and are capable of correctly classifying objects that are noisy, blurred, or otherwise distorted. It has been hypothesized that sleep promotes generalization of knowledge and improves robustness against noise in animals and humans. In this work, we utilize a biologically inspired sleep phase in ANNs and demonstrate the benefit of sleep on defending against adversarial attacks as well as in increasing ANN classification robustness. We compare the sleep algorithm's performance on various robustness tasks with two previously proposed adversarial defenses - defensive distillation and fine-tuning. We report an increase in robustness after sleep phase to adversarial attacks as well as to general image distortions for three datasets: MNIST, CUB200, and a toy dataset. Overall, these results demonstrate the potential for biologically inspired solutions to solve existing problems in ANNs and guide the development of more robust, human-like ANNs.",/pdf/23bc921edfb8cc42a65a409f7156a6ef2daed92f.pdf,ICLR,2020,We describe a biologically inspired sleep algorithm for increasing an artificial neural network's ability to extract the gist of a training set and exhibit increased robustness to adversarial attacks and general distortions. +qkLMTphG5-h,n8xqTVfFZiR,1601310000000.0,1615810000000.0,831,Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning,"[""~Namyeong_Kwon1"", ""~Hwidong_Na1"", ""~Gabriel_Huang1"", ""~Simon_Lacoste-Julien1""]","[""Namyeong Kwon"", ""Hwidong Na"", ""Gabriel Huang"", ""Simon Lacoste-Julien""]","[""Meta-learning"", ""Few-shot learning"", ""Out-of-domain"", ""Uncertainty"", ""Ensemble"", ""Adversarial training"", ""Stepsize optimization""]","Model-agnostic meta-learning (MAML) is a popular method for few-shot learning but assumes that we have access to the meta-training set. In practice, training on the meta-training set may not always be an option due to data privacy concerns, intellectual property issues, or merely lack of computing resources. In this paper, we consider the novel problem of repurposing pretrained MAML checkpoints to solve new few-shot classification tasks. Because of the potential distribution mismatch, the original MAML steps may no longer be optimal. Therefore we propose an alternative meta-testing procedure and combine MAML gradient steps with adversarial training and uncertainty-based stepsize adaptation. Our method outperforms ""vanilla"" MAML on same-domain and cross-domains benchmarks using both SGD and Adam optimizers and shows improved robustness to the choice of base stepsize.",/pdf/d7cdb4aa01e48fb3d0a82cabe99b8fe0b1c57f47.pdf,ICLR,2021,We propose an alternative meta-testing procedure and combine MAML gradient steps with adversarial training and uncertainty-based stepsize adaptation. +H1eVlgHKPr,rJxKp2yYPr,1569440000000.0,1577170000000.0,2094,Event Discovery for History Representation in Reinforcement Learning,"[""aleksandr.ermolov@unitn.it"", ""enver.sangineto@unitn.it"", ""niculae.sebe@unitn.it""]","[""Aleksandr Ermolov"", ""Enver Sangineto"", ""Nicu Sebe""]","[""reinforcement learning"", ""self-supervision"", ""POMDP""]","Environments in Reinforcement Learning (RL) are usually only partially observable. To address this problem, a possible solution is to provide the agent with information about past observations. While common methods represent this history using a Recurrent Neural Network (RNN), in this paper we propose an alternative representation which is based on the record of the past events observed in a given episode. Inspired by the human memory, these events describe only important changes in the environment and, in our approach, are automatically discovered using self-supervision. + We evaluate our history representation method using two challenging RL benchmarks: some games of the Atari-57 suite and the 3D environment Obstacle Tower. Using these benchmarks we show the advantage of our solution with respect to common RNN-based approaches.",/pdf/7b83dcafcd4ba091ee057d7fb5f5bc478ee76467.pdf,ICLR,2020,event discovery to represent the history for the agent in RL +9l0K4OM-oXE,_NFh20k-Hrz,1601310000000.0,1614330000000.0,1631,Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks,"[""~Yige_Li1"", ""xxlv@mail.xidian.edu.cn"", ""~Nodens_Koren1"", ""~Lingjuan_Lyu1"", ""~Bo_Li19"", ""~Xingjun_Ma1""]","[""Yige Li"", ""Xixiang Lyu"", ""Nodens Koren"", ""Lingjuan Lyu"", ""Bo Li"", ""Xingjun Ma""]","[""Backdoor Defense"", ""Deep Neural Networks"", ""Neural Attention Distillation""]","Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make the incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Our code is available at https://github.com/bboylyg/NAD.",/pdf/42f5786a622e8cdc4ce43d79d5d83ebe8e4feeeb.pdf,ICLR,2021,A simple but effective nerual attention distillation method for backdoor defense. +r1xfECEKvr,SJxbaGUdPS,1569440000000.0,1577170000000.0,1068,Analyzing the Role of Model Uncertainty for Electronic Health Records,"[""dusenberrymw@google.com"", ""trandustin@google.com"", ""mp2893@gmail.com"", ""jonasbkemp@google.com"", ""jeremynixon@google.com"", ""ghassen@google.com"", ""kheller@google.com"", ""adai@google.com""]","[""Michael W. Dusenberry"", ""Dustin Tran"", ""Edward Choi"", ""Jonas Kemp"", ""Jeremy Nixon"", ""Ghassen Jerfel"", ""Katherine Heller"", ""Andrew M. Dai""]","[""medicine"", ""uncertainty"", ""neural networks"", ""Bayesian"", ""electronic health records""]","In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that population-level metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patient-specific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.",/pdf/8094119adb5f1f4f3ca35ebf5579b2d0b36bddcc.pdf,ICLR,2020,"We investigate the role of model uncertainty methods for domains like medicine, and compare a multitude of Bayesian RNN variants with deterministic RNN ensembles." +B1ecVlrtDr,S1ggWDxFPr,1569440000000.0,1577170000000.0,2258,Symmetric-APL Activations: Training Insights and Robustness to Adversarial Attacks,"[""mohamadt@uci.edu"", ""fagostin@uci.edu"", ""pfbaldi@ics.uci.edu""]","[""Mohammadamin Tavakoli"", ""Forest Agostinelli"", ""Pierre Baldi""]","[""Activation function"", ""Adaptive"", ""Training"", ""Robustness"", ""Adversarial attack""]","Deep neural networks with learnable activation functions have shown superior performance over deep neural networks with fixed activation functions for many different problems. The adaptability of learnable activation functions adds expressive power to the model which results in better performance. Here, we propose a new learnable activation function based on Adaptive Piecewise Linear units (APL), which 1) gives equal expressive power to both the positive and negative halves on the input space and 2) is able to approximate any zero-centered continuous non-linearity in a closed interval. We investigate how the shape of the Symmetric-APL function changes during training and perform ablation studies to gain insight into the reason behind these changes. We hypothesize that these activation functions go through two distinct stages: 1) adding gradient information and 2) adding expressive power. Finally, we show that the use of Symmetric-APL activations can significantly increase the robustness of deep neural networks to adversarial attacks. Our experiments on both black-box and open-box adversarial attacks show that commonly-used architectures, namely Lenet, Network-in-Network, and ResNet-18 can be up to 51% more resistant to adversarial fooling by only using the proposed activation functions instead of ReLUs.",/pdf/7dd1ecc866b319ddc1e41f1469497935df49ea6f.pdf,ICLR,2020,Symmetric Adaptive Piecewise Linear activations are proposed as new activation function with deep explanation on training behavior and robustness to adversarial attacks. +Syl-xpNtwS,rJenKERSPH,1569440000000.0,1577170000000.0,328,Learning Representations in Reinforcement Learning: an Information Bottleneck Approach,"[""peiyingjun4@gmail.com"", ""xwhou@nlpr.ia.ac.cn""]","[""Yingjun Pei"", ""Xinwen Hou""]","[""representation learning"", ""reinforcement learning"", ""information bottleneck""]","The information bottleneck principle is an elegant and useful approach to representation learning. In this paper, we investigate the problem of representation learning in the context of reinforcement learning using the information bottleneck framework, aiming at improving the sample efficiency of the learning algorithms.We analytically derive the optimal conditional distribution of the representation, and provide a variational lower bound. Then, we maximize this lower bound with the Stein variational (SV) gradient method. +We incorporate this framework in the advantageous actor critic algorithm (A2C) and the proximal policy optimization algorithm (PPO). Our experimental results show that our framework can improve the sample efficiency of vanilla A2C and PPO significantly. Finally, we study the information-bottleneck (IB) perspective in deep RL with the algorithm called mutual information neural estimation(MINE). +We experimentally verify that the information extraction-compression process also exists in deep RL and our framework is capable of accelerating this process. We also analyze the relationship between MINE and our method, through this relationship, we theoretically derive an algorithm to optimize our IB framework without constructing the lower bound.",/pdf/acfd9ba9d052c933b587c6673668c12fd127d32d.pdf,ICLR,2020,Derive an information bottleneck framework in reinforcement learning and some simple relevant theories and tools. +ryh_8f9lg,,1478270000000.0,1484350000000.0,173,Classless Association using Neural Networks,"[""federico.raue@dfki.de"", ""sebastian.palacio@dfki.de"", ""andreas.dengel@dfki.de"", ""liwicki@cs.uni-kl.de""]","[""Federico Raue"", ""Sebastian Palacio"", ""Andreas Dengel"", ""Marcus Liwicki""]",[],"The goal of this paper is to train a model based on the relation between two instances that represent the same unknown class. This scenario is inspired by the Symbol Grounding Problem and the association learning in infants. We propose a novel model called Classless Association. It has two parallel Multilayer Perceptrons (MLP) that uses one network as a target of the other network, and vice versa. In addition, the presented model is trained based on an EM-approach, in which the output vectors are matched against a statistical distribution. We generate four classless datasets based on MNIST, where the input is two different instances of the same digit. In addition, the digits have a uniform distribution. Furthermore, our classless association model is evaluated against two scenarios: totally supervised and totally unsupervised. In the first scenario, our model reaches a good performance in terms of accuracy and the classless constraint. In the second scenario, our model reaches better results against two clustering algorithms. +",/pdf/788a4cffe22f661847498b56c09f13aadd311e8c.pdf,ICLR,2017,Learning based on the relation between two instances of the same unknown class +B1l8L6EtDS,rygES-FvDS,1569440000000.0,1583910000000.0,560,Self-Adversarial Learning with Comparative Discrimination for Text Generation,"[""v-waz@microsoft.com"", ""tage@microsoft.com"", ""kexu@nlsde.buaa.edu.cn"", ""fuwei@microsoft.com"", ""mingzhou@microsoft.com""]","[""Wangchunshu Zhou"", ""Tao Ge"", ""Ke Xu"", ""Furu Wei"", ""Ming Zhou""]","[""adversarial learning"", ""text generation""]","Conventional Generative Adversarial Networks (GANs) for text generation tend to have issues of reward sparsity and mode collapse that affect the quality and diversity of generated samples. To address the issues, we propose a novel self-adversarial learning (SAL) paradigm for improving GANs' performance in text generation. In contrast to standard GANs that use a binary classifier as its discriminator to predict whether a sample is real or generated, SAL employs a comparative discriminator which is a pairwise classifier for comparing the text quality between a pair of samples. During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples. This self-improvement reward mechanism allows the model to receive credits more easily and avoid collapsing towards the limited number of real samples, which not only helps alleviate the reward sparsity issue but also reduces the risk of mode collapse. Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity, and yields more stable performance compared to the previous GANs for text generation.",/pdf/d5957de3f1cc9bca5cee3d926654b7d35292c8c4.pdf,ICLR,2020,We propose a self-adversarial learning (SAL) paradigm which improves the generator in a self-play fashion for improving GANs' performance in text generation. +SJtfOEn6-,rJ_M_Eh6b,1508820000000.0,1518730000000.0,56,ResBinNet: Residual Binary Neural Network,"[""mghasemzadeh@ucsd.edu"", ""msamragh@ucsd.edu"", ""farinaz@ucsd.edu""]","[""Mohammad Ghasemzadeh"", ""Mohammad Samragh"", ""Farinaz Koushanfar""]","[""Binary Neural Networks"", ""Residual Binarization"", ""Deep Learning""]","Recent efforts on training light-weight binary neural networks offer promising execution/memory efficiency. This paper introduces ResBinNet, which is a composition of two interlinked methodologies aiming to address the slow convergence speed and limited accuracy of binary convolutional neural networks. The first method, called residual binarization, learns a multi-level binary representation for the features within a certain neural network layer. The second method, called temperature adjustment, gradually binarizes the weights of a particular layer. The two methods jointly learn a set of soft-binarized parameters that improve the convergence rate and accuracy of binary neural networks. We corroborate the applicability and scalability of ResBinNet by implementing a prototype hardware accelerator. The accelerator is reconfigurable in terms of the numerical precision of the binarized features, offering a trade-off between runtime and inference accuracy. +",/pdf/6f52eca88529fe24950cd3bf6d416c131a1f1058.pdf,ICLR,2018,Residual Binary Neural Networks significantly improve the convergence rate and inference accuracy of the binary neural networks. +ByYPLJA6W,S1uPUyCaZ,1508930000000.0,1518730000000.0,80,Distribution Regression Network,"[""koukl@comp.nus.edu.sg"", ""leehk@bii.a-star.edu.sg"", ""ngtk@comp.nus.edu.sg""]","[""Connie Kou"", ""Hwee Kuan Lee"", ""Teck Khim Ng""]","[""distribution regression"", ""supervised learning"", ""regression analysis""]","We introduce our Distribution Regression Network (DRN) which performs regression from input probability distributions to output probability distributions. Compared to existing methods, DRN learns with fewer model parameters and easily extends to multiple input and multiple output distributions. On synthetic and real-world datasets, DRN performs similarly or better than the state-of-the-art. Furthermore, DRN generalizes the conventional multilayer perceptron (MLP). In the framework of MLP, each node encodes a real number, whereas in DRN, each node encodes a probability distribution. ",/pdf/4494945530898996fba4f5ba5569d676fd172d6a.pdf,ICLR,2018,A learning network which generalizes the MLP framework to perform distribution-to-distribution regression +B17JTOe0-,rkfJpdeA-,1509100000000.0,1519530000000.0,321,Emergence of grid-like representations by training recurrent neural networks to perform spatial localization,"[""ccueva@gmail.com"", ""weixxpku@gmail.com""]","[""Christopher J. Cueva"", ""Xue-Xin Wei""]","[""recurrent neural network"", ""grid cell"", ""neural representation of space""]","Decades of research on the neural code underlying spatial navigation have revealed a diverse set of neural response properties. The Entorhinal Cortex (EC) of the mammalian brain contains a rich set of spatial correlates, including grid cells which encode space using tessellating patterns. However, the mechanisms and functional significance of these spatial representations remain largely mysterious. As a new way to understand these neural representations, we trained recurrent neural networks (RNNs) to perform navigation tasks in 2D arenas based on velocity inputs. Surprisingly, we find that grid-like spatial response patterns emerge in trained networks, along with units that exhibit other spatial correlates, including border cells and band-like cells. All these different functional types of neurons have been observed experimentally. The order of the emergence of grid-like and border cells is also consistent with observations from developmental studies. Together, our results suggest that grid cells, border cells and others as observed in EC may be a natural solution for representing space efficiently given the predominant recurrent connections in the neural circuits. +",/pdf/742beeb37fd932c9da9d178b741d2ab0093ccba7.pdf,ICLR,2018,"To our knowledge, this is the first study to show how neural representations of space, including grid-like cells and border cells as observed in the brain, could emerge from training a recurrent neural network to perform navigation tasks." +HJxK5pEYvr,B1xacgRvwS,1569440000000.0,1583910000000.0,715,Tree-Structured Attention with Hierarchical Accumulation,"[""nxphi47@gmail.com"", ""sjoty@salesforce.com""]","[""Xuan-Phi Nguyen"", ""Shafiq Joty"", ""Steven Hoi"", ""Richard Socher""]","[""Tree"", ""Constituency Tree"", ""Hierarchical Accumulation"", ""Machine Translation"", ""NMT"", ""WMT"", ""IWSLT"", ""Text Classification"", ""Sentiment Analysis""]","Incorporating hierarchical structures like constituency trees has been shown to be effective for various natural language processing (NLP) tasks. However, it is evident that state-of-the-art (SOTA) sequence-based models like the Transformer struggle to encode such structures inherently. On the other hand, dedicated models like the Tree-LSTM, while explicitly modeling hierarchical structures, do not perform as efficiently as the Transformer. In this paper, we attempt to bridge this gap with Hierarchical Accumulation to encode parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German task. It also yields improvements over Transformer and Tree-LSTM on three text classification tasks. We further demonstrate that using hierarchical priors can compensate for data shortage, and that our model prefers phrase-level attentions over token-level attentions.",/pdf/55798e850eb51021c8c65ef727968718b41a91d4.pdf,ICLR,2020, +8X2eaSZxTP,KYbb-CxoDlI,1601310000000.0,1615280000000.0,1636,PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds,"[""~Yujia_Liu3"", ""~Stefano_D'Aronco1"", ""~Konrad_Schindler1"", ""~Jan_Dirk_Wegner1""]","[""Yujia Liu"", ""Stefano D'Aronco"", ""Konrad Schindler"", ""Jan Dirk Wegner""]","[""deep neural network"", ""3d point cloud"", ""wireframe model""]","We introduce PC2WF, the first end-to-end trainable deep network architecture to convert a 3D point cloud into a wireframe model. The network takes as input an unordered set of 3D points sampled from the surface of some object, and outputs a wireframe of that object, i.e., a sparse set of corner points linked by line segments. Recovering the wireframe is a challenging task, where the numbers of both vertices and edges are different for every instance, and a-priori unknown. Our architecture gradually builds up the model: It starts by encoding the points into feature vectors. Based on those features, it identifies a pool of candidate vertices, then prunes those candidates to a final set of corner vertices and refines their locations. Next, the corners are linked with an exhaustive set of candidate edges, which is again pruned to obtain the final wireframe. All steps are trainable, and errors can be backpropagated through the entire sequence. We validate the proposed model on a publicly available synthetic dataset, for which the ground truth wireframes are accessible, as well as on a new real-world dataset. Our model produces wireframe abstractions of good quality and outperforms several baselines.",/pdf/3de76ce33a27b6e85c1414aa9fa6edfa54803af0.pdf,ICLR,2021,An end-to-end trainable deep neural network for converting a 3D point cloud into a wireframe model. +SyGjjsC5tQ,Hygo2bK9F7,1538090000000.0,1548600000000.0,652,Stable Opponent Shaping in Differentiable Games,"[""ahp.letcher@gmail.com"", ""jakobfoerster@gmail.com"", ""dbalduzzi@google.com"", ""tim.rocktaeschel@gmail.com"", ""shimon.whiteson@cs.ox.ac.uk""]","[""Alistair Letcher"", ""Jakob Foerster"", ""David Balduzzi"", ""Tim Rockt\u00e4schel"", ""Shimon Whiteson""]","[""multi-agent learning"", ""multiple interacting losses"", ""opponent shaping"", ""exploitation"", ""convergence""]","A growing number of learning methods are actually differentiable games whose players optimise multiple, interdependent objectives in parallel – from GANs and intrinsic curiosity to multi-agent RL. Opponent shaping is a powerful approach to improve learning dynamics in these games, accounting for player influence on others’ updates. Learning with Opponent-Learning Awareness (LOLA) is a recent algorithm that exploits this response and leads to cooperation in settings like the Iterated Prisoner’s Dilemma. Although experimentally successful, we show that LOLA agents can exhibit ‘arrogant’ behaviour directly at odds with convergence. In fact, remarkably few algorithms have theoretical guarantees applying across all (n-player, non-convex) games. In this paper we present Stable Opponent Shaping (SOS), a new method that interpolates between LOLA and a stable variant named LookAhead. We prove that LookAhead converges locally to equilibria and avoids strict saddles in all differentiable games. SOS inherits these essential guarantees, while also shaping the learning of opponents and consistently either matching or outperforming LOLA experimentally.",/pdf/70c1f902a0584a62f83691cdcb6cd15b62433c8b.pdf,ICLR,2019,Opponent shaping is a powerful approach to multi-agent learning but can prevent convergence; our SOS algorithm fixes this with strong guarantees in all differentiable games. +BJgRDjR9tQ,HJecFTGYKX,1538090000000.0,1550720000000.0,314,ROBUST ESTIMATION VIA GENERATIVE ADVERSARIAL NETWORKS,"[""chaogao@galton.uchicago.edu"", ""jiyi.liu@yale.edu"", ""yuany@ust.hk"", ""wzhuai@connect.ust.hk""]","[""Chao GAO"", ""jiyi LIU"", ""Yuan YAO"", ""Weizhi ZHU""]","[""robust statistics"", ""neural networks"", ""minimax rate"", ""data depth"", ""contamination model"", ""Tukey median"", ""GAN""]","Robust estimation under Huber's $\epsilon$-contamination model has become an important topic in statistics and theoretical computer science. Rate-optimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f-GAN, we show that these depth functions that lead to rate-optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, we show that a JS-GAN that uses a neural network discriminator with at least one hidden layer is able to achieve the minimax rate of robust mean estimation under Huber's $\epsilon$-contamination model. Interestingly, the hidden layers of the neural net structure in the discriminator class are shown to be necessary for robust estimation.",/pdf/f25cdd52eefdd04a6afe5fdc9034c0c69b8ec517.pdf,ICLR,2019,GANs are shown to provide us a new effective robust mean estimate against agnostic contaminations with both statistical optimality and practical tractability. +rJIN_4lA-,ryrVONg0Z,1509080000000.0,1518730000000.0,241,Maintaining cooperation in complex social dilemmas using deep reinforcement learning,"[""alex.peys@gmail.com"", ""alerer@fb.com""]","[""Alexander Peysakhovich"", ""Adam Lerer""]","[""reinforcement learning"", ""cooperation"", ""social dilemmas"", ""game theory""]","Social dilemmas are situations where individuals face a temptation to increase their payoffs at a cost to total welfare. Building artificially intelligent agents that achieve good outcomes in these situations is important because many real world interactions include a tension between selfish interests and the welfare of others. We show how to modify modern reinforcement learning methods to construct agents that act in ways that are simple to understand, nice (begin by cooperating), provokable (try to avoid being exploited), and forgiving (try to return to mutual cooperation). We show both theoretically and experimentally that such agents can maintain cooperation in Markov social dilemmas. Our construction does not require training methods beyond a modification of self-play, thus if an environment is such that good strategies can be constructed in the zero-sum case (eg. Atari) then we can construct agents that solve social dilemmas in this environment. ",/pdf/64312fdb688b166b53125ef25d290de2dc0d65a5.pdf,ICLR,2018,How can we build artificial agents that solve social dilemmas (situations where individuals face a temptation to increase their payoffs at a cost to total welfare)? +SJgndT4KwB,H1gh5UnPwH,1569440000000.0,1583910000000.0,646,Finite Depth and Width Corrections to the Neural Tangent Kernel,"[""bhanin@math.tamu.edu"", ""mnica@math.utoronto.ca""]","[""Boris Hanin"", ""Mihai Nica""]","[""Neural Tangent Kernel"", ""Finite Width Corrections"", ""Random ReLU Net"", ""Wide Networks"", ""Deep Networks""]","We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that for such deep and wide networks, the NTK has a non-trivial evolution during training by showing that the mean of its first SGD update is also exponential in the ratio of network depth to width. This is sharp contrast to the regime where depth is fixed and network width is very large. Our results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime. ",/pdf/d2542e25b48b0fdb9e0c496f3bb41b3592be559d.pdf,ICLR,2020,The neural tangent kernel in a randomly initialized ReLU net is non-trivial fluctuations as long as the depth and width are comparable. +Hkl6i0EFPH,BJlbw7Yuvr,1569440000000.0,1577170000000.0,1336,Scalable Differentially Private Data Generation via Private Aggregation of Teacher Ensembles,"[""ylong4@illinois.edu"", ""linsuxin28@gmail.com"", ""lucas110550@sjtu.edu.cn"", ""cgunter@illinois.edu"", ""hanliu@northwestern.edu"", ""lbo@illinois.edu""]","[""Yunhui Long"", ""Suxin Lin"", ""Zhuolin Yang"", ""Carl A. Gunter"", ""Han Liu"", ""Bo Li""]",[],"We present a novel approach named G-PATE for training differentially private data generator. The generator can be used to produce synthetic datasets with strong privacy guarantee while preserving high data utility. Our approach leverages generative adversarial nets to generate data and exploits the PATE (Private Aggregation of Teacher Ensembles) framework to protect data privacy. Compared to existing methods, our approach significantly improves the use of privacy budget. This is possible since we only need to ensure differential privacy for the generator, which is the part of the model that actually needs to be published for private data generation. To achieve this, we connect a student generator with an ensemble of teacher discriminators and propose a private gradient aggregation mechanism to ensure differential privacy on all the information that flows from the teacher discriminators to the student generator. Theoretically, we prove that our algorithm ensures differential privacy for the generator. Empirically, we provide thorough experiments to demonstrate the superiority of our method over prior work on both image and non-image datasets.",/pdf/8f9acbae7780d669e8e6511046eb4e017d835f38.pdf,ICLR,2020, +H1DJFybC-,HJ8kYJWCb,1509120000000.0,1518730000000.0,498,Learning to Infer Graphics Programs from Hand-Drawn Images,"[""ellisk@mit.edu"", ""daniel_richie@brown.edu"", ""asolar@csail.mit.edu"", ""jbt@mit.edu""]","[""Kevin Ellis"", ""Daniel Ritchie"", ""Armando Solar-Lezama"", ""Joshua B. Tenenbaum""]","[""program induction"", ""HCI"", ""deep learning""]"," We introduce a model that learns to convert simple hand drawings + into graphics programs written in a subset of \LaTeX.~The model + combines techniques from deep learning and program synthesis. We + learn a convolutional neural network that proposes plausible drawing + primitives that explain an image. These drawing primitives are like + a trace of the set of primitive commands issued by a graphics + program. We learn a model that uses program synthesis techniques to + recover a graphics program from that trace. These programs have + constructs like variable bindings, iterative loops, or simple kinds + of conditionals. With a graphics program in hand, we can correct + errors made by the deep network and extrapolate drawings. Taken + together these results are a step towards agents that induce useful, + human-readable programs from perceptual input.",/pdf/40a05da6d518dae6d3f36ec4b4ea232e06443cd3.pdf,ICLR,2018,Learn to convert a hand drawn sketch into a high-level program +H1eNleBYwr,B1g1GTytPr,1569440000000.0,1577170000000.0,2095,GENN: Predicting Correlated Drug-drug Interactions with Graph Energy Neural Networks,"[""tengfei.ma1@ibm.com"", ""sjy1203@pku.edu.cn"", ""cao.xiao@iqvia.com"", ""sun@cc.gatech.edu""]","[""Tengfei Ma"", ""Junyuan Shang"", ""Cao Xiao"", ""Jimeng Sun""]","[""graph neural networks"", ""energy model"", ""structure prediction"", ""drug-drug-interaction""]","Gaining more comprehensive knowledge about drug-drug interactions (DDIs) is one of the most important tasks in drug development and medical practice. Recently graph neural networks have achieved great success in this task by modeling drugs as nodes and drug-drug interactions as links and casting DDI predictions as link prediction problems. However, correlations between link labels (e.g., DDI types) were rarely considered in existing works. + We propose the graph energy neural network (\mname) to explicitly model link type correlations. We formulate the DDI prediction task as a structure prediction problem and introduce a new energy-based model where the energy function is defined by graph neural networks. Experiments on two real-world DDI datasets demonstrated that \mname is superior to many baselines without consideration of link type correlations and achieved $13.77\%$ and $5.01\%$ PR-AUC improvement on the two datasets, respectively. We also present a case study in which \mname can better capture meaningful DDI correlations compared with baseline models.",/pdf/9dea0bdf7aca16c7335106e64d67adfc7c80eeca.pdf,ICLR,2020, +S1ly2grtvB,H1gWxeZKwr,1569440000000.0,1577170000000.0,2522,IS THE LABEL TRUSTFUL: TRAINING BETTER DEEP LEARNING MODEL VIA UNCERTAINTY MINING NET,"[""yang.sun1@ibm.com"", ""abhishek.kolagunda@ibm.com"", ""steven.eliuk@ibm.com"", ""visionxiaolong@gmail.com""]","[""Yang Sun"", ""Abhishek Kolagunda"", ""Steven Eliuk"", ""Xiaolong Wang""]","[""Semi-supervised Learning"", ""Robust Learning"", ""Deep Generative Model""]","In this work, we consider a new problem of training deep neural network on partially labeled data with label noise. As far as we know, +there have been very few efforts to tackle such problems. +We present a novel end-to-end deep generative pipeline for improving classifier performance when dealing with such data problems. We call it +Uncertainty Mining Net (UMN). + During the training stage, we utilize all the available data (labeled and unlabeled) to train the classifier via a semi-supervised generative framework. + During training, UMN estimates the uncertainly of the labels’ to focus on clean data for learning. More precisely, UMN applies the sample-wise label uncertainty estimation scheme. + Extensive experiments and comparisons against state-of-the-art methods on several popular benchmark datasets demonstrate that UMN can reduce the effects of label noise and significantly improve classifier performance.",/pdf/dc49301820d49a5eff0e591fb5b8c6e5356474d2.pdf,ICLR,2020, +S1xJ4JHFvS,rkg1292_wS,1569440000000.0,1577170000000.0,1639,Acutum: When Generalization Meets Adaptability,"[""huangxunpeng@bytedance.com"", ""liuzhengyang.lozycs@bytedance.com"", ""wang.10982@osu.edu"", ""yuyue.elaine@bytedance.com"", ""lilei.02@bytedance.com""]","[""Xunpeng Huang"", ""Zhengyang Liu"", ""Zhe Wang"", ""Yue Yu"", ""Lei Li""]","[""optimization"", ""momentum"", ""adaptive gradient methods""]","In spite of the slow convergence, stochastic gradient descent (SGD) is still the most practical optimization method due to its outstanding generalization ability and simplicity. On the other hand, adaptive methods have attracted much more attention of optimization and machine learning communities, both for the leverage of life-long information and for the deep and fundamental mathematical theory. Taking the best of both worlds is the most exciting and challenging question in the field of optimization for machine learning. + +In this paper, we take a small step towards such ultimate goal. We revisit existing adaptive methods from a novel point of view, which reveals a fresh understanding of momentum. Our new intuition empowers us to remove the second moments in Adam without the loss of performance. Based on our view, we propose a new method, named acute adaptive momentum (Acutum). To the best of our knowledge, Acutum is the first adaptive gradient method without second moments. Experimentally, we demonstrate that our method has a faster convergence rate than Adam/Amsgrad, and generalizes as well as SGD with momentum. We also provide a convergence analysis of our proposed method to complement our intuition. ",/pdf/e445cc3c50ea648e756d9723f7fb6ca82639387b.pdf,ICLR,2020, +Bk8BvDqex,,1478290000000.0,1487900000000.0,392,Metacontrol for Adaptive Imagination-Based Optimization,"[""jhamrick@berkeley.edu"", ""aybd@google.com"", ""razp@google.com"", ""vinyals@google.com"", ""heess@google.com"", ""peterbattaglia@google.com""]","[""Jessica B. Hamrick"", ""Andrew J. Ballard"", ""Razvan Pascanu"", ""Oriol Vinyals"", ""Nicolas Heess"", ""Peter W. Battaglia""]","[""Deep learning"", ""Reinforcement Learning"", ""Optimization""]","Many machine learning systems are built to solve the hardest examples of a particular task, which often makes them large and expensive to run---especially with respect to the easier examples, which might require much less computation. For an agent with a limited computational budget, this ""one-size-fits-all"" approach may result in the agent wasting valuable computation on easy examples, while not spending enough on hard examples. Rather than learning a single, fixed policy for solving all instances of a task, we introduce a metacontroller which learns to optimize a sequence of ""imagined"" internal simulations over predictive models of the world in order to construct a more informed, and more economical, solution. The metacontroller component is a model-free reinforcement learning agent, which decides both how many iterations of the optimization procedure to run, as well as which model to consult on each iteration. The models (which we call ""experts"") can be state transition models, action-value functions, or any other mechanism that provides information useful for solving the task, and can be learned on-policy or off-policy in parallel with the metacontroller. When the metacontroller, controller, and experts were trained with ""interaction networks"" (Battaglia et al., 2016) as expert models, our approach was able to solve a challenging decision-making problem under complex non-linear dynamics. The metacontroller learned to adapt the amount of computation it performed to the difficulty of the task, and learned how to choose which experts to consult by factoring in both their reliability and individual computational resource costs. This allowed the metacontroller to achieve a lower overall cost (task loss plus computational cost) than more traditional fixed policy approaches. These results demonstrate that our approach is a powerful framework for using rich forward models for efficient model-based reinforcement learning.",/pdf/da8a506167be0f352375d2aeb90379f95feebe76.pdf,ICLR,2017,"We present a ""metacontroller"" neural architecture which can adaptively decide how long to run an model-based online optimization procedure for, and which models to use during the optimization." +rJlHIo09KQ,SkgX6Af5tm,1538090000000.0,1545360000000.0,172,Gradient-based Training of Slow Feature Analysis by Differentiable Approximate Whitening,"[""merlin.schueler@ini.rub.de"", ""hlynur.hlynsson@ini.rub.de"", ""laurenz.wiskott@ini.rub.de""]","[""Merlin Sch\u00fcler"", ""Hlynur Dav\u00ed\u00f0 Hlynsson"", ""Laurenz Wiskott""]","[""Slow Feature Analysis"", ""Deep Learning"", ""Spectral Embedding"", ""Temporal Coherence""]","We propose Power Slow Feature Analysis, a gradient-based method to extract temporally slow features from a high-dimensional input stream that varies on a faster time-scale, as a variant of Slow Feature Analysis (SFA). While displaying performance comparable to hierarchical extensions to the SFA algorithm, such as Hierarchical Slow Feature Analysis, for a small number of output-features, our algorithm allows fully differentiable end-to-end training of arbitrary differentiable approximators (e.g., deep neural networks). We provide experimental evidence that PowerSFA is able to extract meaningful and informative low-dimensional features in the case of (a) synthetic low-dimensional data, (b) visual data, and also for (c) a general dataset for which symmetric non-temporal relations between points can be defined.",/pdf/bd9fec0c2c4930a203098e51f978ef175a3bdba3.pdf,ICLR,2019,We propose a way to train Slow Feature Analysis with stochastic gradient descent eliminating the need for greedy layer-wise training. +HkeO104tPB,BJxisqMuwH,1569440000000.0,1577170000000.0,899,Reinforcement Learning without Ground-Truth State,"[""xlin3@cs.cmu.edu"", ""dheld@andrew.cmu.edu"", ""harjatis@andrew.cmu.edu""]","[""Xingyu Lin"", ""Harjatin Singh Baweja"", ""David Held""]","[""Self-supervised"", ""goal-conditioned reinforcement learning""]","To perform robot manipulation tasks, a low-dimensional state of the environment typically needs to be estimated. However, designing a state estimator can sometimes be difficult, especially in environments with deformable objects. An alternative is to learn an end-to-end policy that maps directly from high-dimensional sensor inputs to actions. However, if this policy is trained with reinforcement learning, then without a state estimator, it is hard to specify a reward function based on high-dimensional observations. To meet this challenge, we propose a simple indicator reward function for goal-conditioned reinforcement learning: we only give a positive reward when the robot's observation exactly matches a target goal observation. We show that by relabeling the original goal with the achieved goal to obtain positive rewards (Andrychowicz et al., 2017), we can learn with the indicator reward function even in continuous state spaces. We propose two methods to further speed up convergence with indicator rewards: reward balancing and reward filtering. We show comparable performance between our method and an oracle which uses the ground-truth state for computing rewards. We show that our method can perform complex tasks in continuous state spaces such as rope manipulation from RGB-D images, without knowledge of the ground-truth state.",/pdf/38fe5b8adc9198e513f13fdd909d4cc50afb2f7b.pdf,ICLR,2020,"This paper proposes to use an indicator function for specifying reward in goal-conditioned reinforcement learning, eliminating the need for reward engineering." +MmcywoW7PbJ,RZ5ZYRyPqce,1601310000000.0,1614990000000.0,2587,Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning,"[""~Jinxin_Liu1"", ""~Donglin_Wang1"", ""~Qiangxing_Tian1"", ""~Zhengyu_Chen2""]","[""Jinxin Liu"", ""Donglin Wang"", ""Qiangxing Tian"", ""Zhengyu Chen""]","[""unsupervised reinforcement learning"", ""goal-conditioned policy"", ""intrinsic reward""]","It is of significance for an agent to learn a widely applicable and general-purpose policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, the frontier of deep reinforcement learning research is to learn a goal-conditioned policy without hand-crafted rewards. To learn this kind of policy, recent works usually take as the reward the non-parametric distance to a given goal in an explicit embedding space. From a different viewpoint, we propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM), which jointly learns both an abstract-level policy and a goal-conditioned policy. The abstract-level policy is conditioned on a latent variable to optimize a discriminator and discovers diverse states that are further rendered into perceptually-specific goals for the goal-conditioned policy. The learned discriminator serves as an intrinsic reward function for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method which substantially outperforms prior techniques. ",/pdf/f9a32e8e7adb00c44d88852c836f45ea7e997f45.pdf,ICLR,2021,We learn the goal-conditioned policy in an unsupervised manner. +HyiAuyb0b,r1iCOyWAb,1509120000000.0,1518800000000.0,497,TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning,"[""amiranas@cs.uni-freiburg.de"", ""adosovitskiy@gmail.com"", ""vkoltun@gmail.com"", ""brox@cs.uni-freiburg.de""]","[""Artemij Amiranashvili"", ""Alexey Dosovitskiy"", ""Vladlen Koltun"", ""Thomas Brox""]","[""deep learning"", ""reinforcement learning"", ""temporal difference""]","Our understanding of reinforcement learning (RL) has been shaped by theoretical and empirical results that were obtained decades ago using tabular representations and linear function approximators. These results suggest that RL methods that use temporal differencing (TD) are superior to direct Monte Carlo estimation (MC). How do these results hold up in deep RL, which deals with perceptually complex environments and deep nonlinear models? In this paper, we re-examine the role of TD in modern deep RL, using specially designed environments that control for specific factors that affect performance, such as reward sparsity, reward delay, and the perceptual complexity of the task. When comparing TD with infinite-horizon MC, we are able to reproduce classic results in modern settings. Yet we also find that finite-horizon MC is not inferior to TD, even when rewards are sparse or delayed. This makes MC a viable alternative to TD in deep RL.",/pdf/3a0350f13e4cdc8ca28d0c1915dd4ac2b734622d.pdf,ICLR,2018, +BJxbYoC9FQ,SygZ-BGKtm,1538090000000.0,1571430000000.0,418,Classifier-agnostic saliency map extraction,"[""konrad.zolna@gmail.com"", ""k.j.geras@nyu.edu"", ""kyunghyun.cho@nyu.edu""]","[""Konrad Zolna"", ""Krzysztof J. Geras"", ""Kyunghyun Cho""]","[""saliency maps"", ""explainable AI"", ""convolutional neural networks"", ""generative adversarial training"", ""classification""]","Extracting saliency maps, which indicate parts of the image important to classification, requires many tricks to achieve satisfactory performance when using classifier-dependent methods. Instead, we propose classifier-agnostic saliency map extraction, which finds all parts of the image that any classifier could use, not just one given in advance. We observe that the proposed approach extracts higher quality saliency maps and outperforms existing weakly-supervised localization techniques, setting the new state of the art result on the ImageNet dataset.",/pdf/aea03de50b0d227049d9038e47b8198cc915a476.pdf,ICLR,2019,We propose a new saliency map extraction method which results in extracting higher quality maps. +SyQq185lg,,1478280000000.0,1486480000000.0,279,Latent Sequence Decompositions,"[""williamchan@cmu.edu"", ""yzhang87@mit.edu"", ""qvl@google.com"", ""ndjaitly@google.com""]","[""William Chan"", ""Yu Zhang"", ""Quoc Le"", ""Navdeep Jaitly""]","[""Speech"", ""Applications"", ""Natural language processing"", ""Deep learning""]","Sequence-to-sequence models rely on a fixed decomposition of the target sequences into a sequence of tokens that may be words, word-pieces or characters. The choice of these tokens and the decomposition of the target sequences into a sequence of tokens is often static, and independent of the input, output data domains. This can potentially lead to a sub-optimal choice of token dictionaries, as the decomposition is not informed by the particular problem being solved. In this paper we present Latent Sequence Decompositions (LSD), a framework in which the decomposition of sequences into constituent tokens is learnt during the training of the model. The decomposition depends both on the input sequence and on the output sequence. In LSD, during training, the model samples decompositions incrementally, from left to right by locally sampling between valid extensions. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve a WER of 9.6%. +",/pdf/7dd9ea27d813cb4a72d10469a49427fe81f55fb4.pdf,ICLR,2017, +iMKvxHlrZb3,Qr-vKTns4YO,1601310000000.0,1614990000000.0,955,Scalable Graph Neural Networks for Heterogeneous Graphs,"[""~Lingfan_Yu1"", ""~Jiajun_Shen1"", ""~Jinyang_Li1"", ""~Adam_Lerer1""]","[""Lingfan Yu"", ""Jiajun Shen"", ""Jinyang Li"", ""Adam Lerer""]","[""Graph Neural Networks"", ""Large Graphs"", ""Heterogeneous Graphs""]","Graph neural networks (GNNs) are a popular class of parametric model for learning over graph-structured data. Recent work has argued that GNNs primarily use the graph for feature smoothing, and have shown competitive results on benchmark tasks by simply operating on graph-smoothed node features, rather than using end-to-end learned feature hierarchies that are challenging to scale to large graphs. In this work, we ask whether these results can be extended to heterogeneous graphs, which encode multiple types of relationship between different entities. We propose Neighbor Averaging over Relation Subgraphs (NARS), which trains a classifier on neighbor-averaged features for randomly-sampled subgraphs of the ‘metagraph‘ of relations. We describe optimizations to allow these sets of node features to be computed in a memory-efficient way, both at training and inference time. NARS achieves a new state of the art accuracy on several benchmark datasets, outperforming more expensive GNN-based methods.",/pdf/6629950ac7502dd98951eca805fe36efada80845.pdf,ICLR,2021, +rkeNqkBFPB,Hyle9YROPr,1569440000000.0,1577170000000.0,1874,Deep automodulators,"[""ari.heljakka@aalto.fi"", ""yuxin.hou@aalto.fi"", ""arno.solin@aalto.fi"", ""juho.kannala@aalto.fi""]","[""Ari Heljakka"", ""Yuxin Hou"", ""Juho Kannala"", ""Arno Solin""]","[""unsupervised learning"", ""generative models"", ""autoencoders"", ""disentanglement"", ""style transfer""]","We introduce a novel autoencoder model that deviates from traditional autoencoders by using the full latent vector to independently modulate each layer in the decoder. We demonstrate how such an 'automodulator' allows for a principled approach to enforce latent space disentanglement, mixing of latent codes, and a straightforward way to utilize prior information that can be construed as a scale-specific invariance. Unlike GANs, autoencoder models can directly operate on new real input samples. This makes our model directly suitable for applications involving real-world inputs. As the architectural backbone, we extend recent generative autoencoder models that retain input identity and image sharpness at high resolutions better than VAEs. We show that our model achieves state-of-the-art latent space disentanglement and achieves high quality and diversity of output samples, as well as faithfulness of reconstructions.",/pdf/347eba633bf19d94b630d05c5fca4cd33f53861b.pdf,ICLR,2020,"A novel autoencoder model that supports mutually independent decoder layers, enabling e.g. style mixing." +rJgDT04twH,HyeC1McdDB,1569440000000.0,1577170000000.0,1396,Deep Reinforcement Learning with Implicit Human Feedback,"[""dxu3016@gatech.edu"", ""me.agmohit@gatech.edu"", ""siva@ece.gatech.edu"", ""faramarz.fekri@ece.gatech.edu""]","[""Duo Xu"", ""Mohit Agarwal"", ""Raghupathy Sivakumar"", ""Faramarz Fekri""]","[""Error-Potentials"", ""Implicit Human Feedback"", ""Deep Reinforcement Learning"", ""Human-assistance""]","We consider the following central question in the field of Deep Reinforcement Learning (DRL): How can we use implicit human feedback to accelerate and optimize the training of a DRL algorithm? State-of-the-art methods rely on any human feedback to be provided explicitly, requiring the active participation of humans (e.g., expert labeling, demonstrations, etc.). In this work, we investigate an alternative paradigm, where non-expert humans are silently observing (and assessing) the agent interacting with the environment. The human's intrinsic reactions to the agent's behavior is sensed as implicit feedback by placing electrodes on the human scalp and monitoring what are known as event-related electric potentials. The implicit feedback is then used to augment the agent's learning in the RL tasks. We develop a system to obtain and accurately decode the implicit human feedback (specifically error-related event potentials) for state-action pairs in an Atari-type environment. As a baseline contribution, we demonstrate the feasibility of capturing error-potentials of a human observer watching an agent learning to play several different Atari-games using an electroencephalogram (EEG) cap, and then decoding the signals appropriately and using them as an auxiliary reward function to a DRL algorithm with the intent of accelerating its learning of the game. Building atop the baseline, we then make the following novel contributions in our work: +(i) We argue that the definition of error-potentials is generalizable across different environments; specifically we show that error-potentials of an observer can be learned for a specific game, and the definition used as-is for another game without requiring re-learning of the error-potentials. +(ii) We propose two different frameworks to combine recent advances in DRL into the error-potential based feedback system in a sample-efficient manner, allowing humans to provide implicit feedback while training in the loop, or prior to the training of the RL agent. +(iii) Finally, we scale the implicit human feedback (via ErrP) based RL to reasonably complex environments (games) and demonstrate the significance of our approach through synthetic and real user experiments. +",/pdf/44cda98831ea9ff4ea352a0b5970ebf2ff94b609.pdf,ICLR,2020,"We use implicit human feedback (via error-potentials, EEG) to accelerate and optimize the training of a DRL algorithm, in a practical manner." +Eql5b1_hTE4,yvvEEq4opfk,1601310000000.0,1615780000000.0,1387,Robust early-learning: Hindering the memorization of noisy labels,"[""~Xiaobo_Xia1"", ""~Tongliang_Liu1"", ""~Bo_Han1"", ""~Chen_Gong5"", ""~Nannan_Wang1"", ""~Zongyuan_Ge1"", ""yichang@jlu.edu.cn""]","[""Xiaobo Xia"", ""Tongliang Liu"", ""Bo Han"", ""Chen Gong"", ""Nannan Wang"", ""Zongyuan Ge"", ""Yi Chang""]",[],"The \textit{memorization effects} of deep networks show that they will first memorize training data with clean labels and then those with noisy labels. The \textit{early stopping} method therefore can be exploited for learning with noisy labels. However, the side effect brought by noisy labels will influence the memorization of clean labels before early stopping. In this paper, motivated by the \textit{lottery ticket hypothesis} which shows that only partial parameters are important for generalization, we find that only partial parameters are important for fitting clean labels and generalize well, which we term as \textit{critical parameters}; while the other parameters tend to fit noisy labels and cannot generalize well, which we term as \textit{non-critical parameters}. Based on this, we propose \textit{robust early-learning} to reduce the side effect of noisy labels before early stopping and thus enhance the memorization of clean labels. Specifically, in each iteration, we divide all parameters into the critical and non-critical ones, and then perform different update rules for different types of parameters. Extensive experiments on benchmark-simulated and real-world label-noise datasets demonstrate the superiority of the proposed method over the state-of-the-art label-noise learning methods.",/pdf/8bcbd9a8ffef76580768ad0f329dcc1cae3be97e.pdf,ICLR,2021, +HJSA_e1AW,rkEAdek0b,1509000000000.0,1518730000000.0,115,Normalized Direction-preserving Adam,"[""zijun.zhang@ucalgary.ca"", ""linmawhu@gmail.com"", ""zongpeng@ucalgary.ca"", ""cwu@cs.hku.hk""]","[""Zijun Zhang"", ""Lin Ma"", ""Zongpeng Li"", ""Chuan Wu""]","[""optimization"", ""generalization"", ""Adam"", ""SGD""]","Optimization algorithms for training deep models not only affects the convergence rate and stability of the training process, but are also highly related to the generalization performance of trained models. While adaptive algorithms, such as Adam and RMSprop, have shown better optimization performance than stochastic gradient descent (SGD) in many scenarios, they often lead to worse generalization performance than SGD, when used for training deep neural networks (DNNs). In this work, we identify two problems regarding the direction and step size for updating the weight vectors of hidden units, which may degrade the generalization performance of Adam. As a solution, we propose the normalized direction-preserving Adam (ND-Adam) algorithm, which controls the update direction and step size more precisely, and thus bridges the generalization gap between Adam and SGD. Following a similar rationale, we further improve the generalization performance in classification tasks by regularizing the softmax logits. By bridging the gap between SGD and Adam, we also shed some light on why certain optimization algorithms generalize better than others.",/pdf/a54f255162cc8f9322d001098aa86fc6e8732622.pdf,ICLR,2018,"A tailored version of Adam for training DNNs, which bridges the generalization gap between Adam and SGD." +rygvFyrKwH,r1eUS8ROPH,1569440000000.0,1577170000000.0,1844,Adversarial Robustness as a Prior for Learned Representations,"[""engstrom@mit.edu"", ""ailyas@mit.edu"", ""shibani@mit.edu"", ""tsipras@mit.edu"", ""btran115@mit.edu"", ""madry@mit.edu""]","[""Logan Engstrom"", ""Andrew Ilyas"", ""Shibani Santurkar"", ""Dimitris Tsipras"", ""Brandon Tran"", ""Aleksander Madry""]","[""adversarial robustness"", ""adversarial examples"", ""robust optimization"", ""representation learning"", ""feature visualization""]","An important goal in deep learning is to learn versatile, high-level feature representations of input data. However, standard networks' representations seem to possess shortcomings that, as we illustrate, prevent them from fully realizing this goal. In this work, we show that robust optimization can be re-cast as a tool for enforcing priors on the features learned by deep neural networks. It turns out that representations learned by robust models address the aforementioned shortcomings and make significant progress towards learning a high-level encoding of inputs. In particular, these representations are approximately invertible, while allowing for direct visualization and manipulation of salient input features. More broadly, our results indicate adversarial robustness as a promising avenue for improving learned representations.",/pdf/059de1824bc4e6c6377b1ed00f64b18a502fbb9e.pdf,ICLR,2020,"Representations learned by robust neural networks align better with our idealization of representations as high-level feature extractors, and thus allow for representation inversion, as well as direct feature visualization and manipulation." +bWqodw-mFi1,0tIRcjzdvA,1601310000000.0,1614990000000.0,1502,Explicit homography estimation improves contrastive self-supervised learning,"[""~David_Torpey1"", ""~Richard_Klein1""]","[""David Torpey"", ""Richard Klein""]",[],"The typical contrastive self-supervised algorithm uses a similarity measure in latent space as the supervision signal by contrasting positive and negative images directly or indirectly. Although the utility of self-supervised algorithms has improved recently, there are still bottlenecks hindering their widespread use, such as the compute needed. In this paper, we propose a module that serves as an additional objective in the self-supervised contrastive learning paradigm. We show how the inclusion of this module to regress the parameters of an affine transformation or homography, in addition to the original contrastive objective, improves both performance and rate of learning. Importantly, we ensure that this module does not enforce invariance to the various components of the affine transform, as this is not always ideal. We demonstrate the effectiveness of the additional objective on two recent, popular self-supervised algorithms. We perform an extensive experimental analysis of the proposed method and show an improvement in performance for all considered datasets. Further, we find that although both the general homography and affine transformation are sufficient to improve performance and convergence, the affine transformation performs better in all cases.",/pdf/5d773534ff8067050015baa0e31582613488982c.pdf,ICLR,2021,Explicit homography estimation improves contrastive self-supervised learning +rygjN3C9F7,Hkl7x1RqYm,1538090000000.0,1545360000000.0,1483,The Variational Deficiency Bottleneck,"[""pradeep@mis.mpg.de"", ""montufar@math.ucla.edu""]","[""Pradeep Kr. Banerjee"", ""Guido Montufar""]","[""Variational Information Bottleneck"", ""Blackwell Sufficiency"", ""Le Cam Deficiency"", ""Information Channel""]","We introduce a bottleneck method for learning data representations based on channel deficiency, rather than the more traditional information sufficiency. A variational upper bound allows us to implement this method efficiently. The bound itself is bounded above by the variational information bottleneck objective, and the two methods coincide in the regime of single-shot Monte Carlo approximations. The notion of deficiency provides a principled way of approximating complicated channels by relatively simpler ones. The deficiency of one channel w.r.t. another has an operational interpretation in terms of the optimal risk gap of decision problems, capturing classification as a special case. Unsupervised generalizations are possible, such as the deficiency autoencoder, which can also be formulated in a variational form. Experiments demonstrate that the deficiency bottleneck can provide advantages in terms of minimal sufficiency as measured by information bottleneck curves, while retaining a good test performance in classification and reconstruction tasks. ",/pdf/a6354bc66cc502d14f80d5c7ac1d18a14db47876.pdf,ICLR,2019,We develop a new bottleneck method based on channel deficiency. +Bklrea4KwS,rylY2318wS,1569440000000.0,1577170000000.0,338,Deep Multiple Instance Learning with Gaussian Weighting,"[""basura.fernando@anu.edu.au"", ""hbilen@ed.ac.uk""]","[""Basura Fernando"", ""Hakan Bilen""]","[""Multiple instance learning"", ""deep learning""]","In this paper we present a deep Multiple Instance Learning (MIL) method that can be trained end-to-end to perform classification from weak supervision. Our MIL method is implemented as a two stream neural network, specialized in tasks of instance classification and weighting. Our instance weighting stream makes use of Gaussian radial basis function to normalize the instance weights by comparing instances locally within the bag and globally across bags. The final classification score of the bag is an aggregate of all instance classification scores. The instance representation is shared by both instance classification and weighting streams. The Gaussian instance weighting allows us to regularize the representation learning of instances such that all positive instances to be closer to each other w.r.t. the instance weighting function. We evaluate our method on five standard MIL datasets and show that our method outperforms other MIL methods. We also evaluate our model on two datasets where all models are trained end-to-end. Our method obtain better bag-classification and instance classification results on these datasets. We conduct extensive experiments to investigate the robustness of the proposed model and obtain interesting insights.",/pdf/9edbb42fb131a3f85441b00d8a15ffb0efa12750.pdf,ICLR,2020, +B1x1MerYPB,rJegPWgFvB,1569440000000.0,1577170000000.0,2158,Putting Machine Translation in Context with the Noisy Channel Model,"[""leiyu@google.com"", ""lsartran@google.com"", ""wstokowiec@google.com"", ""lingwang@google.com"", ""lingpenk@google.com"", ""pblunsom@google.com"", ""cdyer@google.com""]","[""Lei Yu"", ""Laurent Sartran"", ""Wojciech Stokowiec"", ""Wang Ling"", ""Lingpeng Kong"", ""Phil Blunsom"", ""Chris Dyer""]","[""machine translation"", ""context-aware machine translation"", ""bayes rule""]","We show that Bayes' rule provides a compelling mechanism for controlling unconditional document language models, using the long-standing challenge of effectively leveraging document context in machine translation. In our formulation, we estimate the probability of a candidate translation as the product of the unconditional probability of the candidate output document and the ``reverse translation probability'' of translating the candidate output back into the input source language document---the so-called ``noisy channel'' decomposition. A particular advantage of our model is that it requires only parallel sentences to train, rather than parallel documents, which are not always available. Using a new beam search reranking approximation to solve the decoding problem, we find that document language models outperform language models that assume independence between sentences, and that using either a document or sentence language model outperform comparable models that directly estimate the translation probability. We obtain the best-published results on the NIST Chinese--English translation task, a standard task for evaluating document translation. Our model also outperforms the benchmark Transformer model by approximately 2.5 BLEU on the WMT19 Chinese--English translation task.",/pdf/c20527d2d5d628b86a30e752404bfcc02044223e.pdf,ICLR,2020, +HygS7n0cFQ,BJgsSv6qFm,1538090000000.0,1545360000000.0,1357,Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning,"[""keramati@stanford.edu"", ""jaywhang@cs.stanford.edu"", ""patcho@cs.stanford.edu"", ""ebrun@cs.stanford.edu""]","[""Ramtin Keramati"", ""Jay Whang"", ""Patrick Cho"", ""Emma Brunskill""]","[""Reinforcement Learning"", ""Strategic Exploration"", ""Model Based Reinforcement Learning""]","Humans learn to play video games significantly faster than the state-of-the-art reinforcement learning (RL) algorithms. People seem to build simple models that are easy to learn to support planning and strategic exploration. Inspired by this, we investigate two issues in leveraging model-based RL for sample efficiency. First we investigate how to perform strategic exploration when exact planning is not feasible and empirically show that optimistic Monte Carlo Tree Search outperforms posterior sampling methods. Second we show how to learn simple deterministic models to support fast learning using object representation. We illustrate the benefit of these ideas by introducing a novel algorithm, Strategic Object Oriented Reinforcement Learning (SOORL), that outperforms state-of-the-art algorithms in the game of Pitfall! in less than 50 episodes.",/pdf/55f6170d303ecb98c0eb5a6bded7145f616e4227.pdf,ICLR,2019,We studied exploration with imperfect planning and used object representation to learn simple models and introduced a new sample efficient RL algorithm that achieves state of the art results on Pitfall! +rJ5C67-C-,B1XRpXW0b,1509140000000.0,1518730000000.0,1173,Hyperedge2vec: Distributed Representations for Hyperedges,"[""sharm170@umn.edu"", ""srjoty@ntu.edu.sg"", ""himanshukharkwal765@gmail.com"", ""srivasta@umn.edu""]","[""Ankit Sharma"", ""Shafiq Joty"", ""Himanshu Kharkwal"", ""Jaideep Srivastava""]","[""hypergraph"", ""representation learning"", ""tensors""]","Data structured in form of overlapping or non-overlapping sets is found in a variety of domains, sometimes explicitly but often subtly. For example, teams, which are of prime importance in social science studies are \enquote{sets of individuals}; \enquote{item sets} in pattern mining are sets; and for various types of analysis in language studies a sentence can be considered as a \enquote{set or bag of words}. Although building models and inference algorithms for structured data has been an important task in the fields of machine learning and statistics, research on \enquote{set-like} data still remains less explored. Relationships between pairs of elements can be modeled as edges in a graph. However, modeling relationships that involve all members of a set, a hyperedge is a more natural representation for the set. In this work, we focus on the problem of embedding hyperedges in a hypergraph (a network of overlapping sets) to a low dimensional vector space. We propose a probabilistic deep-learning based method as well as a tensor-based algebraic model, both of which capture the hypergraph structure in a principled manner without loosing set-level information. Our central focus is to highlight the connection between hypergraphs (topology), tensors (algebra) and probabilistic models. We present a number of interesting baselines, some of which adapt existing node-level embedding models to the hyperedge-level, as well as sequence based language techniques which are adapted for set structured hypergraph topology. The performance is evaluated with a network of social groups and a network of word phrases. Our experiments show that accuracy wise our methods perform similar to those of baselines which are not designed for hypergraphs. Moreover, our tensor based method is quiet efficient as compared to deep-learning based auto-encoder method. We therefore, argue that we have proposed more general methods which are suited for hypergraphs (and therefore also for graphs) while maintaining accuracy and efficiency. ",/pdf/df0bce76f679daed6584e6c2fa64e70dfeadcbb2.pdf,ICLR,2018, +SJxIkkSKwB,S1lb8TqOwr,1569440000000.0,1577170000000.0,1470,Learning in Confusion: Batch Active Learning with Noisy Oracle,"[""ggaurav@usc.edu"", ""anit.sahu@gmail.com"", ""wan-yi.lin@us.bosch.com""]","[""Gaurav Gupta"", ""Anit Kumar Sahu"", ""Wan-Yi Lin""]","[""Active Learning"", ""Noisy Oracle"", ""Model Uncertainty"", ""Image classification""]","We study the problem of training machine learning models incrementally using active learning with access to imperfect or noisy oracles. We specifically consider the setting of batch active learning, in which multiple samples are selected as opposed to a single sample as in classical settings so as to reduce the training overhead. Our approach bridges between uniform randomness and score based importance sampling of clusters when selecting a batch of new samples. Experiments on +benchmark image classification datasets (MNIST, SVHN, and CIFAR10) shows improvement over existing active learning strategies. We introduce an extra denoising layer to deep networks to make active learning robust to label noises and show significant improvements. +",/pdf/8740f822938dd32907511cac542103fd2a1ee79d.pdf,ICLR,2020,We address the active learning in batch setting with noisy oracles and use model uncertainty to encode the decision quality of active learning algorithm during acquisition. +6FsCHsZ66Fp,xWJyFfPLT3f,1601310000000.0,1614990000000.0,2907,Towards certifying $\ell_\infty$ robustness using Neural networks with $\ell_\infty$-dist Neurons,"[""zhangbohang@pku.edu.cn"", ""~Zhou_Lu1"", ""~Tianle_Cai1"", ""~Di_He1"", ""~Liwei_Wang1""]","[""Bohang Zhang"", ""Zhou Lu"", ""Tianle Cai"", ""Di He"", ""Liwei Wang""]",[],"It is well-known that standard neural networks, even with a high classification accuracy, are vulnerable to small $\ell_\infty$ perturbations. Many attempts have been tried to learn a network that can resist such adversarial attacks. However, most previous works either can only provide empirical verification of the defense to a particular attack method or can only develop a theoretical guarantee of the model robustness in limited scenarios. In this paper, we develop a theoretically principled neural network that inherently resists $\ell_\infty$ perturbations. In particular, we design a novel neuron that uses $\ell_\infty$ distance as its basic operation, which we call $\ell_\infty$-dist neuron. We show that the $\ell_\infty$-dist neuron is naturally a 1-Lipschitz function with respect to the $\ell_\infty$ norm, and the neural networks constructed with $\ell_\infty$-dist neuron ($\ell_{\infty}$-dist Nets) enjoy the same property. This directly provides a theoretical guarantee of the certified robustness based on the margin of the prediction outputs. We further prove that the $\ell_{\infty}$-dist Nets have enough expressiveness power to approximate any 1-Lipschitz function, and can generalize well as the robust test error can be upper-bounded by the performance of a large margin classifier on the training data. Preliminary experiments show that even without the help of adversarial training, the learned networks with high classification accuracy are already provably robust.",/pdf/9b8b95236e675ae664dc48770761d48159b66f9a.pdf,ICLR,2021, +TGFO0DbD_pk,A5O1jBAqKUg,1601310000000.0,1613380000000.0,3101,Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning,"[""~Enrico_Marchesini1"", ""davide.corsi@univr.it"", ""~Alessandro_Farinelli1""]","[""Enrico Marchesini"", ""Davide Corsi"", ""Alessandro Farinelli""]","[""Deep Reinforcement Learning"", ""Evolutionary Algorithms"", ""Formal Verification"", ""Machine Learning for Robotics""]","The combination of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) has been recently proposed to merge the benefits of both solutions. Existing mixed approaches, however, have been successfully applied only to actor-critic methods and present significant overhead. We address these issues by introducing a novel mixed framework that exploits a periodical genetic evaluation to soft update the weights of a DRL agent. The resulting approach is applicable with any DRL method and, in a worst-case scenario, it does not exhibit detrimental behaviours. Experiments in robotic applications and continuous control benchmarks demonstrate the versatility of our approach that significantly outperforms prior DRL, EAs, and mixed approaches. Finally, we employ formal verification to confirm the policy improvement, mitigating the inefficient exploration and hyper-parameter sensitivity of DRL.ment, mitigating the inefficient exploration and hyper-parameter sensitivity of DRL.",/pdf/2a012533ff0b6880941f619b1e03b63abd1414c6.pdf,ICLR,2021,We present a novel mixed framework that combines the benefits of Evolutionary Algorithms and any DRL algorithms (including value-based ones); we support our claims on the beneficial policy improvement using recent formal verification tools. +S1RP6GLle,,1478010000000.0,1487680000000.0,29,Amortised MAP Inference for Image Super-resolution,"[""casperkaae@gmail.com"", ""jcaballero@twitter.com"", ""ltheis@twitter.com"", ""wshi@twitter.com"", ""fhuszar@twitter.com""]","[""Casper Kaae S\u00f8nderby"", ""Jose Caballero"", ""Lucas Theis"", ""Wenzhe Shi"", ""Ferenc Husz\u00e1r""]","[""Theory"", ""Computer vision"", ""Deep learning""]","Image super-resolution (SR) is an underdetermined inverse problem, where a large number of plausible high resolution images can explain the same downsampled image. Most current single image SR methods use empirical risk minimisation, often with a pixel-wise mean squared error (MSE) loss. +However, the outputs from such methods tend to be blurry, over-smoothed and generally appear implausible. A more desirable approach would employ Maximum a Posteriori (MAP) inference, preferring solutions that always have a high probability under the image prior, and thus appear more plausible. Direct MAP estimation for SR is non-trivial, as it requires us to build a model for the image prior from samples. Here we introduce new methods for \emph{amortised MAP inference} whereby we calculate the MAP estimate directly using a convolutional neural network. We first introduce a novel neural network architecture that performs a projection to the affine subspace of valid SR solutions ensuring that the high resolution output of the network is always consistent with the low resolution input. We show that, using this architecture, the amortised MAP inference problem reduces to minimising the cross-entropy between two distributions, similar to training generative models. We propose three methods to solve this optimisation problem: (1) Generative Adversarial Networks (GAN) (2) denoiser-guided SR which backpropagates gradient-estimates from denoising to train the network, and (3) a baseline method using a maximum-likelihood-trained image prior. Our experiments show that the GAN based approach performs best on real image data. Lastly, we establish a connection between GANs and amortised variational inference as in e.g. variational autoencoders.",/pdf/62cf51f7f8bb2bfb79c850cc2584cb044e7f6ad3.pdf,ICLR,2017,Probabilisticly motivated image superresolution using a projection to the subspace of valid solutions +#NAME?,wld72sLx2iL,1601310000000.0,1614990000000.0,3823,A Theory of Self-Supervised Framework for Few-Shot Learning,"[""~Zhong_Cao1"", ""~Jiang_Lu1"", ""~Jian_Liang3"", ""~Changshui_Zhang2""]","[""Zhong Cao"", ""Jiang Lu"", ""Jian Liang"", ""Changshui Zhang""]",[],"Recently, self-supervised learning (SSL) algorithms have been applied to Few-shot learning(FSL). FSL aims at distilling transferable knowledge on existing classes with large-scale labeled data to cope with novel classes for which only a few labeled data are available. Due to the limited number of novel classes, the initial embedding network becomes an essential component and can largely affect the performance in practice. But almost no one analyzes why a pre-trained embedding network with self-supervised training can provide representation for downstream FSL tasks in theory. In this paper, we first summarized the supervised FSL methods and explained why SSL is suitable for FSL. Then we further analyzed the main difference between supervised training and self-supervised training on FSL and obtained the bound for the gap between self-supervised loss and supervised loss. Finally, we proposed potential ways to improve the test accuracy under the setting of self-supervised FSL. ",/pdf/3f730ddfd0ec9ef3031a78a16e0c3403defafa22.pdf,ICLR,2021, +BJg1fgBYwH,r1eHv-eKDr,1569440000000.0,1577170000000.0,2159,SAFE-DNN: A Deep Neural Network with Spike Assisted Feature Extraction for Noise Robust Inference,"[""xshe6@gatech.edu"", ""priyabratasaha@gatech.edu"", ""daehyun.kim@gatech.edu"", ""yunlong@gatech.edu"", ""saibal.mukhopadhyay@ece.gatech.edu""]","[""Xueyuan She"", ""Priyabrata Saha"", ""Daehyun Kim"", ""Yun Long"", ""Saibal Mukhopadhyay""]","[""Noise robust"", ""deep learning"", ""DNN"", ""image classification""]",We present a Deep Neural Network with Spike Assisted Feature Extraction (SAFE-DNN) to improve robustness of classification under stochastic perturbation of inputs. The proposed network augments a DNN with unsupervised learning of low-level features using spiking neuron network (SNN) with Spike-Time-Dependent-Plasticity (STDP). The complete network learns to ignore local perturbation while performing global feature detection and classification. The experimental results on CIFAR-10 and ImageNet subset demonstrate improved noise robustness for multiple DNN architectures without sacrificing accuracy on clean images.,/pdf/7faa4ff4643f80cee7f3aee2ec656e0bb689f5c4.pdf,ICLR,2020,A noise robust deep learning architecture. +PxTIG12RRHS,olrb2OosIhN,1601310000000.0,1612860000000.0,2561,Score-Based Generative Modeling through Stochastic Differential Equations,"[""~Yang_Song1"", ""~Jascha_Sohl-Dickstein2"", ""~Diederik_P_Kingma1"", ""~Abhishek_Kumar1"", ""~Stefano_Ermon1"", ""~Ben_Poole1""]","[""Yang Song"", ""Jascha Sohl-Dickstein"", ""Diederik P Kingma"", ""Abhishek Kumar"", ""Stefano Ermon"", ""Ben Poole""]","[""generative models"", ""score-based generative models"", ""stochastic differential equations"", ""score matching"", ""diffusion""]","Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. +Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of $1024\times 1024$ images for the first time from a score-based generative model.",/pdf/ef0eadbe07115b0853e964f17aa09d811cd490f1.pdf,ICLR,2021,"A general framework for training and sampling from score-based models that unifies and generalizes previous methods, allows likelihood computation, and enables controllable generation." +ry_4vpixl,,1478380000000.0,1478380000000.0,605,Rotation Plane Doubly Orthogonal Recurrent Neural Networks,"[""zmccarthy@berkeley.edu"", ""xiaoyang.bai@berkeley.edu"", ""c.xi@berkeley.edu"", ""pabbeel@berkeley.edu""]","[""Zoe McCarthy"", ""Andrew Bai"", ""Xi Chen"", ""Pieter Abbeel""]","[""Deep learning"", ""Theory""]","Recurrent Neural Networks (RNNs) applied to long sequences suffer from the well known vanishing and exploding gradients problem. The recently proposed Unitary Evolution Recurrent Neural Network (uRNN) alleviates the exploding gradient problem and can learn very long dependencies, but its nonlinearities make it still affected by the vanishing gradient problem and so learning can break down for extremely long dependencies. We propose a new RNN transition architecture where the hidden state is updated multiplicatively by a time invariant orthogonal transformation followed by an input modulated orthogonal transformation. There are no additive interactions and so our architecture exactly preserves forward hid-den state activation norm and backwards gradient norm for all time steps, and is provably not affected by vanishing or exploding gradients. We propose using the rotation plane parameterization to represent the orthogonal matrices. We validate our model on a simplified memory copy task and see that our model can learn dependencies as long as 5,000 timesteps.",/pdf/652d9439cff6afd96bc9a6060100a6c12d9982a9.pdf,ICLR,2017,"Recurrent equation for RNNs that uses the composition of two orthogonal transitions, one time invariant and one modulated by input, that doesn't suffer from vanishing or exploding gradients." +Syl38yrFwr,rJlYddpOPr,1569440000000.0,1577170000000.0,1743,Near-Zero-Cost Differentially Private Deep Learning with Teacher Ensembles,"[""james.lichao.sun@gmail.com"", ""yingbo.zhou@salesforce.com"", ""jia.li@salesforce.com"", ""rsocher@salesforce.com"", ""psyu@uic.edu"", ""cxiong@salesforce.com""]","[""Lichao Sun"", ""Yingbo Zhou"", ""Jia Li"", ""Richard Socher"", ""Philip S. Yu"", ""Caiming Xiong""]",[],"Ensuring the privacy of sensitive data used to train modern machine learning models is of paramount importance in many areas of practice. One approach to study these concerns is through the lens of differential privacy. In this framework, privacy guarantees are generally obtained by perturbing models in such a way that specifics of data used to train the model are made ambiguous. A particular instance of this approach is through a ``teacher-student'' model, wherein the teacher, who owns the sensitive data, provides the student with useful, but noisy, information, hopefully allowing the student model to perform well on a given task without access to particular features of the sensitive data. Because stronger privacy guarantees generally involve more significant noising on the part of the teacher, deploying existing frameworks fundamentally involves a trade-off between utility and privacy guarantee. One of the most important techniques used in previous work involves an ensemble of teacher models, which return information to a student based on a noisy voting procedure. In this work, we propose a novel voting mechanism, which we call an Immutable Noisy ArgMax, that, under certain conditions, can bear very large random noising from the teacher without affecting the useful information transferred to the student. Our mechanisms improve over the state-of-the-art methods on all measures, and scale to larger tasks with both higher utility and stronger privacy ($\epsilon \approx 0$).",/pdf/d315f5b524580c06a1994893543c610fcffce83f.pdf,ICLR,2020, +IFqrg1p5Bc,mrsOses4lW9,1601310000000.0,1615820000000.0,2216,Distance-Based Regularisation of Deep Networks for Fine-Tuning,"[""~Henry_Gouk1"", ""~Timothy_Hospedales1"", ""~massimiliano_pontil1""]","[""Henry Gouk"", ""Timothy Hospedales"", ""massimiliano pontil""]","[""Deep Learning"", ""Transfer Learning"", ""Statistical Learning Theory""]","We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective fine-tuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pre-trained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art fine-tuning competitors, and penalty-based alternatives that we show do not directly constrain the radius of the search space.",/pdf/8758dc3fcea289b116ec58cc0b6feae810915b43.pdf,ICLR,2021,"We derive generalisation bounds applicable to fine-tuning, then demonstrate an algorithm that regularises these bounds improves fine-tuning performance." +BygFVAEKDH,Byxf3HLuvB,1569440000000.0,1583910000000.0,1083,Understanding Knowledge Distillation in Non-autoregressive Machine Translation,"[""chuntinz@andrew.cmu.edu"", ""jgu@fb.com"", ""gneubig@cs.cmu.edu""]","[""Chunting Zhou"", ""Jiatao Gu"", ""Graham Neubig""]","[""knowledge distillation"", ""non-autoregressive neural machine translation""]","Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality. Based on these findings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve the state-of-the-art performance for the NAT-based models, and close the gap with the autoregressive baseline on WMT14 En-De benchmark.",/pdf/a057d93f5526ce4ceb6751adc88766291d440cd7.pdf,ICLR,2020,"We systematically examine why knowledge distillation is crucial to the training of non-autoregressive translation (NAT) models, and propose methods to further improve the distilled data to best match the capacity of an NAT model." +B1gjs6EtDr,H1l0eD1dPB,1569440000000.0,1577170000000.0,757,Efficient Content-Based Sparse Attention with Routing Transformers,"[""aurkor@google.com"", ""msaffar@google.com"", ""grangier@google.com"", ""avaswani@google.com""]","[""Aurko Roy*"", ""Mohammad Taghi Saffar*"", ""David Grangier"", ""Ashish Vaswani""]","[""Sparse attention"", ""autoregressive"", ""generative models""]","Self-attention has recently been adopted for a wide range of sequence modeling +problems. Despite its effectiveness, self-attention suffers quadratic compute and +memory requirements with respect to sequence length. Successful approaches to +reduce this complexity focused on attention to local sliding windows or a small +set of locations independent of content. Our work proposes to learn dynamic +sparse attention patterns that avoid allocating computation and memory to attend +to content unrelated to the query of interest. This work builds upon two lines of +research: it combines the modeling flexibility of prior work on content-based sparse +attention with the efficiency gains from approaches based on local, temporal sparse +attention. Our model, the Routing Transformer, endows self-attention with a sparse +routing module based on online k-means while reducing the overall complexity of +attention to O(n^{1.5}d) from O(n^2d) for sequence length n and hidden dimension +d. We show that our model outperforms comparable sparse attention models on +language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on +image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. +Code will be open-sourced on acceptance.",/pdf/682d907048338aedb4a6043e69e7792cea3a48ad.pdf,ICLR,2020,We propose a content-based sparse attention model and show improvements on language modeling and image generation. +rygVV205KQ,SylVIb39Fm,1538090000000.0,1545360000000.0,1444,Visual Imitation with a Minimal Adversary,"[""reedscot@google.com"", ""yusufaytar@google.com"", ""ziyu@google.com"", ""tpaine@google.com"", ""avdnoord@google.com"", ""tpfaff@google.com"", ""sergomez@google.com"", ""anovikov@google.com"", ""budden@google.com"", ""vinyals@google.com""]","[""Scott Reed"", ""Yusuf Aytar"", ""Ziyu Wang"", ""Tom Paine"", ""A\u00e4ron van den Oord"", ""Tobias Pfaff"", ""Sergio Gomez"", ""Alexander Novikov"", ""David Budden"", ""Oriol Vinyals""]","[""imitation"", ""from pixels"", ""adversarial""]","High-dimensional sparse reward tasks present major challenges for reinforcement learning agents. In this work we use imitation learning to address two of these challenges: how to learn a useful representation of the world e.g. from pixels, and how to explore efficiently given the rarity of a reward signal? We show that adversarial imitation can work well even in this high dimensional observation space. Surprisingly the adversary itself, acting as the learned reward function, can be tiny, comprising as few as 128 parameters, and can be easily trained using the most basic GAN formulation. Our approach removes limitations present in most contemporary imitation approaches: requiring no demonstrator actions (only video), no special initial conditions or warm starts, and no explicit tracking of any single demo. The proposed agent can solve a challenging robot manipulation task of block stacking from only video demonstrations and sparse reward, in which the non-imitating agents fail to learn completely. Furthermore, our agent learns much faster than competing approaches that depend on hand-crafted, staged dense reward functions, and also better compared to standard GAIL baselines. Finally, we develop a new adversarial goal recognizer that in some cases allows the agent to learn stacking without any task reward, purely from imitation.",/pdf/d2433f5e54c14a167852d8a7ac32028e53de92d0.pdf,ICLR,2019,"Imitation from pixels, with sparse or no reward, using off-policy RL and a tiny adversarially-learned reward function." +r1GbfhRqF7,B1lhu3wUFm,1538090000000.0,1547510000000.0,1239,Kernel Change-point Detection with Auxiliary Deep Generative Models,"[""wchang2@cs.cmu.edu"", ""chunlial@cs.cmu.edu"", ""yiming@cs.cmu.edu"", ""bapoczos@cs.cmu.edu""]","[""Wei-Cheng Chang"", ""Chun-Liang Li"", ""Yiming Yang"", ""Barnab\u00e1s P\u00f3czos""]","[""deep kernel learning"", ""generative models"", ""kernel two-sample test"", ""time series change-point detection""]","Detecting the emergence of abrupt property changes in time series is a challenging problem. Kernel two-sample test has been studied for this task which makes fewer assumptions on the distributions than traditional parametric approaches. However, selecting kernels is non-trivial in practice. Although kernel selection for the two-sample test has been studied, the insufficient samples in change point detection problem hinder the success of those developed kernel selection algorithms. In this paper, we propose KL-CPD, a novel kernel learning framework for time series CPD that optimizes a lower bound of test power via an auxiliary generative model. With deep kernel parameterization, KL-CPD endows kernel two-sample test with the data-driven kernel to detect different types of change-points in real-world applications. The proposed approach significantly outperformed other state-of-the-art methods in our comparative evaluation of benchmark datasets and simulation studies.",/pdf/e96b952c18a888cce887a9ca5f19d108a6730b45.pdf,ICLR,2019,"In this paper, we propose KL-CPD, a novel kernel learning framework for time series CPD that optimizes a lower bound of test power via an auxiliary generative model as a surrogate to the abnormal distribution. " +B1bgpzZAZ,Sk1ChMWR-,1509140000000.0,1518730000000.0,990,ElimiNet: A Model for Eliminating Options for Reading Comprehension with Multiple Choice Questions,"[""sohamp@cse.iitm.ac.in"", ""ananyasb@cse.iitm.ac.in"", ""preksha@cse.iitm.ac.in"", ""miteshk@cse.iitm.ac.in""]","[""Soham Parikh"", ""Ananya Sai"", ""Preksha Nema"", ""Mitesh M Khapra""]","[""Reading Comprehension"", ""Answering Multiple Choice Questions""]","The task of Reading Comprehension with Multiple Choice Questions, requires a human (or machine) to read a given \{\textit{passage, question}\} pair and select one of the $n$ given options. The current state of the art model for this task first computes a query-aware representation for the passage and then \textit{selects} the option which has the maximum similarity with this representation. However, when humans perform this task they do not just focus on option selection but use a combination of \textit{elimination} and \textit{selection}. Specifically, a human would first try to eliminate the most irrelevant option and then read the document again in the light of this new information (and perhaps ignore portions corresponding to the eliminated option). This process could be repeated multiple times till the reader is finally ready to select the correct option. We propose \textit{ElimiNet}, a neural network based model which tries to mimic this process. Specifically, it has gates which decide whether an option can be eliminated given the \{\textit{document, question}\} pair and if so it tries to make the document representation orthogonal to this eliminatedd option (akin to ignoring portions of the document corresponding to the eliminated option). The model makes multiple rounds of partial elimination to refine the document representation and finally uses a selection module to pick the best option. We evaluate our model on the recently released large scale RACE dataset and show that it outperforms the current state of the art model on 7 out of the 13 question types in this dataset. Further we show that taking an ensemble of our \textit{elimination-selection} based method with a \textit{selection} based method gives us an improvement of 7\% (relative) over the best reported performance on this dataset. +",/pdf/ec9693345d75f670ab35c40974afc583f0f4d12f.pdf,ICLR,2018,A model combining elimination and selection for answering multiple choice questions +rklaWn0qK7,HJlk5EpcYQ,1538090000000.0,1556180000000.0,1214,Learning Neural PDE Solvers with Convergence Guarantees,"[""junting@stanford.edu"", ""sjzhao@stanford.edu"", ""seismann@stanford.edu"", ""lucia.mirabella@siemens.com"", ""ermon@cs.stanford.edu""]","[""Jun-Ting Hsieh"", ""Shengjia Zhao"", ""Stephan Eismann"", ""Lucia Mirabella"", ""Stefano Ermon""]","[""Partial differential equation"", ""deep learning""]","Partial differential equations (PDEs) are widely used across the physical and computational sciences. Decades of research and engineering went into designing fast iterative solution methods. Existing solvers are general purpose, but may be sub-optimal for specific classes of problems. In contrast to existing hand-crafted solutions, we propose an approach to learn a fast iterative solver tailored to a specific domain. We achieve this goal by learning to modify the updates of an existing solver using a deep neural network. Crucially, our approach is proven to preserve strong correctness and convergence guarantees. After training on a single geometry, our model generalizes to a wide variety of geometries and boundary conditions, and achieves 2-3 times speedup compared to state-of-the-art solvers.",/pdf/998c5af0a4f8baea22609e1db83dba99a65a0508.pdf,ICLR,2019,We learn a fast neural solver for PDEs that has convergence guarantees. +SkgpBJrtvS,Skx797TuPr,1569440000000.0,1583910000000.0,1708,Contrastive Representation Distillation,"[""yonglong@mit.edu"", ""dilipkay@google.com"", ""phillipi@mit.edu""]","[""Yonglong Tian"", ""Dilip Krishnan"", ""Phillip Isola""]","[""Knowledge Distillation"", ""Representation Learning"", ""Contrastive Learning"", ""Mutual Information""]"," Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. When combined with knowledge distillation, our method sets a state of the art in many transfer tasks, sometimes even outperforming the teacher network.",/pdf/7c944bf8a4951d953b5c3fe5cc92bddf3eb4e40f.pdf,ICLR,2020,Representation/knowledge distillation by maximizing mutual information between teacher and student +HkxWXkStDB,HyeuHBn_DB,1569440000000.0,1577170000000.0,1606,Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation,"[""iraphael@google.com"", ""dongyin@berkeley.edu"", ""pooleb@google.com"", ""gilmer@google.com"", ""cubuk@google.com""]","[""Raphael Gontijo Lopes"", ""Dong Yin"", ""Ben Poole"", ""Justin Gilmer"", ""Ekin D. Cubuk""]","[""Data Augmentation"", ""Out-of-distribution"", ""Robustness"", ""Generalization"", ""Computer Vision"", ""Corruption""]","Deploying machine learning systems in the real world requires both high accuracy on clean data and robustness to naturally occurring corruptions. While architectural advances have led to improved accuracy, building robust models remains challenging, involving major changes in training procedure and datasets. Prior work has argued that there is an inherent trade-off between robustness and accuracy, as exemplified by standard data augmentation techniques such as Cutout, which improves clean accuracy but not robustness, and additive Gaussian noise, which improves robustness but hurts accuracy. We introduce Patch Gaussian, a simple augmentation scheme that adds noise to randomly selected patches in an input image. Models trained with Patch Gaussian achieve state of the art on the CIFAR-10 and ImageNet Common Corruptions benchmarks while also maintaining accuracy on clean data. We find that this augmentation leads to reduced sensitivity to high frequency noise (similar to Gaussian) while retaining the ability to take advantage of relevant high frequency information in the image (similar to Cutout). We show it can be used in conjunction with other regularization methods and data augmentation policies such as AutoAugment. Finally, we find that the idea of restricting perturbations to patches can also be useful in the context of adversarial learning, yielding models without the loss in accuracy that is found with unconstrained adversarial training.",/pdf/26fc9b8228b82566d2cd3f3b879c34e3dd52792c.pdf,ICLR,2020,Simple augmentation method overcomes robustness/accuracy trade-off observed in literature and opens questions about the effect of training distribution on out-of-distribution generalization. +rkeqCoA5tX,Byxs-Aa5Y7,1538090000000.0,1545360000000.0,919,LEARNING GENERATIVE MODELS FOR DEMIXING OF STRUCTURED SIGNALS FROM THEIR SUPERPOSITION USING GANS,"[""msoltani@iastate.edu"", ""swayambhoo.jain@technicolor.com"", ""samba014@umn.edu""]","[""Mohammadreza Soltani"", ""Swayambhoo Jain"", ""Abhinav V. Sambasivan""]","[""Generative Models"", ""GANs"", ""Denosing"", ""Demixing"", ""Structured Recovery""]","Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the problem of learning GANs under the observation setting when the samples from target distribution are given by the superposition of two structured components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through comprehensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing.",/pdf/0a3e3ba0a1bef72a5324ff486bea2fdff1af56ea.pdf,ICLR,2019, +rJe2syrtvS,SyeOUyJFDB,1569440000000.0,1587940000000.0,1930,The Ingredients of Real World Robotic Reinforcement Learning,"[""henryzhu@berkeley.edu"", ""justinvyu@berkeley.edu"", ""abhigupta@berkeley.edu"", ""shah@eecs.berkeley.edu"", ""kristian.hartikainen@gmail.com"", ""avisingh@cs.berkeley.edu"", ""vikashplus@gmail.com"", ""svlevine@eecs.berkeley.edu""]","[""Henry Zhu"", ""Justin Yu"", ""Abhishek Gupta"", ""Dhruv Shah"", ""Kristian Hartikainen"", ""Avi Singh"", ""Vikash Kumar"", ""Sergey Levine""]","[""Reinforcement Learning"", ""Robotics""]","The success of reinforcement learning in the real world has been limited to instrumented laboratory scenarios, often requiring arduous human supervision to enable continuous learning. In this work, we discuss the required elements of a robotic system that can continually and autonomously improve with data collected in the real world, and propose a particular instantiation of such a system. Subsequently, we investigate a number of challenges of learning without instrumentation -- including the lack of episodic resets, state estimation, and hand-engineered rewards -- and propose simple, scalable solutions to these challenges. We demonstrate the efficacy of our proposed system on dexterous robotic manipulation tasks in simulation and the real world, and also provide an insightful analysis and ablation study of the challenges associated with this learning paradigm.",/pdf/c9e6956612c7e18c323ae939026d8457431406cb.pdf,ICLR,2020,System to learn robotic tasks in the real world with reinforcement learning without instrumentation +By14kuqxx,,1478290000000.0,1486250000000.0,450,Bit-Pragmatic Deep Neural Network Computing,"[""jorge.albericio@gmail.com"", ""judd@ece.utoronto.ca"", ""delmas1@ece.utoronto.ca"", ""sayeh@ece.utoronto.ca"", ""moshovos@ece.utoronto.ca""]","[""Jorge Albericio"", ""Patrick Judd"", ""Alberto Delmas"", ""Sayeh Sharify"", ""Andreas Moshovos""]","[""Deep learning"", ""Applications""]","We quantify a source of ineffectual computations when processing the multiplications of the convolutional layers in Deep Neural Networks (DNNs) and propose Pragrmatic (PRA), an architecture that exploits it improving performance and energy efficiency. +The source of these ineffectual computations is best understood in the context of conventional multipliers which generate internally multiple terms, that is, products of the multiplicand and powers of two, which added together produce the final product. At runtime, many of these terms are zero as they are generated when the multiplicand is combined with the zero-bits of the multiplicator. While conventional bit-parallel multipliers calculate all terms in parallel to reduce individual product latency, PRA calculates only the non-zero terms resulting in a design whose execution time for convolutional layers is ideally proportional to the number of activation bits that are 1. Measurements demonstrate that for the convolutional layers on Convolutional Neural Networks and during inference, PRA improves performance by 4.3x over the DaDiaNao (DaDN) accelerator and by 4.5x when DaDN uses an 8-bit quantized representation. DaDN was reported to be 300x faster than commodity graphics processors. + +",/pdf/e1f60d5888bde86192ea809184924b0af7947d5c.pdf,ICLR,2017,A hardware accelerator for DNNs whose execution time for convolutional layers is proportional to the number of activation *bits* that are 1. +BkedwoC5t7,Hke1cQ_qtQ,1538090000000.0,1545360000000.0,277,Formal Limitations on the Measurement of Mutual Information,"[""mcallester@ttic.edu"", ""stratos@ttic.edu""]","[""David McAllester"", ""Karl Stratos""]","[""mutual information"", ""predictive coding"", ""unsupervised learning"", ""predictive learning"", ""generalization bounds"", ""MINE"", ""DIM"", ""contrastive predictive coding""]","Motivated by applications to unsupervised learning, we consider the problem of measuring mutual information. Recent analysis has shown that naive kNN estimators of mutual information have serious statistical limitations motivating more refined methods. In this paper we prove that serious statistical limitations are inherent to any measurement method. More specifically, we show that any distribution-free high-confidence lower bound on mutual information cannot be larger than $O(\ln N)$ where $N$ is the size of the data sample. We also analyze the Donsker-Varadhan lower bound on KL divergence in particular and show that, when simple statistical considerations are taken into account, this bound can never produce a high-confidence value larger than $\ln N$. While large high-confidence lower bounds are impossible, in practice one can use estimators without formal guarantees. We suggest expressing mutual information as a difference of entropies and using cross entropy as an entropy estimator. We observe that, although cross entropy is only an upper bound on entropy, cross-entropy estimates converge to the true cross entropy at the rate of $1/\sqrt{N}$.",/pdf/bc72ba1d365853a69f91190c9264e375f6d460ce.pdf,ICLR,2019,We give a theoretical analysis of the measurement and optimization of mutual information. +9ITXiTrAoT,Oiue_JdLc7d,1601310000000.0,1616010000000.0,2792,Multi-timescale Representation Learning in LSTM Language Models,"[""shivangi@utexas.edu"", ""vy.vo@intel.com"", ""~Javier_S._Turek1"", ""~Alexander_Huth1""]","[""Shivangi Mahto"", ""Vy Ai Vo"", ""Javier S. Turek"", ""Alexander Huth""]","[""Language Model"", ""LSTM"", ""timescales""]","Language models must capture statistical dependencies between words at timescales ranging from very short to very long. Earlier work has demonstrated that dependencies in natural language tend to decay with distance between words according to a power law. However, it is unclear how this knowledge can be used for analyzing or designing neural network language models. In this work, we derived a theory for how the memory gating mechanism in long short-term memory (LSTM) language models can capture power law decay. We found that unit timescales within an LSTM, which are determined by the forget gate bias, should follow an Inverse Gamma distribution. Experiments then showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution. Further, we found that explicitly imposing the theoretical distribution upon the model during training yielded better language model perplexity overall, with particular improvements for predicting low-frequency (rare) words. Moreover, the explicit multi-timescale model selectively routes information about different types of words through units with different timescales, potentially improving model interpretability. These results demonstrate the importance of careful, theoretically-motivated analysis of memory and timescale in language models.",/pdf/6faff0f37219bcee41b257a3d80d7eeb3df0e2d6.pdf,ICLR,2021,This work presents a theoretically-motivated analysis of memory and timescale in LSTM language models. +SyMWn05F7,r1era9T5F7,1538090000000.0,1551130000000.0,1147,Learning Exploration Policies for Navigation,"[""taoc1@andrew.cmu.edu"", ""sgupta@eecs.berkeley.edu"", ""abhinavg@cs.cmu.edu""]","[""Tao Chen"", ""Saurabh Gupta"", ""Abhinav Gupta""]","[""Exploration"", ""navigation"", ""reinforcement learning""]","Numerous past works have tackled the problem of task-driven navigation. But, how to effectively explore a new environment to enable a variety of down-stream tasks has received much less attention. In this work, we study how agents can autonomously explore realistic and complex 3D environments without the context of task-rewards. We propose a learning-based approach and investigate different policy architectures, reward functions, and training paradigms. We find that use of policies with spatial memory that are bootstrapped with imitation learning and finally finetuned with coverage rewards derived purely from on-board sensors can be effective at exploring novel environments. We show that our learned exploration policies can explore better than classical approaches based on geometry alone and generic learning-based exploration techniques. Finally, we also show how such task-agnostic exploration can be used for down-stream tasks. Videos are available at https://sites.google.com/view/exploration-for-nav/.",/pdf/594683f9f034106169e1df832730b19b82779d22.pdf,ICLR,2019, +B1lgUkBFwr,rJeMYN6uPr,1569440000000.0,1577170000000.0,1716,Unsupervised domain adaptation with imputation,"[""m.kirchmeyer@criteo.com"", ""patrick.gallinari@lip6.fr"", ""a.rakotomamonjy@criteo.com"", ""a.mantrach@criteo.com""]","[""Matthieu Kirchmeyer"", ""Patrick Gallinari"", ""Alain Rakotomamonjy"", ""Amin Mantrach""]","[""domain adaptation"", ""imputation"", ""missing data"", ""advertising""]","Motivated by practical applications, we consider unsupervised domain adaptation for classification problems, in the presence of missing data in the target domain. More precisely, we focus on the case where there is a domain shift between source and target domains, while some components of the target data are systematically absent. We propose a way to impute non-stochastic missing data for a classification task by leveraging supervision from a complete source domain through domain adaptation. We introduce a single model performing joint domain adaptation, imputation and classification which is shown to perform well under various representative divergence families (H-divergence, Optimal Transport). We perform experiments on two families of datasets: a classical digit classification benchmark commonly used in domain adaptation papers and real world digital advertising datasets, on which we evaluate our model’s classification performance in an unsupervised setting. We analyze its behavior showing the benefit of explicitly imputing non-stochastic missing data jointly with domain adaptation.",/pdf/03ca5b14535e841ec2d9398c1af8a013816b5a59.pdf,ICLR,2020,We propose a way to jointly tackle unsupervised domain adaptation and non-stochastic missing data in a target domain using distant supervision from a complete source domain. +om1guSP_ray,kyj_FXGz2h,1601310000000.0,1614990000000.0,645,Graph Pooling by Edge Cut,"[""~Alexis_Galland1"", ""~marc_lelarge1""]","[""Alexis Galland"", ""marc lelarge""]","[""graph"", ""deep"", ""learning"", ""pooling""]","Graph neural networks (GNNs) are very efficient at solving several tasks in graphs such as node classification or graph classification. They come from an adaptation of convolutional neural networks on images to graph structured data. These models are very effective at finding patterns in images that can discriminate images from each others. Another aspect leading to their success is their ability to uncover hierarchical structures. This comes from the pooling operation that produces different versions of the input image at different scales. The same way, we want to identify patterns at different scales in graphs in order to improve the classification accuracy. Compared to the case of images, it is not trivial to develop a pooling layer on graphs. This is mainly due to the fact that in graphs nodes are not ordered and have irregular neighborhoods. To aleviate this issue, we propose a pooling layer based on edge cuts in graphs. This pooling layer works by computing edge scores that correspond to the importance of edges in the process of information propagation of the GNN. Moreover, we define a regularization function that aims at producing edge scores that minimize the minCUT problem. Finally, through extensive experiments we show that this architecture can compete with state-of- the-art methods.",/pdf/d893a32b0837da6ed0d81a5e6b8da9fc042c3486.pdf,ICLR,2021,A pooling layer for graph neural networks based on edge cuts. +S1jmAotxg,,1478240000000.0,1491260000000.0,129,Stick-Breaking Variational Autoencoders,"[""enalisni@uci.edu"", ""smyth@ics.uci.edu""]","[""Eric Nalisnick"", ""Padhraic Smyth""]","[""Deep learning"", ""Unsupervised Learning"", ""Semi-Supervised Learning""]","We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE’s.",/pdf/7763c4fd20502cd6801d446a1d63c59eb9adab1d.pdf,ICLR,2017,We define a variational autoencoder variant with stick-breaking latent variables thereby giving it adaptive width. +HJlF3h4FvB,HklFhrvGPr,1569440000000.0,1577170000000.0,197,Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized NN,"[""dongbin@math.pku.edu.cn"", ""houjikai@pku.edu.cn"", ""yplu@stanford.edu"", ""zhzhang@math.pku.edu.cn""]","[""Bin Dong"", ""Jikai Hou"", ""Yiping Lu"", ""Zhihua Zhang""]","[""Distillation"", ""Learning Thoery"", ""Corrupted Label""]","Distillation is a method to transfer knowledge from one model to another and often achieves higher accuracy with the same capacity. In this paper, we aim to provide a theoretical understanding on what mainly helps with the distillation. Our answer is ""early stopping"". Assuming that the teacher network is overparameterized, we argue that the teacher network is essentially harvesting dark knowledge from the data via early stopping. This can be justified by a new concept, Anisotropic In- formation Retrieval (AIR), which means that the neural network tends to fit the informative information first and the non-informative information (including noise) later. Motivated by the recent development on theoretically analyzing overparame- terized neural networks, we can characterize AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new understanding of distillation. With that, we further utilize distillation to refine noisy labels. We propose a self-distillation al- gorithm to sequentially distill knowledge from the network in the previous training epoch to avoid memorizing the wrong labels. We also demonstrate, both theoret- ically and empirically, that self-distillation can benefit from more than just early stopping. Theoretically, we prove convergence of the proposed algorithm to the ground truth labels for randomly initialized overparameterized neural networks in terms of l2 distance, while the previous result was on convergence in 0-1 loss. The theoretical result ensures the learned neural network enjoy a margin on the training data which leads to better generalization. Empirically, we achieve better testing accuracy and entirely avoid early stopping which makes the algorithm more user-friendly. +",/pdf/79c837a0c05725a8a2fbc735ac6fe81ed93ac69f.pdf,ICLR,2020,"theoretically understand the regularization effect of distillation. We show that early stopping is essential in this process. From this perspective, we developed a distillation method for learning with corrupted Label with theoretical guarantees." +rygk9oA9Ym,BkxRDHo5tm,1538090000000.0,1545360000000.0,497,3D-RelNet: Joint Object and Relational Network for 3D Prediction,"[""nileshk@cs.cmu.edu"", ""ishan@cmu.edu"", ""shubhtuls@fb.com"", ""abhinavg@cs.cmu.edu""]","[""Nilesh Kulkarni"", ""Ishan Misra"", ""Shubham Tulsiani"", ""Abhinav Gupta""]","[""3D Reconstruction"", ""3D Scene Understanding"", ""Relative Prediction""]","We propose an approach to predict the 3D shape and pose for the objects present in a scene. Existing learning based methods that pursue this goal make independent predictions per object, and do not leverage the relationships amongst them. We argue that reasoning about these relationships is crucial, and present an approach to incorporate these in a 3D prediction framework. In addition to independent per-object predictions, we predict pairwise relations in the form of relative 3D pose, and demonstrate that these can be easily incorporated to improve object level estimates. We report performance across different datasets (SUNCG, NYUv2), and show that our approach significantly improves over independent prediction approaches while also outperforming alternate implicit reasoning methods.",/pdf/1e15304388dfa26382448362f4f5a1d57c4d277b.pdf,ICLR,2019,We reason about relative spatial relationships between the objects in a scene to produce better 3D predictions +H1xsSjC9Ym,H1xvAyiKF7,1538090000000.0,1550250000000.0,116,Learning to Understand Goal Specifications by Modelling Reward,"[""dimabgv@gmail.com"", ""felixhill@google.com"", ""leike@google.com"", ""edwardhughes@google.com"", ""seyedarian.hosseini@umontreal.ca"", ""pushmeet@google.com"", ""etg@google.com""]","[""Dzmitry Bahdanau"", ""Felix Hill"", ""Jan Leike"", ""Edward Hughes"", ""Arian Hosseini"", ""Pushmeet Kohli"", ""Edward Grefenstette""]","[""instruction following"", ""reward modelling"", ""language understanding""]","Recent work has shown that deep reinforcement-learning agents can learn to follow language-like instructions from infrequent environment rewards. However, this places on environment designers the onus of designing language-conditional reward functions which may not be easily or tractably implemented as the complexity of the environment and the language scales. To overcome this limitation, we present a framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples. As reward models improve, they learn to accurately reward agents for completing tasks for environment configurations---and for instructions---not present amongst the expert data. This framework effectively separates the representation of what instructions require from how they can be executed. +In a simple grid world, it enables an agent to learn a range of commands requiring interaction with blocks and understanding of spatial relations and underspecified abstract arrangements. We further show the method allows our agent to adapt to changes in the environment without requiring new expert examples.",/pdf/02befcff74b4bc496cdf3fb1d52cb7a29597fade.pdf,ICLR,2019,"We propose AGILE, a framework for training agents to perform instructions from examples of respective goal-states." +jxdXSW9Doc,jRI8o2Ygmv,1601310000000.0,1613830000000.0,1362,Effective Distributed Learning with Random Features: Improved Bounds and Algorithms,"[""liuyonggsai@ruc.edu.cn"", ""liujiankun@iie.ac.cn"", ""sq.wang@siat.ac.cn""]","[""Yong Liu"", ""Jiankun Liu"", ""Shuqiang Wang""]","[""Risk bound"", ""statistical learning theory"", ""kernel methods""]","In this paper, we study the statistical properties of distributed kernel ridge regression together with random features (DKRR-RF), and obtain optimal generalization bounds under the basic setting, which can substantially relax the restriction on the number of local machines in the existing state-of-art bounds. Specifically, we first show that the simple combination of divide-and-conquer technique and random features can achieve the same statistical accuracy as the exact KRR in expectation requiring only $\mathcal{O}(|\mathcal{D}|)$ memory and $\mathcal{O}(|\mathcal{D}|^{1.5})$ time. Then, beyond the generalization bounds in expectation that demonstrate the average information for multiple trails, we derive generalization bounds in probability to capture the learning performance for a single trail. Finally, we propose an effective communication strategy to further improve the performance of DKRR-RF, and validate the theoretical bounds via numerical experiments.",/pdf/79a6a49d47bc7c0250c2c92272851fefc0880cb6.pdf,ICLR,2021,This papaer focuses on the studies of the statistical properties of distributed KRR together with random features +q_S44KLQ_Aa,-D-6TdXkz1k,1601310000000.0,1616760000000.0,1614,Neurally Augmented ALISTA,"[""~Freya_Behrens1"", ""~Jonathan_Sauder2"", ""~Peter_Jung2""]","[""Freya Behrens"", ""Jonathan Sauder"", ""Peter Jung""]","[""compressed sensing"", ""sparse reconstruction"", ""unrolled algorithms"", ""learned ISTA""]"," It is well-established that many iterative sparse reconstruction algorithms can be unrolled to yield a learnable neural network for improved empirical performance. A prime example is learned ISTA (LISTA) where weights, step sizes and thresholds are learned from training data. Recently, Analytic LISTA (ALISTA) has been introduced, combining the strong empirical performance of a fully learned approach like LISTA, while retaining theoretical guarantees of classical compressed sensing algorithms and significantly reducing the number of parameters to learn. However, these parameters are trained to work in expectation, often leading to suboptimal reconstruction of individual targets. In this work we therefore introduce Neurally Augmented ALISTA, in which an LSTM network is used to compute step sizes and thresholds individually for each target vector during reconstruction. This adaptive approach is theoretically motivated by revisiting the recovery guarantees of ALISTA. We show that our approach further improves empirical performance in sparse reconstruction, in particular outperforming existing algorithms by an increasing margin as the compression ratio becomes more challenging.",/pdf/47516abe23cbccb7a1e7ed8520cf57ee2f68581f.pdf,ICLR,2021,"We introduce Neurally Augmented ALISTA, extending ALISTA to compute adaptive parameters to achieve improved recovery of individual sparse target vectors." +rkzUYjCcFm,BJlFMkjcFQ,1538090000000.0,1545360000000.0,449,FAST OBJECT LOCALIZATION VIA SENSITIVITY ANALYSIS,"[""mebrahimpour@ucmerced.edu"", ""dnoelle@ucmerced.edu""]","[""Mohammad K. Ebrahimpour"", ""David C. Noelle""]","[""Internal Representations"", ""Sensitivity Analysis"", ""Object Detection""]","Deep Convolutional Neural Networks (CNNs) have been repeatedly shown to perform well on image classification tasks, successfully recognizing a broad array of objects when given sufficient training data. Methods for object localization, however, are still in need of substantial improvement. Common approaches to this problem involve the use of a sliding window, sometimes at multiple scales, providing input to a deep CNN trained to classify the contents of the window. In general, these approaches are time consuming, requiring many classification calculations. In this paper, we offer a fundamentally different approach to the localization of recognized objects in images. Our method is predicated on the idea that a deep CNN capable of recognizing an object must implicitly contain knowledge about object location in its connection weights. We provide a simple method to interpret classifier weights in the context of individual classified images. This method involves the calculation of the derivative of network generated activation patterns, such as the activation of output class label units, with regard to each in- put pixel, performing a sensitivity analysis that identifies the pixels that, in a local sense, have the greatest influence on internal representations and object recognition. These derivatives can be efficiently computed using a single backward pass through the deep CNN classifier, producing a sensitivity map of the image. We demonstrate that a simple linear mapping can be learned from sensitivity maps to bounding box coordinates, localizing the recognized object. Our experimental results, using real-world data sets for which ground truth localization information is known, reveal competitive accuracy from our fast technique.",/pdf/b67e00576a54d07d3da7d6beaf0177f6cada1cca.pdf,ICLR,2019,Proposing a novel object localization(detection) approach based on interpreting the deep CNN using internal representation and network's thoughts +SkOb1Fl0Z,rkvZkKx0W,1509100000000.0,1518730000000.0,323,A Flexible Approach to Automated RNN Architecture Generation,"[""msch@mit.edu"", ""smerity@smerity.com"", ""james.bradbury@salesforce.com"", ""richard@socher.org""]","[""Martin Schrimpf"", ""Stephen Merity"", ""James Bradbury"", ""Richard Socher""]","[""reinforcement learning"", ""architecture search"", ""ranking function"", ""recurrent neural networks"", ""recursive neural networks""]","The process of designing neural architectures requires expert knowledge and extensive trial and error. +While automated architecture search may simplify these requirements, the recurrent neural network (RNN) architectures generated by existing methods are limited in both flexibility and components. +We propose a domain-specific language (DSL) for use in automated architecture search which can produce novel RNNs of arbitrary depth and width. +The DSL is flexible enough to define standard architectures such as the Gated Recurrent Unit and Long Short Term Memory and allows the introduction of non-standard RNN components such as trigonometric curves and layer normalization. Using two different candidate generation techniques, random search with a ranking function and reinforcement learning, +we explore the novel architectures produced by the RNN DSL for language modeling and machine translation domains. +The resulting architectures do not follow human intuition yet perform well on their targeted tasks, suggesting the space of usable RNN architectures is far larger than previously assumed.",/pdf/9873717312898f8fe2f78ebc1a27b40cf8105a8e.pdf,ICLR,2018,"We define a flexible DSL for RNN architecture generation that allows RNNs of varying size and complexity and propose a ranking function that represents RNNs as recursive neural networks, simulating their performance to decide on the most promising architectures." +HJflg30qKX,r1lide0cKQ,1538090000000.0,1550990000000.0,1043,Gradient descent aligns the layers of deep linear networks,"[""ziweiji2@illinois.edu"", ""mjt@illinois.edu""]","[""Ziwei Ji"", ""Matus Telgarsky""]","[""implicit regularization"", ""alignment of layers"", ""deep linear networks"", ""gradient descent"", ""separable data""]","This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): +(i) the risk converges to 0; +(ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation u_iv_i^T; +(iii) these rank-1 matrices are aligned across layers, meaning |v_{i+1}^T u_i| -> 1. +In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.",/pdf/951ccaeab553a8fc388a416d5d825e6f19fa43ed.pdf,ICLR,2019, +S1lVniC5Y7,B1xgz165tm,1538090000000.0,1545360000000.0,704,From Nodes to Networks: Evolving Recurrent Neural Networks,"[""aditya@cs.utexas.edu"", ""jasonzliang@utexas.edu"", ""risto@cs.utexas.edu""]","[""Aditya Rawal"", ""Jason Liang"", ""Risto Miikkulainen""]","[""Recurrent neural networks"", ""evolutionary algorithms"", ""genetic programming""]","Gated recurrent networks such as those composed of Long Short-Term Memory +(LSTM) nodes have recently been used to improve state of the art in many sequential +processing tasks such as speech recognition and machine translation. However, +the basic structure of the LSTM node is essentially the same as when it was +first conceived 25 years ago. Recently, evolutionary and reinforcement learning +mechanisms have been employed to create new variations of this structure. This +paper proposes a new method, evolution of a tree-based encoding of the gated +memory nodes, and shows that it makes it possible to explore new variations more +effectively than other methods. The method discovers nodes with multiple recurrent +paths and multiple memory cells, which lead to significant improvement in the +standard language modeling benchmark task. Remarkably, this node did not perform +well in another task, music modeling, but it was possible to evolve a different +node that did, demonstrating that the approach discovers customized structure for +each task. The paper also shows how the search process can be speeded up by +training an LSTM network to estimate performance of candidate structures, and +by encouraging exploration of novel solutions. Thus, evolutionary design of complex +neural network structures promises to improve performance of deep learning +architectures beyond human ability to do so.",/pdf/5fa938d04617bb6f945d2558a6b1a7a17bd9264a.pdf,ICLR,2019,Genetic programming to evolve new recurrent nodes for language and music. Uses a LSTM model to predict the performance of the recurrent node. +GzMUD_GGvJN,KDYH2loFbKX,1601310000000.0,1614990000000.0,3285,On the Importance of Distraction-Robust Representations for Robot Learning,"[""~Andy_Wang1"", ""~Antoine_Cully1""]","[""Andy Wang"", ""Antoine Cully""]","[""Unsupervised Representation Learning"", ""Robot Control"", ""Quality-Diversity""]","Representation Learning methods can allow the application of Reinforcement Learning algorithms when a high dimensionality in a robot's perceptions would otherwise prove prohibitive. Consequently, unsupervised Representation Learning components often feature in robot control algorithms that assume high-dimensional camera images as the principal source of information. +In their design and performance, these algorithms often benefit from the controlled nature of the simulation or laboratory conditions they are evaluated in. However, these settings fail to acknowledge the stochasticity of most real-world environments. +In this work, we introduce the concept of Distraction-Robust Representation Learning. We argue that environment noise and other distractions require learned representations to encode the robot's expected perceptions rather than the observed ones. Our experimental evaluations demonstrate that representations learned with a traditional dimensionality reduction algorithm are strongly susceptible to distractions in a robot's environment. +We propose an Encoder-Decoder architecture that produces representations that allow the learning outcomes of robot control tasks to remain unaffected by these distractions.",/pdf/aa70415fcd438f06e2a1918998ef1a4be2ba2629.pdf,ICLR,2021,This paper introduces the concept of Distraction-Robust Representation Learning and proposes a simple and effective architecture that can be applied to robot learning algorithms. +B1xGxgSYvH,rJe3qhJKDH,1569440000000.0,1577170000000.0,2091,Domain-Invariant Representations: A Look on Compression and Weights,"[""vbouvier@sidetrade.com"", ""celine.hudelot@centralesupelec.fr"", ""cchastagnol@sidetrade.com"", ""pveryranchet@gmail.com"", ""myriam.tami@centralesupelec.fr""]","[""Victor Bouvier"", ""C\u00e9line Hudelot"", ""Cl\u00e9ment Chastagnol"", ""Philippe Very"", ""Myriam Tami""]","[""Domain Adaptation"", ""Invariant Representation"", ""Compression"", ""Machine Learning Theory""]"," Learning Invariant Representations to adapt deep classifiers of a source domain to a new target domain has recently attracted much attention. In this paper, we show that the search for invariance favors the compression of representations. We point out this may have a bad impact on adaptability of representations expressed as a minimal combined domain error. By considering the risk of compression, we show that weighting representations can align representation distributions without impacting their adaptability. This supports the claim that representation invariance is too strict a constraint. First, we introduce a new bound on the target risk that reveals a trade-off between compression and invariance of learned representations. More precisely, our results show that the adaptability of a representation can be better controlled when the compression risk is taken into account. In contrast, preserving adaptability may overestimate the risk of compression that makes the bound impracticable. We support these statements with a theoretical analysis illustrated on a standard domain adaptation benchmark. Second, we show that learning weighted representations plays a key role in relaxing the constraint of invariance and then preserving the risk of compression. Taking advantage of this trade-off may open up promising directions for the design of new adaptation methods.",/pdf/31abfe196eb69ac5c8adfa457db74ac1a913f6fb.pdf,ICLR,2020,We introduce a new theoretical bound of the target risk for domain invariant representation which emphasizes both the role of compression and weights. +r1IRctqxg,,1478300000000.0,1485200000000.0,518,Sample Importance in Training Deep Neural Networks,"[""tgao@cs.unc.edu"", ""vjojic@cs.unc.edu""]","[""Tianxiang Gao"", ""Vladimir Jojic""]","[""Deep learning"", ""Supervised Learning""]","The contribution of each sample during model training varies across training iterations and the model's parameters. We define the concept of sample importance as the change in parameters induced by a sample. In this paper, we explored the sample importance in training deep neural networks using stochastic gradient descent. We found that ""easy"" samples -- samples that are correctly and confidently classified at the end of the training -- shape parameters closer to the output, while the ""hard"" samples impact parameters closer to the input to the network. Further, ""easy"" samples are relevant in the early training stages, and ""hard"" in the late training stage. Further, we show that constructing batches which contain samples of comparable difficulties tends to be a poor strategy compared to maintaining a mix of both hard and easy samples in all of the batches. Interestingly, this contradicts some of the results on curriculum learning which suggest that ordering training examples in terms of difficulty can lead to better performance.",/pdf/1c0e83b5b70d7190adac5254c0d44fc2e498dd55.pdf,ICLR,2017, +H1g6kaVKvH,H1xng_iHDS,1569440000000.0,1577170000000.0,318,Learning with Long-term Remembering: Following the Lead of Mixed Stochastic Gradient,"[""yug185@eng.ucsd.edu"", ""mingrui-liu@uiowa.edu"", ""tianbao-yang@uiowa.edu"", ""tajana@ucsd.edu""]","[""Yunhui Guo"", ""Mingrui Liu"", ""Tianbao Yang"", ""Tajana Rosing""]","[""lifelong learning"", ""continual learning""]","Current deep neural networks can achieve remarkable performance on a single task. However, when the deep neural network is continually trained on a sequence of tasks, it seems to gradually forget the previous learned knowledge. This phenomenon is referred to as catastrophic forgetting and motivates the field called lifelong learning. The central question in lifelong learning is how to enable deep neural networks to maintain performance on old tasks while learning a new task. In this paper, we introduce a novel and effective lifelong learning algorithm, called MixEd stochastic GrAdient (MEGA), which allows deep neural networks to acquire the ability of retaining performance on old tasks while learning new tasks. MEGA modulates the balance between old tasks and the new task by integrating the current gradient with the gradient computed on a small reference episodic memory. Extensive experimental results show that the proposed MEGA algorithm significantly advances the state-of-the-art on all four commonly used life-long learning benchmarks, reducing the error by up to 18%.",/pdf/88e73de8e94a7b9e4c474cb98a3b01488736fbd1.pdf,ICLR,2020,A novel and effective lifelong learning algorithm which achieves the state-of-the-art results on several benchmarks. +SkeyppEFvS,rygHguxuwS,1569440000000.0,1583910000000.0,804,CoPhy: Counterfactual Learning of Physical Dynamics,"[""fabien.baradel@insa-lyon.fr"", ""nneverova@fb.com"", ""julien.mille@insa-cvl.fr"", ""mori@cs.sfu.ca"", ""christian.wolf@insa-lyon.fr""]","[""Fabien Baradel"", ""Natalia Neverova"", ""Julien Mille"", ""Greg Mori"", ""Christian Wolf""]","[""intuitive physics"", ""visual reasoning""]","Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the CoPhy benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.",/pdf/f00a0eeedb2a7f893adea8ba02e04d6836be26b0.pdf,ICLR,2020, +yoVo1fThmS1,4Ag0bzumDF,1601310000000.0,1614990000000.0,2886,Novelty Detection via Robust Variational Autoencoding,"[""~Chieh-Hsin_Lai1"", ""~Dongmian_Zou1"", ""~Gilad_Lerman1""]","[""Chieh-Hsin Lai"", ""Dongmian Zou"", ""Gilad Lerman""]","[""novelty detection"", ""variational autoencoding"", ""robustness"", ""Wasserstein metric"", ""one-class classification"", ""semi-supervised anomaly detection""]","We propose a new method for novelty detection that can tolerate high corruption of the training points, whereas previous works assumed either no or very low corruption. Our method trains a robust variational autoencoder (VAE), which aims to generate a model for the uncorrupted training points. To gain robustness to high corruption, we incorporate the following four changes to the common VAE: 1. Extracting crucial features of the latent code by a carefully designed dimension reduction component for distributions; 2. Modeling the latent distribution as a mixture of Gaussian low-rank inliers and full-rank outliers, where the testing only uses the inlier model; 3. Applying the Wasserstein-1 metric for regularization, instead of the Kullback-Leibler (KL) divergence; and 4. Using a least absolute deviation error for reconstruction. We establish both robustness to outliers and suitability to low-rank modeling of the Wasserstein metric as opposed to the KL divergence. We illustrate state-of-the-art results on standard benchmarks for novelty detection.",/pdf/f04c03a02f1ef3dd2a4a0162767a43ec05621390.pdf,ICLR,2021,A novel method for novelty detection which allows high corruption of the training set +Syx7A3NFvH,HyeL69F4wS,1569440000000.0,1583910000000.0,259,Multi-agent Reinforcement Learning for Networked System Control,"[""cts198859@hotmail.com"", ""csandeep@stanford.edu"", ""skatti@stanford.edu""]","[""Tianshu Chu"", ""Sandeep Chinchali"", ""Sachin Katti""]","[""deep reinforcement learning"", ""multi-agent reinforcement learning"", ""decision and control""]","This paper considers multi-agent reinforcement learning (MARL) in networked system control. Specifically, each agent learns a decentralized control policy based on local observations and messages from connected neighbors. We formulate such a networked MARL (NMARL) problem as a spatiotemporal Markov decision process and introduce a spatial discount factor to stabilize the training of each local agent. Further, we propose a new differentiable communication protocol, called NeurComm, to reduce information loss and non-stationarity in NMARL. Based on experiments in realistic NMARL scenarios of adaptive traffic signal control and cooperative adaptive cruise control, an appropriate spatial discount factor effectively enhances the learning curves of non-communicative MARL algorithms, while NeurComm outperforms existing communication protocols in both learning efficiency and control performance.",/pdf/c055354b9e1fe7ae23b2afc99bdc0065e561e62e.pdf,ICLR,2020,This paper proposes a new formulation and a new communication protocol for networked multi-agent control problems +rkgt0REKwS,r1elcY5uDS,1569440000000.0,1583910000000.0,1438,Curriculum Loss: Robust Learning and Generalization against Label Corruption,"[""lv_yueming@outlook.com"", ""ivor.tsang@uts.edu.au""]","[""Yueming Lyu"", ""Ivor W. Tsang""]","[""Curriculum Learning"", ""deep learning""]","Deep neural networks (DNNs) have great expressive power, which can even memorize samples with wrong labels. It is vitally important to reiterate robustness and generalization in DNNs against label corruption. To this end, this paper studies the 0-1 loss, which has a monotonic relationship between empirical adversary (reweighted) risk (Hu et al. 2018). Although the 0-1 loss is robust to outliers, it is also difficult to optimize. To efficiently optimize the 0-1 loss while keeping its robust properties, we propose a very simple and efficient loss, i.e. curriculum loss (CL). Our CL is a tighter upper bound of the 0-1 loss compared with conventional summation based surrogate losses. Moreover, CL can adaptively select samples for stagewise training. As a result, our loss can be deemed as a novel perspective of curriculum sample selection strategy, which bridges a connection between curriculum learning and robust learning. Experimental results on noisy MNIST, CIFAR10 and CIFAR100 dataset validate the robustness of the proposed loss.",/pdf/800966a39eedb870f2172779e97ac57edbff69e5.pdf,ICLR,2020,A novel loss bridges curriculum learning and robust learning +rkg6PhNKDr,S1gDAasnBS,1569440000000.0,1577170000000.0,21,HOW IMPORTANT ARE NETWORK WEIGHTS? TO WHAT EXTENT DO THEY NEED AN UPDATE?,"[""fawaz.sammani@aol.com"", ""elsayedmahmoud@aol.com"", ""abdelsalam.h.a.a@gmail.com""]","[""Fawaz Sammani"", ""Mahmoud Elsayed"", ""Abdelsalam Hamdi""]","[""weights update"", ""weights importance"", ""weight freezing""]","In the context of optimization, a gradient of a neural network indicates the amount a specific weight should change with respect to the loss. Therefore, small gradients indicate a good value of the weight that requires no change and can be kept frozen during training. This paper provides an experimental study on the importance of a neural network weights, and to which extent do they need to be updated. We wish to show that starting from the third epoch, freezing weights which have no informative gradient and are less likely to be changed during training, results in a very slight drop in the overall accuracy (and in sometimes better). We experiment on the MNIST, CIFAR10 and Flickr8k datasets using several architectures (VGG19, +ResNet-110 and DenseNet-121). On CIFAR10, we show that freezing 80% of the VGG19 network parameters from the third epoch onwards results in 0.24% drop in accuracy, while freezing 50% of Resnet-110 parameters results in 0.9% drop in accuracy and finally freezing 70% of Densnet-121 parameters results in 0.57% drop in accuracy. Furthermore, to experiemnt with real-life applications, we train an image captioning model with attention mechanism on the Flickr8k dataset using LSTM networks, freezing 60% of the parameters from the third epoch onwards, resulting in a better BLEU-4 score than the fully trained model. Our source code can be found in the appendix.",/pdf/815e43289650dde28f32a3def294fe96622c38a8.pdf,ICLR,2020,"An experimental paper that proves the amount of redundant weights that can be freezed from the third epoch only, with only a very slight drop in accuracy." +rJgCOySYwH,rJlhmQ0_DH,1569440000000.0,1577170000000.0,1824,Function Feature Learning of Neural Networks,"[""wanggc3@mail2.sysu.edu.cn"", ""stsljh@mail.sysu.edu.cn"", ""wanggrun@mail2.sysu.edu.cn"", ""liangwq8@mail2.sysu.edu.cn""]","[""Guangcong Wang"", ""Jianhuang Lai"", ""Guangrun Wang"", ""Wenqi Liang""]",[],"We present a Function Feature Learning (FFL) method that can measure the similarity of non-convex neural networks. The function feature representation provides crucial insights into the understanding of the relations between different local solutions of identical neural networks. Unlike existing methods that use neuron activation vectors over a given dataset as neural network representation, FFL aligns weights of neural networks and projects them into a common function feature space by introducing a chain alignment rule. We investigate the function feature representation on Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), finding that identical neural networks trained with different random initializations on different learning tasks by the Stochastic Gradient Descent (SGD) algorithm can be projected into different fixed points. This finding demonstrates the strong connection between different local solutions of identical neural networks and the equivalence of projected local solutions. With FFL, we also find that the semantics are often presented in a bottom-up way. Besides, FFL provides more insights into the structure of local solutions. Experiments on CIFAR-100, NameData, and tiny ImageNet datasets validate the effectiveness of the proposed method.",/pdf/62846941f62c2f6a5c5b5fa31287ae314584d411.pdf,ICLR,2020, +SyegvgHtwr,SyeNWieFvB,1569440000000.0,1577170000000.0,2348,Localised Generative Flows,"[""rcornish@robots.ox.ac.uk"", ""anthony.caterini@stats.ox.ac.uk"", ""deligian@stats.ox.ac.uk"", ""doucet@stats.ox.ac.uk""]","[""Rob Cornish"", ""Anthony Caterini"", ""George Deligiannidis"", ""Arnaud Doucet""]","[""Deep generative models"", ""normalizing flows"", ""variational inference""]","We argue that flow-based density models based on continuous bijections are limited in their ability to learn target distributions with complicated topologies, and propose localised generative flows (LGFs) to address this problem. LGFs are composed of stacked continuous mixtures of bijections, which enables each bijection to learn a local region of the target rather than its entirety. Our method is a generalisation of existing flow-based methods, which can be used without modification as the basis for an LGF model. Unlike normalising flows, LGFs do not permit exact computation of log likelihoods, but we propose a simple variational scheme that performs well in practice. We show empirically that LGFs yield improved performance across a variety of common density estimation tasks.",/pdf/25aae4bf66a9800822f1385cffcd1374214482c4.pdf,ICLR,2020,We use a deep continuous mixture of bijections to improve normalising flows for density estimation. +SJxSDxrKDr,rkeRuieYDr,1569440000000.0,1583910000000.0,2360,Adversarial Training and Provable Defenses: Bridging the Gap,"[""bmislav@student.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Mislav Balunovic"", ""Martin Vechev""]","[""adversarial examples"", ""adversarial training"", ""provable defense"", ""convex relaxations"", ""deep learning""]","We present COLT, a new method to train neural networks based on a novel combination of adversarial training and provable defenses. The key idea is to model neural network training as a procedure which includes both, the verifier and the adversary. In every iteration, the verifier aims to certify the network using convex relaxation while the adversary tries to find inputs inside that convex relaxation which cause verification to fail. We experimentally show that this training method, named convex layerwise adversarial training (COLT), is promising and achieves the best of both worlds -- it produces a state-of-the-art neural network with certified robustness of 60.5% and accuracy of 78.4% on the challenging CIFAR-10 dataset with a 2/255 L-infinity perturbation. This significantly improves over the best concurrent results of 54.0% certified robustness and 71.5% accuracy. + +",/pdf/644155cb461f31f0b1bf353e1b2e1f32bf16e4c4.pdf,ICLR,2020,We propose a novel combination of adversarial training and provable defenses which produces a model with state-of-the-art accuracy and certified robustness on CIFAR-10. +SklOUpEYvB,r1l_1dtwvS,1569440000000.0,1583910000000.0,566,Identifying through Flows for Recovering Latent Representations,"[""maths.shenli@gmail.com"", ""bhooi@comp.nus.edu.sg"", ""dcslgh@nus.edu.sg""]","[""Shen Li"", ""Bryan Hooi"", ""Gim Hee Lee""]","[""Representation learning"", ""identifiable generative models"", ""nonlinear-ICA""]","Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable generative modelling using variational autoencoders (iVAE) with a theory of identifiability. Due to the intractablity of KL divergence between variational approximate posterior and the true posterior, however, iVAE has to maximize the evidence lower bound (ELBO) of the marginal likelihood, leading to suboptimal solutions in both theory and practice. In contrast, we propose an identifiable framework for estimating latent representations using a flow-based model (iFlow). Our approach directly maximizes the marginal likelihood, allowing for theoretical guarantees on identifiability, thereby dispensing with variational approximations. We derive its optimization objective in analytical form, making it possible to train iFlow in an end-to-end manner. Simulations on synthetic data validate the correctness and effectiveness of our proposed method and demonstrate its practical advantages over other existing methods.",/pdf/86068293e74eae321daeca98487beb99215c5a4f.pdf,ICLR,2020, +rkEtzzWAb,HkmKzG-Ab,1509130000000.0,1518730000000.0,794,Parametric Adversarial Divergences are Good Task Losses for Generative Modeling,"[""gbxhuang@gmail.com"", ""berard.hugo@gmail.com"", ""ahmed.touati@umontreal.ca"", ""gauthier.gidel@inria.fr"", ""pascal.vincent@umontreal.ca"", ""slacoste@iro.umontreal.ca""]","[""Gabriel Huang"", ""Hugo Berard"", ""Ahmed Touati"", ""Gauthier Gidel"", ""Pascal Vincent"", ""Simon Lacoste-Julien""]","[""parametric"", ""adversarial"", ""divergence"", ""generative"", ""modeling"", ""gan"", ""neural"", ""network"", ""task"", ""loss"", ""structured"", ""prediction""]","Generative modeling of high dimensional data like images is a notoriously difficult and ill-defined problem. In particular, how to evaluate a learned generative model is unclear. +In this paper, we argue that *adversarial learning*, pioneered with generative adversarial networks (GANs), provides an interesting framework to implicitly define more meaningful task losses for unsupervised tasks, such as for generating ""visually realistic"" images. By relating GANs and structured prediction under the framework of statistical decision theory, we put into light links between recent advances in structured prediction theory and the choice of the divergence in GANs. We argue that the insights about the notions of ""hard"" and ""easy"" to learn losses can be analogously extended to adversarial divergences. We also discuss the attractive properties of parametric adversarial divergences for generative modeling, and perform experiments to show the importance of choosing a divergence that reflects the final task.",/pdf/bc8ada16c9bc4edf85fbc821d165d0da82dd4fce.pdf,ICLR,2018,"Parametric adversarial divergences implicitly define more meaningful task losses for generative modeling, we make parallels with structured prediction to study the properties of these divergences and their ability to encode the task of interest." +S1gUVjCqKm,HyenhvCqKX,1538090000000.0,1545360000000.0,2,Unsupervised classification into unknown number of classes,"[""syhan@cml.snu.ac.kr"", ""kimdy7@snu.ac.kr"", ""junglee@snu.ac.kr""]","[""Sungyeob Han"", ""Daeyoung Kim"", ""Jungwoo Lee""]","[""unsupervised learning""]","We propose a novel unsupervised classification method based on graph Laplacian. Unlike the widely used classification method, this architecture does not require the labels of data and the number of classes. Our key idea is to introduce a approximate linear map and a spectral clustering theory on the dimension reduced spaces into generative adversarial networks. Inspired by the human visual recognition system, the proposed framework can classify and also generate images as the human brains do. We build an approximate linear connector network $C$ analogous to the cerebral cortex, between the discriminator $D$ and the generator $G$. The connector network allows us to estimate the unknown number of classes. Estimating the number of classes is one of the challenging researches in the unsupervised learning, especially in spectral clustering. The proposed method can also classify the images by using the estimated number of classes. Therefore, we define our method as an unsupervised classification method.",/pdf/055d89d7d88f3d1126774fe6825d3329457cda7a.pdf,ICLR,2019, +BkesJ3R9YX,SkeUVTnqtm,1538090000000.0,1545360000000.0,1017,Where and when to look? Spatial-temporal attention for action recognition in videos,"[""lilimeng1103@gmail.com"", ""bzhao03@cs.ubc.ca"", ""bchang@stat.ubc.ca"", ""gh349@cornell.edu"", ""ftung@sfu.ca"", ""lsigal@cs.ubc.ca""]","[""Lili Meng"", ""Bo Zhao"", ""Bo Chang"", ""Gao Huang"", ""Frederick Tung"", ""Leonid Sigal""]","[""visual attention"", ""video action recognition"", ""network interpretability""]","Inspired by the observation that humans are able to process videos efficiently by only paying attention when and where it is needed, we propose a novel spatial-temporal attention mechanism for video-based action recognition. For spatial attention, we learn a saliency mask to allow the model to focus on the most salient parts of the feature maps. +For temporal attention, we employ a soft temporal attention mechanism to identify the most relevant frames from an input video. Further, we propose a set of regularizers that ensure that our attention mechanism attends to coherent regions in space and time. Our model is efficient, as it proposes a separable spatio-temporal mechanism for video attention, while being able to identify important parts of the video both spatially and temporally. We demonstrate the efficacy of our approach on three public video action recognition datasets. The proposed approach leads to state-of-the-art performance on all of them, including the new large-scale Moments in Time dataset. Furthermore, we quantitatively and qualitatively evaluate our model's ability to accurately localize discriminative regions spatially and critical frames temporally. This is despite our model only being trained with per video classification labels. ",/pdf/be82786554c2d5e6afe50bb7bc6f84b2635a7134.pdf,ICLR,2019, +rkg6FgrtPB,SkeKLAetwr,1569440000000.0,1577170000000.0,2451,Biologically Plausible Neural Networks via Evolutionary Dynamics and Dopaminergic Plasticity,"[""sruthi@comp.nus.edu.sg"", ""anandl@iisc.ac.in"", ""christos@columbia.edu"", ""vempala@gatech.edu"", ""y.naganand@gmail.com""]","[""Sruthi Gorantla"", ""Anand Louis"", ""Christos H. Papadimitriou"", ""Santosh Vempala"", ""Naganand Yadati""]","[""Biological plausibility"", ""dopaminergic plasticity"", ""allele frequency"", ""neural net evolution""]","Artificial neural networks (ANNs) lack in biological plausibility, chiefly because backpropagation requires a variant of plasticity (precise changes of the synaptic weights informed by neural events that occur downstream in the neural circuit) that is profoundly incompatible with the current understanding of the animal brain. Here we propose that backpropagation can happen in evolutionary time, instead of lifetime, in what we call neural net evolution (NNE). In NNE the weights of the links of the neural net are sparse linear functions of the animal's genes, where each gene has two alleles, 0 and 1. In each generation, a population is generated at random based on current allele frequencies, and it is tested in the learning task. The relative performance of the two alleles of each gene over the whole population is determined, and the allele frequencies are updated via the standard population genetics equations for the weak selection regime. We prove that, under assumptions, NNE succeeds in learning simple labeling functions with high probability, and with polynomially many generations and individuals per generation. We test the NNE concept, with only one hidden layer, on MNIST with encouraging results. Finally, we explore a further version of biologically plausible ANNs inspired by the recent discovery in animals of dopaminergic plasticity: the increase of the strength of a synapse that fired if dopamine was released soon after the firing.",/pdf/7587b1d8e81138b84d713d63e3663a0107201238.pdf,ICLR,2020, +S1gOpsCctm,SyeVDlV9KX,1538090000000.0,1550830000000.0,814,Learning Finite State Representations of Recurrent Policy Networks,"[""koula@oregonstate.edu"", ""alan.fern@oregonstate.edu"", ""sgrey@google.com""]","[""Anurag Koul"", ""Alan Fern"", ""Sam Greydanus""]","[""recurrent neural networks"", ""finite state machine"", ""quantization"", ""interpretability"", ""autoencoder"", ""moore machine"", ""reinforcement learning"", ""imitation learning"", ""representation"", ""Atari"", ""Tomita""]","Recurrent neural networks (RNNs) are an effective representation of control policies for a wide range of reinforcement and imitation learning problems. RNN policies, however, are particularly difficult to explain, understand, and analyze due to their use of continuous-valued memory vectors and observation features. In this paper, we introduce a new technique, Quantized Bottleneck Insertion, to learn finite representations of these vectors and features. The result is a quantized representation of the RNN that can be analyzed to improve our understanding of memory use and general behavior. We present results of this approach on synthetic environments and six Atari games. The resulting finite representations are surprisingly small in some cases, using as few as 3 discrete memory states and 10 observations for a perfect Pong policy. We also show that these finite policy representations lead to improved interpretability. ",/pdf/95121c16cada4c5fd725165b1396161d4a445ce2.pdf,ICLR,2019,Extracting a finite state machine from a recurrent neural network via quantization for the purpose of interpretability with experiments on Atari. +HkfwpiA9KX,HJx5SVpqKQ,1538090000000.0,1545360000000.0,812,Automata Guided Skill Composition,"[""xli87@bu.edu"", ""yaoma@bu.edu"", ""cbelta@bu.edu""]","[""Xiao Li"", ""Yao Ma"", ""Calin Belta""]","[""Skill composition"", ""temporal logic"", ""finite state automata""]","Skills learned through (deep) reinforcement learning often generalizes poorly +across tasks and re-training is necessary when presented with a new task. We +present a framework that combines techniques in formal methods with reinforcement +learning (RL) that allows for the convenient specification of complex temporal +dependent tasks with logical expressions and construction of new skills from existing +ones with no additional exploration. We provide theoretical results for our +composition technique and evaluate on a simple grid world simulation as well as +a robotic manipulation task.",/pdf/26a3f8873bc0bfc0cdb73c3398b919c75853568b.pdf,ICLR,2019,A formal method's approach to skill composition in reinforcement learning tasks +r1g1LoAcFm,rylrEoxuKm,1538090000000.0,1545360000000.0,136,Using Ontologies To Improve Performance In Massively Multi-label Prediction,"[""ethan.steinberg@gmail.com"", ""peterjliu@google.com""]","[""Ethan Steinberg"", ""Peter J. Liu""]","[""multi-label"", ""Bayesian network"", ""ontology""]","Massively multi-label prediction/classification problems arise in environments like health-care or biology where it is useful to make very precise predictions. One challenge with massively multi-label problems is that there is often a long-tailed frequency distribution for the labels, resulting in few positive examples for the rare labels. We propose a solution to this problem by modifying the output layer of a neural network to create a Bayesian network of sigmoids which takes advantage of ontology relationships between the labels to help share information between the rare and the more common labels. We apply this method to the two massively multi-label tasks of disease prediction (ICD-9 codes) and protein function prediction (Gene Ontology terms) and obtain significant improvements in per-label AUROC and average precision.",/pdf/082aac3ea80659aad9c325b8c6dd61c6a8f75f98.pdf,ICLR,2019, We propose a new method for using ontology information to improve performance on massively multi-label prediction/classification problems. +H1g8p1BYvS,SJxgdEytPS,1569440000000.0,1577170000000.0,1989,Adversarial Filters of Dataset Biases,"[""ronanlb@allenai.org"", ""swabhas@allenai.org"", ""chandrab@allenai.org"", ""rowanz@cs.washington.edu"", ""matthewp@allenai.org"", ""ashishs@allenai.org"", ""yejinc@allenai.org""]","[""Ronan Le Bras"", ""Swabha Swayamdipta"", ""Chandra Bhagavatula"", ""Rowan Zellers"", ""Matthew Peters"", ""Ashish Sabharwal"", ""Yejin Choi""]",[],"Large-scale benchmark datasets have been among the major driving forces in AI, supporting training of models and measuring their progress. The key assumption is that these benchmarks are realistic approximations of the target tasks in the real world. However, while machine performance on these benchmarks advances rapidly --- often surpassing human performance --- it still struggles on the target tasks in the wild. This raises an important question: whether the surreal high performance on existing benchmarks are inflated due to spurious biases in them, and if so, how we can effectively revise these benchmarks to better simulate more realistic problem distributions in the real world.   +In this paper, we posit that while the real world problems consist of a great deal of long-tail problems, existing benchmarks are overly populated with a great deal of similar (thus non-tail) problems, which in turn, leads to a major overestimation of true AI performance. To address this challenge, we present a novel framework of Adversarial Filters to investigate model-based reduction of dataset biases. We discuss that the optimum bias reduction via AFOptimum is intractable, thus propose AFLite, an iterative greedy algorithm that adversarially filters out data points to identify a reduced dataset with more realistic problem distributions and considerably less spurious biases. +AFLite is lightweight and can in principle be applied to any task and dataset. We apply it to popular benchmarks that are practically solved --- ImageNet and Natural Language Inference (SNLI, MNLI, QNLI) --- and present filtered counterparts as new challenge datasets where the model performance drops considerably (e.g., from 84% to 24% for ImageNet and from 92% to 62% for SNLI), while human performance remains high. An extensive suite of analysis demonstrates that AFLite effectively reduces measurable dataset biases in both the synthetic and real datasets. Finally, we introduce new measures of dataset biases based on K-nearest-neighbors to help guide future research on dataset developments and bias reduction. ",/pdf/dae9382653439c53c5698a5c93ae5e30e11215d2.pdf,ICLR,2020, +ryGiYoAqt7,B1g1R4c5tm,1538090000000.0,1545360000000.0,476,Learning agents with prioritization and parameter noise in continuous state and action space,"[""rajesh.dm@iiitb.ac.in"", ""gsr@iiitb.ac.in""]","[""Rajesh Devaraddi"", ""G. Srinivasaraghavan""]","[""reinforcement learning"", ""continuous action space"", ""prioritization"", ""parameter"", ""noise"", ""policy gradients""]","Reinforcement Learning (RL) problem can be solved in two different ways - the Value function-based approach and the policy optimization-based approach - to eventually arrive at an optimal policy for the given environment. One of the recent breakthroughs in reinforcement learning is the use of deep neural networks as function approximators to approximate the value function or q-function in a reinforcement learning scheme. This has led to results with agents automatically learning how to play games like alpha-go showing better-than-human performance. Deep Q-learning networks (DQN) and Deep Deterministic Policy Gradient (DDPG) are two such methods that have shown state-of-the-art results in recent times. Among the many variants of RL, an important class of problems is where the state and action spaces are continuous --- autonomous robots, autonomous vehicles, optimal control are all examples of such problems that can lend themselves naturally to reinforcement based algorithms, and have continuous state and action spaces. In this paper, we adapt and combine approaches such as DQN and DDPG in novel ways to outperform the earlier results for continuous state and action space problems. We believe these results are a valuable addition to the fast-growing body of results on Reinforcement Learning, more so for continuous state and action space problems.",/pdf/a36561edcadcf5a168252742d99bcc132e87f5c5.pdf,ICLR,2019,Improving the performance of an RL agent in the continuous action and state space domain by using prioritised experience replay and parameter noise. +rJVruWZRW,HkVH_W-AW,1509130000000.0,1518730000000.0,695,Dense Recurrent Neural Network with Attention Gate,"[""yhyoo@rit.kaist.ac.kr"", ""khan@rit.kaist.ac.kr"", ""scho@rit.kaist.ac.kr"", ""kckoh@rit.kaist.ac.kr"", ""johkim@rit.kaist.ac.kr""]","[""Yong-Ho Yoo"", ""Kook Han"", ""Sanghyun Cho"", ""Kyoung-Chul Koh"", ""Jong-Hwan Kim""]","[""recurrent neural network"", ""language modeling"", ""dense connection""]","We propose the dense RNN, which has the fully connections from each hidden state to multiple preceding hidden states of all layers directly. As the density of the connection increases, the number of paths through which the gradient flows can be increased. It increases the magnitude of gradients, which help to prevent the vanishing gradient problem in time. Larger gradients, however, can also cause exploding gradient problem. To complement the trade-off between two problems, we propose an attention gate, which controls the amounts of gradient flows. We describe the relation between the attention gate and the gradient flows by approximation. The experiment on the language modeling using Penn Treebank corpus shows dense connections with the attention gate improve the model’s performance.",/pdf/57a201147ed46a8ee5a3230754e3f6673f21e7e7.pdf,ICLR,2018,Dense RNN that has fully connections from each hidden state to multiple preceding hidden states of all layers directly. +ByxKo04tvr,S1gW-QFOwH,1569440000000.0,1577170000000.0,1328,Multigrid Neural Memory,"[""trihuynh@uchicago.edu"", ""mmaire@uchicago.edu"", ""mwalter@ttic.edu""]","[""Tri Huynh"", ""Michael Maire"", ""Matthew R. Walter""]","[""multigrid architecture"", ""memory network"", ""convolutional neural network""]","We introduce a novel architecture that integrates a large addressable memory space into the core functionality of a deep neural network. Our design distributes both memory addressing operations and storage capacity over many network layers. Distinct from strategies that connect neural networks to external memory banks, our approach co-locates memory with computation throughout the network structure. Mirroring recent architectural innovations in convolutional networks, we organize memory into a multiresolution hierarchy, whose internal connectivity enables learning of dynamic information routing strategies and data-dependent read/write operations. This multigrid spatial layout permits parameter-efficient scaling of memory size, allowing us to experiment with memories substantially larger than those in prior work. We demonstrate this capability on synthetic exploration and mapping tasks, where the network is able to self-organize and retain long-term memory for trajectories of thousands of time steps. On tasks decoupled from any notion of spatial geometry, such as sorting or associative recall, our design functions as a truly generic memory and yields results competitive with those of the recently proposed Differentiable Neural Computer.",/pdf/106690bd64ea9a501b0e546f2cbe55ecac4deb8f.pdf,ICLR,2020,"A novel neural memory architecture that co-locates memory and computation throughout the network structure, providing addressable, scalable, long-term and large capacity neural memory." +SJefGpEtDB,HklpMitUDr,1569440000000.0,1577170000000.0,404,A Dynamic Approach to Accelerate Deep Learning Training,"[""john.osorio@bsc.es"", ""adria.armejach@bsc.es"", ""eric.petit@intel.com"", ""marc.casas@bsc.es""]","[""John Osorio"", ""Adri\u00e0 Armejach"", ""Eric Petit"", ""Marc Casas""]","[""reduced precision"", ""bfloat16"", ""CNN"", ""DNN"", ""dynamic precision"", ""mixed precision""]","Mixed-precision arithmetic combining both single- and half-precision operands in the same operation have been successfully applied to train deep neural networks. Despite the advantages of mixed-precision arithmetic in terms of reducing the need for key resources like memory bandwidth or register file size, it has a limited capacity for diminishing computing costs and requires 32 bits to represent its output operands. This paper proposes two approaches to replace mixed-precision for half-precision arithmetic during a large portion of the training. The first approach achieves accuracy ratios slightly slower than the state-of-the-art by using half-precision arithmetic during more than 99% of training. The second approach reaches the same accuracy as the state-of-the-art by dynamically switching between half- and mixed-precision arithmetic during training. It uses half-precision during more than 94% of the training process. This paper is the first in demonstrating that half-precision can be used for a very large portion of DNNs training and still reach state-of-the-art accuracy.",/pdf/439a2864a64790adaa12b89c24cde56c9c37f030.pdf,ICLR,2020,Dynamic precision technique to train deep neural networks +HJXyS7bRb,ByCAV7bA-,1509140000000.0,1518730000000.0,1164,A Goal-oriented Neural Conversation Model by Self-Play,"[""wewei@google.com"", ""adai@google.com"", ""qvl@google.com"", ""lijiali@google.com""]","[""Wei Wei"", ""Quoc V. Le"", ""Andrew M. Dai"", ""Li-Jia Li""]","[""conversation model"", ""seq2seq"", ""self-play"", ""reinforcement learning""]","Building chatbots that can accomplish goals such as booking a flight ticket is an unsolved problem in natural language understanding. Much progress has been made to build conversation models using techniques such as sequence2sequence modeling. One challenge in applying such techniques to building goal-oriented conversation models is that maximum likelihood-based models are not optimized toward accomplishing goals. Recently, many methods have been proposed to address this issue by optimizing a reward that contains task status or outcome. However, adding the reward optimization on the fly usually provides little guidance for language construction and the conversation model soon becomes decoupled from the language model. In this paper, we propose a new setting in goal-oriented dialogue system to tighten the gap between these two aspects by enforcing model level information isolation on individual models between two agents. Language construction now becomes an important part in reward optimization since it is the only way information can be exchanged. We experimented our models using self-play and results showed that our method not only beat the baseline sequence2sequence model in rewards but can also generate human-readable meaningful conversations of comparable quality. ",/pdf/40fc8cdd76f4aba7cb8069509d9e5ddf2523ad35.pdf,ICLR,2018,A Goal-oriented Neural Conversation Model by Self-Play +SkxWnkStvS,r1xL5x1KwS,1569440000000.0,1577170000000.0,1941,Searching for Stage-wise Neural Graphs In the Limit,"[""chow459@gmail.com"", ""doudejing@baidu.com"", ""libo0001@gmail.com""]","[""Xin Zhou"", ""Dejing Dou"", ""Boyang Li""]","[""neural architecture search"", ""graphon"", ""random graphs""]","Search space is a key consideration for neural architecture search. Recently, Xie et al. (2019a) found that randomly generated networks from the same distribution perform similarly, which suggest we should search for random graph distributions instead of graphs. We propose graphon as a new search space. A graphon is the limit of Cauchy sequence of graphs and a scale-free probabilistic distribution, from which graphs of different number of vertices can be drawn. This property enables us to perform NAS using fast, low-capacity models and scale the found models up when necessary. We develop an algorithm for NAS in the space of graphons and empirically demonstrate that it can find stage-wise graphs that outperform DenseNet and other baselines on ImageNet. ",/pdf/bf6daea9d9c7b272d16bea44974e39d8cb755c3b.pdf,ICLR,2020,Graphon is a good search space for neural architecture search and empirically produces good networks. +E8fmaZwzEj,zfMiG1tzjqA,1601310000000.0,1614990000000.0,974,Defective Convolutional Networks,"[""~Tiange_Luo1"", ""~Tianle_Cai1"", ""~Mengxiao_Zhang2"", ""~Siyu_Chen1"", ""~Di_He1"", ""~Liwei_Wang1""]","[""Tiange Luo"", ""Tianle Cai"", ""Mengxiao Zhang"", ""Siyu Chen"", ""Di He"", ""Liwei Wang""]","[""Representation Learning"", ""Robustness""]","Robustness of convolutional neural networks (CNNs) has gained in importance on account of adversarial examples, i.e., inputs added as well-designed perturbations that are imperceptible to humans but can cause the model to predict incorrectly. Recent research suggests that the noise in adversarial examples breaks the textural structure, which eventually leads to wrong predictions. To mitigate the threat of such adversarial attacks, we propose defective convolutional networks that make predictions relying less on textural information but more on shape information by properly integrating defective convolutional layers into standard CNNs. The defective convolutional layers contain defective neurons whose activations are set to be a constant function. As defective neurons contain no information and are far different from standard neurons in its spatial neighborhood, the textural features cannot be accurately extracted, and so the model has to seek other features for classification, such as the shape. We show extensive evidence to justify our proposal and demonstrate that defective CNNs can defend against black-box attacks better than standard CNNs. In particular, they achieve state-of-the-art performance against transfer-based attacks without any adversarial training being applied. + +",/pdf/4d31b0d2817a4bbe0c6161897e0a29a022d97dc3.pdf,ICLR,2021,"A new kind of CNNs that makes predictions relying less on textural information but more on shape information. Compared to standard CNNs, the proposed ones show better defense performance against black-box attacks." +BylKL1SKvr,Hkx6YD6dvB,1569440000000.0,1577170000000.0,1736,Towards Understanding the Transferability of Deep Representations,"[""h-l17@mails.tsinghua.edu.cn"", ""mingsheng@tsinghua.edu.cn"", ""jimwang@tsinghua.edu.cn"", ""jordan@cs.berkeley.edu""]","[""Hong Liu"", ""Mingsheng Long"", ""Jianmin Wang"", ""Michael I. Jordan""]","[""Transfer Learning"", ""Fine-tuning"", ""Deep Neural Networks""]","Deep neural networks trained on a wide range of datasets demonstrate impressive transferability. Deep features appear general in that they are applicable to many datasets and tasks. Such property is in prevalent use in real-world applications. A neural network pretrained on large datasets, such as ImageNet, can significantly boost generalization and accelerate training if fine-tuned to a smaller target dataset. Despite its pervasiveness, few effort has been devoted to uncovering the reason of transferability in deep feature representations. This paper tries to understand transferability from the perspectives of improved generalization, optimization and the feasibility of transferability. We demonstrate that 1) Transferred models tend to find flatter minima, since their weight matrices stay close to the original flat region of pretrained parameters when transferred to a similar target dataset; 2) Transferred representations make the loss landscape more favorable with improved Lipschitzness, which accelerates and stabilizes training substantially. The improvement largely attributes to the fact that the principal component of gradient is suppressed in the pretrained parameters, thus stabilizing the magnitude of gradient in back-propagation. 3) The feasibility of transferability is related to the similarity of both input and label. And a surprising discovery is that the feasibility is also impacted by the training stages in that the transferability first increases during training, and then declines. We further provide a theoretical analysis to verify our observations.",/pdf/ea6f3674360c3efc6c5a77f24f95d82b7d64d1b1.pdf,ICLR,2020,"Understand transferability from the perspectives of improved generalization, optimization and the feasibility of transferability." +ByeaXeBFvH,rJlWHHeYwH,1569440000000.0,1577170000000.0,2227,Hydra: Preserving Ensemble Diversity for Model Distillation,"[""linh.tran@imperial.ac.uk"", ""basveeling@gmail.com"", ""kevin.roth@inf.ethz.ch"", ""kuba.swiatkowski@gmail.com"", ""jvdillon@google.com"", ""jaspersnoek@gmail.com"", ""stephan.mandt@gmail.com"", ""salimans@google.com"", ""nowozin@google.com"", ""rjenatton@google.com""]","[""Linh Tran"", ""Bastiaan S. Veeling"", ""Kevin Roth"", ""Jakub \u015awi\u0105tkowski"", ""Joshua V. Dillon"", ""Jasper Snoek"", ""Stephan Mandt"", ""Tim Salimans"", ""Sebastian Nowozin"", ""Rodolphe Jenatton""]","[""model distillation"", ""ensemble models""]","Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each individual member, is lost. Thus the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behaviour of the original ensemble over both in-domain and out-of-distribution tasks.",/pdf/233f2b7690e503f2445e6868dd19a4430e900093.pdf,ICLR,2020,"We distill ensemble models using a shared body network and many heads, preserving ensemble diversity." +GvqjmSwUxkY,KvHvCxo06o,1601310000000.0,1614990000000.0,560,Rethinking the Truly Unsupervised Image-to-Image Translation,"[""~Kyungjune_Baek1"", ""~Yunjey_Choi3"", ""~Youngjung_Uh2"", ""~Jaejun_Yoo1"", ""~Hyunjung_Shim1""]","[""Kyungjune Baek"", ""Yunjey Choi"", ""Youngjung Uh"", ""Jaejun Yoo"", ""Hyunjung Shim""]","[""unsupervised approach"", ""image-to-image translation"", ""representation learning""]","Every recent image-to-image translation model uses either image-level (i.e. input-output pairs) or set-level (i.e. domain labels) supervision at a minimum. However, even the set-level supervision can be a serious bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose a truly unsupervised image-to-image translation model (TUNIT) that simultaneously learns to separate image domains and translate input images into the estimated domains. +Experimental results show that our model achieves comparable or even better performance than the set-level supervised model trained with full labels, generalizes well on various datasets, and is robust against the choice of hyperparameters (e.g. the preset number of pseudo domains). In addition, TUNIT extends well to the semi-supervised scenario with various amount of labels provided. ",/pdf/c3521a0d6dd1ef044939d758940385957eeba5d5.pdf,ICLR,2021,We propose a truly unsupervised image-to-image translation model even without set-level supervisions. +SylpBgrKPH,SylVWFeKwS,1569440000000.0,1577170000000.0,2302,MissDeepCausal: causal inference from incomplete data using deep latent variable models,"[""julie.josse@polytechnique.edu"", ""imke.mayer@polytechnique.edu"", ""jpvert@google.com""]","[""Julie Josse"", ""Imke Mayer"", ""Jean-Philippe Vert""]","[""treatment effect estimation"", ""missing values"", ""variational autoencoders"", ""importance sampling"", ""double robustness""]","Inferring causal effects of a treatment, intervention or policy from observational data is central to many applications. However, state-of-the-art methods for causal inference seldom consider the possibility that covariates have missing values, which is ubiquitous in many real-world analyses. Missing data greatly complicate causal inference procedures as they require an adapted unconfoundedness hypothesis which can be difficult to justify in practice. We circumvent this issue by considering latent confounders whose distribution is learned through variational autoencoders adapted to missing values. They can be used either as a pre-processing step prior to causal inference but we also suggest to embed them in a multiple imputation strategy to take into account the variability due to missing values. Numerical experiments demonstrate the effectiveness of the proposed methodology especially for non-linear models compared to competitors.",/pdf/e5327e8b9e5a2c8c429329f224d48ce511a2953d.pdf,ICLR,2020, +BJe1hsCcYQ,SyxQSepqYm,1538090000000.0,1545360000000.0,673,Lorentzian Distance Learning,"[""law@cs.toronto.edu"", ""jsnell@cs.toronto.edu"", ""zemel@cs.toronto.edu""]","[""Marc T Law"", ""Jake Snell"", ""Richard S Zemel""]","[""distance learning"", ""metric learning"", ""hyperbolic geometry"", ""hierarchy tree""]","This paper introduces an approach to learn representations based on the Lorentzian distance in hyperbolic geometry. Hyperbolic geometry is especially suited to hierarchically-structured datasets, which are prevalent in the real world. Current hyperbolic representation learning methods compare examples with the Poincar\'e distance metric. They formulate the problem as minimizing the distance of each node in a hierarchy with its descendants while maximizing its distance with other nodes. This formulation produces node representations close to the centroid of their descendants. We exploit the fact that the centroid w.r.t the squared Lorentzian distance can be written in closed-form. We show that the Euclidean norm of such a centroid decreases as the curvature of the hyperbolic space decreases. This property makes it appropriate to represent hierarchies where parent nodes minimize the distances to their descendants and have smaller Euclidean norm than their children. Our approach obtains state-of-the-art results in retrieval and classification tasks on different datasets. ",/pdf/a0042f7237a5a5ec98da2ed9a3e57929d1f4a8a1.pdf,ICLR,2019,A distance learning approach to learn hyperbolic representations +r1Ue8Hcxg,,1478280000000.0,1494010000000.0,251,Neural Architecture Search with Reinforcement Learning,"[""barretzoph@google.com"", ""qvl@google.com""]","[""Barret Zoph"", ""Quoc Le""]",[],"Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.",/pdf/a737e298ef4f25808b2a4b464c913678234b1a5d.pdf,ICLR,2017, +ByvJuTigl,,1478380000000.0,1483540000000.0,607,End-to-End Learnable Histogram Filters,"[""rico.jonschkowski@tu-berlin.de"", ""oliver.brock@tu-berlin.de""]","[""Rico Jonschkowski"", ""Oliver Brock""]","[""Deep learning"", ""Unsupervised Learning""]","Problem-specific algorithms and generic machine learning approaches have complementary strengths and weaknesses, trading-off data efficiency and generality. To find the right balance between these, we propose to use problem-specific information encoded in algorithms together with the ability to learn details about the problem-instance from data. We demonstrate this approach in the context of state estimation in robotics, where we propose end-to-end learnable histogram filters---a differentiable implementation of histogram filters that encodes the structure of recursive state estimation using prediction and measurement update but allows the specific models to be learned end-to-end, i.e. in such a way that they optimize the performance of the filter, using either supervised or unsupervised learning.",/pdf/e074094d321fa15bcb3ac8fc42954ef29afd269f.pdf,ICLR,2017,a way to combine the algorithmic structure of Bayes filters with the end-to-end learnability of neural networks +l35SB-_raSQ,A9PpQPfwWAk,1601310000000.0,1616120000000.0,2185,A Hypergradient Approach to Robust Regression without Correspondence,"[""~Yujia_Xie1"", ""956986044myx@gmail.com"", ""~Simiao_Zuo1"", ""~Hongteng_Xu1"", ""~Xiaojing_Ye1"", ""~Tuo_Zhao1"", ""~Hongyuan_Zha1""]","[""Yujia Xie"", ""Yixiu Mao"", ""Simiao Zuo"", ""Hongteng Xu"", ""Xiaojing Ye"", ""Tuo Zhao"", ""Hongyuan Zha""]","[""Regression without correspondence"", ""differentiable programming"", ""first-order optimization"", ""Sinkhorn algorithm""]","We consider a regression problem, where the correspondence between the input and output data is not available. Such shuffled data are commonly observed in many real world problems. Take flow cytometry as an example: the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature of the problem, most of the existing methods are only applicable when the sample size is small, and are limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework --- ROBOT --- for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression model, and therefore it allows us to find a better descent direction for the model parameters by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to an inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking. ",/pdf/01f4f60af211bcbeb9691824fbd8f92211e50098.pdf,ICLR,2021,We propose a differentiable programming framework for the regression without correspondence problem. +jNhWDHdjVi4,w7kmCgWUWwv,1601310000000.0,1614990000000.0,872,Learning Consistent Deep Generative Models from Sparse Data via Prediction Constraints,"[""~Gabriel_Hope1"", ""~Madina_Abdrakhmanova1"", ""xiaoyic6@uci.edu"", ""~Michael_C_Hughes1"", ""~Erik_B_Sudderth1""]","[""Gabriel Hope"", ""Madina Abdrakhmanova"", ""Xiaoyin Chen"", ""Michael C Hughes"", ""Erik B Sudderth""]","[""Semisupervised learning"", ""deep generative models"", ""variational autoencoders""]","We develop a new framework for learning variational autoencoders and other deep generative models that balances generative and discriminative goals. Our framework optimizes model parameters to maximize a variational lower bound on the likelihood of observed data, subject to a task-specific prediction constraint that prevents model misspecification from leading to inaccurate predictions. We further enforce a consistency constraint, derived naturally from the generative model, that requires predictions on reconstructed data to match those on the original data. We show that these two contributions -- prediction constraints and consistency constraints -- lead to promising image classification performance, especially in the semi-supervised scenario where category labels are sparse but unlabeled data is plentiful. Our approach enables advances in generative modeling to directly boost semi-supervised classification performance, an ability we demonstrate by augmenting deep generative models with latent variables capturing spatial transformations. ",/pdf/e1da6d2b84b257c1c437abc793d28c110b58428d.pdf,ICLR,2021,We develop a new framework for learning variational autoencoders and other deep generative models that balances generative and discriminative goals. +4artD3N3xB0,cTAI6Wj0As,1601310000000.0,1614990000000.0,2289,Bayesian Learning to Optimize: Quantifying the Optimizer Uncertainty,"[""~Yue_Cao4"", ""~Tianlong_Chen1"", ""~Zhangyang_Wang1"", ""~Yang_Shen4""]","[""Yue Cao"", ""Tianlong Chen"", ""Zhangyang Wang"", ""Yang Shen""]","[""Optimizer Uncertainty"", ""Optimization"", ""Uncertainty Quantification""]","Optimizing an objective function with uncertainty awareness is well-known to improve the accuracy and confidence of optimization solutions. Meanwhile, another relevant but very different question remains yet open: how to model and quantify the uncertainty of an optimization algorithm itself? To close such a gap, the prerequisite is to consider the optimizers as sampled from a distribution, rather than a few pre-defined and fixed update rules. We first take the novel angle to consider the algorithmic space of optimizers, each being parameterized by a neural network. We then propose a Boltzmann-shaped posterior over this optimizer space, and approximate the posterior locally as Gaussian distributions through variational inference. Our novel model, Bayesian learning to optimize (BL2O) is the first study to recognize and quantify the uncertainty of the optimization algorithm. Our experiments on optimizing test functions, energy functions in protein-protein interactions and loss functions in image classification and data privacy attack demonstrate that, compared to state-of-the-art methods, BL2O improves optimization and uncertainty quantification (UQ) in aforementioned problems as well as calibration and out-of-domain detection in image classification.",/pdf/3dbb4401a0167d28f647d75c55f4cd4fb513e101.pdf,ICLR,2021,We develop the first method that quantifies optimizer uncertainty and shows superior performance of optimization and uncertainty quantification on extensive applications against state-of-the-art methods. +7FNqrcPtieT,4tmYCJJxlNg,1601310000000.0,1611610000000.0,1052,On Data-Augmentation and Consistency-Based Semi-Supervised Learning,"[""~Atin_Ghosh1"", ""~Alexandre_H._Thiery1""]","[""Atin Ghosh"", ""Alexandre H. Thiery""]","[""Semi-Supervised Learning"", ""Regularization"", ""Data augmentation""]","Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the Pi-model, temporal ensembling, the mean teacher, or the virtual adversarial training, achieve the state of the art results in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. To make progress, we analyse (variations of) the Pi-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Furthermore, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a tractable framework for understanding SSL methods.",/pdf/b336e63b29c715b76018ba42cb29f2592569ab56.pdf,ICLR,2021,We propose a simple and natural framework leveraging the Hidden Manifold Model to study modern SSL methods. +H1fevoAcKX,BJePdFI9tQ,1538090000000.0,1545360000000.0,234,Globally Soft Filter Pruning For Efficient Convolutional Neural Networks,"[""17112071@bjtu.edu.cn"", ""16120304@bjtu.edu.cn"", ""16120347@bjtu.edu.cn"", ""16112065@bjtu.edu.cn"", ""wangdong@bjtu.edu.cn""]","[""Ke Xu"", ""Xiaoyun Wang"", ""Qun Jia"", ""Jianjing An"", ""Dong Wang""]","[""Filter Pruning"", ""Model Compression"", ""Efficient Convolutional Neural Networks""]","This paper propose a cumulative saliency based Globally Soft Filter Pruning (GSFP) scheme to prune redundant filters of Convolutional Neural Networks (CNNs).Specifically, the GSFP adopts a robust pruning method, which measures the global redundancy of the filter in the whole model by using the soft pruning strategy. In addition, in the model recovery process after pruning, we use the cumulative saliency strategy to improve the accuracy of pruning. GSFP has two advantages over previous works:(1) More accurate pruning guidance. For a pre-trained CNN model, the saliency of the filter varies with different input data. Therefore, accumulating the saliency of the filter over the entire data set can provide more accurate guidance for pruning. On the other hand, pruning from a global perspective is more accurate than local pruning. (2) More robust pruning strategy. We propose a reasonable normalization formula to prevent certain layers of filters in the network from being completely clipped due to excessive pruning rate.",/pdf/5172e96f7fe405e99da42b97d1bdbbb031a175fc.pdf,ICLR,2019, +SJlRF04YwB,BJxEvvd_DB,1569440000000.0,1577170000000.0,1266,Generating Semantic Adversarial Examples with Differentiable Rendering,"[""lakshya.jain@berkeley.edu"", ""scchen@berkeley.edu"", ""wilswu@berkeley.edu"", ""wjang@cs.wisc.edu"", ""vchandrasek4@wisc.edu"", ""sseshia@eecs.berkeley.edu"", ""jha@cs.wisc.edu""]","[""Lakshya Jain"", ""Steven Chen"", ""Wilson Wu"", ""Uyeong Jang"", ""Varun Chandrasekaran"", ""Sanjit Seshia"", ""Somesh Jha""]","[""semantic adversarial examples"", ""inverse graphics"", ""differentiable rendering""]","Machine learning (ML) algorithms, especially deep neural networks, have demonstrated success in several domains. However, several types of attacks have raised concerns about deploying ML in safety-critical domains, such as autonomous driving and security. An attacker perturbs a data point slightly in the pixel space and causes the ML algorithm to misclassify (e.g. a perturbed stop sign is classified as a yield sign). These perturbed data points are called adversarial examples, and there are numerous algorithms in the literature for constructing adversarial examples and defending against them. In this paper we explore semantic adversarial examples (SAEs) where an attacker creates perturbations in the semantic space. For example, an attacker can change the background of the image to be cloudier to cause misclassification. We present an algorithm for constructing SAEs that uses recent advances in differential rendering and inverse graphics. ",/pdf/9cb5ffa72890cbbde136ee2d532ed21d6e2c1046.pdf,ICLR,2020,Generating Semantic Adversarial Examples with Differentiable Rendering +HkzRQhR9YX,Skx58XActm,1538090000000.0,1550880000000.0,1408,Tree-Structured Recurrent Switching Linear Dynamical Systems for Multi-Scale Modeling,"[""josue.nassar@stonybrook.edu"", ""scott.linderman@columbia.edu"", ""monica.bugallo@stonybrook.edu"", ""memming.park@stonybrook.edu""]","[""Josue Nassar"", ""Scott Linderman"", ""Monica Bugallo"", ""Il Memming Park""]","[""machine learning"", ""bayesian statistics"", ""dynamical systems""]","Many real-world systems studied are governed by complex, nonlinear dynamics. By modeling these dynamics, we can gain insight into how these systems work, make predictions about how they will behave, and develop strategies for controlling them. While there are many methods for modeling nonlinear dynamical systems, existing techniques face a trade off between offering interpretable descriptions and making accurate predictions. Here, we develop a class of models that aims to achieve both simultaneously, smoothly interpolating between simple descriptions and more complex, yet also more accurate models. Our probabilistic model achieves this multi-scale property through of a hierarchy of locally linear dynamics that jointly approximate global nonlinear dynamics. We call it the tree-structured recurrent switching linear dynamical system. To fit this model, we present a fully-Bayesian sampling procedure using Polya-Gamma data augmentation to allow for fast and conjugate Gibbs sampling. Through a variety of synthetic and real examples, we show how these models outperform existing methods in both interpretability and predictive capability.",/pdf/1bf5bccde3cae7be771c38f8abbe8331d2a5ce74.pdf,ICLR,2019, +FUtMxDTJ_h,8q6Pi29zild,1601310000000.0,1614990000000.0,3181,Symmetry Control Neural Networks,"[""~Marc_Syvaeri1"", ""~Sven_Krippendorf1""]","[""Marc Syvaeri"", ""Sven Krippendorf""]","[""Inductive (symmetry) Bias"", ""Predictive Models"", ""Hamiltonian Dynamics"", ""Physics""]","This paper continues the quest for designing the optimal physics bias for neural networks predicting the dynamics of systems when the underlying dynamics shall be inferred from the data directly. The description of physical systems is greatly simplified when the underlying symmetries of the system are taken into account. In classical systems described via Hamiltonian dynamics this is achieved by using appropriate coordinates, so-called cyclic coordinates, which reveal conserved quantities directly. Without changing the Hamiltonian, these coordinates can be obtained via canonical transformations. We show that such coordinates can be searched for automatically with appropriate loss functions which naturally arise from Hamiltonian dynamics. As a proof of principle, we test our method on standard classical physics systems using synthetic and experimental data where our network identifies the conserved quantities in an unsupervised way and find improved performance on predicting the dynamics of the system compared to networks biasing just to the Hamiltonian. Effectively, these new coordinates guarantee that motion takes place on symmetry orbits in phase space, i.e.~appropriate lower dimensional sub-spaces of phase space. By fitting analytic formulae we recover that our networks are utilising conserved quantities such as (angular) momentum.",/pdf/7461225e9d09ec1f61dc5d7413a1b0b38a28e89d.pdf,ICLR,2021,We present a framework for neural networks to learn (unknown) symmetries and to use these symmetries for improved performance on predicting the time evolution of the system. +tu29GQT0JFy,Obq0p5EvHSR,1601310000000.0,1616070000000.0,1124,not-MIWAE: Deep Generative Modelling with Missing not at Random Data,"[""~Niels_Bruun_Ipsen1"", ""~Pierre-Alexandre_Mattei3"", ""~Jes_Frellsen1""]","[""Niels Bruun Ipsen"", ""Pierre-Alexandre Mattei"", ""Jes Frellsen""]",[],"When a missing process depends on the missing values themselves, it needs to be explicitly modelled and taken into account while doing likelihood-based inference. We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. Specifically, a deep neural network enables us to flexibly model the conditional distribution of the missingness pattern given the data. This allows for incorporating prior information about the type of missingness (e.g.~self-censoring) into the model. Our inference technique, based on importance-weighted variational inference, involves maximising a lower bound of the joint likelihood. Stochastic gradients of the bound are obtained by using the reparameterisation trick both in latent space and data space. We show on various kinds of data sets and missingness patterns that explicitly modelling the missing process can be invaluable.",/pdf/cd42c1e09d98ef209caa6b63d5a67a5273108128.pdf,ICLR,2021,We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. +SklR_iCcYm,SJg27I9cFQ,1538090000000.0,1545360000000.0,404,Faster Training by Selecting Samples Using Embeddings,"[""slgonzalez@utexas.edu"", ""jland@cs.utexas.edu"", ""risto@cs.utexas.edu""]","[""Santiago Gonzalez"", ""Joshua Landgraf"", ""Risto Miikkulainen""]","[""Machine Learning"", ""Embeddings"", ""Training Time"", ""Optimization"", ""Autoencoders""]","Long training times have increasingly become a burden for researchers by slowing down the pace of innovation, with some models taking days or weeks to train. In this paper, a new, general technique is presented that aims to speed up the training process by using a thinned-down training dataset. By leveraging autoencoders and the unique properties of embedding spaces, we are able to filter training datasets to include only those samples that matter the most. Through evaluation on a standard CIFAR-10 image classification task, this technique is shown to be effective. With this technique, training times can be reduced with a minimal loss in accuracy. Conversely, given a fixed training time budget, the technique was shown to improve accuracy by over 50%. This technique is a practical tool for achieving better results with large datasets and limited computational budgets.",/pdf/fc9bbe659f1844daeb44394d8f15efb061870d4a.pdf,ICLR,2019,Training is sped up by using a dataset that has been subsampled through embedding analysis. +Skw0n-W0Z,HkLRnW-CW,1509130000000.0,1519590000000.0,723,Temporal Difference Models: Model-Free Deep RL for Model-Based Control,"[""vitchyr@berkeley.edu"", ""sg717@cam.ac.uk"", ""mdalal@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Vitchyr Pong*"", ""Shixiang Gu*"", ""Murtaza Dalal"", ""Sergey Levine""]","[""model-based reinforcement learning"", ""model-free reinforcement learning"", ""temporal difference learning"", ""predictive learning"", ""predictive models"", ""optimal control"", ""off-policy reinforcement learning"", ""deep learning"", ""deep reinforcement learning"", ""q learning""]","Model-free reinforcement learning (RL) has been proven to be a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even for off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods.",/pdf/1cd2d10bace855efc82d202d0897cd536a84b4ee.pdf,ICLR,2018,"We show that a special goal-condition value function trained with model free methods can be used within model-based control, resulting in substantially better sample efficiency and performance." +Hyg53gSYPB,HkeP_ebFvB,1569440000000.0,1577170000000.0,2546,Defense against Adversarial Examples by Encoder-Assisted Search in the Latent Coding Space,"[""huangwenjing@sjtu.edu.cn"", ""tushikui@sjtu.edu.cn""]","[""Wenjing Huang"", ""Shikui Tu"", ""Lei Xu""]","[""Adversarial Defense"", ""Auto-encoder"", ""Adversarial Attack"", ""GAN""]","Deep neural networks were shown to be vulnerable to crafted adversarial perturbations, and thus bring serious safety problems. To solve this problem, we proposed $\text{AE-GAN}_\text{+sr}$, a framework for purifying input images by searching a closest natural reconstruction with little computation. We first build a reconstruction network AE-GAN, which adapted auto-encoder by introducing adversarial loss to the objective function. In this way, we can enhance the generative ability of decoder and preserve the abstraction ability of encoder to form a self-organized latent space. In the inference time, when given an input, we will start a search process in the latent space which aims to find the closest reconstruction to the given image on the distribution of normal data. The encoder can provide a good start point for the searching process, which saves much computation cost. Experiments show that our method is robust against various attacks and can reach comparable even better performance to similar methods with much fewer computations.",/pdf/a7d75f8068b0376887d74426dd8b81b135cdd17e.pdf,ICLR,2020, +tc5qisoB-C,80tRsF20r4p,1601310000000.0,1615950000000.0,1334,C-Learning: Learning to Achieve Goals via Recursive Classification,"[""~Benjamin_Eysenbach1"", ""~Ruslan_Salakhutdinov1"", ""~Sergey_Levine1""]","[""Benjamin Eysenbach"", ""Ruslan Salakhutdinov"", ""Sergey Levine""]","[""reinforcement learning"", ""goal reaching"", ""density estimation"", ""Q-learning"", ""hindsight relabeling""]","We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods.",/pdf/6a06ad37cef81666dc0ffbc9cffba623fcb34843.pdf,ICLR,2021,"We reframe the goal-conditioned RL problem as one of predicting and controlling the future state of the world, and derive a principled algorithm to solve this problem. " +u15gHPQViL,R3Ud0Hsgbj6,1601310000000.0,1614990000000.0,2412,Zero-Shot Recognition through Image-Guided Semantic Classification,"[""~Mei-Chen_Yeh1"", ""~Fang_Li5"", ""a0917251699@gmail.com""]","[""Mei-Chen Yeh"", ""Fang Li"", ""Bo-Heng Li""]","[""zero-shot learning"", ""visual-semantic embedding"", ""deep learning""]","We present a new visual-semantic embedding method for generalized zero-shot learning. Existing embedding-based methods aim to learn the correspondence between an image classifier (visual representation) and its class prototype (semantic representation) for each class. Inspired by the binary relevance method for multi-label classification, we learn the mapping between an image and its semantic classifier. Given an input image, the proposed Image-Guided Semantic Classification (IGSC) method creates a label classifier, being applied to all label embeddings to determine whether a label belongs to the input image. Therefore, a semantic classifier is image conditioned and is generated during inference. We also show that IGSC is a unifying framework for two state-of-the-art deep-embedding methods. We validate our approach with four standard benchmark datasets. + +",/pdf/1cbbbe8244990488eb4cd71454e5cbfc32b4646c.pdf,ICLR,2021, +kHSu4ebxFXY,VBa8TimIWx,1601310000000.0,1616060000000.0,3014,MARS: Markov Molecular Sampling for Multi-objective Drug Discovery,"[""~Yutong_Xie3"", ""~Chence_Shi1"", ""zhouhao.nlp@bytedance.com"", ""yuwei.yang@bytedance.com"", ""~Weinan_Zhang1"", ""~Yong_Yu1"", ""~Lei_Li11""]","[""Yutong Xie"", ""Chence Shi"", ""Hao Zhou"", ""Yuwei Yang"", ""Weinan Zhang"", ""Yong Yu"", ""Lei Li""]","[""drug discovery"", ""molecular graph generation"", ""MCMC sampling""]","Searching for novel molecules with desired chemical properties is crucial in drug discovery. Existing work focuses on developing neural models to generate either molecular sequences or chemical graphs. However, it remains a big challenge to find novel and diverse compounds satisfying several properties. In this paper, we propose MARS, a method for multi-objective drug molecule discovery. MARS is based on the idea of generating the chemical candidates by iteratively editing fragments of molecular graphs. To search for high-quality candidates, it employs Markov chain Monte Carlo sampling (MCMC) on molecules with an annealing scheme and an adaptive proposal. To further improve sample efficiency, MARS uses a graph neural network (GNN) to represent and select candidate edits, where the GNN is trained on-the-fly with samples from MCMC. Experiments show that MARS achieves state-of-the-art performance in various multi-objective settings where molecular bio-activity, drug-likeness, and synthesizability are considered. Remarkably, in the most challenging setting where all four objectives are simultaneously optimized, our approach outperforms previous methods significantly in comprehensive evaluations. The code is available at https://github.com/yutxie/mars.",/pdf/4f5a198cf9191eebd5788dea1fd15fcb151d8ef9.pdf,ICLR,2021,"In this paper, we propose a self-adaptive MCMC sampling method (MARS) to generate molecules targeting multiple objectives for drug discovery for multi-objective drug discovery." +ypJS_nyu-I,PtzwNfB2tvf,1601310000000.0,1614990000000.0,366,A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms,"[""~Shangtong_Zhang1"", ""~Romain_Laroche1"", ""~Harm_van_Seijen1"", ""~Shimon_Whiteson1"", ""~Remi_Tachet_des_Combes1""]","[""Shangtong Zhang"", ""Romain Laroche"", ""Harm van Seijen"", ""Shimon Whiteson"", ""Remi Tachet des Combes""]",[],"We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a $\gamma^t$ term in the actor update for the transition observed at time $t$ in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting ($\gamma^t$) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective $(\gamma = 1)$ where $\gamma^t$ disappears naturally $(1^t = 1)$. We then propose to interpret the discounting in critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective ($\gamma < 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.",/pdf/c143d158079b2554d07ec12fd2d16e506b521a35.pdf,ICLR,2021, +x1uGDeV6ter,rhI9M4OqoFP,1601310000000.0,1614990000000.0,2580,Adaptive Automotive Radar data Acquisition,"[""~Madhumitha_Sakthi1"", ""~Ahmed_Tewfik1""]","[""Madhumitha Sakthi"", ""Ahmed Tewfik""]","[""Compressed Sensing"", ""Adaptive acquisition"", ""object detection""]","In an autonomous driving scenario, it is vital to acquire and efficiently process data from various sensors to obtain a complete and robust perspective of the surroundings. Many studies have shown the importance of having radar data in addition to images since radar improves object detection performance. We develop a novel algorithm motivated by the hypothesis that with a limited sampling budget, allocating more sampling budget to areas with the object as opposed to a uniform sampling budget ultimately improves relevant object detection and classification. In order to identify the areas with objects, we develop an algorithm to process the object detection results from the Faster R-CNN object detection algorithm and the previous radar frame and use these as prior information to adaptively allocate more bits to areas in the scene that may contain relevant objects. We use previous radar frame information to mitigate the potential information loss of an object missed by the image or the object detection network. Also, in our algorithm, the error of missing relevant information in the current frame due to the limited budget sampling of the previous radar frame did not propagate across frames. We also develop an end-to-end transformer-based 2D object detection network using the NuScenes radar and image data. Finally, we compare the performance of our algorithm against that of standard CS and adaptive CS using radar on the Oxford Radar RobotCar dataset.",/pdf/5c2e49d2bb0d47b81396f0627b1e6f5454c5355e.pdf,ICLR,2021,Adaptive automotive radar data acquisition using prior image and radar data. +SkymMAxAb,ry0MMAeRW,1509120000000.0,1518730000000.0,442,AirNet: a machine learning dataset for air quality forecasting,"[""gfgkmn@gmail.com"", ""yuan@caiyunapp.com"", ""xiaoda99@gmail.com"", ""littletree@caiyunapp.com"", ""joeyzhouyuanli@caiyunapp.com""]","[""Songgang Zhao"", ""Xingyuan Yuan"", ""Da Xiao"", ""Jianyuan Zhang"", ""Zhouyuan Li""]",[],"In the past decade, many urban areas in China have suffered from serious air pollution problems, making air quality forecast a hot spot. Conventional approaches rely on numerical methods to estimate the pollutant concentration and require lots of computing power. To solve this problem, we applied the widely used deep learning methods. Deep learning requires large-scale datasets to train an effective model. In this paper, we introduced a new dataset, entitled as AirNet, containing the 0.25 degree resolution grid map of mainland China, with more than two years of continued air quality measurement and meteorological data. We published this dataset as an open resource for machine learning researches and set up a baseline of a 5-day air pollution forecast. The results of experiments demonstrated that this dataset could facilitate the development of new algorithms on the air quality forecast.",/pdf/e9812ba01ffc25d39cd4b0238c5560e3cab01db9.pdf,ICLR,2018, +Bx05YH2W8bE,lyCT-7nX_u2,1601310000000.0,1614990000000.0,760,DyHCN: Dynamic Hypergraph Convolutional Networks,"[""~Nan_Yin1"", ""zgluo@nudt.edu.cn"", ""wenjiewang96@gmail.com"", ""~Fuli_Feng1"", ""~Xiang_Zhang7""]","[""Nan Yin"", ""zhigang luo"", ""wenjie wang"", ""Fuli Feng"", ""Xiang Zhang""]",[],"Hypergraph Convolutional Network (HCN) has become a default choice for capturing high-order relations among nodes, \emph{i.e., } encoding the structure of a hypergraph. However, existing HCN models ignore the dynamic evolution of hypergraphs in the real-world scenarios, \emph{i.e., } nodes and hyperedges in a hypergraph change dynamically over time. To capture the evolution of high-order relations and facilitate relevant analytic tasks, we formulate dynamic hypergraph and devise the Dynamic Hypergraph Convolution Networks (DyHCN). In general, DyHCN consists of a Hypergraph Convolution (HC) to encode the hypergraph structure at a time point and a Temporal Evolution module (TE) to capture the varying of the relations. The HC is delicately designed by equipping inner attention and outer attention, which adaptively aggregate nodes' features to hyperedge and estimate the importance of each hyperedge connected to the centroid node, respectively. Extensive experiments on the Tiigo and Stocktwits datasets show that DyHCN achieves superior performance over existing methods, which implies the effectiveness of capturing the property of dynamic hypergraphs by HC and TE modules.",/pdf/23f3e220feb3f887e55275c922b6b84bd8055c70.pdf,ICLR,2021, +SklfY6EFDH,HyxSmn3DDr,1569440000000.0,1577170000000.0,660,Representation Quality Explain Adversarial Attacks,"[""vargas@inf.kyushu-u.ac.jp"", ""shashankkotyan@gmail.com"", ""matsuki.sousisu@gmail.com""]","[""Danilo Vasconcellos Vargas"", ""Shashank Kotyan"", ""Moe Matsuki""]","[""Representation Metrics"", ""Adversarial Machine Learning"", ""One-Pixel Attack"", ""DeepFool"", ""CapsNet""]","Neural networks have been shown vulnerable to adversarial samples. Slightly perturbed input images are able to change the classification of accurate models, showing that the representation learned is not as good as previously thought. To aid the development of better neural networks, it would be important to evaluate to what extent are current neural networks' representations capturing the existing features. Here we propose a way to evaluate the representation quality of neural networks using a novel type of zero-shot test, entitled Raw Zero-Shot. The main idea lies in the fact that some features are present on unknown classes and that unknown classes can be defined as a combination of previous learned features without representation bias (a bias towards representation that maps only current set of input-outputs and their boundary). To evaluate the soft-labels of unknown classes, two metrics are proposed. One is based on clustering validation techniques (Davies-Bouldin Index) and the other is based on soft-label distance of a given correct soft-label. +Experiments show that such metrics are in accordance with the robustness to adversarial attacks and might serve as a guidance to build better models as well as be used in loss functions to create new types of neural networks. Interestingly, the results suggests that dynamic routing networks such as CapsNet have better representation while current deeper DNNs are trading off representation quality for accuracy.",/pdf/0243086c701181aeb1367afb513b390272628621.pdf,ICLR,2020, +XEyElxd9zji,8L0MjcPpIcv,1601310000000.0,1614990000000.0,2189,Learning with Plasticity Rules: Generalization and Robustness,"[""~Rares_C_Cristian1"", ""~Max_Dabagia1"", ""~Christos_Papadimitriou2"", ""~Santosh_Vempala1""]","[""Rares C Cristian"", ""Max Dabagia"", ""Christos Papadimitriou"", ""Santosh Vempala""]","[""meta learning"", ""plasticity"", ""local learning"", ""deep learning"", ""machine learning"", ""neural networks"", ""RNNs"", ""backpropagation"", ""perceptron"", ""evolution"", ""adversarial examples""]"," Brains learn robustly, and generalize effortlessly between different learning tasks; in contrast, robustness and generalization across tasks are well known weaknesses of artificial neural nets (ANNs). How can we use our accelerating understanding of the brain to improve these and other aspects of ANNs? Here we hypothesize that (a) Brains employ synaptic plasticity rules that serve as proxies for GD; (b) These rules themselves can be learned by GD on the rule parameters; and (c) This process may be a missing ingredient for the development of ANNs that generalize well and are robust to adversarial perturbations. We provide both empirical and theoretical evidence for this hypothesis. In our experiments, plasticity rules for the synaptic weights of recurrent neural nets (RNNs) are learned through GD and are found to perform reasonably well (with no backpropagation). We find that plasticity rules learned by this process generalize from one type of data/classifier to others (e.g., rules learned on synthetic data work well on MNIST/Fashion MNIST) and converge with fewer updates. Moreover, the classifiers learned using plasticity rules exhibit surprising levels of tolerance to adversarial perturbations. In the special case of the last layer of a classification network, we show analytically that GD on the plasticity rule recovers (and improves upon) the perceptron algorithm and the multiplicative weights method. Finally, we argue that applying GD to learning rules is biologically plausible, in the sense that it can be learned over evolutionary time: we describe a genetic setting where natural selection of a numerical parameter over a sequence of generations provably simulates a simple variant of GD.",/pdf/a4be1ff74d6832766a291e6879f833418f2f0346.pdf,ICLR,2021,"Local plasticity rules, based only on the activations of the transmitting neurons, are used to update the weights of a neural network, leading to better generalization and robustness to adversarial examples." +BJe7h34YDS,r1gxYUxzwB,1569440000000.0,1577170000000.0,183,Understanding and Stabilizing GANs' Training Dynamics with Control Theory,"[""kunxu.thu@gmail.com"", ""chongxuanli1991@gmail.com"", ""weihuanshu94@hotmail.com"", ""dcszj@mail.tsinghua.edu.cn"", ""dcszb@mail.tsinghua.edu.cn""]","[""Kun Xu"", ""Chongxuan Li"", ""Huanshu Wei"", ""Jun Zhu"", ""Bo Zhang""]","[""Generative Adversarial Nets"", ""Stability Analysis"", ""Control Theory""]","Generative adversarial networks~(GANs) have made significant progress on realistic image generation but often suffer from instability during the training process. Most previous analyses mainly focus on the equilibrium that GANs achieve, whereas a gap exists between such theoretical analyses and practical implementations, where it is the training dynamics that plays a vital role in the convergence and stability of GANs. In this paper, we directly model the dynamics of GANs and adopt the control theory to understand and stabilize it. Specifically, we interpret the training process of various GANs as certain types of dynamics in a unified perspective of control theory which enables us to model the stability and convergence easily. Borrowed from control theory, we adopt the widely-used negative feedback control to stabilize the training dynamics, which can be considered as an $L2$ regularization on the output of the discriminator. We empirically verify our method on both synthetic data and natural image datasets. The results demonstrate that our method can stabilize the training dynamics as well as converge better than baselines.",/pdf/1bd50f512462bc7c8ec88d5eca9d3088aec98c98.pdf,ICLR,2020,We adopt the control theory to understand and stabilize the dynamics of GANs. +VwU1lyi5nzb,18PrWPG27A,1601310000000.0,1614990000000.0,1193,MULTI-SPAN QUESTION ANSWERING USING SPAN-IMAGE NETWORK,"[""aricit@amazon.com"", ""~Hayreddin_Ceker1"", ""ismailt@amazon.com""]","[""Tarik Arici"", ""Hayreddin Ceker"", ""Ismail Baha Tutar""]","[""BERT"", ""deep learning"", ""multi-span answer"", ""question-answering"", ""SQuAD"", ""transformers""]","Question-answering (QA) models aim to find an answer given a question and con- text. Language models like BERT are used to associate question and context to find an answer span. Prior art on QA focuses on finding the best answer. There is a need for multi-span QA models to output the top-K likely answers to questions such as ""Which companies Elon Musk started?"" or ""What factors cause global warming?"" In this work, we introduce Span-Image architecture that can learn to identify multiple answers in a context for a given question. This architecture can incorporate prior information about the span length distribution or valid span patterns (e.g., end index has to be larger than start index), thus eliminating the need for post-processing. Span-Image architecture outperforms the state-of-the-art in top-K answer accuracy on SQuAD dataset and in multi-span answer accuracy on an Amazon internal dataset. +",/pdf/70e8f9b5c609c97d049eb13dd42127ca42720d4e.pdf,ICLR,2021,We build multi-span question-answering models to output the top-N likely answers to questions instead of one answer using SQuAD dataset and Amazon internal dataset. +lvXLfNeCQdK,xFpHFpBJbZ,1601310000000.0,1614990000000.0,1590,Loss Landscape Matters: Training Certifiably Robust Models with Favorable Loss Landscape,"[""~Sungyoon_Lee1"", ""~Woojin_Lee1"", ""~Jinseong_Park1"", ""~Jaewook_Lee1""]","[""Sungyoon Lee"", ""Woojin Lee"", ""Jinseong Park"", ""Jaewook Lee""]","[""Adversarial Examples"", ""Certifiable Robustness"", ""Certifiable Training"", ""Loss Landscape"", ""Deep Learning"", ""Security""]","In this paper, we study the problem of training certifiably robust models. Certifiable training minimizes an upper bound on the worst-case loss over the allowed perturbation, and thus the tightness of the upper bound is an important factor in building certifiably robust models. However, many studies have shown that Interval Bound Propagation (IBP) training uses much looser bounds but outperforms other models that use tighter bounds. We identify another key factor that influences the performance of certifiable training: \textit{smoothness of the loss landscape}. We consider linear relaxation based methods and find significant differences in the loss landscape across these methods. Based on this analysis, we propose a certifiable training method that utilizes a tighter upper bound and has a landscape with favorable properties. The proposed method achieves performance comparable to state-of-the-art methods under a wide range of perturbations.",/pdf/c5dc63b434d2a9ca4f5f921ae39f5e7804eddcf2.pdf,ICLR,2021,We identify smoothness of the loss landscape as an important factor in building certifiably robust model and propose a method that achieves performance comparable to state-of-the-art certifiable training methods under a wide range of perturbations. +Bkf1tjR9KQ,r1gL1rYYFm,1538090000000.0,1545360000000.0,407,DVOLVER: Efficient Pareto-Optimal Neural Network Architecture Search,"[""guillaume.michel@netatmo.com"", ""mohammed-amine.alaoui@netatmo.com"", ""alice.lebois@netatmo.com"", ""amal.feriani@netatmo.com"", ""mehdi.felhi@netatmo.com""]","[""Guillaume Michel"", ""Mohammed Amine Alaoui"", ""Alice Lebois"", ""Amal Feriani"", ""Mehdi Felhi""]","[""architecture search"", ""Pareto optimality"", ""multi-objective"", ""optimization"", ""cnn"", ""deep learning""]","Automatic search of neural network architectures is a standing research topic. In addition to the fact that it presents a faster alternative to hand-designed architectures, it can improve their efficiency and for instance generate Convolutional Neural Networks (CNN) adapted for mobile devices. In this paper, we present a multi-objective neural architecture search method to find a family of CNN models with the best accuracy and computational resources tradeoffs, in a search space inspired by the state-of-the-art findings in neural search. Our work, called Dvolver, evolves a population of architectures and iteratively improves an approximation of the optimal Pareto front. Applying Dvolver on the model accuracy and on the number of floating points operations as objective functions, we are able to find, in only 2.5 days 1 , a set of competitive mobile models on ImageNet. Amongst these models one architecture has the same Top-1 accuracy on ImageNet as NASNet-A mobile with 8% less floating point operations and another one has a Top-1 accuracy of 75.28% on ImageNet exceeding by 0.28% the best MobileNetV2 model for the same computational resources.",/pdf/9914d02e0fdb0558096e5868bf823948591b74c9.pdf,ICLR,2019,Multi-objective Neural architecture search as an efficient way to find fast and accurate architecture for mobile devices. +Hkls_yBKDB,S1e2pMCdvr,1569440000000.0,1577170000000.0,1817,Neural Program Synthesis By Self-Learning,"[""yix081@ucsd.edu"", ""dldaisy@mail.ustc.edu.cn"", ""u1singh@ucsd.edu"", ""kez040@ucsd.edu"", ""ztu@ucsd.edu""]","[""Yifan Xu"", ""Lu Dai"", ""Udaikaran Singh"", ""Kening Zhang"", ""Zhuowen Tu""]","[""Neural Program Synthesis"", ""Reinforcement Learning"", ""Deep learning"", ""Self-Learning""]","Neural inductive program synthesis is a task generating instructions that can produce desired outputs from given inputs. In this paper, we focus on the generation of a chunk of assembly code that can be executed to match a state change inside the CPU. We develop a neural program synthesis algorithm, AutoAssemblet, learned via self-learning reinforcement learning that explores the large code space efficiently. Policy networks and value networks are learned to reduce the breadth and depth of the Monte Carlo Tree Search, resulting in better synthesis performance. We also propose an effective multi-entropy policy sampling technique to alleviate online update correlations. We apply AutoAssemblet to basic programming tasks and show significant higher success rates compared to several competing baselines.",/pdf/c5ee3a13a4a7ab972ee9c54cd5f830a6c79b44df.pdf,ICLR,2020,"We develop a neural program synthesis algorithm,AutoAssemblet, to explore the large-scale code space efficiently via self-learning under the reinforcement learning (RL) framework." +SyedHyBFwS,rJgULzpdvB,1569440000000.0,1577170000000.0,1697,Relative Pixel Prediction For Autoregressive Image Generation,"[""lingwang@google.com"", ""cdyer@google.com"", ""leiyu@google.com"", ""lingpenk@google.com"", ""dyogatama@google.com"", ""susannahy@google.com""]","[""Wang Ling"", ""Chris Dyer"", ""Lei Yu"", ""Lingpeng Kong"", ""Dani Yogatama"", ""Susannah Young""]","[""Image Generation"", ""Autoregressive""]","In natural images, transitions between adjacent pixels tend to be smooth and gradual, a fact that has long been exploited in image compression models based on predictive coding. In contrast, existing neural autoregressive image generation models predict the absolute pixel intensities at each position, which is a more challenging problem. In this paper, we propose to predict pixels relatively, by predicting new pixels relative to previously generated pixels (or pixels from the conditioning context, when available). We show that this form of prediction fare favorably to its absolute counterpart when used independently, but their coordination under an unified probabilistic model yields optimal performance, as the model learns to predict sharp transitions using the absolute predictor, while generating smooth transitions using the relative predictor. +Experiments on multiple benchmarks for unconditional image generation, image colorization, and super-resolution indicate that our presented mechanism leads to improvements in terms of likelihood compared to the absolute prediction counterparts. ",/pdf/3ea163163e2371ad13d13f296a926bc7f7bfbd43.pdf,ICLR,2020, +rEaz5uTcL6Q,NxpNwC-Sm_a,1601310000000.0,1614990000000.0,533,Neural spatio-temporal reasoning with object-centric self-supervised learning,"[""~David_Ding2"", ""~Felix_Hill1"", ""~Adam_Santoro1"", ""~Matthew_Botvinick1""]","[""David Ding"", ""Felix Hill"", ""Adam Santoro"", ""Matthew Botvinick""]","[""self-attention"", ""object representations"", ""visual reasoning"", ""dynamics"", ""visual question answering""]","Transformer-based language models have proved capable of rudimentary symbolic reasoning, underlining the effectiveness of applying self-attention computations to sets of discrete entities. In this work, we apply this lesson to videos of physical interaction between objects. We show that self-attention-based models operating on discrete, learned, object-centric representations perform well on spatio-temporal reasoning tasks which were expressly designed to trouble traditional neural network models and to require higher-level cognitive processes such as causal reasoning and understanding of intuitive physics and narrative structure. We achieve state of the art results on two datasets, CLEVRER and CATER, significantly outperforming leading hybrid neuro-symbolic models. Moreover, we find that techniques from language modelling, such as BERT-style semi-supervised predictive losses, allow our model to surpass neuro-symbolic approaches while using 40% less labelled data. Our results corroborate the idea that neural networks can reason about the causal, dynamic structure of visual data and attain understanding of intuitive physics, which counters the popular claim that they are only effective at perceptual pattern-recognition and not reasoning per se.",/pdf/f0d3521d5d1abcc8afb742cf625c91ddb5e90b6f.pdf,ICLR,2021,Self-attention with object representations can significantly improve upon state of the art for video and physical dynamics based causal reasoning tasks. +rkej86VYvB,SylSfTYDPr,1569440000000.0,1577170000000.0,572,Temporal Difference Weighted Ensemble For Reinforcement Learning,"[""seno@ailab.ics.keio.ac.jp"", ""michita@ailab.ics.keio.ac.jp""]","[""Takuma Seno"", ""Michita Imai""]","[""reinforcement learning"", ""ensemble"", ""deep q-network""]","Combining multiple function approximators in machine learning models typically leads to better performance and robustness compared with a single function. In reinforcement learning, ensemble algorithms such as an averaging method and a majority voting method are not always optimal, because each function can learn fundamentally different optimal trajectories from exploration. In this paper, we propose a Temporal Difference Weighted (TDW) algorithm, an ensemble method that adjusts weights of each contribution based on accumulated temporal difference errors. The advantage of this algorithm is that it improves ensemble performance by reducing weights of Q-functions unfamiliar with current trajectories. We provide experimental results for Gridworld tasks and Atari tasks that show significant performance improvements compared with baseline algorithms.",/pdf/f74bfaf0109146c773fc1e78b5125311e06f58c1.pdf,ICLR,2020,Ensemble method for reinforcement learning that weights Q-functions based on accumulated TD errors. +GFsU8a0sGB,EiZzNbYUuZ,1601310000000.0,1614860000000.0,1852,Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms,"[""~Maruan_Al-Shedivat1"", ""~Jennifer_Gillenwater1"", ""~Eric_Xing1"", ""~Afshin_Rostamizadeh1""]","[""Maruan Al-Shedivat"", ""Jennifer Gillenwater"", ""Eric Xing"", ""Afshin Rostamizadeh""]","[""federated learning"", ""posterior inference"", ""MCMC""]","Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this perspective provides a principled way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm—federated posterior averaging (FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to refine a global estimate of the posterior mode. Finally, we show that FedPA generalizes federated averaging (FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima.",/pdf/3c19f2476503b117ff059dbfc938d5efd211ed0f.pdf,ICLR,2021,"A new approach to federated learning that generalizes federated optimization, combines local MCMC-based sampling with global optimization-based posterior inference, and achieves competitive results on challenging benchmarks." +B1eWOJHKvB,HyelMkCuvB,1569440000000.0,1583910000000.0,1794,Kernel of CycleGAN as a principal homogeneous space,"[""nikita.moriakov@radboudumc.nl"", ""jonasadl@kth.se"", ""jonas.teuwen@radboudumc.nl""]","[""Nikita Moriakov"", ""Jonas Adler"", ""Jonas Teuwen""]","[""Generative models"", ""CycleGAN""]","Unpaired image-to-image translation has attracted significant interest due to the invention of CycleGAN, a method which utilizes a combination of adversarial and cycle consistency losses to avoid the need for paired data. It is known that the CycleGAN problem might admit multiple solutions, and our goal in this paper is to analyze the space of exact solutions and to give perturbation bounds for approximate solutions. We show theoretically that the exact solution space is invariant with respect to automorphisms of the underlying probability spaces, and, furthermore, that the group of automorphisms acts freely and transitively on the space of exact solutions. We examine the case of zero pure CycleGAN loss first in its generality, and, subsequently, expand our analysis to approximate solutions for extended CycleGAN loss where identity loss term is included. In order to demonstrate that these results are applicable, we show that under mild conditions nontrivial smooth automorphisms exist. Furthermore, we provide empirical evidence that neural networks can learn these automorphisms with unexpected and unwanted results. We conclude that finding optimal solutions to the CycleGAN loss does not necessarily lead to the envisioned result in image-to-image translation tasks and that underlying hidden symmetries can render the result useless.",/pdf/8790b57a87025087d771a181a86afbd4b4282d3d.pdf,ICLR,2020,"The space of approximate solutions of CycleGAN admits a lot of symmetry, and an identity loss does not fix this." +S1xHfxHtPr,ryx3MGeFDB,1569440000000.0,1577170000000.0,2173,Online Learned Continual Compression with Stacked Quantization Modules,"[""lucas.page-caccia@mail.mcgill.ca"", ""belilovsky.eugene@gmail.com"", ""massimo.p.caccia@gmail.com"", ""jpineau@cs.mcgill.ca""]","[""Lucas Caccia"", ""Eugene Belilovsky"", ""Massimo Caccia"", ""Joelle Pineau""]","[""continual learning"", ""lifelong learning""]","We introduce and study the problem of Online Continual Compression, where one attempts to learn to compress and store a representative dataset from a non i.i.d data stream, while only observing each sample once. This problem is highly relevant for downstream online continual learning tasks, as well as standard learning methods under resource constrained data collection. We propose a new architecture which stacks Quantization Modules (SQM), consisting of a series of discrete autoencoders, each equipped with their own memory. Every added module is trained to reconstruct the latent space of the previous module using fewer bits, allowing the learned representation to become more compact as training progresses. This modularity has several advantages: 1) moderate compressions are quickly available early in training, which is crucial for remembering the early tasks, 2) as more data needs to be stored, earlier data becomes more compressed, freeing memory, 3) unlike previous methods, our approach does not require pretraining, even on challenging datasets. We show several potential applications of this method. We first replace the episodic memory used in Experience Replay with SQM, leading to significant gains on standard continual learning benchmarks using a fixed memory budget. We then apply our method to compressing larger images like those from Imagenet, and show that it is also effective with other modalities, such as LiDAR data.",/pdf/4b0f5de5794f526b8506ec3175662418cf7ab854.pdf,ICLR,2020,We propose an approach for learning to compress online from a non-iid data stream. We argue for the relevance of this problem and show promising results in downstream applications +ryl0cAVtPH,SyeRW1YOPS,1569440000000.0,1577170000000.0,1301,On The Difficulty of Warm-Starting Neural Network Training,"[""jordanta@cs.princeton.edu"", ""rpa@princeton.edu""]","[""Jordan T. Ash"", ""Ryan P. Adams""]","[""deep learning"", ""neural networks""]","In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to ""warm start'' the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. Although the present investigation did not lead to a solution, we hope that a thorough articulation of the problem will spur new research that may lead to improved methods that consume fewer resources during training.",/pdf/afe002c6182a66c96ece21a42d56da91efa2a521.pdf,ICLR,2020,We empirically study the gap in generalization between warm-started and randomly-initialized neural networks. +sRA5rLNpmQc,9dQMlcSccG,1601310000000.0,1616040000000.0,2380,Provably robust classification of adversarial examples with detection,"[""~Fatemeh_Sheikholeslami1"", ""~Ali_Lotfi1"", ""~J_Zico_Kolter1""]","[""Fatemeh Sheikholeslami"", ""Ali Lotfi"", ""J Zico Kolter""]","[""Adversarial robustness"", ""robust deep learning""]","Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can \emph{detect} the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating \emph{verifiable} performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional ""abstain/detection"" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify \emph{or} detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under $\ell_\infty$ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than $63.63\%$ and $67.92\%$, for $55.6\%$ and $66.37\%$ natural error, for $\epsilon=8/255$ and $16/255$ on the CIFAR-10 dataset, respectively. +",/pdf/f8635fcc4d33b492dbd371448f02d31878d69223.pdf,ICLR,2021,We propose a joint classifier/detector training scheme with provable performance guarantees against adversarial perturbations. +SyVVJ85lg,,1478280000000.0,1486680000000.0,277,Paleo: A Performance Model for Deep Neural Networks,"[""hangqi@cs.ucla.edu"", ""sparks@cs.berkeley.edu"", ""ameet@cs.ucla.edu""]","[""Hang Qi"", ""Evan R. Sparks"", ""Ameet Talwalkar""]","[""Deep learning""]","Although various scalable deep learning software packages have been proposed, it remains unclear how to best leverage parallel and distributed computing infrastructure to accelerate their training and deployment. Moreover, the effectiveness of existing parallel and distributed systems varies widely based on the neural network architecture and dataset under consideration. In order to efficiently explore the space of scalable deep learning systems and quickly diagnose their effectiveness for a given problem instance, we introduce an analytical performance model called Paleo. Our key observation is that a neural network architecture carries with it a declarative specification of the computational requirements associated with its training and evaluation. By extracting these requirements from a given architecture and mapping them to a specific point within the design space of software, hardware and communication strategies, Paleo can efficiently and accurately model the expected scalability and performance of a putative deep learning system. We show that Paleo is robust to the choice of network architecture, hardware, software, communication schemes, and parallelization strategies. We further demonstrate its ability to accurately model various recently published scalability results for CNNs such as NiN, Inception and AlexNet.",/pdf/c30790984e960f282cd1534f65f5f26444ec9a19.pdf,ICLR,2017,Paleo: An analytical performance model for exploring the space of scalable deep learning systems and quickly diagnosing their effectiveness for a given problem instance. +H1-oTz-Cb,SyrKafZR-,1509140000000.0,1518730000000.0,1022,Parametrizing filters of a CNN with a GAN,"[""yannic.kilcher@inf.ethz.ch"", ""gary.becigneul@inf.ethz.ch"", ""thomas.hofmann@inf.ethz.ch""]","[""Yannic Kilcher"", ""Gary Becigneul"", ""Thomas Hofmann""]","[""invariance"", ""cnn"", ""gan"", ""infogan"", ""transformation""]","It is commonly agreed that the use of relevant invariances as a good statistical bias is important in machine-learning. However, most approaches that explicitely incorporate invariances into a model architecture only make use of very simple transformations, such as translations and rotations. Hence, there is a need for methods to model and extract richer transformations that capture much higher-level invariances. To that end, we introduce a tool allowing to parametrize the set of filters of a trained convolutional neural network with the latent space of a generative adversarial network. We then show that the method can capture highly non-linear invariances of the data by visualizing their effect in the data space.",/pdf/1417b939577e252c1f8c08b510a5c3268c28f4d1.pdf,ICLR,2018, +Sklv5iRqYX,S1l5otVOKX,1538090000000.0,1550780000000.0,541,Multi-Domain Adversarial Learning,"[""alice.schoenauer@polytechnique.org"", ""louise.heinrich@ucsf.edu"", ""marc.schoenauer@inria.fr"", ""sebag@lri.fr"", ""lani.wu@ucsf.edu"", ""steven.altschuler@ucsf.edu""]","[""Alice Schoenauer-Sebag"", ""Louise Heinrich"", ""Marc Schoenauer"", ""Michele Sebag"", ""Lani F. Wu"", ""Steve J. Altschuler""]","[""multi-domain learning"", ""domain adaptation"", ""adversarial learning"", ""H-divergence"", ""deep representation learning"", ""high-content microscopy""]","Multi-domain learning (MDL) aims at obtaining a model with minimal average risk across multiple domains. Our empirical motivation is automated microscopy data, where cultured cells are imaged after being exposed to known and unknown chemical perturbations, and each dataset displays significant experimental bias. This paper presents a multi-domain adversarial learning approach, MuLANN, to leverage multiple datasets with overlapping but distinct class sets, in a semi-supervised setting. Our contributions include: i) a bound on the average- and worst-domain risk in MDL, obtained using the H-divergence; ii) a new loss to accommodate semi-supervised multi-domain learning and domain adaptation; iii) the experimental validation of the approach, improving on the state of the art on two standard image benchmarks, and a novel bioimage dataset, Cell.",/pdf/5d9acbed727d88dcc8ac8a57ff06e1a1a19948fe.pdf,ICLR,2019,Adversarial Domain adaptation and Multi-domain learning: a new loss to handle multi- and single-domain classes in the semi-supervised setting. +HyxehhNtvS,ryglnG_WvH,1569440000000.0,1577170000000.0,176,Why Learning of Large-Scale Neural Networks Behaves Like Convex Optimization,"[""hj@cse.yorku.ca""]","[""Hui Jiang""]","[""function space"", ""canonical space"", ""neural networks"", ""stochastic gradient descent"", ""disparity matrix""]","In this paper, we present some theoretical work to explain why simple gradient descent methods are so successful in solving non-convex optimization problems in learning large-scale neural networks (NN). After introducing a mathematical tool called canonical space, we have proved that the objective functions in learning NNs are convex in the canonical model space. We further elucidate that the gradients between the original NN model space and the canonical space are related by a pointwise linear transformation, which is represented by the so-called disparity matrix. Furthermore, we have proved that gradient descent methods surely converge to a global minimum of zero loss provided that the disparity matrices maintain full rank. If this full-rank condition holds, the learning of NNs behaves in the same way as normal convex optimization. At last, we have shown that the chance to have singular disparity matrices is extremely slim in large NNs. In particular, when over-parameterized NNs are randomly initialized, the gradient decent algorithms converge to a global minimum of zero loss in probability. ",/pdf/71a15a221e19f7e4386c720c5615ae503fc7b577.pdf,ICLR,2020,Some theoretical work on why learning of large neural networks converges to a global minimum in probability one +PpshD0AXfA,GNPTBYKpBUw,1601310000000.0,1616040000000.0,1980,Generative Time-series Modeling with Fourier Flows,"[""~Ahmed_Alaa1"", ""~Alex_James_Chan1"", ""~Mihaela_van_der_Schaar2""]","[""Ahmed Alaa"", ""Alex James Chan"", ""Mihaela van der Schaar""]",[],"Generating synthetic time-series data is crucial in various application domains, such as medical prognosis, wherein research is hamstrung by the lack of access to data due to concerns over privacy. Most of the recently proposed methods for generating synthetic time-series rely on implicit likelihood modeling using generative adversarial networks (GANs)—but such models can be difficult to train, and may jeopardize privacy by “memorizing” temporal patterns in training data. In this paper, we propose an explicit likelihood model based on a novel class of normalizing flows that view time-series data in the frequency-domain rather than the time-domain. The proposed flow, dubbed a Fourier flow, uses a discrete Fourier transform (DFT) to convert variable-length time-series with arbitrary sampling periods into fixed-length spectral representations, then applies a (data-dependent) spectral filter to the frequency-transformed time-series. We show that, by virtue of the DFT analytic properties, the Jacobian determinants and inverse mapping for the Fourier flow can be computed efficiently in linearithmic time, without imposing explicit structural constraints as in existing flows such as NICE (Dinh et al. (2014)), RealNVP (Dinh et al. (2016)) and GLOW (Kingma & Dhariwal (2018)). Experiments show that Fourier flows perform competitively compared to state-of-the-art baselines.",/pdf/7aad31a541edaadc936aa88af4d48ddc836b7344.pdf,ICLR,2021, +dnKsslWzLNY,JCySo76hRvO,1601310000000.0,1614990000000.0,1517,On the Universal Approximability and Complexity Bounds of Deep Learning in Hybrid Quantum-Classical Computing,"[""~Weiwen_Jiang1"", ""~Yukun_Ding1"", ""~Yiyu_Shi1""]","[""Weiwen Jiang"", ""Yukun Ding"", ""Yiyu Shi""]","[""deep learning"", ""hybrid quantum-classical computing"", ""universal approximability""]","With the continuously increasing number of quantum bits in quantum computers, there are growing interests in exploring applications that can harvest the power of them. Recently, several attempts were made to implement neural networks, known to be computationally intensive, in hybrid quantum-classical scheme computing. While encouraging results are shown, two fundamental questions need to be answered: (1) whether neural networks in hybrid quantum-classical computing can leverage quantum power and meanwhile approximate any function within a given error bound, i.e., universal approximability; (2) how do these neural networks compare with ones on a classical computer in terms of representation power? This work sheds light on these two questions from a theoretical perspective.",/pdf/884968f81a10499778d74b41376cb12924ab1239.pdf,ICLR,2021,This paper proves the universal approximability of neural networks on a quantum computer for a wide class of functions as well as the associated bounds. +rklPITVKvS,SyeQPmYwwH,1569440000000.0,1577170000000.0,563,BRIDGING ADVERSARIAL SAMPLES AND ADVERSARIAL NETWORKS,"[""lfq18@mails.tsinghua.edu.cn"", ""xmk18@mails.tsinghua.edu.cn"", ""liguoqi@mail.tsinghua.edu.cn"", ""peij@mail.tsinghua.edu.cn"", ""lpshi@mail.tsinghua.edu.cn""]","[""Faqiang Liu"", ""Mingkun Xu"", ""Guoqi Li"", ""Jing Pei"", ""Luping Shi""]","[""ADVERSARIAL SAMPLES"", ""ADVERSARIAL NETWORKS""]","Generative adversarial networks have achieved remarkable performance on various tasks but suffer from sensitivity to hyper-parameters, training instability, and mode collapse. We find that this is partly due to gradient given by non-robust discriminator containing non-informative adversarial noise, which can hinder generator from catching the pattern of real samples. Inspired by defense against adversarial samples, we introduce adversarial training of discriminator on real samples that does not exist in classic GANs framework to make adversarial training symmetric, which can balance min-max game and make discriminator more robust. Robust discriminator can give more informative gradient with less adversarial noise, which can stabilize training and accelerate convergence. We validate the proposed method on image generation tasks with varied network architectures quantitatively. Experiments show that training stability, perceptual quality, and diversity of generated samples are consistently improved with small additional training computation cost.",/pdf/8caf970a168bc127b73b97459466ad55939b8e1c.pdf,ICLR,2020,"We introduce adversarial training on real samples that does not exist in standard GANs to make discriminator more robust, which can stabilize training, accelerate convergence, and achieve better performance." +r1VVsebAZ,SJnQox-C-,1509130000000.0,1519060000000.0,622,Synthesizing realistic neural population activity patterns using Generative Adversarial Networks,"[""manuel.molano@iit.it"", ""aonken@inf.ed.ac.uk"", ""epiasini@sas.upenn.edu"", ""stefano.panzeri@iit.it""]","[""Manuel Molano-Mazon"", ""Arno Onken"", ""Eugenio Piasini*"", ""Stefano Panzeri*""]","[""GANs"", ""Wasserstein-GANs"", ""convolutional networks"", ""neuroscience"", ""spike train patterns"", ""spike train analysis""]","The ability to synthesize realistic patterns of neural activity is crucial for studying neural information processing. Here we used the Generative Adversarial Networks (GANs) framework to simulate the concerted activity of a population of neurons. +We adapted the Wasserstein-GAN variant to facilitate the generation of unconstrained neural population activity patterns while still benefiting from parameter sharing in the temporal domain. +We demonstrate that our proposed GAN, which we termed Spike-GAN, generates spike trains that match accurately the first- and second-order statistics of datasets of tens of neurons and also approximates well their higher-order statistics. We applied Spike-GAN to a real dataset recorded from salamander retina and showed that it performs as well as state-of-the-art approaches based on the maximum entropy and the dichotomized Gaussian frameworks. Importantly, Spike-GAN does not require to specify a priori the statistics to be matched by the model, and so constitutes a more flexible method than these alternative approaches. +Finally, we show how to exploit a trained Spike-GAN to construct 'importance maps' to detect the most relevant statistical structures present in a spike train. +Spike-GAN provides a powerful, easy-to-use technique for generating realistic spiking neural activity and for describing the most relevant features of the large-scale neural population recordings studied in modern systems neuroscience. +",/pdf/ccd4e045a6627d2c194fa41930a50141aa5bcafd.pdf,ICLR,2018,Using Wasserstein-GANs to generate realistic neural activity and to detect the most relevant features present in neural population patterns. +HJ0UKP9ge,,1478290000000.0,1488590000000.0,411,Bidirectional Attention Flow for Machine Comprehension,"[""minjoon@cs.washington.edu"", ""anik@allenai.org"", ""alif@allenai.org"", ""hannaneh@cs.washington.edu""]","[""Minjoon Seo"", ""Aniruddha Kembhavi"", ""Ali Farhadi"", ""Hannaneh Hajishirzi""]","[""Natural language processing"", ""Deep learning""]","Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.",https://arxiv.org/pdf/1611.01603.pdf,ICLR,2017, +ByxtHCVKwB,Hyxh1TUOvH,1569440000000.0,1577170000000.0,1119,Targeted sampling of enlarged neighborhood via Monte Carlo tree search for TSP,"[""fuzhanghua@cuhk.edu.cn"", ""20150008030@m.scnu.edu.cn"", ""qiumeng.sz@gmail.com"", ""zhahy@cuhk.edu.cn""]","[""Zhang-Hua Fu"", ""Kai-Bin Qiu"", ""Meng Qiu"", ""Hongyuan Zha""]","[""Travelling salesman problem"", ""Monte Carlo tree search"", ""Reinforcement learning"", ""Variable neighborhood search""]","The travelling salesman problem (TSP) is a well-known combinatorial optimization problem with a variety of real-life applications. We tackle TSP by incorporating machine learning methodology and leveraging the variable neighborhood search strategy. More precisely, the search process is considered as a Markov decision process (MDP), where a 2-opt local search is used to search within a small neighborhood, while a Monte Carlo tree search (MCTS) method (which iterates through simulation, selection and back-propagation steps), is used to sample a number of targeted actions within an enlarged neighborhood. This new paradigm clearly distinguishes itself from the existing machine learning (ML) based paradigms for solving the TSP, which either uses an end-to-end ML model, or simply applies traditional techniques after ML for post optimization. Experiments based on two public data sets show that, our approach clearly dominates all the existing learning based TSP algorithms in terms of performance, demonstrating its high potential on the TSP. More importantly, as a general framework without complicated hand-crafted rules, it can be readily extended to many other combinatorial optimization problems.",/pdf/7512e3f79bc9eabfcf24d241a3517e75cb44486f.pdf,ICLR,2020,This paper combines Monte Carlo tree search with 2-opt local search in a variable neighborhood mode to solve the TSP effectively. +I4pQCAhSu62,Elif_0x6oBpS,1601310000000.0,1614990000000.0,634,Balancing Robustness and Sensitivity using Feature Contrastive Learning,"[""~Seungyeon_Kim1"", ""~Daniel_Glasner2"", ""~Srikumar_Ramalingam2"", ""~Cho-Jui_Hsieh1"", ""papineni@google.com"", ""~Sanjiv_Kumar1""]","[""Seungyeon Kim"", ""Daniel Glasner"", ""Srikumar Ramalingam"", ""Cho-Jui Hsieh"", ""Kishore Papineni"", ""Sanjiv Kumar""]","[""deep learning"", ""non-adversarial robustness"", ""sensitivity"", ""input perturbation"", ""contextual feature utility"", ""contextual feature sensitivity.""]","It is generally believed that robust training of extremely large networks is critical to their success in real-world applications. However, when taken to the extreme, methods that promote robustness can hurt the model’s sensitivity to rare or underrepresented patterns. In this paper, we discuss this trade-off between robustness and sensitivity by introducing two notions: contextual feature utility and contextual feature sensitivity. We propose Feature Contrastive Learning (FCL) that encourages the model to be more sensitive to the features that have higher contextual utility. Empirical results demonstrate that models trained with FCL achieve a better balance of robustness and sensitivity, leading to improved generalization in the presence of noise.",/pdf/1c3f624e56375d32298d8085dcac9fbc6e890b1c.pdf,ICLR,2021,"Taken to the extreme, robustness can hurt sensitivity, we propose a balance by contrasting feature perturbations with high and low contextual utility." +S1xCcpNYPr,HylJ18RDvB,1569440000000.0,1577170000000.0,726,Cost-Effective Testing of a Deep Learning Model through Input Reduction,"[""zhoujianyi@pku.edu.cn"", ""lifeng2014@pku.edu.cn"", ""xdu_jhdong@163.com"", ""hongyu.zhang@newcastle.edu.au"", ""haod@sei.pku.edu.cn""]","[""Jianyi Zhou"", ""Feng Li"", ""Jinhao Dong"", ""Hongyu Zhang"", ""Dan Hao""]","[""Software Testing"", ""Deep Learning"", ""Input Data Reduction""]","With the increasing adoption of Deep Learning (DL) models in various applications, testing DL models is vitally important. However, testing DL models is costly and expensive, especially when developers explore alternative designs of DL models and tune the hyperparameters. To reduce testing cost, we propose to use only a selected subset of testing data, which is small but representative enough for quick estimation of the performance of DL models. Our approach, called DeepReduce, adopts a two-phase strategy. At first, our approach selects testing data for the purpose of satisfying testing adequacy. Then, it selects more testing data in order to approximate the distribution between the whole testing data and the selected data leveraging relative entropy minimization. +Experiments with various DL models and datasets show that our approach can reduce the whole testing data to 4.6\% on average, and can reliably estimate the performance of DL models. Our approach significantly outperforms the random approach, and is more stable and reliable than the state-of-the-art approach.",/pdf/35e7ab9240d2b4e5d0476dd7e7c0db9b27c66345.pdf,ICLR,2020,"we propose DeepReduce, a software engineering approach to cost-effective testing of Deep Learning models." +B1l6y0VFPr,ByexCAfuDH,1569440000000.0,1583910000000.0,909,Identity Crisis: Memorization and Generalization Under Extreme Overparameterization,"[""pluskid@gmail.com"", ""bengio@google.com"", ""moritzhardt@gmail.com"", ""mcmozer@google.com"", ""y.s@cs.princeton.edu""]","[""Chiyuan Zhang"", ""Samy Bengio"", ""Moritz Hardt"", ""Michael C. Mozer"", ""Yoram Singer""]","[""Generalization"", ""Memorization"", ""Understanding"", ""Inductive Bias""]","We study the interplay between memorization and generalization of +overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). +We formally characterize generalization in single-layer FCNs and CNNs. +We show empirically that different architectures exhibit strikingly different inductive biases. +For example, CNNs of up to 10 layers are able to generalize +from a single example, whereas FCNs cannot learn the identity function reliably from 60k examples. Deeper CNNs often fail, but nonetheless do astonishing work to memorize the training output: because CNN biases are location invariant, the model must progressively grow an output pattern from the image boundaries via the coordination of many layers. Our work helps to quantify and visualize the sensitivity of inductive biases to architectural choices such as depth, kernel width, and number of channels. +",/pdf/cabbbd0e36a514055f84f92c88975788671de490.pdf,ICLR,2020, +rkcya1ZAW,H1xyTJ-RW,1509130000000.0,1518730000000.0,522,Continuous-Time Flows for Efficient Inference and Density Estimation,"[""cchangyou@gmail.com"", ""chunyuan.li@duke.edu"", ""lc267@duke.edu"", ""wenlin.wang@duke.edu"", ""yunchen.pu@duke.edu"", ""lcarin@duke.edu""]","[""Changyou Chen"", ""Chunyuan Li"", ""Liqun Chen"", ""Wenlin Wang"", ""Yunchen Pu"", ""Lawrence Carin""]","[""continuous-time flows"", ""efficient inference"", ""density estimation"", ""deep generative models""]","Two fundamental problems in unsupervised learning are efficient inference for latent-variable models and robust density estimation based on large amounts of unlabeled data. For efficient inference, normalizing flows have been recently developed to approximate a target distribution arbitrarily well. In practice, however, normalizing flows only consist of a finite number of deterministic transformations, and thus they possess no guarantee on the approximation accuracy. For density estimation, the generative adversarial network (GAN) has been advanced as an appealing model, due to its often excellent performance in generating samples. In this paper, we propose the concept of {\em continuous-time flows} (CTFs), a family of diffusion-based methods that are able to asymptotically approach a target distribution. Distinct from normalizing flows and GANs, CTFs can be adopted to achieve the above two goals in one framework, with theoretical guarantees. Our framework includes distilling knowledge from a CTF for efficient inference, and learning an explicit energy-based distribution with CTFs for density estimation. Experiments on various tasks demonstrate promising performance of the proposed CTF framework, compared to related techniques.",/pdf/b62e8abe24c071df14013e00b04d1d382336a798.pdf,ICLR,2018, +HJNJws0cF7,SkxbIly5K7,1538090000000.0,1545360000000.0,229,Convolutional Neural Networks combined with Runge-Kutta Methods,"[""zhumai@stumail.neu.edu.cn"", ""bchang@stat.ubc.ca"", ""fuchong@mail.neu.edu.cn""]","[""Mai Zhu"", ""Bo Chang"", ""Chong Fu""]",[],"A convolutional neural network for image classification can be constructed mathematically since it can be regarded as a multi-period dynamical system. In this paper, a novel approach is proposed to construct network models from the dynamical systems view. Since a pre-activation residual network can be deemed an approximation of a time-dependent dynamical system using the forward Euler method, higher order Runge-Kutta methods (RK methods) can be utilized to build network models in order to achieve higher accuracy. The model constructed in such a way is referred to as the Runge-Kutta Convolutional Neural Network (RKNet). RK methods also provide an interpretation of Dense Convolutional Networks (DenseNets) and Convolutional Neural Networks with Alternately Updated Clique (CliqueNets) from the dynamical systems view. The proposed methods are evaluated on benchmark datasets: CIFAR-10/100, SVHN and ImageNet. The experimental results are consistent with the theoretical properties of RK methods and support the dynamical systems interpretation. Moreover, the experimental results show that the RKNets are superior to the state-of-the-art network models on CIFAR-10 and on par on CIFAR-100, SVHN and ImageNet.",/pdf/b0007bda6112e4ef6029e8930d8d94cdda3abd56.pdf,ICLR,2019, +cTbIjyrUVwJ,3SUHA3Z5_fb,1601310000000.0,1614220000000.0,733,Learning Accurate Entropy Model with Global Reference for Image Compression,"[""~Yichen_Qian1"", ""zhiyu.tzy@alibaba-inc.com"", ""~Xiuyu_Sun1"", ""~Ming_Lin4"", ""yingtian.ldy@alibaba-inc.com"", ""zhenhong.szh@alibaba-inc.com"", ""~Li_Hao1"", ""~Rong_Jin1""]","[""Yichen Qian"", ""Zhiyu Tan"", ""Xiuyu Sun"", ""Ming Lin"", ""Dongyang Li"", ""Zhenhong Sun"", ""Li Hao"", ""Rong Jin""]","[""Image compression"", ""Entropy Model"", ""Global Reference""]","In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans decoded latents and then finds the most relevant latent to assist the distribution estimating of the current latent. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance. Experimental results demonstrate that the proposed model outperforms the rate-distortion performance of most of the state-of-the-art methods in the industry.",/pdf/06784d9497e2c81c4f81b487b90f789b97d82af0.pdf,ICLR,2021,"In this paper, we propose a novel Reference-based Model for image compression to effectively leverage both the local and global context information, which yields an enhanced compression performance." +H1x5wRVtvS,SyerMFwOvr,1569440000000.0,1583910000000.0,1185,Variational Hetero-Encoder Randomized GANs for Joint Image-Text Modeling,"[""zhanghao_xidian@163.com"", ""bchen@mail.xidian.edu.cn"", ""tianlong_xidian@163.com"", ""zhengjuewang@163.com"", ""mingyuan.zhou@mccombs.utexas.edu""]","[""Hao Zhang"", ""Bo Chen"", ""Long Tian"", ""Zhengjue Wang"", ""Mingyuan Zhou""]","[""Deep topic model"", ""image generation"", ""text generation"", ""raster-scan-GAN"", ""zero-shot learning""]","For bidirectional joint image-text modeling, we develop variational hetero-encoder (VHE) randomized generative adversarial network (GAN), a versatile deep generative model that integrates a probabilistic text decoder, probabilistic image encoder, and GAN into a coherent end-to-end multi-modality learning framework. VHE randomized GAN (VHE-GAN) encodes an image to decode its associated text, and feeds the variational posterior as the source of randomness into the GAN image generator. We plug three off-the-shelf modules, including a deep topic model, a ladder-structured image encoder, and StackGAN++, into VHE-GAN, which already achieves competitive performance. This further motivates the development of VHE-raster-scan-GAN that generates photo-realistic images in not only a multi-scale low-to-high-resolution manner, but also a hierarchical-semantic coarse-to-fine fashion. By capturing and relating hierarchical semantic and visual concepts with end-to-end training, VHE-raster-scan-GAN achieves state-of-the-art performance in a wide variety of image-text multi-modality learning and generation tasks. ",/pdf/f976e5408d643dda99fee9162ce5a356b02ce0f6.pdf,ICLR,2020,"A novel Bayesian deep learning framework that captures and relates hierarchical semantic and visual concepts, performing well on a variety of image and text modeling and generation tasks." +ryserbZR-,H19xS-WCb,1509130000000.0,1518730000000.0,670,Classification and Disease Localization in Histopathology Using Only Global Labels: A Weakly-Supervised Approach,"[""pierre.courtiol@owkin.com"", ""eric.tramel@owkin.com"", ""marc.sanselme@owkin.com"", ""gilles.wainrib@owkin.com""]","[""Pierre Courtiol"", ""Eric W. Tramel"", ""Marc Sanselme"", ""Gilles Wainrib""]","[""Weakly Supervised Learning"", ""Medical Imaging"", ""Histopathology"", ""Deep Feature Extraction""]","Analysis of histopathology slides is a critical step for many diagnoses, and in particular in oncology where it defines the gold standard. In the case of digital histopathological analysis, highly trained pathologists must review vast whole-slide-images of extreme digital resolution (100,000^2 pixels) across multiple zoom levels in order to locate abnormal regions of cells, or in some cases single cells, out of millions. The application of deep learning to this problem is hampered not only by small sample sizes, as typical datasets contain only a few hundred samples, but also by the generation of ground-truth localized annotations for training interpretable classification and segmentation models. We propose a method for disease available during training. Even without pixel-level annotations, we are able to demonstrate performance comparable with models trained with strong annotations on the Camelyon-16 lymph node metastases detection challenge. We accomplish this through the use of pre-trained deep convolutional networks, feature embedding, as well as learning via top instances and negative evidence, a multiple instance learning technique fromatp the field of semantic segmentation and object detection.",/pdf/012fa6e5f65cad627fe2f27021148514eea04636.pdf,ICLR,2018,We propose a weakly supervised learning method for the classification and localization of cancers in extremely high resolution histopathology whole slide images using only image-wide labels. +rJBiunlAW,H1SoOheAW,1509110000000.0,1518730000000.0,395,Training RNNs as Fast as CNNs,"[""tao@asapp.com"", ""yzhang87@csail.mit.edu"", ""yoav@cs.cornell.edu""]","[""Tao Lei"", ""Yu Zhang"", ""Yoav Artzi""]","[""recurrent neural networks"", ""natural language processing""]","Common recurrent neural network architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU) architecture, a recurrent unit that simplifies the computation and exposes more parallelism. In SRU, the majority of computation for each step is independent of the recurrence and can be easily parallelized. SRU is as fast as a convolutional layer and 5-10x faster than an optimized LSTM implementation. We study SRUs on a wide range of applications, including classification, question answering, language modeling, translation and speech recognition. Our experiments demonstrate the effectiveness of SRU and the trade-off it enables between speed and performance. ",/pdf/0ef2f95f0c2fa3811d5d871749a766e6344f94b3.pdf,ICLR,2018, +B1lLw6EYwB,r1edLgjPDr,1569440000000.0,1583910000000.0,597,Gap-Aware Mitigation of Gradient Staleness,"[""saarbarkai@gmail.com"", ""idohakimi@gmail.com"", ""assaf@cs.technion.ac.il""]","[""Saar Barkai"", ""Ido Hakimi"", ""Assaf Schuster""]","[""distributed"", ""asynchronous"", ""large scale"", ""gradient staleness"", ""staleness penalization"", ""sgd"", ""deep learning"", ""neural networks"", ""optimization""]","Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. Despite prior beliefs, we show that if GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up.",/pdf/55a4070120074571a7669811775bfb58717bc1ed.pdf,ICLR,2020,"A new distributed, asynchronous, SGD-based algorithm, which achieves state-of-the-art accuracy on existing architectures using staleness penalization without having to re-tune the hyperparameters." +ryxGuJrFvS,Sye3Hk0dwS,1569440000000.0,1585810000000.0,1796,Distributionally Robust Neural Networks,"[""ssagawa@cs.stanford.edu"", ""koh.pangwei@gmail.com"", ""thashim@stanford.edu"", ""pliang@cs.stanford.edu""]","[""Shiori Sagawa*"", ""Pang Wei Koh*"", ""Tatsunori B. Hashimoto"", ""Percy Liang""]","[""distributionally robust optimization"", ""deep learning"", ""robustness"", ""generalization"", ""regularization""]","Overparameterized neural networks can be highly accurate on average on an i.i.d. test set, yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---stronger-than-typical L2 regularization or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm for the group DRO setting and provide convergence guarantees for the new algorithm. +",/pdf/c8dbfe42f7468c524de61f120586bbc0e332cb00.pdf,ICLR,2020,"Overparameterized neural networks can be distributionally robust, but only when you account for generalization. " +MmCRswl1UYl,yRJUNfGzdsT,1601310000000.0,1612920000000.0,662,Open Question Answering over Tables and Text,"[""~Wenhu_Chen3"", ""~Ming-Wei_Chang3"", ""~Eva_Schlinger2"", ""~William_Yang_Wang2"", ""~William_W._Cohen2""]","[""Wenhu Chen"", ""Ming-Wei Chang"", ""Eva Schlinger"", ""William Yang Wang"", ""William W. Cohen""]","[""Question Answering"", ""Tabular Data"", ""Open-domain"", ""Retrieval""]","In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question. Most open QA systems have considered only retrieving information from unstructured text. Here we consider for the first time open QA over {\em both} tabular and textual data and present a new large-scale dataset \emph{Open Table-and-Text Question Answering} (OTT-QA) to evaluate performance on this task. Most questions in OTT-QA require multi-hop inference across tabular data and unstructured text, and the evidence required to answer a question can be distributed in different ways over these two types of input, making evidence retrieval challenging---our baseline model using an iterative retriever and BERT-based reader achieves an exact match score less than 10\%. We then propose two novel techniques to address the challenge of retrieving and aggregating evidence for OTT-QA. The first technique is to use ``early fusion'' to group multiple highly relevant tabular and textual units into a fused block, which provides more context for the retriever to search for. The second technique is to use a cross-block reader to model the cross-dependency between multiple retrieved evidence with global-local sparse attention. Combining these two techniques improves the score significantly, to above 27\%.",/pdf/6efd9eab0db73088a48f58c2b76aff5b828c7471.pdf,ICLR,2021,We propose the new task of answering open-domain questions answering over web tables and text and design new techniques: 1) fused retrieval 2) cross-block reader to resolve the challenges posed in the new task. +SkgQwpVYwH,HkxDjocPPB,1569440000000.0,1577170000000.0,590,"Credible Sample Elicitation by Deep Learning, for Deep Learning","[""yangliu@ucsc.edu"", ""zuyuefu2022@u.northwestern.edu"", ""zy6@princeton.edu"", ""zhaoranwang@gmail.com""]","[""Yang Liu"", ""Zuyue Fu"", ""Zhuoran Yang"", ""Zhaoran Wang""]",[],"It is important to collect credible training samples $(x,y)$ for building data-intensive learning systems (e.g., a deep learning system). In the literature, there is a line of studies on eliciting distributional information from self-interested agents who hold a relevant information. Asking people to report complex distribution $p(x)$, though theoretically viable, is challenging in practice. This is primarily due to the heavy cognitive loads required for human agents to reason and report this high dimensional information. Consider the example where we are interested in building an image classifier via first collecting a certain category of high-dimensional image data. While classical elicitation results apply to eliciting a complex and generative (and continuous) distribution $p(x)$ for this image data, we are interested in eliciting samples $x_i \sim p(x)$ from agents. This paper introduces a deep learning aided method to incentivize credible sample contributions from selfish and rational agents. The challenge to do so is to design an incentive-compatible score function to score each reported sample to induce truthful reports, instead of an arbitrary or even adversarial one. We show that with accurate estimation of a certain $f$-divergence function we are able to achieve approximate incentive compatibility in eliciting truthful samples. We then present an efficient estimator with theoretical guarantee via studying the variational forms of $f$-divergence function. Our work complements the literature of information elicitation via introducing the problem of \emph{sample elicitation}. We also show a connection between this sample elicitation problem and $f$-GAN, and how this connection can help reconstruct an estimator of the distribution based on collected samples.",/pdf/75470e160d0934494068371f70527f64a36eb4ea.pdf,ICLR,2020,This paper proposes a deep learning aided method to elicit credible samples from self-interested agents. +lU5Rs_wCweN,xBP1mFoOhGK,1601310000000.0,1615740000000.0,416,Taking Notes on the Fly Helps Language Pre-Training,"[""~Qiyu_Wu1"", ""~Chen_Xing2"", ""yatli@microsoft.com"", ""~Guolin_Ke3"", ""~Di_He1"", ""~Tie-Yan_Liu1""]","[""Qiyu Wu"", ""Chen Xing"", ""Yatao Li"", ""Guolin Ke"", ""Di He"", ""Tie-Yan Liu""]","[""Natural Language Processing"", ""Pre-training""]","How to make unsupervised language pre-training more efficient and less resource-intensive is an important research direction in NLP. In this paper, we focus on improving the efficiency of language pre-training methods through providing better data utilization. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. We argue that such embeddings carry inadequate semantic signals, which could make the data utilization inefficient and slow down the pre-training of the entire model. To mitigate this problem, we propose Taking Notes on the Fly (TNF), which takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. Specifically, TNF maintains a note dictionary and saves a rare word's contextual information in it as notes when the rare word occurs in a sentence. When the same rare word occurs again during training, the note information saved beforehand can be employed to enhance the semantics of the current sentence. By doing so, TNF provides a better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences. We implement TNF on both BERT and ELECTRA to check its efficiency and effectiveness. Experimental results show that TNF's training time is 60% less than its backbone pre-training models when reaching the same performance. When trained with same number of iterations, TNF outperforms its backbone methods on most of downstream tasks and the average GLUE score. Code is attached in the supplementary material.",/pdf/954ca2d8ae9cd134cd7cb0003ecd87b3e6f3bf4e.pdf,ICLR,2021,We improve the efficiency of language pre-training methods through providing better data utilization. +BkePneStwH,B1xWvlWYDH,1569440000000.0,1577170000000.0,2541,XD: Cross-lingual Knowledge Distillation for Polyglot Sentence Embeddings,"[""max.del.edu@gmail.com"", ""fishel@ut.ee""]","[""Maksym Del"", ""Mark Fishel""]","[""cross-lingual transfer"", ""sentence embeddings"", ""polyglot language models"", ""knowledge distillation"", ""natural language inference"", ""embedding alignment"", ""embedding mapping""]","Current state-of-the-art results in multilingual natural language inference (NLI) are based on tuning XLM (a pre-trained polyglot language model) separately for each language involved, resulting in multiple models. We reach significantly higher NLI results with a single model for all languages via multilingual tuning. Furthermore, we introduce cross-lingual knowledge distillation (XD), where the same polyglot model is used both as teacher and student across languages to improve its sentence representations without using the end-task labels. When used alone, XD beats multilingual tuning for some languages and the combination of them both results in a new state-of-the-art of 79.2% on the XNLI dataset, surpassing the previous result by absolute 2.5%. The models and code for reproducing our experiments will be made publicly available after de-anonymization.",/pdf/39ab0112530dbc00d00e18757b8d859a2d0fe563.pdf,ICLR,2020,Knowledge distillation for cross-lingual language model alignment with state-of-the-art results on XNLI +Hkg1YiAcK7,Byg8hfmct7,1538090000000.0,1545360000000.0,410,Learning Implicit Generative Models by Teaching Explicit Ones,"[""duchao0726@gmail.com"", ""kunxu.thu@gmail.com"", ""chongxuanli1991@gmail.com"", ""dcszj@tsinghua.edu.cn"", ""dcszb@tsinghua.edu.cn""]","[""Chao Du"", ""Kun Xu"", ""Chongxuan Li"", ""Jun Zhu"", ""Bo Zhang""]",[],"Implicit generative models are difficult to train as no explicit probability density functions are defined. Generative adversarial nets (GANs) propose a minimax framework to train such models, which suffer from mode collapse in practice due to the nature of the JS-divergence. In contrast, we propose a learning by teaching (LBT) framework to learn implicit models, which intrinsically avoid the mode collapse problem because of using the KL-divergence. In LBT, an auxiliary explicit model is introduced to learn the distribution defined by the implicit model while the later one's goal is to teach the explicit model to match the data distribution. LBT is formulated as a bilevel optimization problem, whose optimum implies that we obtain the maximum likelihood estimation of the implicit model. We adopt an unrolling approach to solve the challenging learning problem. Experimental results demonstrate the effectiveness of our method.",/pdf/3c10a26ba9cc6ed120af425ea8d4a8f0ee7f1ab4.pdf,ICLR,2019, +SJUX_MWCZ,SkrQdzWCW,1509140000000.0,1518730000000.0,869,Predict Responsibly: Increasing Fairness by Learning to Defer,"[""david.madras@mail.utoronto.ca"", ""zemel@cs.toronto.edu"", ""toni@cs.toronto.edu""]","[""David Madras"", ""Toniann Pitassi"", ""Richard Zemel""]","[""Fairness"", ""IDK"", ""Calibration"", ""Automated decision-making"", ""Transparency"", ""Accountability""]","When machine learning models are used for high-stakes decisions, they should predict accurately, fairly, and responsibly. To fulfill these three requirements, a model must be able to output a reject option (i.e. say ""``I Don't Know"") when it is not qualified to make a prediction. In this work, we propose learning to defer, a method by which a model can defer judgment to a downstream decision-maker such as a human user. We show that learning to defer generalizes the rejection learning framework in two ways: by considering the effect of other agents in the decision-making process, and by allowing for optimization of complex objectives. We propose a learning algorithm which accounts for potential biases held by decision-makerslater in a pipeline. Experiments on real-world datasets demonstrate that learning +to defer can make a model not only more accurate but also less biased. Even when +operated by highly biased users, we show that +deferring models can still greatly improve the fairness of the entire pipeline.",/pdf/d3c6975dd33d3ce19c7a3a1379b4148b0dd37583.pdf,ICLR,2018,"Incorporating the ability to say I-don't-know can improve the fairness of a classifier without sacrificing too much accuracy, and this improvement magnifies when the classifier has insight into downstream decision-making." +r1xI-gHFDH,rygfDleKPB,1569440000000.0,1577170000000.0,2137,How can we generalise learning distributed representations of graphs?,"[""pms69@cam.ac.uk"", ""pl219@cam.ac.uk""]","[""Paul M Scherer"", ""Pietro Lio""]","[""graphs"", ""distributed representations"", ""similarity learning""]",We propose a general framework to construct unsupervised models capable of learning distributed representations of discrete structures such as graphs based on R-Convolution kernels and distributed semantics research. Our framework combines the insights and observations of Deep Graph Kernels and Graph2Vec towards a unified methodology for performing similarity learning on graphs of arbitrary size. This is exemplified by our own instance G2DR which extends Graph2Vec from labelled graphs towards unlabelled graphs and tackles issues of diagonal dominance through pruning of the subgraph vocabulary composing graphs. These changes produce new state of the art results in the downstream application of G2DR embeddings in graph classification tasks over datasets with small labelled graphs in binary classification to multi-class classification on large unlabelled graphs using an off-the-shelf support vector machine. ,/pdf/bc2489ef3429cb51829b11e40e089e6f99493b8b.pdf,ICLR,2020,We propose a general framework for building models that can learn distributed representations of discrete structures and test this on graphs. +r1gIdySFPH,HklG5g0_PB,1569440000000.0,1577170000000.0,1805,Skew-Fit: State-Covering Self-Supervised Reinforcement Learning,"[""vitchyr@berkeley.edu"", ""mdalal@berkeley.edu"", ""stevenlin598@berkeley.edu"", ""anair17@berkeley.edu"", ""shikharbahl@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Vitchyr H. Pong"", ""Murtaza Dalal"", ""Steven Lin"", ""Ashvin Nair"", ""Shikhar Bahl"", ""Sergey Levine""]","[""deep reinforcement learning"", ""goal space"", ""goal conditioned reinforcement learning"", ""self-supervised reinforcement learning"", ""goal sampling"", ""reinforcement learning""]","Autonomous agents that must exhibit flexible and broad capabilities will need to be equipped with large repertoires of skills. Defining each skill with a manually-designed reward function limits this repertoire and imposes a manual engineering burden. Self-supervised agents that set their own goals can automate this process, but designing appropriate goal setting objectives can be difficult, and often involves heuristic design decisions. In this paper, we propose a formal exploration objective for goal-reaching policies that maximizes state coverage. We show that this objective is equivalent to maximizing the entropy of the goal distribution together with goal reaching performance, where goals correspond to full state observations. To instantiate this principle, we present an algorithm called Skew-Fit for learning a maximum-entropy goal distributions. Skew-Fit enables self-supervised agents to autonomously choose and practice reaching diverse goals. We show that, under certain regularity conditions, our method converges to a uniform distribution over the set of valid states, even when we do not know this set beforehand. Our experiments show that it can learn a variety of manipulation tasks from images, including opening a door with a real robot, entirely from scratch and without any manually-designed reward function.",/pdf/75b54ec2afc748a7f962d69b6e1ebd3d40762570.pdf,ICLR,2020,"We propose a principled objective for autonomous goal-setting in high-dimensional, unknown goal spaces and present a method that theoretically and empirically learns the optimal goal distribution." +jYkO_0z2TAr,o-bjcNAyiE,1601310000000.0,1614990000000.0,547,Zero-Shot Learning with Common Sense Knowledge Graphs,"[""~Nihal_Nayak1"", ""~Stephen_Bach1""]","[""Nihal Nayak"", ""Stephen Bach""]","[""Zero-Shot Learning"", ""Common Sense Knowledge Graphs"", ""Graph Neural Networks""]","Zero-shot learning relies on semantic class representations such as hand-engineered attributes or learned embeddings to predict classes without any labeled examples. We propose to learn class representations from common sense knowledge graphs. Common sense knowledge graphs are an untapped source of explicit high-level knowledge that requires little human effort to apply to a range of tasks. To capture the knowledge in the graph, we introduce ZSL-KG, a general-purpose framework with a novel transformer graph convolutional network (TrGCN) to generate class representations. Our proposed TrGCN architecture computes non-linear combinations of the node neighbourhood and leads to significant improvements on zero-shot learning tasks. We report new state-of-the-art accuracies on six zero-shot benchmark datasets in object classification, intent classification, and fine-grained entity typing tasks. ZSL-KG outperforms the specialized state-of-the-art method for each task by an average 1.7 accuracy points and outperforms the general-purpose method with the best average accuracy by 5.3 points. Our ablation study on ZSL-KG with alternate graph neural networks shows that our transformer-based aggregator adds up to 2.8 accuracy points improvement on these tasks.",/pdf/7fd9386440b5eaa508e66239fd9effc15dbfd256.pdf,ICLR,2021,"Our paper introduces ZSL-KG, a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs." +r1kNDlbCb,Sy07vl-0-,1509130000000.0,1518730000000.0,598,Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks,"[""king6101@gmail.com"", ""tlkagkb93901106@gmail.com""]","[""Yau-Shian Wang"", ""Hung-Yi Lee""]","[""unsupervised learning"", ""text summarization"", ""adversarial training""]","Auto-encoders compress input data into a latent-space representation and reconstruct the original data from the representation. This latent representation is not easily interpreted by humans. In this paper, we propose training an auto-encoder that encodes input text into human-readable sentences. The auto-encoder is composed of a generator and a reconstructor. The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output. +To make the generator output human-readable, a discriminator restricts the output of the generator to resemble human-written sentences. By taking the generator output as the summary of the input text, abstractive summarization is achieved without document-summary pairs as training data. Promising results are shown on both English and Chinese corpora.",/pdf/bc024d100fea1617f3f3977df0a78b39acbb6bf5.pdf,ICLR,2018, +_lV1OrJIgiG,20xJJaY_60f,1601310000000.0,1614990000000.0,387,Model-based Navigation in Environments with Novel Layouts Using Abstract $2$-D Maps,"[""~Linfeng_Zhao1"", ""~Lawson_L._S._Wong1""]","[""Linfeng Zhao"", ""Lawson L. S. Wong""]",[],"Efficiently training agents with planning capabilities has long been one of the major challenges in decision-making. In this work, we focus on zero-shot navigation ability on a given abstract 2-D occupancy map, like human navigation by reading a paper map, by treating it as an image. To learn this ability, we need to efficiently train an agent on environments with a small proportion of training maps and share knowledge effectively across the environments. We hypothesize that model-based navigation can better adapt an agent's behaviors to a task, since it disentangles the variations in map layout and goal location and enables longer-term planning ability on novel locations compared to reactive policies. We propose to learn a hypermodel that can understand patterns from a limited number of abstract maps and goal locations, to maximize alignment between the hypermodel predictions and real trajectories to extract information from multi-task off-policy experiences, and to construct denser feedback for planners by $n$-step goal relabelling. We train our approach on DeepMind Lab environments with layouts from different maps, and demonstrate superior performance on zero-shot transfer to novel maps and goals.",/pdf/c00e6e25426e69df6991307513a5a64da045a57e.pdf,ICLR,2021, +SJl9PTNYDS,rJe2fEswDr,1569440000000.0,1577170000000.0,607,NPTC-net: Narrow-Band Parallel Transport Convolutional Neural Network on Point Clouds,"[""jinpf@pku.edu.cn"", ""howeverlth@pku.edu.cn"", ""lair@rpi.edu"", ""dongbin@math.pku.edu.cn""]","[""Pengfei Jin"", ""Tianhao Lai"", ""Rongjie Lai"", ""Bin Dong""]","[""geometric convolution"", ""point cloud"", ""parallel transport""]","Convolution plays a crucial role in various applications in signal and image processing, analysis and recognition. It is also the main building block of convolution neural networks (CNNs). Designing appropriate convolution neural networks on manifold-structured point clouds can inherit and empower recent advances of CNNs to analyzing and processing point cloud data. However, one of the major challenges is to define a proper way to ""sweep"" filters through the point cloud as a natural generalization of the planar convolution and to reflect the point cloud's geometry at the same time. In this paper, we consider generalizing convolution by adapting parallel transport on the point cloud. Inspired by a triangulated surface based method \cite{DBLP:journals/corr/abs-1805-07857}, we propose the Narrow-Band Parallel Transport Convolution (NPTC) using a specifically defined connection on a voxelized narrow-band approximation of point cloud data. With that, we further propose a deep convolutional neural network based on NPTC (called NPTC-net) for point cloud classification and segmentation. Comprehensive experiments show that the proposed NPTC-net achieves similar or better results than current state-of-the-art methods on point clouds classification and segmentation.",/pdf/fa336be1cc058b5f8e7d2b06cd0be6d180a3b8a5.pdf,ICLR,2020,We propose the Narrow-Band Parallel Transport Convolution (NPTC) using a specifically defined connection on a voxelized narrow-band approximation of point cloud data and further propose a deep convolutional neural network based on NPTC . +SJJySbbAZ,r1p04WZCW,1509130000000.0,1519450000000.0,669,Training GANs with Optimism,"[""costis@mit.edu"", ""ailyas@mit.edu"", ""vasy@microsoft.com"", ""haoyangz@mit.edu""]","[""Constantinos Daskalakis"", ""Andrew Ilyas"", ""Vasilis Syrgkanis"", ""Haoyang Zeng""]","[""GANs"", ""Optimistic Mirror Decent"", ""Cycling"", ""Last Iterate Convergence"", ""Optimistic Adam""]","We address the issue of limit cycling behavior in training Generative Adversarial Networks and propose the use of Optimistic Mirror Decent (OMD) for training Wasserstein GANs. Recent theoretical results have shown that optimistic mirror decent (OMD) can enjoy faster regret rates in the context of zero-sum games. WGANs is exactly a context of solving a zero-sum game with simultaneous no-regret dynamics. Moreover, we show that optimistic mirror decent addresses the limit cycling problem in training WGANs. We formally show that in the case of bi-linear zero-sum games the last iterate of OMD dynamics converges to an equilibrium, in contrast to GD dynamics which are bound to cycle. We also portray the huge qualitative difference between GD and OMD dynamics with toy examples, even when GD is modified with many adaptations proposed in the recent literature, such as gradient penalty or momentum. We apply OMD WGAN training to a bioinformatics problem of generating DNA sequences. We observe that models trained with OMD achieve consistently smaller KL divergence with respect to the true underlying distribution, than models trained with GD variants. Finally, we introduce a new algorithm, Optimistic Adam, which is an optimistic variant of Adam. We apply it to WGAN training on CIFAR10 and observe improved performance in terms of inception score as compared to Adam.",/pdf/09c12e21ce7e1c0e1f34b1aa7ef62042de2b0ddd.pdf,ICLR,2018,We propose the use of optimistic mirror decent to address cycling problems in the training of GANs. We also introduce the Optimistic Adam algorithm +HkanP0lRW,S183D0gC-,1509120000000.0,1518730000000.0,453,Data-driven Feature Sampling for Deep Hyperspectral Classification and Segmentation,"[""wmsever@sandia.gov"", ""jatimli@sandia.gov"", ""skholwadwala@gmail.com"", ""cdjame@sandia.gov"", ""jbaimon@sandia.gov""]","[""William M. Severa"", ""Jerilyn A. Timlin"", ""Suraj Kholwadwala"", ""Conrad D. James"", ""James B. Aimone""]","[""Applied deep learning"", ""Image segmentation"", ""Hyperspectral Imaging"", ""Feature sampling""]","The high dimensionality of hyperspectral imaging forces unique challenges in scope, size and processing requirements. Motivated by the potential for an in-the-field cell sorting detector, we examine a Synechocystis sp. PCC 6803 dataset wherein cells are grown alternatively in nitrogen rich or deplete cultures. We use deep learning techniques to both successfully classify cells and generate a mask segmenting the cells/condition from the background. Further, we use the classification accuracy to guide a data-driven, iterative feature selection method, allowing the design neural networks requiring 90% fewer input features with little accuracy degradation.",/pdf/1f0b60e7281c27ad045c8e74fc9d3f22fc533085.pdf,ICLR,2018,We applied deep learning techniques to hyperspectral image segmentation and iterative feature sampling. +pAj7zLJK05U,VUWPQtIEpb,1601310000000.0,1614990000000.0,2915,AttackDist: Characterizing Zero-day Adversarial Samples by Counter Attack,"[""~Simin_Chen1"", ""~Zihe_Song1"", ""~Lei_Ma1"", ""~Cong_Liu2"", ""wei.yang@utdallas.edu""]","[""Simin Chen"", ""Zihe Song"", ""Lei Ma"", ""Cong Liu"", ""Wei Yang""]",[],"Deep Neural Networks (DNNs) have been shown vulnerable to adversarial attacks, which could produce adversarial samples that easily fool the state-of-the-art DNNs. The harmfulness of adversarial attacks calls for the defense mechanisms under fire. However, the relationship between adversarial attacks and defenses is like spear and shield. Whenever a defense method is proposed, a new attack would be followed to bypass the defense immediately. Devising a definitive defense against new attacks~(zero-day attacks) is proven to be challenging. We tackle this challenge by characterizing the intrinsic properties of adversarial samples, via measuring the norm of the perturbation after a counterattack. Our method is based on the idea that, from an optimization perspective, adversarial samples would be closer to the decision boundary; thus the perturbation to counterattack adversarial samples would be significantly smaller than normal cases. Motivated by this, we propose AttackDist, an attack-agnostic property to characterize adversarial samples. We first theoretically clarify under which condition AttackDist can provide a certified detecting performance, then show that a potential application of AttackDist is distinguishing zero-day adversarial examples without knowing the mechanisms of new attacks. As a proof-of-concept, we evaluate AttackDist on two widely used benchmarks. The evaluation results show that AttackDist can outperform the state-of-the-art detection measures by large margins in detecting zero-day adversarial attacks.",/pdf/44b02f034a3259917a79f05861ebc5bd6d5f8570.pdf,ICLR,2021, +Cn706AbJaKW,0b6aUiO42bm,1601310000000.0,1614990000000.0,2675,An Open Review of OpenReview: A Critical Analysis of the Machine Learning Conference Review Process,"[""~David_Tran1"", ""~Alexander_V_Valtchanov2"", ""~Keshav_R_Ganapathy1"", ""~Raymond_Feng1"", ""slud@umd.edu"", ""~Micah_Goldblum1"", ""~Tom_Goldstein1""]","[""David Tran"", ""Alexander V Valtchanov"", ""Keshav R Ganapathy"", ""Raymond Feng"", ""Eric Victor Slud"", ""Micah Goldblum"", ""Tom Goldstein""]","[""Conference Review"", ""OpenReview"", ""Gender"", ""Bias"", ""Reproducibility"", ""Fairness""]","Mainstream machine learning conferences have seen a dramatic increase in the number of participants, along with a growing range of perspectives, in recent years. Members of the machine learning community are likely to overhear allegations ranging from randomness of acceptance decisions to institutional bias. In this work, we critically analyze the review process through a comprehensive study of papers submitted to ICLR between 2017 and 2020. We quantify reproducibility/randomness in review scores and acceptance decisions, and examine whether scores correlate with paper impact. Our findings suggest strong institutional bias in accept/reject decisions, even after controlling for paper quality. Furthermore, we find evidence for a gender gap, with female authors receiving lower scores, lower acceptance rates, and fewer citations per paper than their male counterparts. We conclude our work with recommendations for future conference organizers. ",/pdf/05f1be4f15e4f9e1f37f1f1761ceb1cf367ac4d8.pdf,ICLR,2021,We study the conference review process to quantify reproducibility and bias. +r1osyr_xg,,1478150000000.0,1484820000000.0,56,Fuzzy paraphrases in learning word representations with a lexicon,"[""enshika8811.a6@keio.jp"", ""hagiwara@keio.jp""]","[""Yuanzhi Ke"", ""Masafumi Hagiwara""]","[""Natural language processing"", ""Unsupervised Learning""]","A synonym of a polysemous word is usually only the paraphrase of one sense among many. When lexicons are used to improve vector-space word representations, such paraphrases are unreliable and bring noise to the vector-space. The prior works use a coefficient to adjust the overall learning of the lexicons. They regard the paraphrases equally. +In this paper, we propose a novel approach that regards the paraphrases diversely to alleviate the adverse effects of polysemy. We annotate each paraphrase with a degree of reliability. The paraphrases are randomly eliminated according to the degrees when our model learns word representations. In this way, our approach drops the unreliable paraphrases, keeping more reliable paraphrases at the same time. The experimental results show that the proposed method improves the word vectors. +Our approach is an attempt to address the polysemy problem keeping one vector per word. It makes the approach easier to use than the conventional methods that estimate multiple vectors for a word. Our approach also outperforms the prior works in the experiments.",/pdf/28e7b976b8ca9d854ce0c725c74051773df282a9.pdf,ICLR,2017,We propose a novel idea to address polysemy problem by annotating paraphrases with a degree of reliability like a member of a fuzzy set. +r1wEFyWCW,ryLNK1WR-,1509120000000.0,1519320000000.0,499,Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions,"[""reedscot@google.com"", ""yutianc@google.com"", ""tpaine@google.com"", ""avdnoord@google.com"", ""aeslami@google.com"", ""danilor@google.com"", ""vinyals@google.com"", ""nandodefreitas@google.com""]","[""Scott Reed"", ""Yutian Chen"", ""Thomas Paine"", ""A\u00e4ron van den Oord"", ""S. M. Ali Eslami"", ""Danilo Rezende"", ""Oriol Vinyals"", ""Nando de Freitas""]","[""few-shot learning"", ""density models"", ""meta learning""]","Deep autoregressive models have shown state-of-the-art performance in density estimation for natural images on large-scale datasets such as ImageNet. However, such models require many thousands of gradient-based weight updates and unique image examples for training. Ideally, the models would rapidly learn visual concepts from only a handful of examples, similar to the manner in which humans learns across many vision tasks. In this paper, we show how 1) neural attention and 2) meta learning techniques can be used in combination with autoregressive models to enable effective few-shot density estimation. Our proposed modifications to PixelCNN result in state-of-the art few-shot density estimation on the Omniglot dataset. Furthermore, we visualize the learned attention policy and find that it learns intuitive algorithms for simple tasks such as image mirroring on ImageNet and handwriting on Omniglot without supervision. Finally, we extend the model to natural images and demonstrate few-shot image generation on the Stanford Online Products dataset.",/pdf/e28ae6e6977da3b13d1c15287710bc92ace11d99.pdf,ICLR,2018,Few-shot learning PixelCNN +UH-cmocLJC,_88KWZFfavA,1601310000000.0,1614730000000.0,700,How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks,"[""~Keyulu_Xu1"", ""~Mozhi_Zhang1"", ""~Jingling_Li1"", ""~Simon_Shaolei_Du1"", ""~Ken-Ichi_Kawarabayashi1"", ""~Stefanie_Jegelka3""]","[""Keyulu Xu"", ""Mozhi Zhang"", ""Jingling Li"", ""Simon Shaolei Du"", ""Ken-Ichi Kawarabayashi"", ""Stefanie Jegelka""]","[""extrapolation"", ""deep learning"", ""out-of-distribution"", ""graph neural networks"", ""deep learning theory""]","We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of the training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while feedforward neural networks, a.k.a. multilayer perceptrons (MLPs), do not extrapolate well in certain simple tasks, Graph Neural Networks (GNNs) -- structured networks with MLP modules -- have shown some success in more complex tasks. Working towards a theoretical explanation, we identify conditions under which MLPs and GNNs extrapolate well. First, we quantify the observation that ReLU MLPs quickly converge to linear functions along any direction from the origin, which implies that ReLU MLPs do not extrapolate most nonlinear functions. But, they can provably learn a linear target function when the training distribution is sufficiently diverse. Second, in connection to analyzing the successes and limitations of GNNs, these results suggest a hypothesis for which we provide theoretical and empirical evidence: the success of GNNs in extrapolating algorithmic tasks to new data (e.g., larger graphs or edge weights) relies on encoding task-specific non-linearities in the architecture or features. Our theoretical analysis builds on a connection of over-parameterized networks to the neural tangent kernel. Empirically, our theory holds across different training settings.",/pdf/eb776544f1a0835881289e4dad2436f38f88269d.pdf,ICLR,2021,We study how neural networks trained by gradient descent extrapolate. +MMXhHXbNsa-,GQBqARxfFi,1601310000000.0,1614990000000.0,2440,Blind Pareto Fairness and Subgroup Robustness,"[""~Natalia_Martinez1"", ""martin.a.bertran@gmail.com"", ""~Afroditi_Papadaki1"", ""~Miguel_R._D._Rodrigues1"", ""~Guillermo_Sapiro1""]","[""Natalia Martinez"", ""Martin Bertran"", ""Afroditi Papadaki"", ""Miguel R. D. Rodrigues"", ""Guillermo Sapiro""]","[""fairness"", ""fairness in machine learning"", ""fairness without demographics"", ""robustness"", ""subgroup robustness"", ""blind fairness"", ""pareto fairness""]","With the wide adoption of machine learning algorithms across various application domains, there is a growing interest in the fairness properties of such algorithms. The vast majority of the activity in the field of group fairness addresses disparities between predefined groups based on protected features such as gender, age, and race, which need to be available at train, and often also at test, time. These approaches are static and retrospective, since algorithms designed to protect groups identified a priori cannot anticipate and protect the needs of different at-risk groups in the future. In this work we analyze the space of solutions for worst-case fairness beyond demographics, and propose Blind Pareto Fairness (BPF), a method that leverages no-regret dynamics to recover a fair minimax classifier that reduces worst-case risk of any potential subgroup of sufficient size, and guarantees that the remaining population receives the best possible level of service. BPF addresses fairness beyond demographics, that is, it does not rely on predefined notions of at-risk groups, neither at train nor at test time. Our experimental results show that the proposed framework improves worst-case risk in multiple standard datasets, while simultaneously providing better levels of service for the remaining population, in comparison to competing methods.",/pdf/6fe40433443114bc2c892683832d52416321de54.pdf,ICLR,2021,"We analyze worst-case fairness beyond demographics, and propose Blind Pareto Fairness, a method that reduces worst-case risk of any subgroup of sufficient size, and guarantees that the remaining population receives the best possible level of service." +rJeEqiC5KQ,S1ermRS9FX,1538090000000.0,1545360000000.0,526,ON THE USE OF CONVOLUTIONAL AUTO-ENCODER FOR INCREMENTAL CLASSIFIER LEARNING IN CONTEXT AWARE ADVERTISEMENT,"[""tlnma@i2r.a-star.edu.sg"", ""xie_shudong@i2r.a-star.edu.sg"", ""e0267605@u.nus.edu"", ""yqli@i2r.a-star.edu.sg"", ""joohwee@i2r.a-star.edu.sg""]","[""Tin Lay Nwe"", ""Shudong Xie"", ""Balaji Nataraj"", ""Yiqun Li"", ""Joo-Hwee Lim""]","[""Incremental learning"", ""deep learning"", ""autoencoder"", ""privacy"", ""convolutional neural network""]","Context Aware Advertisement (CAA) is a type of advertisement +appearing on websites or mobile apps. The advertisement is targeted +on specific group of users and/or the content displayed on the +websites or apps. This paper focuses on classifying images displayed +on the websites by incremental learning classifier with Deep +Convolutional Neural Network (DCNN) especially for Context Aware +Advertisement (CAA) framework. Incrementally learning new knowledge +with DCNN leads to catastrophic forgetting as previously stored +information is replaced with new information. To prevent +catastrophic forgetting, part of previously learned knowledge should +be stored for the life time of incremental classifier. Storing +information for life time involves privacy and legal concerns +especially in context aware advertising framework. Here, we propose +an incremental classifier learning method which addresses privacy +and legal concerns while taking care of catastrophic forgetting +problem. We conduct experiments on different datasets including +CIFAR-100. Experimental results show that proposed system achieves +relatively high performance compared to the state-of-the-art +incremental learning methods.",/pdf/de32809720997cc7dbd94749cc66085c1f425415.pdf,ICLR,2019,Human brain inspired incremental learning system +SVP44gujOBL,hvXAfafNts,1601310000000.0,1614990000000.0,3564,A Simple Approach To Define Curricula For Training Neural Networks,"[""~Vinu_Sankar_Sadasivan1"", ""~Anirban_Dasgupta1""]","[""Vinu Sankar Sadasivan"", ""Anirban Dasgupta""]","[""Curriculum learning"", ""neural networks""]","In practice, sequence of mini-batches generated by uniform sampling of examples from the entire data is used for training neural networks. Curriculum learning is a training strategy that sorts the training examples by their difficulty and gradually exposes them to the learner. In this work, we propose two novel curriculum learning algorithms and empirically show their improvements in performance with convolutional and fully-connected neural networks on multiple real image datasets. Our dynamic curriculum learning algorithm tries to reduce the distance between the network weight and an optimal weight at any training step by greedily sampling examples with gradients that are directed towards the optimal weight. The curriculum ordering determined by our dynamic algorithm achieves a training speedup of $\sim 45\%$ in our experiments. We also introduce a new task-specific curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points in natural image datasets. We show that this new approach yields a mean training speedup of $\sim 43\%$ in the experiments we perform. Further, we also use our algorithms to learn why curriculum learning works. Based on our study, we argue that curriculum learning removes noisy examples from the initial phases of training, and gradually exposes them to the learner acting like a regularizer that helps in improving the generalization ability of the learner.",/pdf/b9d83283b657b55874ffcc9c616ba64a0ffbe559.pdf,ICLR,2021,"We introduce a simple, unsupervised approach to score the difficulty of examples using statistical measures for curriculum learning, and analyze it with the help of a dynamic curriculum learning framework that we design." +BJg1f6EFDB,rJlGWVF8DB,1569440000000.0,1583910000000.0,396,On Identifiability in Transformers,"[""brunnegi@ethz.ch"", ""liu.yang@alumni.ethz.ch"", ""dpascual@ethz.ch"", ""richtero@ethz.ch"", ""massi@google.com"", ""wattenhofer@ethz.ch""]","[""Gino Brunner"", ""Yang Liu"", ""Damian Pascual"", ""Oliver Richter"", ""Massimiliano Ciaramita"", ""Roger Wattenhofer""]","[""Self-attention"", ""interpretability"", ""identifiability"", ""BERT"", ""Transformer"", ""NLP"", ""explanation"", ""gradient attribution""]","In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identifiable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also find evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution. Overall, we show that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models. ",/pdf/93b48ea21212ada64c99298c6f32e8503e99d0e2.pdf,ICLR,2020,We investigate the identifiability and interpretability of attention distributions and tokens within contextual embeddings in the self-attention based BERT model. +BJxhLAuxg,,1478190000000.0,1478300000000.0,69,A Deep Learning Approach for Joint Video Frame and Reward Prediction in Atari Games,"[""felix.leibfried@gmail.com"", ""nkushman@microsoft.com"", ""katja.hofmann@microsoft.com""]","[""Felix Leibfried"", ""Nate Kushman"", ""Katja Hofmann""]",[],"Reinforcement learning is concerned with learning to interact with environments that are initially unknown. State-of-the-art reinforcement learning approaches, such as DQN, are model-free and learn to act effectively across a wide range of environments such as Atari games, but require huge amounts of data. Model-based techniques are more data-efficient, but need to acquire explicit knowledge about the environment dynamics or the reward structure. + +In this paper we take a step towards using model-based techniques in environments with high-dimensional visual state space when system dynamics and the reward structure are both unknown and need to be learned, by demonstrating that it is possible to learn both jointly. +Empirical evaluation on five Atari games demonstrate accurate cumulative reward prediction of up to 200 frames. We consider these positive results as opening up important directions for model-based RL in complex, initially unknown environments.",/pdf/6211011f3f06ace82437c764fc1d57ca176241e6.pdf,ICLR,2017, +SJexHkSFPS,rJxaFgpOPr,1569440000000.0,1583910000000.0,1679,Thinking While Moving: Deep Reinforcement Learning with Concurrent Control,"[""tedxiao@google.com"", ""ejang@google.com"", ""dkalashnikov@google.com"", ""slevine@google.com"", ""julianibarz@google.com"", ""karolhausman@google.com"", ""alexherzog@google.com""]","[""Ted Xiao"", ""Eric Jang"", ""Dmitry Kalashnikov"", ""Sergey Levine"", ""Julian Ibarz"", ""Karol Hausman"", ""Alexander Herzog""]","[""deep reinforcement learning"", ""continuous-time"", ""robotics""]","We study reinforcement learning in settings where sampling an action from the policy must be done concurrently with the time evolution of the controlled system, such as when a robot must decide on the next action while still performing the previous action. Much like a person or an animal, the robot must think and move at the same time, deciding on its next action before the previous one has completed. In order to develop an algorithmic framework for such concurrent control problems, we start with a continuous-time formulation of the Bellman equations, and then discretize them in a way that is aware of system delays. We instantiate this new class of approximate dynamic programming methods via a simple architectural extension to existing value-based deep reinforcement learning algorithms. We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must ""think while moving.""",/pdf/0b01c152c145c317d0f64e522bc8badcfd6d2f29.pdf,ICLR,2020,"Reinforcement learning formulation that allows agents to think and act at the same time, demonstrated on real-world robotic grasping." +XZDeL25T12l,Jyvtw4FNvCZ,1601310000000.0,1614990000000.0,707,Can Students Outperform Teachers in Knowledge Distillation based Model Compression?,"[""~Xiang_Deng1"", ""~Zhongfei_Zhang1""]","[""Xiang Deng"", ""Zhongfei Zhang""]","[""Knowledge Distillation"", ""Deep Learning"", ""Supervised Learning"", ""Model Compression""]","Knowledge distillation (KD) is an effective technique to compress a large model (teacher) to a compact one (student) by knowledge transfer. The ideal case is that the teacher is compressed to the small student without any performance dropping. However, even for the state-of-the-art (SOTA) distillation approaches, there is still an obvious performance gap between the student and the teacher. The existing literature usually attributes this to model capacity differences between them. However, model capacity differences are unavoidable in model compression. In this work, we systematically study this question. By designing exploratory experiments, we find that model capacity differences are not necessarily the root reason, and the distillation data matters when the student capacity is greater than a threshold. In light of this, we propose to go beyond in-distribution distillation and accordingly develop KD+. KD+ is superior to the original KD as it outperforms KD and the other SOTA approaches substantially and is more compatible with the existing approaches to further improve their performances significantly.",/pdf/1f267450c9b97448b6c7111e79c03346754aac13.pdf,ICLR,2021,This work is toward understanding why students underperform teachers in knowledge distillation based model compression from a data perspective. +S1CChZ-CZ,HJTRh-ZAW,1509130000000.0,1519410000000.0,725,Ask the Right Questions: Active Question Reformulation with Reinforcement Learning,"[""cbuck@google.com"", ""jbulian@google.com"", ""massi@google.com"", ""wgaj@google.com"", ""agesmundo@google.com"", ""neilhoulsby@google.com"", ""wangwe@google.com""]","[""Christian Buck"", ""Jannis Bulian"", ""Massimiliano Ciaramita"", ""Wojciech Gajewski"", ""Andrea Gesmundo"", ""Neil Houlsby"", ""Wei Wang.""]","[""machine translation"", ""paraphrasing"", ""question answering"", ""reinforcement learning"", ""agents""]","We frame Question Answering (QA) as a Reinforcement Learning task, an approach that we call Active Question Answering. + +We propose an agent that sits between the user and a black box QA system and learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned evidence to yield the best answer. + +The reformulation system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. The agent outperforms a state-of-the-art base model, playing the role of the environment, and other benchmarks. + +We also analyze the language that the agent has learned while interacting with the question answering system. We find that successful question reformulations look quite different from natural language paraphrases. The agent is able to discover non-trivial reformulation strategies that resemble classic information retrieval techniques such as term re-weighting (tf-idf) and stemming.",/pdf/0b37c5f5a4af31a4530fd15654af20969fc0c1ed.pdf,ICLR,2018,We propose an agent that sits between the user and a black box question-answering system and which learns to reformulate questions to elicit the best possible answers +rylUOn4Yvr,rJl1mTMr8H,1569440000000.0,1577170000000.0,42,ROBUST DISCRIMINATIVE REPRESENTATION LEARNING VIA GRADIENT RESCALING: AN EMPHASIS REGULARISATION PERSPECTIVE,"[""xwang39@qub.ac.uk"", ""y.hua@qub.ac.uk"", ""elyor@anyvision.co"", ""n.robertson@qub.ac.uk""]","[""Xinshao Wang"", ""Yang Hua"", ""Elyor Kodirov"", ""Neil M. Robertson""]","[""examples weighting"", ""emphasis regularisation"", ""gradient scaling"", ""abnormal training examples""]","It is fundamental and challenging to train robust and accurate Deep Neural Networks (DNNs) when semantically abnormal examples exist. Although great progress has been made, there is still one crucial research question which is not thoroughly explored yet: What training examples should be focused and how much more should they be emphasised to achieve robust learning? In this work, we study this question and propose gradient rescaling (GR) to solve it. GR modifies the magnitude of logit vector’s gradient to emphasise on relatively easier training data points when noise becomes more severe, which functions as explicit emphasis regularisation to improve the generalisation performance of DNNs. Apart from regularisation, we connect GR to examples weighting and designing robust loss functions. We empirically demonstrate that GR is highly anomaly-robust and outperforms the state-of-the-art by a large margin, e.g., increasing 7% on CIFAR100 with 40% noisy labels. It is also significantly superior to standard regularisers in both clean and abnormal settings. Furthermore, we present comprehensive ablation studies to explore the behaviours of GR under different cases, which is informative for applying GR in real-world scenarios.",/pdf/5fecb7e054b65fe4a80d5ed507144c216fbf56fb.pdf,ICLR,2020,ROBUST DISCRIMINATIVE REPRESENTATION LEARNING VIA GRADIENT RESCALING: AN EMPHASIS REGULARISATION PERSPECTIVE +rJe7FW-Cb,rykmKW-C-,1509130000000.0,1518730000000.0,701,A Painless Attention Mechanism for Convolutional Neural Networks,"[""pau.rodriguez@cvc.uab.es"", ""gcucurull@cvc.uab.cat"", ""poal@cvc.uab.cat"", ""xavir@cvc.uab.es""]","[""Pau Rodr\u00edguez"", ""Guillem Cucurull"", ""Jordi Gonz\u00e0lez"", ""Josep M. Gonfaus"", ""Xavier Roca""]","[""computer vision"", ""deep learning"", ""convolutional neural networks"", ""attention""]","We propose a novel attention mechanism to enhance Convolutional Neural Networks for fine-grained recognition. The proposed mechanism reuses CNN feature activations to find the most informative parts of the image at different depths with the help of gating mechanisms and without part annotations. Thus, it can be used to augment any layer of a CNN to extract low- and high-level local information to be more discriminative. + +Differently, from other approaches, the mechanism we propose just needs a single pass through the input and it can be trained end-to-end through SGD. As a consequence, the proposed mechanism is modular, architecture-independent, easy to implement, and faster than iterative approaches. + +Experiments show that, when augmented with our approach, Wide Residual Networks systematically achieve superior performance on each of five different fine-grained recognition datasets: the Adience age and gender recognition benchmark, Caltech-UCSD Birds-200-2011, Stanford Dogs, Stanford Cars, and UEC Food-100, obtaining competitive and state-of-the-art scores.",/pdf/f567a7427678ea1e282a25a03b4d7c78faa2a3bd.pdf,ICLR,2018,We enhance CNNs with a novel attention mechanism for fine-grained recognition. Superior performance is obtained on 5 datasets. +Bkga90VKDB,Bylph0O_PB,1569440000000.0,1577170000000.0,1299,Distilled embedding: non-linear embedding factorization using knowledge distillation,"[""vasileios.lioutas@carleton.ca"", ""ahmad.rashid@huawei.com"", ""krtin.kumar@huawei.com"", ""md.akmal.haidar@huawei.com"", ""mehdi.rezagholizadeh@huawei.com""]","[""Vasileios Lioutas"", ""Ahmad Rashid"", ""Krtin Kumar"", ""Md Akmal Haidar"", ""Mehdi Rezagholizadeh""]","[""Model Compression"", ""Embedding Compression"", ""Low Rank Approximation"", ""Machine Translation"", ""Natural Language Processing"", ""Deep Learning""]","Word-embeddings are a vital component of Natural Language Processing (NLP) systems and have been extensively researched. Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations. Compressing embedding matrices without sacrificing model performance is essential for successful commercial edge deployment. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding. We conduct extensive experimentation with various compression rates on machine translation, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique outperforms conventional low-rank matrix factorization, and other recently proposed word-embedding matrix compression methods. +",/pdf/da6abb4122fef835800a2e21f70fceaae43c0133.pdf,ICLR,2020,We present an embedding decomposition and distillation technique for NLP model compression which is state-of-the-art in machine translation and simpler than existing methods +Byxpfh0cFm,r1gFusacKQ,1538090000000.0,1550960000000.0,1306,Efficient Augmentation via Data Subsampling,"[""mkuchnik@andrew.cmu.edu"", ""smithv@cmu.edu""]","[""Michael Kuchnik"", ""Virginia Smith""]","[""data augmentation"", ""invariance"", ""subsampling"", ""influence""]","Data augmentation is commonly used to encode invariances in learning methods. However, this process is often performed in an inefficient manner, as artificial examples are created by applying a number of transformations to all points in the training set. The resulting explosion of the dataset size can be an issue in terms of storage and training costs, as well as in selecting and tuning the optimal set of transformations to apply. In this work, we demonstrate that it is possible to significantly reduce the number of data points included in data augmentation while realizing the same accuracy and invariance benefits of augmenting the entire dataset. We propose a novel set of subsampling policies, based on model influence and loss, that can achieve a 90% reduction in augmentation set size while maintaining the accuracy gains of standard data augmentation.",/pdf/78fe660c51df9990135afc8da65faf20e904f239.pdf,ICLR,2019,Selectively augmenting difficult to classify points results in efficient training. +SJxhNTNYwB,ryxPAfEvvH,1569440000000.0,1583910000000.0,501,Black-Box Adversarial Attack with Transferable Model-based Embedding,"[""zhuangbx@connect.ust.hk"", ""tongzhang@tongzhang-ml.org""]","[""Zhichao Huang"", ""Tong Zhang""]","[""adversarial examples"", ""black-box attack"", ""embedding""]","We present a new method for black-box adversarial attack. Unlike previous methods that combined transfer-based and scored-based methods by using the gradient or initialization of a surrogate white-box model, this new method tries to learn a low-dimensional embedding using a pretrained model, and then performs efficient search within the embedding space to attack an unknown target network. The method produces adversarial perturbations with high level semantic patterns that are easily transferable. We show that this approach can greatly improve the query efficiency of black-box adversarial attack across different target network architectures. We evaluate our approach on MNIST, ImageNet and Google Cloud Vision API, resulting in a significant reduction on the number of queries. We also attack adversarially defended networks on CIFAR10 and ImageNet, where our method not only reduces the number of queries, but also improves the attack success rate.",/pdf/beaabb5f70003b94e2d05e8fa1a90859cd962b3c.pdf,ICLR,2020,"We present a new method that combines transfer-based and scored black-box adversarial attack, improving the success rate and query efficiency of black-box adversarial attack across different network architectures." +OGg9XnKxFAH,aicLVslxPqK,1601310000000.0,1615990000000.0,3498,Training independent subnetworks for robust prediction,"[""~Marton_Havasi1"", ""~Rodolphe_Jenatton3"", ""~Stanislav_Fort1"", ""~Jeremiah_Zhe_Liu1"", ""~Jasper_Snoek1"", ""~Balaji_Lakshminarayanan1"", ""~Andrew_Mingbo_Dai1"", ""~Dustin_Tran1""]","[""Marton Havasi"", ""Rodolphe Jenatton"", ""Stanislav Fort"", ""Jeremiah Zhe Liu"", ""Jasper Snoek"", ""Balaji Lakshminarayanan"", ""Andrew Mingbo Dai"", ""Dustin Tran""]","[""Efficient ensembles"", ""robustness""]","Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: +the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.",/pdf/685ca0ecb292949cd9ede19a51143b8505e9b854.pdf,ICLR,2021,"We show that a deep neural network can be trained to give multiple independent predictions simultaneously, which results in a computationally efficient ensemble model." +HklCk1BtwS,Sye1bljdDS,1569440000000.0,1577170000000.0,1488,Word embedding re-examined: is the symmetrical factorization optimal?,"[""zchan@se.cuhk.edu.hk"", ""lijia@se.cuhk.edu.hk"", ""xuli@se.cuhk.edu.hk"", ""hcheng@se.cuhk.edu.hk""]","[""Zhichao Han"", ""Jia Li"", ""Xu Li"", ""Hong Cheng""]","[""word embedding"", ""matrix factorization"", ""linear transformation"", ""neighborhood structure""]","As observed in previous works, many word embedding methods exhibit two interesting properties: (1) words having similar semantic meanings are embedded closely; (2) analogy structure exists in the embedding space, such that ''emph{Paris} is to \emph{France} as \emph{Berlin} is to \emph{Germany}''. We theoretically analyze the inner mechanism leading to these nice properties. Specifically, the embedding can be viewed as a linear transformation from the word-context co-occurrence space to the embedding space. We reveal how the relative distances between nodes change during this transforming process. Such linear transformation will result in these good properties. Based on the analysis, we also provide the answer to a question whether the symmetrical factorization (e.g., \texttt{word2vec}) is better than traditional SVD method. We propose a method to improve the embedding further. The experiments on real datasets verify our analysis.",/pdf/111be2881133225eec43e73a9b954cf31454fb8d.pdf,ICLR,2020, +a3wKPZpGtCF,Y6b_wWpkns7,1601310000000.0,1615840000000.0,2387,Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions,"[""~Yun_Kuen_Cheung1"", ""yt851@nyu.edu""]","[""Yun Kuen Cheung"", ""Yixin Tao""]","[""Learning in Games"", ""Lyapunov Chaos"", ""Game Decomposition"", ""Multiplicative Weights Update"", ""Follow-the-Regularized-Leader"", ""Volume Analysis"", ""Dynamical Systems""]","It is of primary interest for ML to understand how agents learn and interact dynamically in competitive environments and games (e.g. GANs). But this has been a difficult task, as irregular behaviors are commonly observed in such systems. This can be explained theoretically, for instance, by the works of Cheung and Piliouras (COLT 2019; NeurIPS 2020), which showed that in two-person zero-sum games, if agents employ one of the most well-known learning algorithms, Multiplicative Weights Update (MWU), then Lyapunov chaos occurs everywhere in the payoff space. In this paper, we study how persistent chaos can occur in the more general normal game settings, where the agents might have the motivation to coordinate (which is not true for zero-sum games) and the number of agents can be arbitrary. + +We characterize bimatrix games where MWU, its optimistic variant (OMWU) or Follow-the-Regularized-Leader (FTRL) algorithms are Lyapunov chaotic almost everywhere in the payoff space. Technically, our characterization is derived by extending the volume-expansion argument of Cheung and Piliouras via the canonical game decomposition into zero-sum and coordination components. Interestingly, the two components induce opposite volume-changing behaviors, so the overall behavior can be analyzed by comparing the strengths of the components against each other. The comparison is done via our new notion of ""matrix domination"" or via a linear program. For multi-player games, we present a local equivalence of volume change between general games and graphical games, which is used to perform volume and chaos analyses of MWU and OMWU in potential games.",/pdf/eb21a8cbc05cb76a2135d38e48ebb0d0192bb4d5.pdf,ICLR,2021,We characterize games in which popular learning algorithms exhibit Lyapunov chaos. +EGVxmJKLC2L,KpDKUS0ZSS,1601310000000.0,1614990000000.0,3102,Learning not to learn: Nature versus nurture in silico,"[""~Robert_Tjarko_Lange1"", ""h.sprekeler@tu-berlin.de""]","[""Robert Tjarko Lange"", ""Henning Sprekeler""]","[""Meta-Learning"", ""Reinforcement Learning""]","Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of meta-learning (or 'learning to learn') to answer when it is beneficial to learn such an adaptive strategy and when to hard-code a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents' lifetime has crucial effects on the meta-learned amortized Bayesian inference performed by an agent. There exist two regimes: One in which meta-learning yields a learning algorithm that implements task-dependent information-integration and a second regime in which meta-learning imprints a heuristic or 'hard-coded' behavior. Further analysis reveals that non-adaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hard-coded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.",/pdf/2cd9a294b23ab1b044e9e2a1195238b4e2f3ec0b.pdf,ICLR,2021,We show that meta-learning provides an answer to when it is beneficial to learn an adaptive strategy & when to hard-code a heuristic behavior. +pzpytjk3Xb2,SZi2MP9DVF,1601310000000.0,1616230000000.0,102,Policy-Driven Attack: Learning to Query for Hard-label Black-box Adversarial Examples,"[""~Ziang_Yan1"", ""~Yiwen_Guo1"", ""~Jian_Liang3"", ""~Changshui_Zhang1""]","[""Ziang Yan"", ""Yiwen Guo"", ""Jian Liang"", ""Changshui Zhang""]","[""hard-label attack"", ""black-box attack"", ""adversarial attack"", ""reinforcement learning""]","To craft black-box adversarial examples, adversaries need to query the victim model and take proper advantage of its feedback. Existing black-box attacks generally suffer from high query complexity, especially when only the top-1 decision (i.e., the hard-label prediction) of the victim model is available. In this paper, we propose a novel hard-label black-box attack named Policy-Driven Attack, to reduce the query complexity. Our core idea is to learn promising search directions of the adversarial examples using a well-designed policy network in a novel reinforcement learning formulation, in which the queries become more sensible. Experimental results demonstrate that our method can significantly reduce the query complexity in comparison with existing state-of-the-art hard-label black-box attacks on various image classification benchmark datasets. Code and models for reproducing our results are available at https://github.com/ZiangYan/pda.pytorch",/pdf/7095c4811250b0fb464739f87a3151931d53f718.pdf,ICLR,2021,A novel hard-label black-box adversarial attack that introduces a reinforcement learning based formulation with a pre-trained policy network +QfEssgaXpm,XMXeFGaX0a1,1601310000000.0,1614990000000.0,3193,Reinforcement Learning for Control with Probabilistic Stability Guarantee,"[""~Minghao_Han2"", ""~Zhipeng_Zhou3"", ""lixianzhang@hit.edu.cn"", ""~Jun_Wang2"", ""~Wei_Pan2""]","[""Minghao Han"", ""Zhipeng Zhou"", ""Lixian Zhang"", ""Jun Wang"", ""Wei Pan""]","[""control"", ""Lyapunov stability"", ""REINFORCE"", ""finite-sample bounds""]","Reinforcement learning is promising to control dynamical systems for which the traditional control methods are hardly applicable. However, in control theory, the stability of a closed-loop system can be hardly guaranteed using the policy/controller learned solely from samples. In this paper, we will combine Lyapunov's method in control theory and stochastic analysis to analyze the mean square stability of MDP in a model-free manner. Furthermore, the finite sample bounds on the probability of stability are derived as a function of the number M and length T of the sampled trajectories. And we show that there is a lower bound on T and the probability is much more demanding for M than T. Based on the theoretical results, a REINFORCE like algorithm is proposed to learn the controller and the Lyapunov function simultaneously. ",/pdf/8864e380755262c5e55267e7f10e9bff5e4d817f.pdf,ICLR,2021,Sample-based stability condition and the associated finite sample bound for reinforcement learning control. +SJgs8TVtvr,SylPGTKwwH,1569440000000.0,1577170000000.0,573,Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations,"[""akopf@ethz.ch"", ""fortuin@inf.ethz.ch"", ""vsomnath@student.ethz.ch"", ""mclaassen@ethz.ch""]","[""Andreas Kopf"", ""Vincent Fortuin"", ""Vignesh Ram Somnath"", ""Manfred Claassen""]","[""Variational Autoencoder"", ""Clustering"", ""Generative model""]","Clustering high-dimensional data, such as images or biological measurements, is a long-standing problem and has been studied extensively. Recently, Deep Clustering gained popularity due to the non-linearity of neural networks, which allows for flexibility in fitting the specific peculiarities of complex data. Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering model. The model can learn multi-modal distributions of high-dimensional data and use these to generate realistic data with high efficacy and efficiency. MoE-Sim-VAE is based on a Variational Autoencoder (VAE), where the decoder consists of a Mixture-of-Experts (MoE) architecture. This specific architecture allows for various modes of the data to be automatically learned by means of the experts. Additionally, we encourage the latent representation of our model to follow a Gaussian mixture distribution and to accurately represent the similarities between the data points. We assess the performance of our model on synthetic data, the MNIST benchmark data set, and a challenging real-world task of defining cell subpopulations from mass cytometry (CyTOF) measurements on hundreds of different datasets. MoE-Sim-VAE exhibits superior clustering performance on all these tasks in comparison to the baselines and we show that the MoE architecture in the decoder reduces the computational cost of sampling specific data modes with high fidelity.",/pdf/e9f65742aa83266dd94be945c0538a040ac39237.pdf,ICLR,2020, +SJgn464tPB,HkljjlVDvB,1569440000000.0,1577170000000.0,500,Stabilizing Off-Policy Reinforcement Learning with Conservative Policy Gradients,"[""chen.tessler@gmail.com"", ""merlis.nadav@gmail.com"", ""shiemannor@gmail.com""]","[""Chen Tessler"", ""Nadav Merlis"", ""Shie Mannor""]","[""Deep Reinforcement Learning"", ""Variance Reduction"", ""Policy Gradient""]","In recent years, advances in deep learning have enabled the application of reinforcement learning algorithms in complex domains. However, they lack the theoretical guarantees which are present in the tabular setting and suffer from many stability and reproducibility problems \citep{henderson2018deep}. In this work, we suggest a simple approach for improving stability and providing probabilistic performance guarantees in off-policy actor-critic deep reinforcement learning regimes. Experiments on continuous action spaces, in the MuJoCo control suite, show that our proposed method reduces the variance of the process and improves the overall performance.",/pdf/19fcc2b4faa26f027c9b98f2f768ca8d825b816e.pdf,ICLR,2020,"We propose a conservative update rule for off-policy policy-gradient methods (e.g., DDPG) in order to reduce the variance of the training regime." +lEZIPgMIB1,gT6VnnpCjX,1601310000000.0,1614990000000.0,669,Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning,"[""~Tim_Sainburg1"", ""leland.mcinnes@gmail.com"", ""tgenter@ucsd.edu""]","[""Tim Sainburg"", ""Leland McInnes"", ""Timothy Q Gentner""]","[""unsupervised learning"", ""representation learning"", ""dimensionality reduction"", ""UMAP"", ""semi-supervised learning""]","We propose Parametric UMAP, a parametric variation of the UMAP (Uniform Manifold Approximation and Projection) algorithm. UMAP is a non-parametric graph-based dimensionality reduction algorithm using applied Riemannian geometry and algebraic topology to find low-dimensional embeddings of structured data. The UMAP algorithm consists of two steps: (1) Compute a graphical representation of a dataset (fuzzy simplicial complex), and (2) Through stochastic gradient descent, optimize a low-dimensional embedding of the graph. Here, we replace the second step of UMAP with a deep neural network that learns a parametric relationship between data and embedding. We demonstrate that our method performs similarly to its non-parametric counterpart while conferring the benefit of a learned parametric mapping (e.g. fast online embeddings for new data). We then show that UMAP loss can be extended to arbitrary deep learning applications, for example constraining the latent distribution of autoencoders, and improving classifier accuracy for semi-supervised learning by capturing structure in unlabeled data.",/pdf/71f51dddc6f5f5c19777968e0c40e77e1a9e1ec9.pdf,ICLR,2021,We propose a parametric variant of UMAP and applications in representation and semi-supervised learning. +uHjLW-0tsCu,QF0QtGPtTQ,1601310000000.0,1614990000000.0,1548,Exploring the Potential of Low-Bit Training of Convolutional Neural Networks,"[""~Kai_Zhong2"", ""~Xuefei_Ning1"", ""~Tianchen_Zhao2"", ""zhuzhenh18@mails.tsinghua.edu.cn"", ""zengsl18@mails.tsinghua.edu.cn"", ""daiguohao@mail.tsinghua.edu.cn"", ""~Yu_Wang3"", ""~Huazhong_Yang1""]","[""Kai Zhong"", ""Xuefei Ning"", ""Tianchen Zhao"", ""Zhenhua Zhu"", ""Shulin Zeng"", ""Guohao Dai"", ""Yu Wang"", ""Huazhong Yang""]","[""CNN"", ""training"", ""quantization"", ""low-bit"", ""energy efficiency""]","In this paper, we propose a low-bit training framework for convolutional neural networks. Our framework focuses on reducing the energy and time consumption of convolution kernels, by quantizing all the convolutional operands (activation, weight, and error) to low bit-width. Specifically, we propose a multi-level scaling (MLS) tensor format, in which the element-wise bit-width can be largely reduced to simplify floating-point computations to nearly fixed-point. Then, we describe the dynamic quantization and the low-bit tensor convolution arithmetic to efficiently leverage the MLS tensor format. Experiments show that our framework achieves a superior trade-off between the accuracy and the bit-width than previous methods. When training ResNet-20 on CIFAR-10, +all convolution operands can be quantized to 1-bit mantissa and 2-bit exponent, while retaining the same accuracy as the full-precision training. When training ResNet-18 on ImageNet, with 4-bit mantissa and 2-bit exponent, our framework can achieve an accuracy loss of less than $1\%$. Energy consumption analysis shows that our design can achieve over $6.8\times$ higher energy efficiency than training with floating-point arithmetic.",/pdf/232ca01895ccb6eef5513355a115b6f34ed838a7.pdf,ICLR,2021,"We propose a low-bit training framework with multi-level scaling tensor format, so that the data bit-width for all the convolution inputs in training can be reduced, and the energy efficiency can be improved." +BylDrRNKvH,rklMu38_wH,1569440000000.0,1606270000000.0,1115,Understanding Attention Mechanisms,"[""bul37@psu.edu"", ""yogesh@cs.umd.edu"", ""lzxue@psu.edu"", ""renqiang@nec-labs.com""]","[""Bingyuan Liu"", ""Yogesh Balaji"", ""Lingzhou Xue"", ""Martin Renqiang Min""]","[""Attention"", ""deep learning"", ""sample complexity"", ""self-attention""]","Attention mechanisms have advanced the state of the art in several machine learning tasks. Despite significant empirical gains, there is a lack of theoretical analyses on understanding their effectiveness. In this paper, we address this problem by studying the landscape of population and empirical loss functions of attention-based neural networks. Our results show that, under mild assumptions, every local minimum of a two-layer global attention model has low prediction error, and attention models require lower sample complexity than models not employing attention. We then extend our analyses to the popular self-attention model, proving that they deliver consistent predictions with a more expressive class of functions. Additionally, our theoretical results provide several guidelines for designing attention mechanisms. Our findings are validated with satisfactory experimental results on MNIST and IMDB reviews dataset.",/pdf/20e84e9ff653f2699ae0245a268acdb913da1cef.pdf,ICLR,2020,We analyze the loss landscape of neural networks with attention and explain why attention is helpful in training neural networks to achieve good performance. +DiQD7FWL233,Amn3RXrQom,1601310000000.0,1613450000000.0,2560,Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein,"[""~Khai_Nguyen1"", ""v.sonnv27@vinai.io"", ""~Nhat_Ho1"", ""v.tungph4@vinai.io"", ""~Hung_Bui1""]","[""Khai Nguyen"", ""Son Nguyen"", ""Nhat Ho"", ""Tung Pham"", ""Hung Bui""]","[""Relational regularized autoencoder"", ""deep generative model"", ""sliced fused Gromov Wasserstein"", ""spherical distributions""]","Relational regularized autoencoder (RAE) is a framework to learn the distribution of data by minimizing a reconstruction loss together with a relational regularization on the prior of latent space. A recent attempt to reduce the inner discrepancy between the prior and aggregated posterior distributions is to incorporate sliced fused Gromov-Wasserstein (SFG) between these distributions. That approach has a weakness since it treats every slicing direction similarly, meanwhile several directions are not useful for the discriminative task. To improve the discrepancy and consequently the relational regularization, we propose a new relational discrepancy, named spherical sliced fused Gromov Wasserstein (SSFG), that can find an important area of projections characterized by a von Mises-Fisher distribution. Then, we introduce two variants of SSFG to improve its performance. The first variant, named mixture spherical sliced fused Gromov Wasserstein (MSSFG), replaces the vMF distribution by a mixture of von Mises-Fisher distributions to capture multiple important areas of directions that are far from each other. The second variant, named power spherical sliced fused Gromov Wasserstein (PSSFG), replaces the vMF distribution by a power spherical distribution to improve the sampling time of the vMF distribution in high dimension settings. We then apply the new discrepancies to the RAE framework to achieve its new variants. Finally, we conduct extensive experiments to show that the new autoencoders have favorable performance in learning latent manifold structure, image generation, and reconstruction.",/pdf/d9cff7f7264e7a99f687eeb02f8e8bcf415be338.pdf,ICLR,2021,Improving relational regularized autoencoder by introducing new sliced optimal transport discrepancies between the prior and aggregated posterior distributions. +H5B3lmpO1g,#NAME?,1601310000000.0,1614990000000.0,1342,Goal-Auxiliary Actor-Critic for 6D Robotic Grasping with Point Clouds,"[""~Lirui_Wang1"", ""~Yu_Xiang3"", ""~Dieter_Fox1""]","[""Lirui Wang"", ""Yu Xiang"", ""Dieter Fox""]","[""Robotics"", ""Reinforcement Learning"", ""Learning from Demonstration""]","6D robotic grasping beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting without considering perception feedback and dynamics and contacts of objects, which makes them sensitive to grasp synthesis errors. In this work, we propose a novel method for learning closed-loop control policies for 6D robotic grasping using point clouds from an egocentric camera. We combine imitation learning and reinforcement learning in order to grasp unseen objects and handle the continuous 6D action space, where expert demonstrations are obtained from a joint motion and grasp planner. We introduce a goal-auxiliary actor-critic algorithm, which uses grasping goal prediction as an auxiliary task to facilitate policy learning. The supervision on grasping goals can be obtained from the expert planner for known objects or from hindsight goals for unknown objects. Overall, our learned closed-loop policy achieves over $90\%$ success rates on grasping various ShapeNet objects and YCB objects in simulation. The policy also transfers well to the real world with only one failure among grasping of ten different unseen objects in the presence of perception noises.",/pdf/69fd6d047da1731cc93c91ed42fc0b8c351c6b85.pdf,ICLR,2021,"We propose to augment reinforcement learning with demonstrations and goal auxiliary tasks, for learning closed-loop control policies in 6D robotic grasping using point clouds, which achieves over 90% success rates on grasping various unseen objects." +O7ms4LFdsX,a9o6cX6MBjc,1601310000000.0,1615920000000.0,2714,Disentangled Recurrent Wasserstein Autoencoder ,"[""~Jun_Han4"", ""~Martin_Renqiang_Min1"", ""~Ligong_Han1"", ""~Li_Erran_Li1"", ""~Xuan_Zhang3""]","[""Jun Han"", ""Martin Renqiang Min"", ""Ligong Han"", ""Li Erran Li"", ""Xuan Zhang""]","[""Sequential Representation Learning"", ""Disentanglement"", ""Recurrent Generative Model""]","Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.",/pdf/e8de83dfa5c5849037a74c858c899119d1dc5f33.pdf,ICLR,2021,We propose the first recurrent Wasserstein Autoencoder for learning disentangled representations of sequential data with theoretical analysis. +r1geR1BKPr,Hyg15I1YDr,1569440000000.0,1577170000000.0,2011,MULTI-STAGE INFLUENCE FUNCTION,"[""chenhg@mit.edu"", ""sisidaisy@google.com"", ""liyang@google.com"", ""ciprianchelba@google.com"", ""sanjivk@google.com"", ""boning@mtl.mit.edu"", ""chohsieh@cs.ucla.edu""]","[""Hongge Chen"", ""Si Si"", ""Yang Li"", ""Ciprian Chelba"", ""Sanjiv Kumar"", ""Duane Boning"", ""Cho-Jui Hsieh""]","[""influence function"", ""multistage training"", ""pretrained model""]","Multi-stage training and knowledge transfer from a large-scale pretrain task to various fine-tune end tasks have revolutionized natural language processing (NLP) and computer vision (CV), with state-of-the-art performances constantly being improved. In this paper, we develop a multi-stage influence function score to track predictions from a finetune model all the way back to the pretrain data. With this score, we can identify the pretrain examples in the pretrain task that contribute most to a prediction in the fine-tune task. The proposed multi-stage influence function generalizes the original influence function for a single model in Koh et al 2017, thereby enabling influence computation through both pretrain and fine-tune models. We test our proposed method in various experiments to show its effectiveness and potential applications.",/pdf/3a3c3d235a327a9e8bc765edb22e233939040bbd.pdf,ICLR,2020,We proposed a influence function for multi-stage training +HyenWc5gx,,1478300000000.0,1483770000000.0,529,Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning,"[""mdriemer@us.ibm.com"", ""ekhabiri@us.ibm.com"", ""rgoodwin@us.ibm.com""]","[""Matthew Riemer"", ""Elham Khabiri"", ""Richard Goodwin""]","[""Deep learning"", ""Transfer Learning"", ""Natural language processing""]","Although neural networks are well suited for sequential transfer learning tasks, the catastrophic forgetting problem hinders proper integration of prior knowledge. In this work, we propose a solution to this problem by using a multi-task objective based on the idea of distillation and a mechanism that directly penalizes forgetting at the shared representation layer during the knowledge integration phase of training. We demonstrate our approach on a Twitter domain sentiment analysis task with sequential knowledge transfer from four related tasks. We show that our technique outperforms networks fine-tuned to the target task. Additionally, we show both through empirical evidence and examples that it does not forget useful knowledge from the source task that is forgotten during standard fine-tuning. Surprisingly, we find that first distilling a human made rule based sentiment engine into a recurrent neural network and then integrating the knowledge with the target task data leads to a substantial gain in generalization performance. Our experiments demonstrate the power of multi-source transfer techniques in practical text analytics problems when paired with distillation. In particular, for the SemEval 2016 Task 4 Subtask A (Nakov et al., 2016) dataset we surpass the state of the art established during the competition with a comparatively simple model architecture that is not even competitive when trained on only the labeled task specific data.",/pdf/cf0ce483971ea7bb386bdb8027b9880e89625086.pdf,ICLR,2017,We propose a novel general purpose regularizer to address catastrophic forgetting in neural network sequential transfer learning. +Sy4lojC9tm,rkeBXKc5Y7,1538090000000.0,1545360000000.0,593,Dataset Distillation,"[""tongzhou.wang.1994@gmail.com"", ""junyanz@mit.edu"", ""torralba@mit.edu"", ""efros@eecs.berkeley.edu""]","[""Tongzhou Wang"", ""Jun-Yan Zhu"", ""Antonio Torralba"", ""Alexei A. Efros""]","[""knowledge distillation"", ""deep learning"", ""few-shot learning"", ""adversarial attack""]","Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called {\em dataset distillation}: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to {\em synthesize} a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress $60,000$ MNIST training images into just $10$ synthetic {\em distilled images} (one per class) and achieve close to original performance with only a few steps of gradient descent, given a particular fixed network initialization. We evaluate our method in a wide range of initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods in most settings. ",/pdf/aa73c6e929567faba0258f4d0baaedc9931a57ad.pdf,ICLR,2019,"We propose to distill a large dataset into a small set of synthetic data , so networks can achieve close to original performance when trained on these data." +e6hMkY6MFcU,6LsfZerxGXk,1601310000000.0,1614990000000.0,3461,WordsWorth Scores for Attacking CNNs and LSTMs for Text Classification,"[""~Nimrah_Shakeel1""]","[""Nimrah Shakeel""]",[],"Black box attacks on traditional deep learning models trained for text classifica- tion target important words in a piece of text, in order to change model prediction. Current approaches towards highlighting important features are time consuming and require large number of model queries. We present a simple yet novel method to calculate word importance scores, based on model predictions on single words. These scores, which we call WordsWorth scores, need to be calculated only once for the training vocabulary. They can be used to speed up any attack method that requires word importance, with negligible loss of attack performance. We run ex- periments on a number of datasets trained on word-level CNNs and LSTMs, for sentiment analysis and topic classification and compare to state-of-the-art base- lines. Our results show the effectiveness of our method in attacking these models with success rates that are close to the original baselines. We argue that global importance scores act as a very good proxy for word importance in a local context because words are a highly informative form of data. This aligns with the manner in which humans interpret language, with individual words having well- defined meaning and powerful connotations. We further show that these scores can be used as a debugging tool to interpret a trained model by highlighting rele- vant words for each class. Additionally, we demonstrate the effect of overtraining on word importance, compare the robustness of CNNs and LSTMs, and explain the transferability of adversarial examples across a CNN and an LSTM using these scores. We highlight the fact that neural networks make highly informative pre- dictions on single words.",/pdf/629508abfb5889d43d35b90093466cb163851e2e.pdf,ICLR,2021,An efficient method for computing word importance scores for CNNs and LSTMs +S1eEmn05tQ,SkxfTq65FX,1538090000000.0,1545360000000.0,1348,Uncertainty in Multitask Transfer Learning,"[""alex.lacoste.shmu@gmail.com"", ""boris@elementai.com"", ""wonchang@elementai.com"", ""thomas@elementai.com"", ""negar.rostamzadeh@gmail.com"", ""david.scott.krueger@gmail.com""]","[""Alexandre Lacoste"", ""Boris Oreshkin"", ""Wonchang Chung"", ""Thomas Boquet"", ""Negar Rostamzadeh"", ""David Krueger""]","[""Multi Task"", ""Transfer Learning"", ""Hierarchical Bayes"", ""Variational Bayes"", ""Meta Learning"", ""Few Shot learning""]","Using variational Bayes neural networks, we develop an algorithm capable of accumulating knowledge into a prior from multiple different tasks. This results in a rich prior capable of few-shot learning on new tasks. The posterior can go beyond the mean field approximation and yields good uncertainty on the performed experiments. Analysis on toy tasks show that it can learn from significantly different tasks while finding similarities among them. Experiments on Mini-Imagenet reach state of the art with 74.5% accuracy on 5 shot learning. Finally, we provide two new benchmarks, each showing a failure mode of existing meta learning algorithms such as MAML and prototypical Networks.",/pdf/3753e43fcf24f6bd6a8ceb9aeac24e1a148d5e64.pdf,ICLR,2019,A scalable method for learning an expressive prior over neural networks across multiple tasks. +S1lTg3RcFm,S1xzkZC9FQ,1538090000000.0,1545360000000.0,1121,Perception-Aware Point-Based Value Iteration for Partially Observable Markov Decision Processes,"[""mahsa.ghasemi@utexas.edu"", ""utopcu@utexas.edu""]","[""Mahsa Ghasemi"", ""Ufuk Topcu""]","[""partially observable Markov decision processes"", ""active perception"", ""submodular optimization"", ""point-based value iteration"", ""reinforcement learning""]","Partially observable Markov decision processes (POMDPs) are a widely-used framework to model decision-making with uncertainty about the environment and under stochastic outcome. In conventional POMDP models, the observations that the agent receives originate from fixed known distribution. However, in a variety of real-world scenarios the agent has an active role in its perception by selecting which observations to receive. Due to combinatorial nature of such selection process, it is computationally intractable to integrate the perception decision with the planning decision. To prevent such expansion of the action space, we propose a greedy strategy for observation selection that aims to minimize the uncertainty in state. +We develop a novel point-based value iteration algorithm that incorporates the greedy strategy to achieve near-optimal uncertainty reduction for sampled belief points. This in turn enables the solver to efficiently approximate the reachable subspace of belief simplex by essentially separating computations related to perception from planning. +Lastly, we implement the proposed solver and demonstrate its performance and computational advantage in a range of robotic scenarios where the robot simultaneously performs active perception and planning.",/pdf/9b6ff6e208b5dd8781bddaea26acc23ace7fd9bd.pdf,ICLR,2019,We develop a point-based value iteration solver for POMDPs with active perception and planning tasks. +gYbimGJAENn,xp2ai7QyPSD,1601310000000.0,1614990000000.0,2904,Powers of layers for image-to-image translation,"[""~Hugo_Touvron1"", ""~Matthijs_Douze1"", ""~Matthieu_Cord1"", ""~Herve_Jegou1""]","[""Hugo Touvron"", ""Matthijs Douze"", ""Matthieu Cord"", ""Herve Jegou""]",[],"We propose a simple architecture to address unpaired image-to-image translation tasks: style or class transfer, denoising, deblurring, deblocking, etc. +We start from an image autoencoder architecture with fixed weights. +For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached. +A specific training schedule is required to alleviate the exponentiation effect of the iterations. +At test time, it offers several advantages: the number of weight parameters is limited and the compositional design allows one to modulate the strength of the transformation with the number of iterations. +This is useful, for instance, when the type or amount of noise to suppress is not known in advance. +Experimentally, we show that the performance of our model is comparable or better than CycleGAN and Nice-GAN with fewer parameters.",/pdf/055075c122a5fdbeafc12c7792f595ac27a45b1c.pdf,ICLR,2021, +HkepKG-Rb,SkViYzbCZ,1509140000000.0,1518730000000.0,891,A Semantic Loss Function for Deep Learning with Symbolic Knowledge,"[""jixu@g.ucla.edu"", ""zhangzilu@pku.edu.cn"", ""tal@cs.ucla.edu"", ""yliang@cs.ucla.edu"", ""guyvdb@cs.ucla.edu""]","[""Jingyi Xu"", ""Zilu Zhang"", ""Tal Friedman"", ""Yitao Liang"", ""Guy Van den Broeck""]","[""deep learning"", ""symbolic knowledge"", ""semi-supervised learning"", ""constraints""]","This paper develops a novel methodology for using symbolic knowledge in deep learning. From first principles, we derive a semantic loss function that bridges between neural output vectors and logical constraints. This loss function captures how close the neural network is to satisfying the constraints on its output. An experimental evaluation shows that our semantic loss function effectively guides the learner to achieve (near-)state-of-the-art results on semi-supervised multi-class classification. Moreover, it significantly increases the ability of the neural network to predict structured objects, such as rankings and shortest paths. These discrete concepts are tremendously difficult to learn, and benefit from a tight integration of deep learning and symbolic reasoning methods.",/pdf/42b300f44989dce3f44c6d9911c6904d299cc42b.pdf,ICLR,2018, +PhV-qfEi3Mr,f6tDhk3CuqO,1601310000000.0,1614990000000.0,3029,Improving the accuracy of neural networks in analog computing-in-memory systems by a generalized quantization method,"[""~Lingjun_Dai1"", ""~Qingtian_Zhang1"", ""~Huaqiang_Wu1""]","[""Lingjun Dai"", ""Qingtian Zhang"", ""Huaqiang Wu""]","[""analog computing-in-memory"", ""quantization algorithm"", ""deep neural networks""]","Crossbar-enabled analog computing-in-memory (CACIM) systems can significantly improve the computation speed and energy efficiency of deep neural networks (DNNs). However, the transition of DNN from the digital systems to CACIM systems usually reduces its accuracy. The major issue is that the weights of DNN are stored and calculated directly on analog quantities in CACIM systems. The variation and programming overhead of the analog weight limit the precision. +Therefore, a suitable quantization algorithm is important when deploying a DNN into CACIM systems to obtain less accuracy loss. The analog weight has its unique advantages when doing quantization. Because there is no encoding and decoding process, the set of quanta will not affect the computing process. Therefore, a generalized quantization method that does not constrain the range of quanta and can obtain less quantization error will be effective in CACIM systems. For the first time, we introduced a generalized quantization method into CACIM systems and showed superior performance on a series of computer vision tasks, such as image classification, object detection, and semantic segmentation. Using the generalized quantization method, the DNN with 8-level analog weights can outperform the 32-bit networks. With fewer levels, the generalized quantization method can obtain less accuracy loss than other uniform quantization methods.",/pdf/1a42f56d9468ab746f77dac15cf9794c4e95cdfb.pdf,ICLR,2021,We improve the accuracy of neural networks in analog computing-in-memory systems by a generalized quantization method +SygD-hCcF7,SkxhYeAqKX,1538090000000.0,1556170000000.0,1181,Dimensionality Reduction for Representing the Knowledge of Probabilistic Models,"[""law@cs.toronto.edu"", ""jsnell@cs.toronto.edu"", ""farahmand@vectorinstitute.ai"", ""urtasun@cs.toronto.edu"", ""zemel@cs.toronto.edu""]","[""Marc T Law"", ""Jake Snell"", ""Amir-massoud Farahmand"", ""Raquel Urtasun"", ""Richard S Zemel""]","[""metric learning"", ""distance learning"", ""dimensionality reduction"", ""bound guarantees""]","Most deep learning models rely on expressive high-dimensional representations to achieve good performance on tasks such as classification. However, the high dimensionality of these representations makes them difficult to interpret and prone to over-fitting. We propose a simple, intuitive and scalable dimension reduction framework that takes into account the soft probabilistic interpretation of standard deep models for classification. When applying our framework to visualization, our representations more accurately reflect inter-class distances than standard visualization techniques such as t-SNE. We show experimentally that our framework improves generalization performance to unseen categories in zero-shot learning. We also provide a finite sample error upper bound guarantee for the method.",/pdf/181b02cd9b42cb6ee4d7f72afceaa5d58802227a.pdf,ICLR,2019,dimensionality reduction for cases where examples can be represented as soft probability distributions +rygixkHKDH,H1xKeIs_vS,1569440000000.0,1583910000000.0,1519,Geometric Analysis of Nonconvex Optimization Landscapes for Overcomplete Learning,"[""qingqu1006@gmail.com"", ""ysz@berkeley.edu"", ""xli@ee.cuhk.edu.hk"", ""yqz.zhang@gmail.com"", ""zzhu29@jhu.edu""]","[""Qing Qu"", ""Yuexiang Zhai"", ""Xiao Li"", ""Yuqian Zhang"", ""Zhihui Zhu""]","[""dictionary learning"", ""sparse representations"", ""nonconvex optimization""]","Learning overcomplete representations finds many applications in machine learning and data analytics. In the past decade, despite the empirical success of heuristic methods, theoretical understandings and explanations of these algorithms are still far from satisfactory. In this work, we provide new theoretical insights for several important representation learning problems: learning (i) sparsely used overcomplete dictionaries and (ii) convolutional dictionaries. We formulate these problems as $\ell^4$-norm optimization problems over the sphere and study the geometric properties of their nonconvex optimization landscapes. For both problems, we show the nonconvex objective has benign (global) geometric structures, which enable the development of efficient optimization methods finding the target solutions. Finally, our theoretical results are justified by numerical simulations. +",/pdf/319b5a598511c90365a7055bb855148c63303f98.pdf,ICLR,2020, +ryfz73C9KQ,SyxBGV9cKX,1538090000000.0,1545360000000.0,1339,Neural Predictive Belief Representations,"[""z.daniel.guo@gmail.com"", ""mazar@google.com"", ""piot@google.com"", ""bavilapires@google.com"", ""munos@google.com""]","[""Zhaohan Daniel Guo"", ""Mohammad Gheshlaghi Azar"", ""Bilal Piot"", ""Bernardo Avila Pires"", ""R\u00e9mi Munos""]","[""belief states"", ""representation learning"", ""contrastive predictive coding"", ""reinforcement learning"", ""predictive state representations"", ""deep reinforcement learning""]","Unsupervised representation learning has succeeded with excellent results in many applications. It is an especially powerful tool to learn a good representation of environments with partial or noisy observations. In partially observable domains it is important for the representation to encode a belief state---a sufficient statistic of the observations seen so far. In this paper, we investigate whether it is possible to learn such a belief representation using modern neural architectures. Specifically, we focus on one-step frame prediction and two variants of contrastive predictive coding (CPC) as the objective functions to learn the representations. To evaluate these learned representations, we test how well they can predict various pieces of information about the underlying state of the environment, e.g., position of the agent in a 3D maze. We show that all three methods are able to learn belief representations of the environment---they encode not only the state information, but also its uncertainty, a crucial aspect of belief states. We also find that for CPC multi-step predictions and action-conditioning are critical for accurate belief representations in visually complex environments. The ability of neural representations to capture the belief information has the potential to spur new advances for learning and planning in partially observable domains, where leveraging uncertainty is essential for optimal decision making.",/pdf/a5f6da99e93822f1add1608078362da082a5a915.pdf,ICLR,2019,We investigate the quality of belief state representations of partially observable dynamic environments learned with modern neural architectures. +ryxW804FPH,ryxkDALuwr,1569440000000.0,1577170000000.0,1128,ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION,"[""olsomatt@oregonstate.edu"", ""lisa.zhang@nokia-bell-labs.com"", ""cnyu@cs.cornell.edu""]","[""Matthew Lyle Olson"", ""Lisa Zhang"", ""Chun-Nam Yu""]","[""NLP"", ""Deep Learning"", ""Language Models"", ""Long Document""]","Pretrained language models (LMs) have shown excellent results in achieving human like performance on many language tasks. However, the most powerful LMs have one significant drawback: a fixed-sized input. With this constraint, these LMs are unable to utilize the full input of long documents. In this paper, we introduce a new framework to handle documents of arbitrary lengths. We investigate the addition of a recurrent mechanism to extend the input size and utilizing attention to identify the most discriminating segment of the input. We perform extensive validating experiments on patent and Arxiv datasets, both of which have long text. We demonstrate our method significantly outperforms state-of-the-art results reported in recent literature.",/pdf/8d9bc39c3bd1f7daaa26b461701474f8c76ff62b.pdf,ICLR,2020,We acheive state of the art results on long document classication by combining pretrained language models representations with attention. +r1f78iAcFm,Byx8nwy5FX,1538090000000.0,1545360000000.0,163,GRAPH TRANSFORMATION POLICY NETWORK FOR CHEMICAL REACTION PREDICTION,"[""dkdo@deakin.edu.au"", ""truyen.tran@deakin.edu.au"", ""svetha.venkatesh@deakin.edu.au""]","[""Kien Do"", ""Truyen Tran"", ""Svetha Venkatesh""]","[""Chemical Reaction"", ""Graph Transformation"", ""Reinforcement Learning""]","We address a fundamental problem in chemistry known as chemical reaction product prediction. Our main insight is that the input reactant and reagent molecules can be jointly represented as a graph, and the process of generating product molecules from reactant molecules can be formulated as a sequence of graph transformations. To this end, we propose Graph Transformation Policy Network (GTPN) - a novel generic method that combines the strengths of graph neural networks and reinforcement learning to learn the reactions directly from data with minimal chemical knowledge. Compared to previous methods, GTPN has some appealing properties such as: end-to-end learning, and making no assumption about the length or the order of graph transformations. In order to guide model search through the complex discrete space of sets of bond changes effectively, we extend the standard policy gradient loss by adding useful constraints. Evaluation results show that GTPN improves the top-1 accuracy over the current state-of-the-art method by about 3% on the large USPTO dataset. Our model's performances and prediction errors are also analyzed carefully in the paper.",/pdf/bca96bd756efaa2439c61334fd82c2d64720071d.pdf,ICLR,2019, +ZsZM-4iMQkH,vBZ-irmZXGS,1601310000000.0,1616540000000.0,3245,A unifying view on implicit bias in training linear neural networks,"[""~Chulhee_Yun1"", ""~Shankar_Krishnan1"", ""~Hossein_Mobahi2""]","[""Chulhee Yun"", ""Shankar Krishnan"", ""Hossein Mobahi""]","[""implicit bias"", ""implicit regularization"", ""convergence"", ""gradient flow"", ""gradient descent""]","We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a ""transformed"" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.",/pdf/7592938b320208bd563349d1ea3385dd9e80cbe6.pdf,ICLR,2021,We propose a unifying framework for analyzing implicit bias of linear networks and show theorems that extend existing results with less convergence assumptions. +ZKyd0bkFmom,sfl3GOLm0Kc,1601310000000.0,1614990000000.0,1319,Parametric Copula-GP model for analyzing multidimensional neuronal and behavioral relationships,"[""~Nina_Kudryashova1"", ""t.amvrosiadis@ed.ac.uk"", ""nathalie.dupuy@ed.ac.uk"", ""n.rochefort@ed.ac.uk"", ""~Arno_Onken1""]","[""Nina Kudryashova"", ""Theoklitos Amvrosiadis"", ""Nathalie Dupuy"", ""Nathalie Rochefort"", ""Arno Onken""]","[""hierarchical vine copula model"", ""copula"", ""gaussian process"", ""mutual information"", ""neuroscience"", ""neuronal activity"", ""calcium imaging"", ""visual cortex""]","One of the main challenges in current systems neuroscience is the analysis of high-dimensional neuronal and behavioral data that are characterized by different statistics and timescales of the recorded variables. We propose a parametric copula model which separates the statistics of the individual variables from their dependence structure, and escapes the curse of dimensionality by using vine copula constructions. We use a Bayesian framework with Gaussian Process (GP) priors over copula parameters, conditioned on a continuous task-related variable. We improve the flexibility of this method by 1) using non-parametric conditional (rather than unconditional) marginals; 2) linearly mixing copula elements with qualitatively different tail dependencies. We validate the model on synthetic data and compare its performance in estimating mutual information against the commonly used non-parametric algorithms. Our model provides accurate information estimates when the dependencies in the data match the parametric copulas used in our framework. Moreover, even when the exact density estimation with a parametric model is not possible, our Copula-GP model is still able to provide reasonable information estimates, close to the ground truth and comparable to those obtained with a neural network estimator. Finally, we apply our framework to real neuronal and behavioral recordings obtained in awake mice. We demonstrate the ability of our framework to 1) produce accurate and interpretable bivariate models for the analysis of inter-neuronal noise correlations or behavioral modulations; 2) expand to more than 100 dimensions and measure information content in the whole-population statistics. These results demonstrate that the Copula-GP framework is particularly useful for the analysis of complex multidimensional relationships between neuronal, sensory and behavioral data.",/pdf/a997b8cafcdcbe9a4b2b285a56aeaf8c9c3809bb.pdf,ICLR,2021,"Copula-GP provides accurate information estimates (better or similar to MINE and with uncertainty estimates), scales to high dimensions and is invariant to homeomorphic transformations of marginals, making it suitable for application in neuroscience." +rkeT8iR9Y7,rygcOcXqKX,1538090000000.0,1545360000000.0,216,Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep Learning,"[""bloodwass@kaist.ac.kr"", ""kyunghyun.cho@nyu.edu"", ""wanmo.kang@kaist.edu""]","[""Cheolhyoung Lee"", ""Kyunghyun Cho"", ""Wanmo Kang""]","[""directional statistics"", ""deep learning"", ""SNR"", ""gradient stochasticity"", ""SGD"", ""stochastic gradient"", ""von Mises-Fisher"", ""angle""]","Although stochastic gradient descent (SGD) is a driving force behind the recent success of deep learning, our understanding of its dynamics in a high-dimensional parameter space is limited. In recent years, some researchers have used the stochasticity of minibatch gradients, or the signal-to-noise ratio, to better characterize the learning dynamics of SGD. Inspired from these work, we here analyze SGD from a geometrical perspective by inspecting the stochasticity of the norms and directions of minibatch gradients. We propose a model of the directional concentration for minibatch gradients through von Mises-Fisher (VMF) distribution, and show that the directional uniformity of minibatch gradients increases over the course of SGD. We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity, suggesting that the directional statistics of minibatch gradients is a major factor behind SGD.",/pdf/6fc562d888ca77893585b7443df394b87b6a4c5c.pdf,ICLR,2019,One of theoretical issues in deep learning +JI2TGOehNT0,0qioept9x_,1601310000000.0,1614990000000.0,3288,Combining Imitation and Reinforcement Learning with Free Energy Principle,"[""~Ryoya_Ogishima1"", ""karino@isi.imi.i.u-tokyo.ac.jp"", ""~Yasuo_Kuniyoshi1""]","[""Ryoya Ogishima"", ""Izumi Karino"", ""Yasuo Kuniyoshi""]","[""Imitation"", ""Reinforcement Learning"", ""Free Energy Principle""]","Imitation Learning (IL) and Reinforcement Learning (RL) from high dimensional sensory inputs are often introduced as separate problems, but a more realistic problem setting is how to merge the techniques so that the agent can reduce exploration costs by partially imitating experts at the same time it maximizes its return. Even when the experts are suboptimal (e.g. Experts learned halfway with other RL methods or human-crafted experts), it is expected that the agent outperforms the suboptimal experts’ performance. In this paper, we propose to address the issue by using and theoretically extending Free Energy Principle, a unified brain theory that explains perception, action and model learning in a Bayesian probabilistic way. We find that both IL and RL can be achieved based on the same free energy objective function. Our results show that our approach is promising in visual control tasks especially with sparse-reward environments.",/pdf/cfcf91db3909637376a517eac03f5e0c2dc9488a.pdf,ICLR,2021,Extending Free Energy Principle to achieve imitation reinforcement learning for sparse reward problems with suboptimal experts +ryguP1BFwr,S1gZkT6uwS,1569440000000.0,1577170000000.0,1773,Walking the Tightrope: An Investigation of the Convolutional Autoencoder Bottleneck,"[""ilja.manakov@med.uni-muenchen.de"", ""markus.rohm@med.uni-muenchen.de"", ""volker.tresp@siemens.com""]","[""Ilja Manakov"", ""Markus Rohm"", ""Volker Tresp""]","[""convolutional autoencoder"", ""bottleneck"", ""representation learning""]","In this paper, we present an in-depth investigation of the convolutional autoencoder (CAE) bottleneck. +Autoencoders (AE), and especially their convolutional variants, play a vital role in the current deep learning toolbox. +Researchers and practitioners employ CAEs for a variety of tasks, ranging from outlier detection and compression to transfer and representation learning. +Despite their widespread adoption, we have limited insight into how the bottleneck shape impacts the emergent properties of the CAE. +We demonstrate that increased height and width of the bottleneck drastically improves generalization, which in turn leads to better performance of the latent codes in downstream transfer learning tasks. +The number of channels in the bottleneck, on the other hand, is secondary in importance. +Furthermore, we show empirically, that, contrary to popular belief, CAEs do not learn to copy their input, even when the bottleneck has the same number of neurons as there are pixels in the input. +Copying does not occur, despite training the CAE for 1,000 epochs on a tiny (~ 600 images) dataset. +We believe that the findings in this paper are directly applicable and will lead to improvements in models that rely on CAEs.",/pdf/bf844aab5ecb0066cee41aaa7640495f54f0d463.pdf,ICLR,2020,We conduct experiments on how the bottlneck in convolutional autoencoder influences their behavior and find that heigth/widht matters significantly more than number of channels and that complete CAEs do not learn to simply copy their input. +rJg8yhAqKm,r1lw89YqKX,1538090000000.0,1550860000000.0,989,InfoBot: Transfer and Exploration via the Information Bottleneck,"[""anirudhgoyal9119@gmail.com"", ""riashat.islam@mail.mcgill.ca"", ""danieljstrouse@gmail.com"", ""zafarali.ahmed@mail.mcgill.ca"", ""hugolarochelle@google.com"", ""botvinick@google.com"", ""svlevine@eecs.berkeley.edu"", ""yoshua.bengio@mila.quebec""]","[""Anirudh Goyal"", ""Riashat Islam"", ""DJ Strouse"", ""Zafarali Ahmed"", ""Hugo Larochelle"", ""Matthew Botvinick"", ""Yoshua Bengio"", ""Sergey Levine""]","[""Information bottleneck"", ""policy transfer"", ""policy generalization"", ""exploration""]","A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned model with an information bottleneck, we can identify decision states by examining where the model accesses the goal state through the bottleneck. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.",/pdf/9bfd1fe1d796c1ebab86fec8129f320412daf1b4.pdf,ICLR,2019,Training agents with goal-policy information bottlenecks promotes transfer and yields a powerful exploration bonus +rkgHY0NYwr,BkeWimddDB,1569440000000.0,1583910000000.0,1246,Discovering Motor Programs by Recomposing Demonstrations,"[""tanmayshankar@fb.com"", ""shubhtuls@fb.com"", ""lerrel.pinto@gmail.com"", ""abhinavg@cs.cmu.edu""]","[""Tanmay Shankar"", ""Shubham Tulsiani"", ""Lerrel Pinto"", ""Abhinav Gupta""]","[""Learning from Demonstration"", ""Imitation Learning"", ""Motor Primitives""]","In this paper, we present an approach to learn recomposable motor primitives across large-scale and diverse manipulation demonstrations. Current approaches to decomposing demonstrations into primitives often assume manually defined primitives and bypass the difficulty of discovering these primitives. On the other hand, approaches in primitive discovery put restrictive assumptions on the complexity of a primitive, which limit applicability to narrow tasks. Our approach attempts to circumvent these challenges by jointly learning both the underlying motor primitives and recomposing these primitives to form the original demonstration. Through constraints on both the parsimony of primitive decomposition and the simplicity of a given primitive, we are able to learn a diverse set of motor primitives, as well as a coherent latent representation for these primitives. We demonstrate both qualitatively and quantitatively, that our learned primitives capture semantically meaningful aspects of a demonstration. This allows us to compose these primitives in a hierarchical reinforcement learning setup to efficiently solve robotic manipulation tasks like reaching and pushing. Our results may be viewed at https://sites.google.com/view/discovering-motor-programs. ",/pdf/dc85a396eac86f30ad058c584c9a3e4da898eb49.pdf,ICLR,2020,"We learn a space of motor primitives from unannotated robot demonstrations, and show these primitives are semantically meaningful and can be composed for new robot tasks." +Bi2OvVf1KPn,wDrgblMQodk,1601310000000.0,1614990000000.0,2072,Provable Robust Learning for Deep Neural Networks under Agnostic Corrupted Supervision,"[""~Boyang_Liu1"", ""~Mengying_Sun1"", ""wangdin1@msu.edu"", ""~Pang-Ning_Tan1"", ""~Jiayu_Zhou1""]","[""Boyang Liu"", ""Mengying Sun"", ""Ding Wang"", ""Pang-Ning Tan"", ""Jiayu Zhou""]","[""Noisy Label"", ""Corrupted Supervision"", ""Robustness"", ""Optimization""]","Training deep neural models in the presence of corrupted supervisions is challenging as the corrupted data points may significantly impact the generalization performance. To alleviate this problem, we present an efficient robust algorithm that achieves strong guarantees without any assumption on the type of corruption and provides a unified framework for both classification and regression problems. Different from many existing approaches that quantify the quality of individual data points (e.g., loss values) and filter out data points accordingly, the proposed algorithm focuses on controlling the collective impact of data points on the averaged gradient. Even when a corrupted data point failed to be excluded by the proposed algorithm, the data point will have very limited impacts on the overall loss, as compared with state-of-the-art filtering data points based on loss values. Extensive empirical results on multiple benchmark datasets have demonstrated the robustness of the proposed method under different types of corruption.",/pdf/4ddd5803f48e7fb183d32ed103d6a13482e440de.pdf,ICLR,2021,A provable robust algorithm to defense agnostic label corruptions. +wb3wxCObbRT,CLOuvU1IS-a,1601310000000.0,1616020000000.0,2209,Growing Efficient Deep Networks by Structured Continuous Sparsification,"[""~Xin_Yuan5"", ""~Pedro_Henrique_Pamplona_Savarese1"", ""~Michael_Maire1""]","[""Xin Yuan"", ""Pedro Henrique Pamplona Savarese"", ""Michael Maire""]","[""deep learning"", ""computer vision"", ""network pruning"", ""neural architecture search""]","We develop an approach to growing deep network architectures over the course of training, driven by a principled combination of accuracy and sparsity objectives. Unlike existing pruning or architecture search techniques that operate on full-sized models or supernet architectures, our method can start from a small, simple seed architecture and dynamically grow and prune both layers and filters. By combining a continuous relaxation of discrete network structure optimization with a scheme for sampling sparse subnetworks, we produce compact, pruned networks, while also drastically reducing the computational expense of training. For example, we achieve $49.7\%$ inference FLOPs and $47.4\%$ training FLOPs savings compared to a baseline ResNet-50 on ImageNet, while maintaining $75.2\%$ top-1 validation accuracy --- all without any dedicated fine-tuning stage. Experiments across CIFAR, ImageNet, PASCAL VOC, and Penn Treebank, with convolutional networks for image classification and semantic segmentation, and recurrent networks for language modeling, demonstrate that we both train faster and produce more efficient networks than competing architecture pruning or search methods.",/pdf/f0340388ab26e079bb52b2e75a594fa25f418c28.pdf,ICLR,2021,We propose an efficient training method that dynamically grows and prunes neural network architectures. +H1GLm2R9Km,S1giBny9YX,1538090000000.0,1545360000000.0,1364,Learning Backpropagation-Free Deep Architectures with Kernels,"[""michaelshiyu3@gmail.com"", ""yusjlcy9011@ufl.edu""]","[""Shiyu Duan"", ""Shujian Yu"", ""Yunmei Chen"", ""Jose Principe""]","[""supervised learning"", ""backpropagation-free deep architecture"", ""kernel method""]","One can substitute each neuron in any neural network with a kernel machine and obtain a counterpart powered by kernel machines. The new network inherits the expressive power and architecture of the original but works in a more intuitive way since each node enjoys the simple interpretation as a hyperplane (in a reproducing kernel Hilbert space). Further, using the kernel multilayer perceptron as an example, we prove that in classification, an optimal representation that minimizes the risk of the network can be characterized for each hidden layer. This result removes the need of backpropagation in learning the model and can be generalized to any feedforward kernel network. Moreover, unlike backpropagation, which turns models into black boxes, the optimal hidden representation enjoys an intuitive geometric interpretation, making the dynamics of learning in a deep kernel network simple to understand. Empirical results are provided to validate our theory.",/pdf/601582b9e7f842100c1dd371f8ef0d90e3263e97.pdf,ICLR,2019,We combine kernel method with connectionist models and show that the resulting deep architectures can be trained layer-wise and have more transparent learning dynamics. +rkxNelrKPB,BylVzpJFDB,1569440000000.0,1577170000000.0,2096,On Stochastic Sign Descent Methods,"[""mher.safaryan@gmail.com"", ""peter.richtarik@kaust.edu.sa""]","[""Mher Safaryan"", ""Peter Richt\u00e1rik""]","[""non-convex optimization"", ""stochastic optimization"", ""gradient compression""]","Various gradient compression schemes have been proposed to mitigate the communication cost in distributed training of large scale machine learning models. Sign-based methods, such as signSGD (Bernstein et al., 2018), have recently been gaining popularity because of their simple compression rule and connection to adaptive gradient methods, like ADAM. In this paper, we perform a general analysis of sign-based methods for non-convex optimization. Our analysis is built on intuitive bounds on success probabilities and does not rely on special noise distributions nor on the boundedness of the variance of stochastic gradients. Extending the theory to distributed setting within a parameter server framework, we assure exponentially fast variance reduction with respect to number of nodes, maintaining 1-bit compression in both directions and using small mini-batch sizes. We validate our theoretical findings experimentally.",/pdf/161d991ba94067b79f1e3c7588bb0b06a86b2c8d.pdf,ICLR,2020,"General analysis of sign-based methods (e.g. signSGD) for non-convex optimization, built on intuitive bounds on success probabilities." +ryeRwlSYPH,H1e4Nhltvr,1569440000000.0,1577170000000.0,2380,Learning transitional skills with intrinsic motivation,"[""11821087@zju.edu.cn"", ""liujinxin@westlake.edu.cn"", ""wangdonglin@westlake.edu.cn""]","[""Qiangxing Tian"", ""Jinxin Liu"", ""Donglin Wang""]",[],"By maximizing an information theoretic objective, a few recent methods empower the agent to explore the environment and learn useful skills without supervision. However, when considering to use multiple consecutive skills to complete a specific task, the transition from one to another cannot guarantee the success of the process due to the evident gap between skills. In this paper, we propose to learn transitional skills (LTS) in addition to creating diverse primitive skills without a reward function. By introducing an extra latent variable for transitional skills, our LTS method discovers both primitive and transitional skills by minimizing the difference of mutual information and the similarity of skills. By considering various simulated robotic tasks, our results demonstrate the effectiveness of LTS on learning both diverse primitive skills and transitional skills, and show its superiority in smooth transition of skills over the state-of-the-art baseline DIAYN.",/pdf/8f49ef7a87c7fccc0d63ea1d086b5c6d43b4eb45.pdf,ICLR,2020, +H1eA7AEtvS,SJx97xUODH,1569440000000.0,1583910000000.0,1057,ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,"[""lanzhzh@google.com"", ""mchen@ttic.edu"", ""seabass@google.com"", ""kgimpel@ttic.edu"", ""piyushsharma@google.com"", ""rsoricut@google.com""]","[""Zhenzhong Lan"", ""Mingda Chen"", ""Sebastian Goodman"", ""Kevin Gimpel"", ""Piyush Sharma"", ""Radu Soricut""]","[""Natural Language Processing"", ""BERT"", ""Representation Learning""]","Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT~\citep{devlin2018bert}. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.",/pdf/ce1860d9372b46ab2700549349420cc19125d478.pdf,ICLR,2020,"A new pretraining method that establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. " +Au1gNqq4brw,QHNPJTJA5B-,1601310000000.0,1614990000000.0,2874,SEQUENCE-LEVEL FEATURES: HOW GRU AND LSTM CELLS CAPTURE N-GRAMS,"[""~Xiaobing_Sun1"", ""luwei@sutd.edu.sg""]","[""Xiaobing Sun"", ""Wei Lu""]","[""GRU"", ""LSTM"", ""Sequence-level"", ""Features"", ""N-grams""]","Modern recurrent neural networks (RNN) such as Gated Recurrent Units (GRU) and Long Short-term Memory (LSTM) have demonstrated impressive results on tasks involving sequential data in practice. Despite continuous efforts on interpreting their behaviors, the exact mechanism underlying their successes in capturing sequence-level information have not been thoroughly understood. In this work, we present a study on understanding the essential features captured by GRU/LSTM cells by mathematically expanding and unrolling the hidden states. Based on the expanded and unrolled hidden states, we find there was a type of sequence-level representations brought in by the gating mechanism, which enables the cells to encode sequence-level features along with token-level features. Specifically, we show that the cells would consist of such sequence-level features similar to those of N-grams. Based on such a finding, we also found that replacing the hidden states of the standard cells with N-gram representations does not necessarily degrade performance on the sentiment analysis and language modeling tasks, indicating such features may play a significant role for GRU/LSTM cells.",/pdf/952db6c5f6f9c2dc9ce7595510f7b68997645925.pdf,ICLR,2021,We found that the impressive performances of GRU or LSTM cells might be attributed to sequence-level representations brought in by the gating mechanism. +hypDstHla7,vpT6K1ArQj,1601310000000.0,1614990000000.0,3484,Neuron Activation Analysis for Multi-Joint Robot Reinforcement Learning,"[""~Benedikt_Feldotto1"", ""heiko.lengenfelder@tum.de"", ""~Alois_Knoll1""]","[""Benedikt Feldotto"", ""Heiko Lengenfelder"", ""Alois Knoll""]","[""Reinforcement Learning"", ""Machine Learning"", ""Robot Motion Learning"", ""DQN"", ""Robot Manipulator"", ""Target Reaching"", ""Network Pruning""]","Recent experiments indicate that pre-training of end-to-end Reinforcement Learning neural networks on general tasks can speed up the training process for specific robotic applications. However, it remains open if these networks form general feature extractors and a hierarchical organization that are reused as apparent e.g. in Convolutional Neural Networks. In this paper we analyze the intrinsic neuron activation in networks trained for target reaching of robot manipulators with increasing joint number in a vertical plane. We analyze the individual neuron activity distribution in the network, introduce a pruning algorithm to reduce network size keeping the performance, and with these dense network representations we spot correlations of neuron activity patterns among networks trained for robot manipulators with different joint number. We show that the input and output network layers have more distinct neuron activation in contrast to inner layers. Our pruning algorithm reduces the network size significantly, increases the distance of neuron activation while keeping a high performance in training and evaluation. Our results demonstrate that neuron activity can be mapped among networks trained for robots with different complexity. Hereby, robots with small joint difference show higher layer-wise projection accuracy whereas more different robots mostly show projections to the first layer.",/pdf/dfa95930949d84e2d82cd610b6ad04688cd32bd6.pdf,ICLR,2021,"We analyze the neuron activation in neural networks trained for robot target reaching with Reinforcement Learning, prune the network and highlight correlations between networks trained for robots with different joint count." +B1liraVYwr,HylorlDvwB,1569440000000.0,1577170000000.0,536,LocalGAN: Modeling Local Distributions for Adversarial Response Generation,"[""xuzhenhit@gmail.com"", ""baoxun.wang@gmail.com"", ""zhanghuan123@pku.edu.cn"", ""kq2131@columbia.edu"", ""dyzhang@sau.edu.cn"", ""cjsun@insun.hit.edu.cn""]","[""Zhen Xu"", ""Baoxun Wang"", ""Huan Zhang"", ""Kexin Qiu"", ""Deyuan Zhang"", ""Chengjie Sun""]","[""neural response generation"", ""adversarial learning"", ""local distribution"", ""energy-based distribution modeling""]","This paper presents a new methodology for modeling the local semantic distribution of responses to a given query in the human-conversation corpus, and on this basis, explores a specified adversarial learning mechanism for training Neural Response Generation (NRG) models to build conversational agents. The proposed mechanism aims to address the training instability problem and improve the quality of generated results of Generative Adversarial Nets (GAN) in their utilizations in the response generation scenario. Our investigation begins with the thorough discussions upon the objective function brought by general GAN architectures to NRG models, and the training instability problem is proved to be ascribed to the special local distributions of conversational corpora. Consequently, an energy function is employed to estimate the status of a local area restricted by the query and its responses in the semantic space, and the mathematical approximation of this energy-based distribution is finally found. Building on this foundation, a local distribution oriented objective is proposed and combined with the original objective, working as a hybrid loss for the adversarial training of response generation models, named as LocalGAN. Our experimental results demonstrate that the reasonable local distribution modeling of the query-response corpus is of great importance to adversarial NRG, and our proposed LocalGAN is promising for improving both the training stability and the quality of generated results. +",/pdf/892e7cb4ebd375225e114fc6bd0add0007337503.pdf,ICLR,2020,A study on leveraging the local distribution of query-response pairs to adversarial response generation. +BJ4AFsRcFQ,BJefIdoqY7,1538090000000.0,1545360000000.0,495,Total Style Transfer with a Single Feed-Forward Network,"[""tyui592@ynu.ac.kr"", ""pogary@ynu.ac.kr""]","[""Minseong Kim"", ""Hyun-Chul Choi""]","[""Image Style Transfer"", ""Deep Learning"", ""Neural Network""]","Recent image style transferring methods achieved arbitrary stylization with input content and style images. To transfer the style of an arbitrary image to a content image, these methods used a feed-forward network with a lowest-scaled feature transformer or a cascade of the networks with a feature transformer of a corresponding scale. However, their approaches did not consider either multi-scaled style in their single-scale feature transformer or dependency between the transformed feature statistics across the cascade networks. This shortcoming resulted in generating partially and inexactly transferred style in the generated images. +To overcome this limitation of partial style transfer, we propose a total style transferring method which transfers multi-scaled feature statistics through a single feed-forward process. First, our method transforms multi-scaled feature maps of a content image into those of a target style image by considering both inter-channel correlations in each single scaled feature map and inter-scale correlations between multi-scaled feature maps. Second, each transformed feature map is inserted into the decoder layer of the corresponding scale using skip-connection. Finally, the skip-connected multi-scaled feature maps are decoded into a stylized image through our trained decoder network.",/pdf/ea3856820e1ebb8047b7f356c98c5bb5ffe34c98.pdf,ICLR,2019,A paper suggesting a method to transform the style of images using deep neural networks. +BylNoaVYPS,S1xeWy1ODH,1569440000000.0,1577170000000.0,739,Variational Autoencoders for Opponent Modeling in Multi-Agent Systems,"[""g.papoudakis@ed.ac.uk"", ""s.albrecht@ed.ac.uk""]","[""Georgios Papoudakis"", ""Stefano V. Albrecht""]","[""reinforcement learning"", ""multi-agent systems"", ""representation learning""]","Multi-agent systems exhibit complex behaviors that emanate from the interactions of multiple agents in a shared environment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with the other agents that have fixed policies. Modeling the behavior of other agents (opponents) is essential in understanding the interactions of the agents in the system. By taking advantage of recent advances in unsupervised learning, we propose modeling opponents using variational autoencoders. Additionally, many existing methods in the literature assume that the opponent models have access to opponent's observations and actions during both training and execution. To eliminate this assumption, we propose a modification that attempts to identify the underlying opponent model, using only local information of our agent, such as its observations, actions, and rewards. The experiments indicate that our opponent modeling methods achieve equal or greater episodic returns in reinforcement learning tasks against another modeling method.",/pdf/ceb8f954704ee19f3a85a4fcd414a51be35cd432.pdf,ICLR,2020, +uCQfPZwRaUu,0tnopZZW4y6H,1601310000000.0,1616030000000.0,1629,Data-Efficient Reinforcement Learning with Self-Predictive Representations,"[""~Max_Schwarzer1"", ""~Ankesh_Anand1"", ""~Rishab_Goel3"", ""~R_Devon_Hjelm1"", ""~Aaron_Courville3"", ""~Philip_Bachman1""]","[""Max Schwarzer"", ""Ankesh Anand"", ""Rishab Goel"", ""R Devon Hjelm"", ""Aaron Courville"", ""Philip Bachman""]","[""Reinforcement Learning"", ""Self-Supervised Learning"", ""Representation Learning"", ""Sample Efficiency""]","While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations (SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent’s parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent’s representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. We’ve made the code associated with this work available at https://github.com/mila-iqia/spr.",/pdf/1332dd3bfd157968abcdfda3acf4d4a7499d6143.pdf,ICLR,2021,"We propose a temporal, self-supervised objective for RL agents and show that it significantly improves data efficiency in a setting limited to just 2h of gameplay on Atari. " +GCXq4UHH7h4,juG-f8DsKy1,1601310000000.0,1614990000000.0,361,Selective Sensing: A Data-driven Nonuniform Subsampling Approach for Computation-free On-Sensor Data Dimensionality Reduction,"[""~Zhikang_Zhang1"", ""kaixu@asu.edu"", ""~Fengbo_Ren1""]","[""Zhikang Zhang"", ""Kai Xu"", ""Fengbo Ren""]","[""Compressive sensing"", ""nonuniform subsampling"", ""machine learning""]","Designing an on-sensor data dimensionality reduction scheme for efficient signal sensing has always been a challenging task. Compressive sensing is a state-of-the-art sensing technique used for on-sensor data dimensionality reduction. However, the undesired computational complexity involved in the sensing stage of compressive sensing limits its practical application in resource-constrained sensor devices or high-data-rate sensor devices dealing with high-dimensional signals. In this paper, we propose a selective sensing framework that adopts the novel concept of data-driven nonuniform subsampling to reduce the dimensionality of acquired signals while retaining the information of interest in a computation-free fashion. Selective sensing adopts a co-optimization methodology to co-train a selective sensing operator with a subsequent information decoding neural network. We take image as the sensing modality and reconstruction as the information decoding task to demonstrate the 1st proof-of-concept of selective sensing. The experiment results on CIFAR10, Set5 and Set14 datasets show that selective sensing can achieve an average reconstruction accuracy improvement in terms of PSNR/SSIM by 3.73dB/0.07 and 9.43dB/0.16 over compressive sensing and uniform subsampling counterparts across the compression ratios of 4-32x, respectively. Source code is available at https://figshare.com/s/519a923fae8f386d7f5b",/pdf/2b143a9ae6da6f87d7e2c7628621893bf7f8dcc4.pdf,ICLR,2021,We propose a selective sensing framework that adopts the novel concept of data-driven nonuniform subsampling for on-sensor data dimensionality reduction. +BkfxKj09Km,H1xLYt9qFQ,1538090000000.0,1545360000000.0,414,DiffraNet: Automatic Classification of Serial Crystallography Diffraction Patterns,"[""arturluis@dcc.ufmg.br"", ""leob@dcc.ufmg.br"", ""shollatz@slac.stanford.edu"", ""mattfel@stanford.edu"", ""kunle@stanford.edu"", ""jmholton@slac.stanford.edu"", ""acohen@slac.stanford.edu"", ""lnardi@stanford.edu""]","[""Artur Souza"", ""Leonardo B. Oliveira"", ""Sabine Hollatz"", ""Matt Feldman"", ""Kunle Olukotun"", ""James M. Holton"", ""Aina E. Cohen"", ""Luigi Nardi""]","[""Serial Crystallography"", ""Deep Learning"", ""Image Classification""]","Serial crystallography is the field of science that studies the structure and properties of crystals via diffraction patterns. In this paper, we introduce a new serial crystallography dataset generated through the use of a simulator; the synthetic images are labeled and they are both scalable and accurate. The resulting synthetic dataset is called DiffraNet, and it is composed of 25,000 512x512 grayscale labeled images. We explore several computer vision approaches for classification on DiffraNet such as standard feature extraction algorithms associated with Random Forests and Support Vector Machines but also an end-to-end CNN topology dubbed DeepFreak tailored to work on this new dataset. All implementations are publicly available and have been fine-tuned using off-the-shelf AutoML optimization tools for a fair comparison. Our best model achieves 98.5% accuracy. We believe that the DiffraNet dataset and its classification methods will have in the long term a positive impact in accelerating discoveries in many disciplines, including chemistry, geology, biology, materials science, metallurgy, and physics.",/pdf/84429597a60b9343f9356b3604723008e35e87cb.pdf,ICLR,2019,We introduce a new synthetic dataset for serial crystallography that can be used to train image classification models and explore computer vision and deep learning approaches to classify them. +r1ln504YvH,SkgnlCd_PB,1569440000000.0,1577170000000.0,1296,Actor-Critic Approach for Temporal Predictive Clustering,"[""chl8856@gmail.com"", ""mihaela@ee.ucla.edu""]","[""Changhee Lee"", ""Mihaela van der Schaar""]","[""Temporal Clustering"", ""Predictive Clustering"", ""Actor-Critic""]","Due to the wider availability of modern electronic health records (EHR), patient care data is often being stored in the form of time-series. Clustering such time-series data is crucial for patient phenotyping, anticipating patients’ prognoses by identifying “similar” patients, and designing treatment guidelines that are tailored to homogeneous patient subgroups. In this paper, we develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest (e.g., adverse events, the onset of comorbidities, etc.). The clustering is carried out by using our novel loss functions that encourage each cluster to have homogeneous future outcomes. We adopt actor-critic models to allow “back-propagation” through the sampling process that is required for assigning clusters to time-series inputs. Experiments on two real-world datasets show that our model achieves superior clustering performance over state-of-the-art benchmarks and identifies meaningful clusters that can be translated into actionable information for clinical decision-making.",/pdf/eea6aee838fa04f13f8dd6014a57ee295fea9fe0.pdf,ICLR,2020, +B1e8CsRctX,S1gnlDT5Y7,1538090000000.0,1545360000000.0,898,Generative Ensembles for Robust Anomaly Detection,"[""hyunsunchoi@kaist.ac.kr"", ""ejang@google.com""]","[""Hyunsun Choi"", ""Eric Jang""]","[""Anomaly Detection"", ""Uncertainty"", ""Out-of-Distribution"", ""Generative Models""]","Deep generative models are capable of learning probability distributions over large, high-dimensional datasets such as images, video and natural language. Generative models trained on samples from p(x) ought to assign low likelihoods to out-of-distribution (OoD) samples from q(x), making them suitable for anomaly detection applications. We show that in practice, likelihood models are themselves susceptible to OoD errors, and even assign large likelihoods to images from other natural datasets. To mitigate these issues, we propose Generative Ensembles, a model-independent technique for OoD detection that combines density-based anomaly detection with uncertainty estimation. Our method outperforms ODIN and VIB baselines on image datasets, and achieves comparable performance to a classification model on the Kaggle Credit Fraud dataset.",/pdf/4d5bb7790da80a5224bb9b17051a26d51e6bc3ed.pdf,ICLR,2019,"We use generative models to perform out-of-distribution detection, and improve their robustness with uncertainty estimation." +rkly70EKDH,BkezlErdPS,1569440000000.0,1577170000000.0,1022,Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently,"[""rongge@cs.duke.edu"", ""wrz16@mails.tsinghua.edu.cn"", ""zhaohy16@mails.tsinghua.edu.cn""]","[""Rong Ge"", ""Runzhe Wang"", ""Haoyu Zhao""]","[""nonconvex optimization"", ""optimization landscape"", ""overparametrization""]","It has been observed \citep{zhang2016understanding} that deep neural networks can memorize: they achieve 100\% accuracy on training data. Recent theoretical results explained such behavior in highly overparametrized regimes, where the number of neurons in each layer is larger than the number of training samples. In this paper, we show that neural networks can be trained to memorize training data perfectly in a mildly overparametrized regime, where the number of parameters is just a constant factor more than the number of training samples, and the number of neurons is much smaller.",/pdf/233cebaaec3173904ad3ea99e98924f0b066a940.pdf,ICLR,2020,We show even mildly overparametrized networks (much smaller than existing results) can be trained to perfectly memorize training data. +B1Z3W-b0W,rkehWW-R-,1509130000000.0,1518730000000.0,655,Learning to Infer,"[""jmarino@caltech.edu"", ""yyue@caltech.edu"", ""stephan.mandt@disneyresearch.com""]","[""Joseph Marino"", ""Yisong Yue"", ""Stephan Mandt""]","[""Bayesian Deep Learning"", ""Amortized Inference"", ""Variational Auto-Encoders"", ""Learning to Learn""]","Inference models, which replace an optimization-based inference procedure with a learned model, have been fundamental in advancing Bayesian deep learning, the most notable example being variational auto-encoders (VAEs). In this paper, we propose iterative inference models, which learn how to optimize a variational lower bound through repeatedly encoding gradients. Our approach generalizes VAEs under certain conditions, and by viewing VAEs in the context of iterative inference, we provide further insight into several recent empirical findings. We demonstrate the inference optimization capabilities of iterative inference models, explore unique aspects of these models, and show that they outperform standard inference models on typical benchmark data sets.",/pdf/1da3c7e5f9710b8de0b13a93d63d5185827636cd.pdf,ICLR,2018,We propose a new class of inference models that iteratively encode gradients to estimate approximate posterior distributions. +0fqoSxXBwI6,eAXOFV34VOC,1601310000000.0,1614990000000.0,505,Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations,"[""~Xiang_Gao2"", ""~Wei_Hu6"", ""~Guo-Jun_Qi1""]","[""Xiang Gao"", ""Wei Hu"", ""Guo-Jun Qi""]","[""Self-supervised Learning"", ""Multi-View Learning""]","3D object representation learning is a fundamental challenge in computer vision to draw inferences about the 3D world. Recent advances in deep learning have shown their efficiency in 3D object recognition, among which view-based methods have performed best so far. However, feature learning of multiple views in existing methods is mostly trained in a supervised fashion, which often requires a large amount of data labels with high cost. Hence, it is critical to learn multi-view feature representations in a self-supervised fashion. To this end, we propose a novel self-supervised learning paradigm of Multi-View Transformation Equivariant Representations (MV-TER), exploiting the equivariant transformations of a 3D object and its projected multiple views. Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after transformation via projection. Then, we self-train a representation learning module to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after transformation. Experimental results demonstrate that the proposed MV-TER significantly outperforms the state-of-the-art view-based approaches in 3D object classification and retrieval tasks.",/pdf/1a21695058da19507367628a927a045e4d01c7f4.pdf,ICLR,2021, +rkl4M3R5K7,HkgoKh6tt7,1538090000000.0,1545360000000.0,1253,Optimal Attacks against Multiple Classifiers,"[""jcperdomo@berkeley.edu"", ""yaron@seas.harvard.edu""]","[""Juan C. Perdomo"", ""Yaron Singer""]","[""online learning"", ""nonconvex optimization"", ""robust optimization""]","We study the problem of designing provably optimal adversarial noise algorithms that induce misclassification in settings where a learner aggregates decisions from multiple classifiers. Given the demonstrated vulnerability of state-of-the-art models to adversarial examples, recent efforts within the field of robust machine learning have focused on the use of ensemble classifiers as a way of boosting the robustness of individual models. In this paper, we design provably optimal attacks against a set of classifiers. We demonstrate how this problem can be framed as finding strategies at equilibrium in a two player, zero sum game between a learner and an adversary and consequently illustrate the need for randomization in adversarial attacks. The main technical challenge we consider is the design of best response oracles that can be implemented in a Multiplicative Weight Updates framework to find equilibrium strategies in the zero-sum game. We develop a series of scalable noise generation algorithms for deep neural networks, and show that it outperforms state-of-the-art attacks on various image classification tasks. Although there are generally no guarantees for deep learning, we show this is a well-principled approach in that it is provably optimal for linear classifiers. The main insight is a geometric characterization of the decision space that reduces the problem of designing best response oracles to minimizing a quadratic function over a set of convex polytopes.",/pdf/8b6730a932d8c35e7203ff8c676ba6c763467587.pdf,ICLR,2019,"Paper analyzes the problem of designing adversarial attacks against multiple classifiers, introducing algorithms that are optimal for linear classifiers and which provide state-of-the-art results for deep learning." +B14TlG-RW,B1X6lMZ0b,1509130000000.0,1524560000000.0,775,QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension,"[""weiyu@cs.cmu.edu"", ""ddohan@google.com"", ""thangluong@google.com"", ""rzhao@google.com"", ""kaichen@google.com"", ""mnorouzi@google.com"", ""qvl@google.com""]","[""Adams Wei Yu"", ""David Dohan"", ""Minh-Thang Luong"", ""Rui Zhao"", ""Kai Chen"", ""Mohammad Norouzi"", ""Quoc V. Le""]","[""squad"", ""stanford question answering dataset"", ""reading comprehension"", ""attention"", ""text convolutions"", ""question answering""]"," Current end-to-end machine reading and question answering (Q\&A) models are primarily based on recurrent neural networks (RNNs) with attention. Despite their success, these models are often slow for both training and inference due to the sequential nature of RNNs. We propose a new Q\&A architecture called QANet, which does not require recurrent networks: Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models. The speed-up gain allows us to train the model with much more data. We hence combine our model with data generated by backtranslation from a neural machine translation model. +On the SQuAD dataset, our single model, trained with augmented data, achieves 84.6 F1 score on the test set, which is significantly better than the best published F1 score of 81.8.",/pdf/f29c4e90006287f3f3ef7b0172258e23dcc512d8.pdf,ICLR,2018,A simple architecture consisting of convolutions and attention achieves results on par with the best documented recurrent models. +HJePno0cYm,H1ll8Y2qKm,1538090000000.0,1545360000000.0,717,Transformer-XL: Language Modeling with Longer-Term Dependency,"[""zander.dai@gmail.com"", ""zhiliny@cs.cmu.edu"", ""yiming@cs.cmu.edu"", ""wcohen@google.com"", ""jgc@cs.cmu.edu"", ""qvl@google.com"", ""rsalakhu@cs.cmu.edu""]","[""Zihang Dai*"", ""Zhilin Yang*"", ""Yiming Yang"", ""William W. Cohen"", ""Jaime Carbonell"", ""Quoc V. Le"", ""Ruslan Salakhutdinov""]","[""Language Modeling"", ""Self-Attention""]","We propose a novel neural architecture, Transformer-XL, for modeling longer-term dependency. To address the limitation of fixed-length contexts, we introduce a notion of recurrence by reusing the representations from the history. Empirically, we show state-of-the-art (SoTA) results on both word-level and character-level language modeling datasets, including WikiText-103, One Billion Word, Penn Treebank, and enwiki8. Notably, we improve the SoTA results from 1.06 to 0.99 in bpc on enwiki8, from 33.0 to 18.9 in perplexity on WikiText-103, and from 28.0 to 23.5 in perplexity on One Billion Word. Performance improves when the attention length increases during evaluation, and our best model attends to up to 1,600 words and 3,800 characters. To quantify the effective length of dependency, we devise a new metric and show that on WikiText-103 Transformer-XL manages to model dependency that is about 80% longer than recurrent networks and 450% longer than Transformer. Moreover, Transformer-XL is up to 1,800+ times faster than vanilla Transformer during evaluation.",/pdf/4ccd6a96dea7974a19a037cd6374beb1b5fbfa2d.pdf,ICLR,2019, +TSrvUnWkjGR,uPz1BurC5z,1601310000000.0,1614990000000.0,610,On the Inversion of Deep Generative Models,"[""~Aviad_Aberdam1"", ""~Dror_Simon1"", ""~Michael_Elad1""]","[""Aviad Aberdam"", ""Dror Simon"", ""Michael Elad""]","[""Sparse Representation"", ""Inverse Problem"", ""Deep Generative Models"", ""Compressed Sensing""]","Deep generative models (e.g. GANs and VAEs) have been developed quite extensively in recent years. Lately, there has been an increased interest in the inversion of such a model, i.e. given a (possibly corrupted) signal, we wish to recover the latent vector that generated it. Building upon sparse representation theory, we define conditions that rely only on the cardinalities of the hidden layer and are applicable to any inversion algorithm (gradient descent, deep encoder, etc.), under which such generative models are invertible with a unique solution. Importantly, the proposed analysis is applicable to any trained model, and does not depend on Gaussian i.i.d. weights. Furthermore, we introduce two layer-wise inversion pursuit algorithms for trained generative networks of arbitrary depth, where one of them is accompanied by recovery guarantees. Finally, we validate our theoretical results numerically and show that our method outperforms gradient descent when inverting such generators, both for clean and corrupted signals.",/pdf/fa58e95707e357c4869b462ba8a2d34f7553d2a0.pdf,ICLR,2021,We derive theoretical conditions for the invertiblity of deep generative models and introduce a layerwise inversion algorithm with provable guarantees +HkgeGeBYDB,rJxe9bgYvH,1569440000000.0,1583910000000.0,2160,RaPP: Novelty Detection with Reconstruction along Projection Pathway,"[""khkim@makinarocks.ai"", ""sangwoo@makinarocks.ai"", ""yongsub@makinarocks.ai"", ""jongseob.jeon@makinarocks.ai"", ""jeongwoo@makinarocks.ai"", ""kbc8894@makinarocks.ai"", ""andre@makinarocks.ai""]","[""Ki Hyun Kim"", ""Sangwoo Shim"", ""Yongsub Lim"", ""Jongseob Jeon"", ""Jeongwoo Choi"", ""Byungchan Kim"", ""Andre S. Yoon""]","[""Novelty Detection"", ""Anomaly Detection"", ""Outlier Detection"", ""Semi-supervised Learning""]","We propose RaPP, a new methodology for novelty detection by utilizing hidden space activation values obtained from a deep autoencoder. +Precisely, RaPP compares input and its autoencoder reconstruction not only in the input space but also in the hidden spaces. +We show that if we feed a reconstructed input to the same autoencoder again, its activated values in a hidden space are equivalent to the corresponding reconstruction in that hidden space given the original input. +In order to aggregate the hidden space activation values, we propose two metrics, which enhance the novelty detection performance. +Through extensive experiments using diverse datasets, we validate that RaPP improves novelty detection performances of autoencoder-based approaches. +Besides, we show that RaPP outperforms recent novelty detection methods evaluated on popular benchmarks. +",/pdf/75c618424b6aa0843dfcac31c350b2c953cbeb37.pdf,ICLR,2020,A new methodology for novelty detection by utilizing hidden space activation values obtained from a deep autoencoder. +SylO2yStDr,B1xEkzkFvr,1569440000000.0,1583910000000.0,1958,Reducing Transformer Depth on Demand with Structured Dropout,"[""angelafan@fb.com"", ""egrave@fb.com"", ""ajoulin@fb.com""]","[""Angela Fan"", ""Edouard Grave"", ""Armand Joulin""]","[""reduction"", ""regularization"", ""pruning"", ""dropout"", ""transformer""]","Overparametrized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality than when training from scratch or using distillation.",/pdf/8a43abb0fa69970100c8643bfc1cf9eec21a5a04.pdf,ICLR,2020,"Layerdrop, a form of structured dropout that allows you to train one model at training time and prune to any desired depth at test time. You can also use this to train even deeper models." +pGIHq1m7PU,42431R3LZld,1601310000000.0,1617120000000.0,644,Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs,"[""~Zhen_Han3"", ""peng.chen@tum.de"", ""~Yunpu_Ma1"", ""~Volker_Tresp1""]","[""Zhen Han"", ""Peng Chen"", ""Yunpu Ma"", ""Volker Tresp""]","[""Temporal knowledge graph"", ""future link prediction"", ""graph neural network"", ""subgraph reasoning.""]","Modeling time-evolving knowledge graphs (KGs) has recently gained increasing interest. Here, graph representation learning has become the dominant paradigm for link prediction on temporal KGs. However, the embedding-based approaches largely operate in a black-box fashion, lacking the ability to interpret their predictions. This paper provides a link forecasting framework that reasons over query-relevant subgraphs of temporal KGs and jointly models the structural dependencies and the temporal dynamics. Especially, we propose a temporal relational attention mechanism and a novel reverse representation update scheme to guide the extraction of an enclosing subgraph around the query. The subgraph is expanded by an iterative sampling of temporal neighbors and by attention propagation. Our approach provides human-understandable evidence explaining the forecast. We evaluate our model on four benchmark temporal knowledge graphs for the link forecasting task. While being more explainable, our model obtains a relative improvement of up to 20 $\%$ on Hits@1 compared to the previous best temporal KG forecasting method. We also conduct a survey with 53 respondents, and the results show that the evidence extracted by the model for link forecasting is aligned with human understanding. ",/pdf/0ab0ca1b52f6655da73e49f5bd22facb0665152b.pdf,ICLR,2021,We propose an explainable attention-based reasoning model for predicting future links on temporal knowledge graphs. +rJx1Na4Fwr,S1e1zDbPvB,1569440000000.0,1583910000000.0,470,MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius,"[""zhairuntian@pku.edu.cn"", ""cdan@cs.cmu.edu"", ""dihe@microsoft.com"", ""huan@huan-zhang.com"", ""boqinggo@outlook.com"", ""pradeepr@cs.cmu.edu"", ""chohsieh@cs.ucla.edu"", ""wanglw@cis.pku.edu.cn""]","[""Runtian Zhai"", ""Chen Dan"", ""Di He"", ""Huan Zhang"", ""Boqing Gong"", ""Pradeep Ravikumar"", ""Cho-Jui Hsieh"", ""Liwei Wang""]","[""Adversarial Robustness"", ""Provable Adversarial Defense"", ""Randomized Smoothing"", ""Robustness Certification""]","Adversarial training is one of the most popular ways to learn robust models but is usually attack-dependent and time costly. In this paper, we propose the MACER algorithm, which learns robust models without using adversarial training but performs better than all existing provable l2-defenses. Recent work shows that randomized smoothing can be used to provide a certified l2 radius to smoothed classifiers, and our algorithm trains provably robust smoothed classifiers via MAximizing the CErtified Radius (MACER). The attack-free characteristic makes MACER faster to train and easier to optimize. In our experiments, we show that our method can be applied to modern deep neural networks on a wide range of datasets, including Cifar-10, ImageNet, MNIST, and SVHN. For all tasks, MACER spends less training time than state-of-the-art adversarial training algorithms, and the learned models achieve larger average certified radius.",/pdf/6d10c9ab55c52279cab1dfdb3484046f613e3a2d.pdf,ICLR,2020,We propose MACER: a provable defense algorithm that trains robust models by maximizing the certified radius. It does not use adversarial training but performs better than all existing provable l2-defenses. +r1gGpjActQ,Hke0bTt5t7,1538090000000.0,1545360000000.0,781,Hint-based Training for Non-Autoregressive Translation,"[""lizhuohan@pku.edu.cn"", ""dihe@microsoft.com"", ""fetia@microsoft.com"", ""taoqin@microsoft.com"", ""wanglw@cis.pku.edu.cn"", ""tyliu@microsoft.com""]","[""Zhuohan Li"", ""Di He"", ""Fei Tian"", ""Tao Qin"", ""Liwei Wang"", ""Tie-Yan Liu""]","[""Natural Language Processing"", ""Machine Translation"", ""Non-Autoregressive Model""]","Machine translation is an important real-world application, and neural network-based AutoRegressive Translation (ART) models have achieved very promising accuracy. Due to the unparallelizable nature of the autoregressive factorization, ART models have to generate tokens one by one during decoding and thus suffer from high inference latency. Recently, Non-AutoRegressive Translation (NART) models were proposed to reduce the inference time. However, they could only achieve inferior accuracy compared with ART models. To improve the accuracy of NART models, in this paper, we propose to leverage the hints from a well-trained ART model to train the NART model. We define two hints for the machine translation task: hints from hidden states and hints from word alignments, and use such hints to regularize the optimization of NART models. Experimental results show that the NART model trained with hints could achieve significantly better translation performance than previous NART models on several tasks. In particular, for the WMT14 En-De and De-En task, we obtain BLEU scores of 25.20 and 29.52 respectively, which largely outperforms the previous non-autoregressive baselines. It is even comparable to a strong LSTM-based ART model (24.60 on WMT14 En-De), but one order of magnitude faster in inference.",/pdf/1b4a07fd6e3d0fe9980093a28d25f5cda6ad8856.pdf,ICLR,2019,"We develop a training algorithm for non-autoregressive machine translation models, achieving comparable accuracy to strong autoregressive baselines, but one order of magnitude faster in inference. " +H1gWyJBFDr,BJxL2squPB,1569440000000.0,1577170000000.0,1458,Fully Convolutional Graph Neural Networks using Bipartite Graph Convolutions,"[""nassar.marcel@gmail.com"", ""caseus.viridis@gmail.com"", ""nervetumer@gmail.com""]","[""Marcel Nassar"", ""Xin Wang"", ""Evren Tumer""]","[""Graph Neural Networks"", ""Graph Convolutional Networks""]","Graph neural networks have been adopted in numerous applications ranging from learning relational representations to modeling data on irregular domains such as point clouds, social graphs, and molecular structures. Though diverse in nature, graph neural network architectures remain limited by the graph convolution operator whose input and output graphs must have the same structure. With this restriction, representational hierarchy can only be built by graph convolution operations followed by non-parameterized pooling or expansion layers. This is very much like early convolutional network architectures, which later have been replaced by more effective parameterized strided and transpose convolution operations in combination with skip connections. In order to bring a similar change to graph convolutional networks, here we introduce the bipartite graph convolution operation, a parameterized transformation between different input and output graphs. Our framework is general enough to subsume conventional graph convolution and pooling as its special cases and supports multi-graph aggregation leading to a class of flexible and adaptable network architectures, termed BiGraphNet. By replacing the sequence of graph convolution and pooling in hierarchical architectures with a single parametric bipartite graph convolution, (i) we answer the question of whether graph pooling matters, and (ii) accelerate computations and lower memory requirements in hierarchical networks by eliminating pooling layers. Then, with concrete examples, we demonstrate that the general BiGraphNet formalism (iii) provides the modeling flexibility to build efficient architectures such as graph skip connections, and autoencoders.",/pdf/f3cb71c3f0490504344df3989a674f66cec639cd.pdf,ICLR,2020, +aGmEDl1NWJ-,eKgJqb-_5Tb,1601310000000.0,1614990000000.0,822,Luring of transferable adversarial perturbations in the black-box paradigm,"[""~R\u00e9mi_Bernhard1"", ""~Pierre-Alain_Mo\u00ebllic1"", ""~Jean-Max_Dutertre1""]","[""R\u00e9mi Bernhard"", ""Pierre-Alain Mo\u00ebllic"", ""Jean-Max Dutertre""]","[""Neural Networks"", ""Adversarial Machine Learning"", ""Security""]","The growing interest for adversarial examples, i.e. maliciously modified examples which fool a classifier, has resulted in many defenses intended to detect them, render them inoffensive or make the model more robust against them. In this paper, we pave the way towards a new approach to improve the robustness of a model against black-box transfer attacks. A removable additional neural network is included in the target model and is designed to induce the ""luring effect"", which tricks the adversary into choosing false directions to fool the target model. Training the additional model is achieved thanks to a loss function acting on the logits sequence order. Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set. We explain the luring effect thanks to the notion of robust and non-robust useful features and perform experiments on MNIST, SVHN and CIFAR10 to characterize and evaluate this phenomenon. Additionally, we discuss two simple prediction schemes, and verify experimentally that our approach can be used as a defense to efficiently thwart an adversary using state-of-the-art attacks and allowed to perform large perturbations.",/pdf/3f355f1983f1570f2e64615f098c4e8fcdc8ffc0.pdf,ICLR,2021,We propose a new approach to improve the robustness of a neural network model against black-box transfer attacks based on a deception method. +Oq79NOiZB1H,KGbEooOr_W8,1601310000000.0,1614990000000.0,3370,On the Importance of Sampling in Training GCNs: Convergence Analysis and Variance Reduction,"[""~Weilin_Cong1"", ""~Morteza_Ramezani1"", ""~Mehrdad_Mahdavi2""]","[""Weilin Cong"", ""Morteza Ramezani"", ""Mehrdad Mahdavi""]","[""Graph neural network"", ""large-scale machine learning"", ""convergence analysis""]","Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications. Despite their great success, training GCNs on large graphs suffers from computational and memory issues. A potential path to circumvent these obstacles is sampling-based methods, where at each layer a subset of nodes is sampled. Although recent studies have empirically demonstrated the effectiveness of sampling-based methods, these works lack theoretical convergence guarantees under realistic settings and cannot fully leverage the information of evolving parameters during optimization. In this paper, we describe and analyze a general \textbf{\textit{doubly variance reduction}} schema that can accelerate any sampling method under the memory budget. The motivating impetus for the proposed schema is a careful analysis for the variance of sampling methods where it is shown that the induced variance can be decomposed into node embedding approximation variance (\emph{zeroth-order variance}) during forward propagation and layerwise-gradient variance (\emph{first-order variance}) during backward propagation. We theoretically analyze the convergence of the proposed schema and show that it enjoys an $\mathcal{O}(1/T)$ convergence rate. We complement our theoretical results by integrating the proposed schema in different sampling methods and applying them to different large real-world graphs.",/pdf/6a70e9058a0fd6319c9245eeec1f9ffcf9532ae7.pdf,ICLR,2021,Provide theoretical analysis on sampling-based GCN training and new algorithms to speed up training process. +H1gzR2VKDH,HyxWEPIVwH,1569440000000.0,1584730000000.0,256,Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation,"[""surajn@stanford.edu"", ""chelseaf@google.com""]","[""Suraj Nair"", ""Chelsea Finn""]","[""video prediction"", ""reinforcement learning"", ""planning""]","Video prediction models combined with planning algorithms have shown promise in enabling robots to learn to perform many vision-based tasks through only self-supervision, reaching novel goals in cluttered scenes with unseen objects. However, due to the compounding uncertainty in long horizon video prediction and poor scalability of sampling-based planning optimizers, one significant limitation of these approaches is the ability to plan over long horizons to reach distant goals. To that end, we propose a framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning. The subgoal images are directly optimized to decompose the task into easy to plan segments, and as a result, we observe that the method naturally identifies semantically meaningful states as subgoals. Across three out of four simulated vision-based manipulation tasks, we find that our method achieves more than 20% absolute performance improvement over planning without subgoals and model-free RL approaches. Further, our experiments illustrate that our approach extends to real, cluttered visual scenes.",/pdf/aa5541d4bdbf7a204c65f956284a12d430af31bb.pdf,ICLR,2020,"Hierarchical visual foresight learns to generate visual subgoals that break down long-horizon tasks into subtasks, using only self-supervision." +H1lXVJStwB,Skgaoi2_DB,1569440000000.0,1577170000000.0,1649,Dynamic Instance Hardness,"[""tianyizh@uw.edu"", ""wangsj@cs.washington.edu"", ""bilmes@uw.edu""]","[""Tianyi Zhou"", ""Shengjie Wang"", ""Jeff A. Bilmes""]","[""training dynamics"", ""instance hardness"", ""curriculum learning"", ""neural nets memorization""]","We introduce dynamic instance hardness (DIH) to facilitate the training of machine learning models. DIH is a property of each training sample and is computed as the running mean of the sample's instantaneous hardness as measured over the training history. We use DIH to evaluate how well a model retains knowledge about each training sample over time. We find that for deep neural nets (DNNs), the DIH of a sample in relatively early training stages reflects its DIH in later stages and as a result, DIH can be effectively used to reduce the set of training samples in future epochs. Specifically, during each epoch, only samples with high DIH are trained (since they are historically hard) while samples with low DIH can be safely ignored. DIH is updated each epoch only for the selected samples, so it does not require additional computation. Hence, using DIH during training leads to an appreciable speedup. Also, since the model is focused on the historically more challenging samples, resultant models are more accurate. The above, when formulated as an algorithm, can be seen as a form of curriculum learning, so we call our framework DIH curriculum learning (or DIHCL). The advantages of DIHCL, compared to other curriculum learning approaches, are: (1) DIHCL does not require additional inference steps over the data not selected by DIHCL in each epoch, (2) the dynamic instance hardness, compared to static instance hardness (e.g., instantaneous loss), is more stable as it integrates information over the entire training history up to the present time. Making certain mathematical assumptions, we formulate the problem of DIHCL as finding a curriculum that maximizes a multi-set function $f(\cdot)$, and derive an approximation bound for a DIH-produced curriculum relative to the optimal curriculum. Empirically, DIHCL-trained DNNs significantly outperform random mini-batch SGD and other recently developed curriculum learning methods in terms of efficiency, early-stage convergence, and final performance, and this is shown in training several state-of-the-art DNNs on 11 modern datasets.",/pdf/ea6e953e80c6ba43965e21841dfd41319b4c4345.pdf,ICLR,2020,New understanding of training dynamics and metrics of memorization hardness lead to efficient and provable curriculum learning. +ByGOuo0cYm,S1xhUvuct7,1538090000000.0,1545360000000.0,368,Meta-Learning with Domain Adaptation for Few-Shot Learning under Domain Shift,"[""doyens@smu.edu.sg"", ""hungle.2018@phdis.smu.edu.sg"", ""chliu@smu.edu.sg"", ""chhoi@smu.edu.sg""]","[""Doyen Sahoo"", ""Hung Le"", ""Chenghao Liu"", ""Steven C. H. Hoi""]","[""Meta-Learning"", ""Few-Shot Learning"", ""Domain Adaptation""]","Few-Shot Learning (learning with limited labeled data) aims to overcome the limitations of traditional machine learning approaches which require thousands of labeled examples to train an effective model. Considered as a hallmark of human intelligence, the community has recently witnessed several contributions on this topic, in particular through meta-learning, where a model learns how to learn an effective model for few-shot learning. The main idea is to acquire prior knowledge from a set of training tasks, which is then used to perform (few-shot) test tasks. Most existing work assumes that both training and test tasks are drawn from the same distribution, and a large amount of labeled data is available in the training tasks. This is a very strong assumption which restricts the usage of meta-learning strategies in the real world where ample training tasks following the same distribution as test tasks may not be available. In this paper, we propose a novel meta-learning paradigm wherein a few-shot learning model is learnt, which simultaneously overcomes domain shift between the train and test tasks via adversarial domain adaptation. We demonstrate the efficacy the proposed method through extensive experiments.",/pdf/6d59180cbf2068f53327d9f97b524ba0ca3cbd7a.pdf,ICLR,2019,Meta Learning for Few Shot learning assumes that training tasks and test tasks are drawn from the same distribution. What do you do if they are not? Meta Learning with task-level Domain Adaptation! +BJf_YjCqYX,Sye52ZWcKm,1538090000000.0,1545360000000.0,460,Identifying Bias in AI using Simulation,"[""damcduff@microsoft.com"", ""rocheng@microsoft.com"", ""akapoor@microsoft.com""]","[""Daniel McDuff"", ""Roger Cheng"", ""Ashish Kapoor""]","[""Bias"", ""Simulation"", ""Optimization"", ""Face Detection""]","Machine learned models exhibit bias, often because the datasets used to train them are biased. This presents a serious problem for the deployment of such technology, as the resulting models might perform poorly on populations that are minorities within the training set and ultimately present higher risks to them. We propose to use high-fidelity computer simulations to interrogate and diagnose biases within ML classifiers. We present a framework that leverages Bayesian parameter search to efficiently characterize the high dimensional feature space and more quickly identify weakness in performance. We apply our approach to an example domain, face detection, and show that it can be used to help identify demographic biases in commercial face application programming interfaces (APIs).",/pdf/32ac04bc91ee838d7786842109dc4550371f657f.pdf,ICLR,2019,We present a framework that leverages high-fidelity computer simulations to interrogate and diagnose biases within ML classifiers. +GJkTaYTmzVS,z5PTp0blkfY,1601310000000.0,1614990000000.0,2776,Play to Grade: Grading Interactive Coding Games as Classifying Markov Decision Process,"[""~Allen_Nie1"", ""~Emma_Brunskill2"", ""chrisjpiech@gmail.com""]","[""Allen Nie"", ""Emma Brunskill"", ""Chris Piech""]","[""Deep Reinforcement Learning"", ""Education"", ""Automated Grading"", ""Program Testing""]","Contemporary coding education often present students with the task of developing programs that have user interaction and complex dynamic systems, such as mouse based games. While pedagogically compelling, grading such student programs requires dynamic user inputs, therefore they are difficult to grade by unit tests. In this paper we formalize the challenge of grading interactive programs as a task of classifying Markov Decision Processes (MDPs). Each student's program fully specifies an MDP where the agent needs to operate and decide, under reasonable generalization, if the dynamics and reward model of the input MDP conforms to a set of latent MDPs. We demonstrate that by experiencing a handful of latent MDPs millions of times, we can use the agent to sample trajectories from the input MDP and use a classifier to determine membership. Our method drastically reduces the amount of data needed to train an automatic grading system for interactive code assignments and present a challenge to state-of-the-art reinforcement learning generalization methods. Together with Code.org, we curated a dataset of 700k student submissions, one of the largest dataset of anonymized student submissions to a single assignment. This Code.org assignment had no previous solution for automatically providing correctness feedback to students and as such this contribution could lead to meaningful improvement in educational experience.",/pdf/dbe0c37012e609beabd27b0cede18013b7d19f40.pdf,ICLR,2021,We apply deep reinforcement learning to learn an agent that can interact with online coding games and develop classifier to grade them with high accuracy. +rJiNwv9gg,,1478290000000.0,1488390000000.0,390,Lossy Image Compression with Compressive Autoencoders,"[""ltheis@twitter.com"", ""wshi@twitter.com"", ""acunningham@twitter.com"", ""fhuszar@twitter.com""]","[""Lucas Theis"", ""Wenzhe Shi"", ""Andrew Cunningham"", ""Ferenc Husz\u00e1r""]","[""Computer vision"", ""Deep learning"", ""Applications""]","We propose a new approach to the problem of optimizing autoencoders for lossy image compression. New media formats, changing hardware technology, as well as diverse requirements and content types create a need for compression algorithms which are more flexible than existing codecs. Autoencoders have the potential to address this need, but are difficult to optimize directly due to the inherent non-differentiabilty of the compression loss. We here show that minimal changes to the loss are sufficient to train deep autoencoders competitive with JPEG 2000 and outperforming recently proposed approaches based on RNNs. Our network is furthermore computationally efficient thanks to a sub-pixel architecture, which makes it suitable for high-resolution images. This is in contrast to previous work on autoencoders for compression using coarser approximations, shallower architectures, computationally expensive methods, or focusing on small images.",/pdf/a745e9aa53656d123430c65f1b3ab1ac37d11245.pdf,ICLR,2017,A simple approach to train autoencoders to compress images as well or better than JPEG 2000. +INhwJdJtxn6,cgVnEUlVMbJ,1601310000000.0,1614990000000.0,3236,Coverage as a Principle for Discovering Transferable Behavior in Reinforcement Learning,"[""~V\u00edctor_Campos1"", ""~Pablo_Sprechmann1"", ""~Steven_Stenberg_Hansen1"", ""~Andre_Barreto1"", ""~Charles_Blundell1"", ""avlife@google.com"", ""~Steven_Kapturowski1"", ""~Adria_Puigdomenech_Badia2""]","[""V\u00edctor Campos"", ""Pablo Sprechmann"", ""Steven Stenberg Hansen"", ""Andre Barreto"", ""Charles Blundell"", ""Alex Vitvitskyi"", ""Steven Kapturowski"", ""Adria Puigdomenech Badia""]","[""deep reinforcement learning"", ""transfer learning"", ""unsupervised learning"", ""exploration""]","Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Unsupervised learning provides a useful paradigm for autonomous acquisition of task-agnostic knowledge. In supervised settings, representations discovered through unsupervised pre-training offer important benefits when transferred to downstream tasks. Given the nature of the reinforcement learning problem, we explore how to transfer knowledge through behavior instead of representations. The behavior of pre-trained policies may be used for solving the task at hand (exploitation), as well as for collecting useful data to solve the problem (exploration). We argue that pre-training policies to maximize coverage will result in behavior that is useful for both strategies. When using these policies for both exploitation and exploration, our agents discover solutions that lead to larger returns. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task.",/pdf/f61a0bd1bf18a7a4188cd8f8c9b04e5224b515ab.pdf,ICLR,2021,"We pre-train agents to maximize coverage in the absence of reward, and show that the discovered behaviors can be used for transfer to downstream tasks via exploration and exploitation mechanisms." +4I5THWNSjC,PKGk0EVTjm,1601310000000.0,1614990000000.0,1817,BasisNet: Two-stage Model Synthesis for Efficient Inference,"[""~Mingda_Zhang1"", ""~Andrey_Zhmoginov1"", ""~Andrew_G._Howard1"", ""~Brendan_Jou1"", ""~Yukun_Zhu1"", ""zhl@google.com"", ""~Rebecca_Hwa1"", ""~Adriana_Kovashka1""]","[""Mingda Zhang"", ""Andrey Zhmoginov"", ""Andrew G. Howard"", ""Brendan Jou"", ""Yukun Zhu"", ""Li Zhang"", ""Rebecca Hwa"", ""Adriana Kovashka""]",[],"We present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach uses a lightweight model to preview an image and generate input-dependent combination coefficients, which are later used to control the synthesis of a specialist model for making more accurate final prediction. The two-stage model synthesis strategy can be used with any network architectures and both stages can be jointly trained end to end. We validated BasisNet on ImageNet classification with MobileNets as backbone, and demonstrated clear advantage on accuracy-efficiency trade-off over strong baselines such as EfficientNet (Tan & Le, 2019), FBNetV3 (Dai et al., 2020) and OFA (Cai et al., 2019). Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations (MAdds), halving the computational cost of previous state-of-the-art without sacrificing accuracy. Besides, since the first-stage lightweight model can independently make predictions, inference can be terminated early if the prediction is sufficiently confident. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0%.",/pdf/50461d4cc6645b96fb7e976a73d57d82346d40d2.pdf,ICLR,2021,Use two-stage model synthesis to generate input-dependent specialist model for making more accurate predictions on given inputs. +B1MB5oRqtQ,ByxzojoqKm,1538090000000.0,1545360000000.0,531,On-Policy Trust Region Policy Optimisation with Replay Buffers,"[""d.kangin@exeter.ac.uk"", ""n.pugeault@exeter.ac.uk""]","[""Dmitry Kangin"", ""Nicolas Pugeault""]","[""reinforcement learning"", ""on-policy learning"", ""trust region policy optimisation"", ""replay buffer""]","Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to the on-policy algorithms. To achieve this, the proposed algorithm generalises the Q-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG. ",/pdf/5045c3c06395d88e720fba1d362290a8287e7dd2.pdf,ICLR,2019,We investigate the theoretical and practical evidence of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. +BkeK-nRcFX,BkxXlBn9Fm,1538090000000.0,1545360000000.0,1194,The Nonlinearity Coefficient - Predicting Generalization in Deep Neural Networks,"[""george.philipp@email.de"", ""jgc@cs.cmu.edu""]","[""George Philipp"", ""Jaime G. Carbonell""]","[""deep learning"", ""neural networks"", ""nonlinearity"", ""activation functions"", ""exploding gradients"", ""vanishing gradients"", ""neural architecture search""]","For a long time, designing neural architectures that exhibit high performance was considered a dark art that required expert hand-tuning. One of the few well-known guidelines for architecture design is the avoidance of exploding or vanishing gradients. However, even this guideline has remained relatively vague and circumstantial, because there exists no well-defined, gradient-based metric that can be computed {\it before} training begins and can robustly predict the performance of the network {\it after} training is complete. + +We introduce what is, to the best of our knowledge, the first such metric: the nonlinearity coefficient (NLC). Via an extensive empirical study, we show that the NLC, computed in the network's randomly initialized state, is a powerful predictor of test error and that attaining a right-sized NLC is essential for attaining an optimal test error, at least in fully-connected feedforward networks. The NLC is also conceptually simple, cheap to compute, and is robust to a range of confounders and architectural design choices that comparable metrics are not necessarily robust to. Hence, we argue the NLC is an important tool for architecture search and design, as it can robustly predict poor training outcomes before training even begins.",/pdf/29e1b79ef0a4db3aa16dd726618b428b8403491c.pdf,ICLR,2019,"We introduce the NLC, a metric that is cheap to compute in the networks randomly initialized state and is highly predictive of generalization, at least in fully-connected networks." +BkeMXR4KvS,HyeX7PS_vr,1569440000000.0,1577170000000.0,1030,DASGrad: Double Adaptive Stochastic Gradient,"[""kdgutier@cs.cmu.edu"", ""cchallu@cs.cmu.edu"", ""jinl2@cs.cmu.edu"", ""awd@cs.cmu.edu""]","[""Kin Gutierrez"", ""Cristian Challu"", ""Jin Li"", ""Artur Dubrawski""]","[""stochastic convex optimization"", ""adaptivity"", ""online learning"", ""transfer learning""]","Adaptive moment methods have been remarkably successful for optimization under the presence of high dimensional or sparse gradients, in parallel to this, adaptive sampling probabilities for SGD have allowed optimizers to improve convergence rates by prioritizing examples to learn efficiently. Numerous applications in the past have implicitly combined adaptive moment methods with adaptive probabilities yet the theoretical guarantees of such procedures have not been explored. We formalize double adaptive stochastic gradient methods DASGrad as an optimization technique and analyze its convergence improvements in a stochastic convex optimization setting, we provide empirical validation of our findings with convex and non convex objectives. We observe that the benefits of the method increase with the model complexity and variability of the gradients, and we explore the resulting utility in extensions to transfer learning. ",/pdf/c0c82438cb73ba6e5946daba79b248bc67923cb7.pdf,ICLR,2020,Stochastic gradient descent with adaptive moments and adaptive probabilities +qoTcTS9-IZ-,Si7Ighn3N2H,1601310000000.0,1614990000000.0,1650,Dynamically Stable Infinite-Width Limits of Neural Classifiers,"[""~Eugene_Golikov1""]","[""Eugene Golikov""]","[""neural tangent kernel"", ""mean field limit""]","Recent research has been focused on two different approaches to studying neural networks training in the limit of infinite width (1) a mean-field (MF) and (2) a constant neural tangent kernel (NTK) approximations. These two approaches have different scaling of hyperparameters with the width of a network layer and as a result, different infinite-width limit models. Restricting ourselves to single hidden layer nets with zero-mean initialization trained for binary classification with SGD, we propose a general framework to study how the limit behavior of neural models depends on the scaling of hyperparameters with network width. Our framework allows us to derive scaling for existing MF and NTK limits, as well as an uncountable number of other scalings that lead to a dynamically stable limit behavior of corresponding models. However, only a finite number of distinct limit models are induced by these scalings. Each distinct limit model corresponds to a unique combination of such properties as boundedness of logits and tangent kernels at initialization or stationarity of tangent kernels. Existing MF and NTK limit models, as well as one novel limit model, satisfy most of the properties demonstrated by finite-width models. We also propose a novel initialization-corrected mean-field limit that satisfies all properties noted above, and its corresponding model is a simple modification for a finite-width model.",/pdf/e6882675c616788700fa89cd6038b3ad9f834515.pdf,ICLR,2021,"A framework that unifies both mean-field and NTK limits, and suggests other important limit models that have not been studied previously." +B1lfHhR9tm,SJe_6Cc5tX,1538090000000.0,1545360000000.0,1522,The Natural Language Decathlon: Multitask Learning as Question Answering,"[""bmccann@salesforce.com"", ""nkeskar@salesforce.com"", ""cxiong@salesforce.com"", ""rsocher@salesforce.com""]","[""Bryan McCann"", ""Nitish Shirish Keskar"", ""Caiming Xiong"", ""Richard Socher""]","[""multitask learning"", ""natural language processing"", ""question answering"", ""machine translation"", ""relation extraction"", ""semantic parsing"", ""commensense reasoning"", ""summarization"", ""entailment"", ""sentiment"", ""dialog""]","Deep learning has improved performance on many natural language processing (NLP) tasks individually. +However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. +We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: +question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. +We cast all tasks as question answering over a context. +Furthermore, we present a new multitask question answering network (MQAN) that jointly learns all tasks in decaNLP without any task-specific modules or parameters more effectively than sequence-to-sequence and reading comprehension baselines. +MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. +We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and that performance further improves with an anti-curriculum training strategy. +Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. +We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.",/pdf/901feb60fcffbcba4ee393030349abf5c7dde39b.pdf,ICLR,2019,We introduce a multitask learning challenge that spans ten natural language processing tasks and propose a new model that jointly learns them. +a5KvtsZ14ev,bYLRiwUCThF,1601310000000.0,1614990000000.0,904,SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks,"[""~Bahare_Fatemi1"", ""~Seyed_Mehran_Kazemi1"", ""~Layla_El_Asri2""]","[""Bahare Fatemi"", ""Seyed Mehran Kazemi"", ""Layla El Asri""]","[""Graph Neural Networks"", ""Graph Representation Learning"", ""Graph Structure Learning"", ""Self-supervision""]","Graph neural networks (GNNs) work well when the graph structure is provided. However, this structure may not always be available in real-world applications. One solution to this problem is to infer the latent structure and then apply a GNN to the inferred graph. Unfortunately, the space of possible graph structures grows super-exponentially with the number of nodes and so the available node labels may be insufficient for learning both the structure and the GNN parameters. In this work, we propose the Simultaneous Learning of Adjacency and GNN Parameters with Self-supervision, or SLAPS, a method that provides more supervision for inferring a graph structure. This approach consists of training a denoising autoencoder GNN in parallel with the task-specific GNN. The autoencoder is trained to reconstruct the initial node features given noisy node features as well as a structure provided by a learnable graph generator. We explore the design space of SLAPS by comparing different graph generation and symmetrization approaches. A comprehensive experimental study demonstrates that SLAPS scales to large graphs with hundreds of thousands of nodes and outperforms several models that have been proposed to learn a task-specific graph structure on established benchmarks.",/pdf/55d1f5ba0b72f1d47de809ad4eb8e6b16a607fc7.pdf,ICLR,2021,Self-Supervision Improves Structure Learning for Graph Neural Networks. +ryeN5aEYDH,BJlus5TvwH,1569440000000.0,1577170000000.0,703,"Deep RL for Blood Glucose Control: Lessons, Challenges, and Opportunities","[""ifox@umich.edu"", ""joyclee@med.umich.edu"", ""rpbusui@umich.edu"", ""wiensj@umich.edu""]","[""Ian Fox"", ""Joyce Lee"", ""Rodica Busui"", ""Jenna Wiens""]","[""Deep Reinforcement Learning"", ""Diabetes"", ""Artificial Pancreas"", ""Control""]","Individuals with type 1 diabetes (T1D) lack the ability to produce the insulin their bodies need. As a result, they must continually make decisions about how much insulin to self-administer in order to adequately control their blood glucose levels. Longitudinal data streams captured from wearables, like continuous glucose monitors, can help these individuals manage their health, but currently the majority of the decision burden remains on the user. To relieve this burden, researchers are working on closed-loop solutions that combine a continuous glucose monitor and an insulin pump with a control algorithm in an `artificial pancreas.' Such systems aim to estimate and deliver the appropriate amount of insulin. Here, we develop reinforcement learning (RL) techniques for automated blood glucose control. Through a series of experiments, we compare the performance of different deep RL approaches to non-RL approaches. We highlight the flexibility of RL approaches, demonstrating how they can adapt to new individuals with little additional data. On over 21k hours of simulated data across 30 patients, RL approaches outperform baseline control algorithms (increasing time spent in normal glucose range from 71% to 75%) without requiring meal announcements. Moreover, these approaches are adept at leveraging latent behavioral patterns (increasing time in range from 58% to 70%). This work demonstrates the potential of deep RL for controlling complex physiological systems with minimal expert knowledge. ",/pdf/d814d32e297195a9b7c4f368e2e79bc551ca2264.pdf,ICLR,2020,We develop a deep reinforcement learning algorithm to control blood glucose in people with diabetes. +rygqqsA9KX,SkgLGtG_KQ,1538090000000.0,1550760000000.0,556,Learning Factorized Multimodal Representations,"[""yaohungt@cs.cmu.edu"", ""pliang@cs.cmu.edu"", ""abagherz@cs.cmu.edu"", ""morency@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu""]","[""Yao-Hung Hubert Tsai"", ""Paul Pu Liang"", ""Amir Zadeh"", ""Louis-Philippe Morency"", ""Ruslan Salakhutdinov""]","[""multimodal learning"", ""representation learning""]","Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.",/pdf/9790b634500ae4876760aeabbef9e853f4c00342.pdf,ICLR,2019,"We propose a model to learn factorized multimodal representations that are discriminative, generative, and interpretable." +PP4KyAaBoBK,Vn_-JhSepH,1601310000000.0,1614990000000.0,376,Human Perception-based Evaluation Criterion for Ultra-high Resolution Cell Membrane Segmentation,"[""~Ruohua_Shi1"", ""~Wenyao_Wang1"", ""~Zhixuan_Li1"", ""~Liuyuan_He1"", ""~Kaiwen_Sheng1"", ""~Lei_Ma3"", ""~Kai_Du1"", ""~Tingting_Jiang2"", ""~Tiejun_Huang1""]","[""Ruohua Shi"", ""Wenyao Wang"", ""Zhixuan Li"", ""Liuyuan He"", ""Kaiwen Sheng"", ""Lei Ma"", ""Kai Du"", ""Tingting Jiang"", ""Tiejun Huang""]","[""Neuroscience"", ""Connectomics"", ""Human perception"", ""EM dataset"", ""Membrane segmentation"", ""Evaluation criterion""]","Computer vision technology is widely used in biological and medical data analysis and understanding. However, there are still two major bottlenecks in the field of cell membrane segmentation, which seriously hinder further research: lack of sufficient high-quality data and lack of suitable evaluation criteria. In order to solve these two problems, this paper first introduces an Ultra-high Resolution Image Segmentation dataset for the Cell membrane, called U-RISC, the largest annotated EM dataset for the Cell membrane with multiple iterative annotations and uncompressed high-resolution raw data. During the analysis process of the U-RISC, we found that the current popular segmentation evaluation criteria are inconsistent with human perception. This interesting phenomenon is confirmed by a subjective experiment involving twenty people. Furthermore, to resolve this inconsistency, we propose a Perceptual Hausdorff Distance (PHD) evaluation criterion to measure the quality of cell membrane segmentation results. Detailed performance comparison and discussion of classic segmentation methods along with two iterative manual annotation results under existing criteria and PHD is given.",/pdf/aec4f250036d0992e59c646ff88bd8c3aac54c25.pdf,ICLR,2021,We established the largest annotated ultra-high resolution EM dataset for the cell membrane with multiple iterative annotations and propose a perceptual-based evaluation criterion to measure the quality of cell membrane segmentation results. +RrSuwzJfMQN,7JlTinQI2Qh,1601310000000.0,1614990000000.0,2405,TOWARDS NATURAL ROBUSTNESS AGAINST ADVERSARIAL EXAMPLES,"[""~Haoyu_Chu1"", ""~Shikui_Wei1"", ""~Yao_Zhao1""]","[""Haoyu Chu"", ""Shikui Wei"", ""Yao Zhao""]",[],"Recent studies have shown that deep neural networks are vulnerable to adversarial examples, but most of the methods proposed to defense adversarial examples cannot solve this problem fundamentally. In this paper, we theoretically prove that there is an upper bound for neural networks with identity mappings to constrain the error caused by adversarial noises. However, in actual computations, this kind of neural network no longer holds any upper bound and is therefore susceptible to adversarial examples. Following similar procedures, we explain why adversarial examples can fool other deep neural networks with skip connections. Furthermore, we demonstrate that a new family of deep neural networks called Neural ODEs (Chen et al., 2018) holds a weaker upper bound. This weaker upper bound prevents the amount of change in the result from being too large. Thus, Neural ODEs have natural robustness against adversarial examples. We evaluate the performance of Neural ODEs compared with ResNet under three white-box adversarial attacks (FGSM, PGD, DI2-FGSM) and one black-box adversarial attack (Boundary Attack). Finally, we show that the natural robustness of Neural ODEs is even better than the robustness of neural networks that are trained with adversarial training methods, such as TRADES and YOPO. +",/pdf/e9ec4f6b45b25368661874758e04810ba96f0f37.pdf,ICLR,2021, +rylztAEYvr,H1lopf_Ovr,1569440000000.0,1577170000000.0,1240,Iterative Target Augmentation for Effective Conditional Generation,"[""yangk@berkeley.edu"", ""wengong@csail.mit.edu"", ""swansonk.14@gmail.com"", ""regina@csail.mit.edu"", ""tommi@csail.mit.edu""]","[""Kevin Yang"", ""Wengong Jin"", ""Kyle Swanson"", ""Regina Barzilay"", ""Tommi Jaakkola""]","[""data augmentation"", ""generative models"", ""self-training"", ""molecular optimization"", ""program synthesis""]","Many challenging prediction problems, from molecular optimization to program synthesis, involve creating complex structured objects as outputs. However, available training data may not be sufficient for a generative model to learn all possible complex transformations. By leveraging the idea that evaluation is easier than generation, we show how a simple, broadly applicable, iterative target augmentation scheme can be surprisingly effective in guiding the training and use of such models. Our scheme views the generative model as a prior distribution, and employs a separately trained filter as the likelihood. In each augmentation step, we filter the model's outputs to obtain additional prediction targets for the next training epoch. Our method is applicable in the supervised as well as semi-supervised settings. We demonstrate that our approach yields significant gains over strong baselines both in molecular optimization and program synthesis. In particular, our augmented model outperforms the previous state-of-the-art in molecular optimization by over 10% in absolute gain. ",/pdf/11b9b10ad2f1fcefd6bd62f91f4f87aa759b7451.pdf,ICLR,2020,We improve generative models by proposing a meta-algorithm that filters new training data from the model's outputs. +Syl6tjAqKX,rJl7mPQ9FX,1538090000000.0,1545360000000.0,487,BEHAVIOR MODULE IN NEURAL NETWORKS,"[""asakryukin@u.nus.edu"", ""yongkang.wong@nus.edu.sg"", ""mohan@comp.nus.edu.sg""]","[""Andrey Sakryukin"", ""Yongkang Wong"", ""Mohan S. Kankanhalli""]","[""Modular Networks"", ""Reinforcement Learning"", ""Task Separation"", ""Representation Learning"", ""Transfer Learning"", ""Adversarial Transfer""]","Prefrontal cortex (PFC) is a part of the brain which is responsible for behavior repertoire. Inspired by PFC functionality and connectivity, as well as human behavior formation process, we propose a novel modular architecture of neural networks with a Behavioral Module (BM) and corresponding end-to-end training strategy. This approach allows the efficient learning of behaviors and preferences representation. This property is particularly useful for user modeling (as for dialog agents) and recommendation tasks, as allows learning personalized representations of different user states. In the experiment with video games playing, the resultsshow that the proposed method allows separation of main task’s objectives andbehaviors between different BMs. The experiments also show network extendability through independent learning of new behavior patterns. Moreover, we demonstrate a strategy for an efficient transfer of newly learned BMs to unseen tasks.",/pdf/d91365d674d7c71e62f89e9dfa7cc48366e48640.pdf,ICLR,2019,Extendable Modular Architecture is proposed for developing of variety of Agent Behaviors in DQN. +HJx9EhC9tQ,rJgerzR9Km,1538090000000.0,1546640000000.0,1474, Reasoning About Physical Interactions with Object-Oriented Prediction and Planning,"[""janner@berkeley.edu"", ""svlevine@eecs.berkeley.edu"", ""billf@mit.edu"", ""jbt@mit.edu"", ""cbfinn@eecs.berkeley.edu"", ""jiajunwu@mit.edu""]","[""Michael Janner"", ""Sergey Levine"", ""William T. Freeman"", ""Joshua B. Tenenbaum"", ""Chelsea Finn"", ""Jiajun Wu""]","[""structured scene representation"", ""predictive models"", ""intuitive physics"", ""self-supervised learning""]","Object-based factorizations provide a useful level of abstraction for interacting with the world. Building explicit object representations, however, often requires supervisory signals that are difficult to obtain in practice. We present a paradigm for learning object-centric representations for physical scene understanding without direct supervision of object properties. Our model, Object-Oriented Prediction and Planning (O2P2), jointly learns a perception function to map from image observations to object representations, a pairwise physics interaction function to predict the time evolution of a collection of objects, and a rendering function to map objects back to pixels. For evaluation, we consider not only the accuracy of the physical predictions of the model, but also its utility for downstream tasks that require an actionable representation of intuitive physics. After training our model on an image prediction task, we can use its learned representations to build block towers more complicated than those observed during training.",/pdf/c3d330a568fa726d681ecc56de656382ee31f638.pdf,ICLR,2019,We present a framework for learning object-centric representations suitable for planning in tasks that require an understanding of physics. +cmcwUBKeoUH,#NAME?,1601310000000.0,1614990000000.0,2761,Learning Blood Oxygen from Respiration Signals,"[""~Hao_He1"", ""~Ying-Cong_Chen1"", ""~Yuan_Yuan5"", ""~Dina_Katabi1""]","[""Hao He"", ""Ying-Cong Chen"", ""Yuan Yuan"", ""Dina Katabi""]","[""healthcare"", ""medical application""]","Monitoring blood oxygen is critical in a variety of medical conditions. For almost a century, pulse oximetry has been the only non-invasive method for measuring blood oxygen. While highly useful, pulse oximetry has important limitations. It requires wearable sensors, which can be cumbersome for older patients. It is also known to be biased when used for dark-skinned subjects. In this paper, we demonstrate, for the first time, the feasibility of predicting oxygen saturation from breathing. By eliminating the dependency on oximetry, we eliminate bias against skin color. Further, since breathing can be monitored without body contact by analyzing the radio signal in the environment, we show that oxygen too can be monitored without any wearable devices. We introduce a new approach for leveraging auxiliary variables via a switcher-based multi-headed neural network model. Empirical results show that our model achieves good accuracy on multiple medical datasets.",/pdf/d55ed684de91b28eaff2e3043afaabf2dde3aa4b.pdf,ICLR,2021, +BygKZkBtDH,H1xG42juPS,1569440000000.0,1577170000000.0,1551,Balancing Cost and Benefit with Tied-Multi Transformers,"[""raj.dabre@nict.go.jp"", ""raphael.rubino@nict.go.jp"", ""fujita@paraphrasing.org""]","[""Raj Dabre"", ""Raphael Rubino"", ""Atsushi Fujita""]","[""tied models"", ""encoder-decoder"", ""multi-layer softmaxing"", ""depth prediction"", ""model compression""]","This paper proposes a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In sequence-to-sequence modeling, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. A single model trained by our method subsumes multiple models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. We then propose a mechanism to choose a priori the number of encoder and decoder layers for faster decoding, and also explore recurrent stacking of layers and knowledge distillation to enable further parameter reduction. In a case study of neural machine translation, we present a cost-benefit analysis of the proposed approaches and empirically show that they greatly reduce decoding costs while preserving translation quality.",/pdf/05fb4d5a21ba747925e45613fde0f5b1b0a4e7f9.pdf,ICLR,2020,"Training multiple transformers with tied parameters, depth selection, and further compression" +fkhl7lb3aw,sedO72grJ1t,1601310000000.0,1614990000000.0,1413,ROGA: Random Over-sampling Based on Genetic Algorithm,"[""~ZONGDA_HAN1"", ""qiaoxq@bupt.edu.cn"", ""zhanshubo@cincc.cn""]","[""ZONGDA HAN"", ""XIUQUAN QIAO"", ""SHUBO ZHAN""]","[""class imbalance"", ""over-sampling"", ""genetic algorithm""]","When using machine learning to solve practical tasks, we often face the problem of class imbalance. Unbalanced classes will cause the model to generate preferences during the learning process, thereby ignoring classes with fewer samples. The oversampling algorithm achieves the purpose of balancing the difference in quantity by generating a minority of samples. The quality of the artificial samples determines the impact of the oversampling algorithm on model training. Therefore, a challenge of the oversampling algorithm is how to find a suitable sample generation space. However, too strong conditional constraints can make the generated samples as non-noise points as possible, but at the same time they also limit the search space of the generated samples, which is not conducive to the discovery of better-quality new samples. Therefore, based on this problem, we propose an oversampling algorithm ROGA based on genetic algorithm. Based on random sampling, new samples are gradually generated and the samples that may become noise are filtered out. ROGA can ensure that the sample generation space is as wide as possible, and it can also reduce the noise samples generated. By verifying on multiple datasets, ROGA can achieve a good result.",/pdf/5994396e56e0aab87dd5bfbe5d22dd85c26a116f.pdf,ICLR,2021, +dmCL033_YwO,XTHkpCnSNTY,1601310000000.0,1614990000000.0,258,DeeperGCN: Training Deeper GCNs with Generalized Aggregation Functions,"[""~Guohao_Li1"", ""chenxin.xiong@kaust.edu.sa"", ""~Ali_Thabet1"", ""~Bernard_Ghanem1""]","[""Guohao Li"", ""Chenxin Xiong"", ""Ali Thabet"", ""Bernard Ghanem""]","[""Graph Neural Networks"", ""Graph Representation Learning""]","Graph Convolutional Networks (GCNs) have been drawing significant attention with the power of representation learning on graphs. Recent works developed frameworks to train deep GCNs. Such works show impressive results in tasks like point cloud classification and segmentation, and protein interaction prediction. In this work, we study the performance of such deep models in large scale graph datasets from the Open Graph Benchmark (OGB). In particular, we look at the effect of adequately choosing an aggregation function, and its effect on final performance. Common choices of aggregation are mean, max, and sum. It has shown that GCNs are sensitive to such aggregations when applied to different datasets. We further validate this point and propose to alleviate it by introducing a novel Generalized Aggregation Function. Our new aggregation not only covers all commonly used ones, but also can be tuned to learn customized functions for different tasks. Our generalized aggregation is fully differentiable, and thus its parameters can be learned in an end-to-end fashion. We add our generalized aggregation into a deep GCN framework and show it achieves state-of-the-art results in six benchmarks from OGB.",/pdf/60cf9f613a7702c9a91a65e2f83dbda97742d2d1.pdf,ICLR,2021,"This paper proposes DeeperGCN that is capable of successfully and reliably training very deep GCNs. We define differentiable generalized aggregation functions to unify different message aggregation operations (e.g. mean, max and sum)." +HJTXaw9gx,,1478290000000.0,1484310000000.0,441,Recursive Regression with Neural Networks: Approximating the HJI PDE Solution,"[""vrubies@berkeley.edu"", ""tomlin@berkeley.edu""]","[""Vicen\u00e7 Rubies Royo"", ""Claire Tomlin""]","[""Supervised Learning"", ""Games"", ""Theory""]","Most machine learning applications using neural networks seek to approximate some function g(x) by minimizing some cost criterion. In the simplest case, if one has access to pairs of the form (x, y) where y = g(x), the problem can be framed as a regression problem. Beyond this family of problems, we find many cases where the unavailability of data pairs makes this approach unfeasible. However, similar to what we find in the reinforcement learning literature, if we have some known properties of the function we are seeking to approximate, there is still hope to frame the problem as a regression problem. In this context, we present an algorithm that approximates the solution to a partial differential equation known as the Hamilton-Jacobi-Isaacs PDE and compare it to current state of the art tools. This PDE, which is found in the fields of control theory and robotics, is of particular importance in safety critical systems where guarantees of performance are a must.",/pdf/3a69364c022247e1d0a905f593408784f7a72f41.pdf,ICLR,2017,A neural network that learns an approximation to a function by generating its own regression points +PAsd7_vP4_,#NAME?,1601310000000.0,1614990000000.0,956,Adaptive Discretization for Continuous Control using Particle Filtering Policy Network,"[""~Pei_Xu1"", ""~Ioannis_Karamouzas1""]","[""Pei Xu"", ""Ioannis Karamouzas""]","[""Reinforcement Learning"", ""Continuous Control"", ""Action Space Discretization"", ""Policy Gradient""]","Controlling the movements of highly articulated agents and robots has been a long-standing challenge to model-free deep reinforcement learning. In this paper, we propose a simple, yet general, framework for improving the performance of policy gradient algorithms by discretizing the continuous action space. Instead of using a fixed set of predetermined atomic actions, we exploit particle filtering to adaptively discretize actions during training and track the posterior policy represented as a mixture distribution. The resulting policy can replace the original continuous policy of any given policy gradient algorithm without changing its underlying model architecture. We demonstrate the applicability of our approach to state-of-the-art on-policy and off-policy baselines in challenging control tasks. Baselines using our particle-based policies achieve better final performance and speed of convergence as compared to corresponding continuous implementations and implementations that rely on fixed discretization schemes. ",/pdf/8c7ede4a1d057dc44772b9d9541fd84e538ee63e.pdf,ICLR,2021, +B1VZqjAcYX,S1lk-PFqKm,1538090000000.0,1550820000000.0,512,SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,"[""namhoon@robots.ox.ac.uk"", ""ajanthan@robots.ox.ac.uk"", ""phst@robots.ox.ac.uk""]","[""Namhoon Lee"", ""Thalaiyasingam Ajanthan"", ""Philip Torr""]","[""neural network pruning"", ""connection sensitivity""]","Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. In existing methods, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on the MNIST, CIFAR-10, and Tiny-ImageNet classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.",/pdf/40fcaa7594d278d40bfb669c13b0f4ae0fe22272.pdf,ICLR,2019,"We present a new approach, SNIP, that is simple, versatile and interpretable; it prunes irrelevant connections for a given task at single-shot prior to training and is applicable to a variety of neural network models without modifications." +KTlJT1nof6d,7ebrGJpT7kO,1601310000000.0,1618800000000.0,1021,Initialization and Regularization of Factorized Neural Layers,"[""~Mikhail_Khodak1"", ""~Neil_A._Tenenholtz1"", ""~Lester_Mackey1"", ""~Nicolo_Fusi1""]","[""Mikhail Khodak"", ""Neil A. Tenenholtz"", ""Lester Mackey"", ""Nicolo Fusi""]","[""model compression"", ""knowledge distillation"", ""multi-head attention"", ""matrix factorization""]","Factorized layers—operations parameterized by products of two or more matrices—occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.",/pdf/d22bc639b3e05ed1c862db4eaa41726d1f24406d.pdf,ICLR,2021,"Principled initialization and regularization of factorized neural layers leads to strong performance in compression, knowledge distillation, and language modeling tasks." +BJg9hTNKPH,HyevZVxuwS,1569440000000.0,1577170000000.0,792,Behavior Regularized Offline Reinforcement Learning,"[""yw4@andrew.cmu.edu"", ""gjt@google.com"", ""ofirnachum@google.com""]","[""Yifan Wu"", ""George Tucker"", ""Ofir Nachum""]","[""reinforcement learning"", ""offline RL"", ""batch RL""]","In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, much recent work has suggested a number of remedies to these issues. In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Surprisingly, we find that many of the technical complexities introduced in recent methods are unnecessary to achieve strong performance. Additional ablations provide insights into which design choices matter most in the offline RL setting.",/pdf/1a36e0b65fc49a03f961dfc9c01ffc644ec62770.pdf,ICLR,2020, +nkap3LV7t7O,6mFD9HY_cW,1601310000000.0,1614990000000.0,1780,Simple and Effective VAE Training with Calibrated Decoders,"[""~Oleh_Rybkin1"", ""~Kostas_Daniilidis1"", ""~Sergey_Levine1""]","[""Oleh Rybkin"", ""Kostas Daniilidis"", ""Sergey Levine""]","[""variational autoencoders"", ""\u03b2-VAE"", ""representation learning""]","Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions. However, training VAEs often requires considerable hyperparameter tuning to determine the optimal amount of information retained by the latent variable. We study the impact of calibrated decoders, which learn the uncertainty of the decoding distribution and can determine this amount of information automatically, on the VAE performance. While many methods for learning calibrated decoders have been proposed, many of the recent papers that employ VAEs rely on heuristic hyperparameters and ad-hoc modifications instead. We perform the first comprehensive comparative analysis of calibrated decoder and provide recommendations for simple and effective VAE training. Our analysis covers a range of datasets and several single-image and sequential VAE models. We further propose a simple but novel modification to the commonly used Gaussian decoder, which computes the prediction variance analytically. We observe empirically that using heuristic modifications is not necessary with our method.",/pdf/b9be409d6542513773a3f2ffdf7eb14574d45a90.pdf,ICLR,2021,We analyze calibrated decoders for VAE training and provide recommendations for simple and effective training without heuristic hyperparameters. +Sye7qoC5FQ,Ske4c2m5KX,1538090000000.0,1545360000000.0,519,Adversarial Attacks on Node Embeddings,"[""a.bojchevski@in.tum.de"", ""guennemann@in.tum.de""]","[""Aleksandar Bojchevski"", ""Stephan G\u00fcnnemann""]","[""node embeddings"", ""adversarial attacks""]","The goal of network representation learning is to learn low-dimensional node embeddings that capture the graph structure and are useful for solving downstream tasks. However, despite the proliferation of such methods there is currently no study of their robustness to adversarial attacks. We provide the first adversarial vulnerability analysis on the widely used family of methods based on random walks. We derive efficient adversarial perturbations that poison the network structure and have a negative effect on both the quality of the embeddings and the downstream tasks. We further show that our attacks are transferable since they generalize to many models, and are successful even when the attacker is restricted.",/pdf/16db6611240ec656cbadc38cd186bd0fdaf7796a.pdf,ICLR,2019,Adversarial attacks on unsupervised node embeddings based on eigenvalue perturbation theory. +OMizHuea_HB,0s61BvbDD-M,1601310000000.0,1615960000000.0,249,Active Contrastive Learning of Audio-Visual Video Representations,"[""~Shuang_Ma3"", ""~Zhaoyang_Zeng1"", ""~Daniel_McDuff1"", ""~Yale_Song1""]","[""Shuang Ma"", ""Zhaoyang Zeng"", ""Daniel McDuff"", ""Yale Song""]","[""self-supervised learning"", ""contrastive representation learning"", ""active learning"", ""audio-visual representation"", ""video recognition""]","Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50. ",/pdf/a696a70ba651de1c50d0a72fea6dc1c20d8bb37a.pdf,ICLR,2021,We propose an active learning approach to improve negative sampling for contrastive learning and demonstrate it on learning audio-visual representations from videos. +fStMpzKkjMT,da_yXX6hz5,1601310000000.0,1614990000000.0,1681,Why Does Decentralized Training Outperform Synchronous Training In The Large Batch Setting?,"[""~Wei_Zhang33"", ""~Mingrui_Liu2"", ""~Yu_Feng3"", ""~Brian_Kingsbury1"", ""~Yuhai_Tu1""]","[""Wei Zhang"", ""Mingrui Liu"", ""Yu Feng"", ""Brian Kingsbury"", ""Yuhai Tu""]","[""Decentralized"", ""Distributed Deep Learning"", ""Large Batch""]","Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Using a sufficiently large batch size is critical to achieving DDL runtime speedup. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. However, a large batch size may converge to sharp minima with poor generalization, and a large learning rate may harm convergence. Synchronous Stochastic Gradient Descent (SSGD) is the de facto DDL optimization method. Recently, Decentralized Parallel SGD (DPSGD) has been proven to achieve a similar convergence rate as SGD and to guarantee linear speedup for non-convex optimization problems. While there was anecdotal evidence that DPSGD outperforms SSGD in the large-batch setting, no systematic study has been conducted to explain why this is the case. Based on a detailed analysis of the DPSGD learning dynamics, we find that DPSGD introduces additional landscape-dependent noise, which has two benefits in the large-batch setting: 1) it automatically adjusts the learning rate to improve convergence; 2) it enhances weight space search by escaping local traps (e.g., saddle points) to find flat minima with better generalization. We conduct extensive studies over 12 state-of-the-art DL models/tasks and demonstrate that DPSGD consistently outperforms SSGD in the large batch setting; + and DPSGD converges in cases where SSGD diverges for large learning rates. Our findings are consistent across different application domains, Computer Vision and Automatic Speech Recognition, and different neural network models, Convolutional Neural Networks and Long Short-Term Memory Recurrent Neural Networks.",/pdf/53d7d372a7bb02c2406d68240acaf97006c81350.pdf,ICLR,2021,The inherent system noise in decentralized distributed training can improve generalization in large batch setting compared to the synchronous training. +amRmtfpYgDt,BHFsKiMtYk3F,1601310000000.0,1614990000000.0,2969,Regioned Episodic Reinforcement Learning,"[""~Jiarui_Jin1"", ""chenconglzh@sjtu.edu.cn"", ""~Ming_Zhou2"", ""~Weinan_Zhang1"", ""~Rasool_Fakoor1"", ""~David_Wipf1"", ""~Yong_Yu1"", ""~Jun_Wang2"", ""~Alex_Smola1""]","[""Jiarui Jin"", ""Cong Chen"", ""Ming Zhou"", ""Weinan Zhang"", ""Rasool Fakoor"", ""David Wipf"", ""Yong Yu"", ""Jun Wang"", ""Alex Smola""]","[""Deep Reinforcement Learning"", ""Episodic Memory"", ""Sample Efficiency""]","Goal-oriented reinforcement learning algorithms are often good at exploration, not exploitation, while episodic algorithms excel at exploitation, not exploration. As a result, neither of these approaches alone can lead to a sample efficient algorithm in complex environments with high dimensional state space and delayed rewards. Motivated by these observations and shortcomings, in this paper, we introduce Regioned Episodic Reinforcement Learning (RERL) that combines the episodic and goal-oriented learning strengths and leads to a more sample efficient and ef- fective algorithm. RERL achieves this by decomposing the space into several sub-space regions and constructing regions that lead to more effective exploration and high values trajectories. Extensive experiments on various benchmark tasks show that RERL outperforms existing methods in terms of sample efficiency and final rewards.",/pdf/b572b4bcfce811cd81bcbfb6f92582268f2a3af7.pdf,ICLR,2021,"In this paper, we introduce Regioned Episodic Reinforcement Learning (RERL) that combines the strengths of episodic and goal-oriented learning to effectively solve tasks with delayed feedbacks and high-dimensional observations." +4f04RAhMUo6,AVAe6ITfGsg,1601310000000.0,1614990000000.0,2970,PODS: Policy Optimization via Differentiable Simulation,"[""~Miguel_Angel_Zamora_Mora1"", ""mpeychev@ethz.ch"", ""sehoon.ha@gmail.com"", ""~Martin_Vechev1"", ""~Stelian_Coros1""]","[""Miguel Angel Zamora Mora"", ""Momchil Peychev"", ""Sehoon Ha"", ""Martin Vechev"", ""Stelian Coros""]","[""Reinforcement Learning"", ""Decision and Control"", ""Planning"", ""Robotics.""]","Current reinforcement learning (RL) methods use simulation models as simple black-box oracles. In this paper, with the goal of improving the performance exhibited by RL algorithms, we explore a systematic way of leveraging the additional information provided by an emerging class of differentiable simulators. Building on concepts established by Deterministic Policy Gradients (DPG) methods, the neural network policies learned with our approach represent deterministic actions. In a departure from standard methodologies, however, learning these policy does not hinge on approximations of the value function that must be learned concurrently in an actor-critic fashion. Instead, we exploit differentiable simulators to directly compute the analytic gradient of a policy's value function with respect to the actions it outputs. This, in turn, allows us to efficiently perform locally optimal policy improvement iterations. Compared against other state-of-the-art RL methods, we show that with minimal hyper-parameter tuning our approach consistently leads to better asymptotic behavior across a set of payload manipulation tasks that demand high precision.",/pdf/cbae60636a77b5ce5c399e9a70540cafdd7d5bfb.pdf,ICLR,2021,Policy Optimization via Differentiable Simulation +BkgUB1SYPS,HJxvMf6uwS,1569440000000.0,1577170000000.0,1692,Interpretable Network Structure for Modeling Contextual Dependency,"[""xindianma@tju.edu.cn"", ""pzhang@tju.edu.cn"", ""xiaoliumao@tju.edu.cn"", ""yehua_zhang@tju.edu.cn"", ""nanduan@microsoft.com"", ""yxhou@tju.edu.cn"", ""mingzhou@microsoft.com""]","[""Xindian Ma"", ""Peng Zhang"", ""Xiaoliu Mao"", ""Yehua Zhang"", ""Nan Duan"", ""Yuexian Hou"", ""Ming Zhou.""]","[""Language Model"", ""Recurrent Neural Network"", ""Separation Rank""]","Neural language models have achieved great success in many NLP tasks, to a large extent, due to the ability to capture contextual dependencies among terms in a text. While many efforts have been devoted to empirically explain the connection between the network hyperparameters and the ability to represent the contextual dependency, the theoretical analysis is relatively insufficient. Inspired by the recent research on the use of tensor space to explain the neural network architecture, we explore the interpretable mechanism for neural language models. Specifically, we define the concept of separation rank in the language modeling process, in order to theoretically measure the degree of contextual dependencies in a sentence. Then, we show that the lower bound of such a separation rank can reveal the quantitative relation between the network structure (e.g. depth/width) and the modeling ability for the contextual dependency. Especially, increasing the depth of the neural network can be more effective to improve the ability of modeling contextual dependency. Therefore, it is important to design an adaptive network to compute the adaptive depth in a task. Inspired by Adaptive Computation Time (ACT), we design an adaptive recurrent network based on the separation rank to model contextual dependency. Experiments on various NLP tasks have verified the proposed theoretical analysis. We also test our adaptive recurrent neural network in the sentence classification task, and the experiments show that it can achieve better results than the traditional bidirectional LSTM.",/pdf/ea275408b49d1ca83212d0ca1e93bf607f898f74.pdf,ICLR,2020, +Syl3_2JCZ,BJlnd2yA-,1509050000000.0,1518730000000.0,173,A Self-Organizing Memory Network,"[""callie.federer@ucdenver.edu"", ""joel.zylberberg@ucdenver.edu""]","[""Callie Federer"", ""Joel Zylberberg""]","[""Working Memory"", ""Learning Rules"", ""Stimulus Representations""]","Working memory requires information about external stimuli to be represented in the brain even after those stimuli go away. This information is encoded in the activities of neurons, and neural activities change over timescales of tens of milliseconds. Information in working memory, however, is retained for tens of seconds, suggesting the question of how time-varying neural activities maintain stable representations. Prior work shows that, if the neural dynamics are in the ` null space' of the representation - so that changes to neural activity do not affect the downstream read-out of stimulus information - then information can be retained for periods much longer than the time-scale of individual-neuronal activities. The prior work, however, requires precisely constructed synaptic connectivity matrices, without explaining how this would arise in a biological neural network. To identify mechanisms through which biological networks can self-organize to learn memory function, we derived biologically plausible synaptic plasticity rules that dynamically modify the connectivity matrix to enable information storing. Networks implementing this plasticity rule can successfully learn to form memory representations even if only 10% of the synapses are plastic, they are robust to synaptic noise, and they can represent information about multiple stimuli. ",/pdf/0d4bb288527cd99fbf12b84a6db62a365f29fef6.pdf,ICLR,2018,We derived biologically plausible synaptic plasticity learning rules for a recurrent neural network to store stimulus representations. +S0UdquAnr9k,dXhyAR6F2dk,1601310000000.0,1614760000000.0,346,Locally Free Weight Sharing for Network Width Search,"[""~Xiu_Su1"", ""~Shan_You3"", ""~Tao_Huang5"", ""~Fei_Wang9"", ""~Chen_Qian1"", ""~Changshui_Zhang1"", ""~Chang_Xu4""]","[""Xiu Su"", ""Shan You"", ""Tao Huang"", ""Fei Wang"", ""Chen Qian"", ""Changshui Zhang"", ""Chang Xu""]",[],"Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to locate freely in a local zone to better represent each width. Besides, we propose to further reduce the search space by leveraging our introduced FLOPs-sensitive bins. As a result, our CafeNet can be trained stochastically and get optimized within a min-min strategy. Extensive experiments on ImageNet, CIFAR-10, CelebA and MS COCO dataset have verified our superiority comparing to other state-of-the-art baselines. For example, our method can further boost the benchmark NAS network EfficientNet-B0 by 0.41\% via searching its width more delicately.",/pdf/72deb2e0a363d37cf758dfbccea8fada27ebf7a8.pdf,ICLR,2021,One-shot locally free weight sharing supernet for searching optimal network width +vYeQQ29Tbvx,s3fWL5UMbPj,1601310000000.0,1615240000000.0,310,Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs,"[""~Jonathan_Frankle1"", ""~David_J._Schwab1"", ""~Ari_S._Morcos1""]","[""Jonathan Frankle"", ""David J. Schwab"", ""Ari S. Morcos""]","[""affine parameters"", ""random features"", ""batchnorm""]","A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.",/pdf/fd54fc17e45c5cb0f95f9e8ce5c8a3a4eb8f759d.pdf,ICLR,2021,We study the role and expressive power of learned affine parameters that transform features by freezing all weights at their random initializations and training only BatchNorm. +ryloogSKDS,r1gGTk-KwB,1569440000000.0,1583910000000.0,2513,Deep Orientation Uncertainty Learning based on a Bingham Loss,"[""igilitschenski@mit.edu"", ""rsahoo@mit.edu"", ""wilkos@mit.edu"", ""amini@mit.edu"", ""sertac@mit.edu"", ""rus@csail.mit.edu""]","[""Igor Gilitschenski"", ""Roshni Sahoo"", ""Wilko Schwarting"", ""Alexander Amini"", ""Sertac Karaman"", ""Daniela Rus""]","[""Orientation Estimation"", ""Directional Statistics"", ""Bingham Distribution""]","Reasoning about uncertain orientations is one of the core problems in many perception tasks such as object pose estimation or motion estimation. In these scenarios, poor illumination conditions, sensor limitations, or appearance invariance may result in highly uncertain estimates. In this work, we propose a novel learning-based representation for orientation uncertainty. By characterizing uncertainty over unit quaternions with the Bingham distribution, we formulate a loss that naturally captures the antipodal symmetry of the representation. We discuss the interpretability of the learned distribution parameters and demonstrate the feasibility of our approach on several challenging real-world pose estimation tasks involving uncertain orientations.",/pdf/9ab4e767edaf0d2a2b9b8cb8ff8b0c140a09a01a.pdf,ICLR,2020,A method for learning to predict uncertainties over orientations using the Bingham Distribution +S1gR2ANFvB,BJgX3ntuwB,1569440000000.0,1577170000000.0,1376,Model Comparison of Beer data classification using an electronic nose,"[""mohammed.munir.abdi@ibm.com"", ""aminat.adebiyi@ibm.com"", ""andrea.fasoli@ibm.com"", ""alberto.mannari@ibm.com"", ""rlabby@us.ibm.com"", ""lbozano@us.ibm.com""]","[""Mohammed Abdi"", ""Aminat Adebiyi"", ""Andrea Fasoli"", ""Alberto Mannari"", ""Ronald Labby"", ""Luisa Bozano""]","[""Electronic Nose"", ""EVA"", ""modular"", ""olfaction"", ""sensitivity"", ""selectivity"", ""analyte"", ""temperature oscillated waveforms"", ""features"", ""fingerprint""]","Olfaction has been and still is an area which is challenging to the research community. Like other senses of the body, there has been a push to replicate the sense of smell to aid in identifying odorous compounds in the form of an electronic nose. At IBM, our team (Cogniscent) has designed a modular sensor board platform based on the artificial olfaction concept we called EVA (Electronic Volatile Analyzer). EVA is an IoT electronic nose device that aims to reproduce olfaction in living begins by integrating an array of partially specific and uniquely selective smell recognition sensors which are directly exposed to the target chemical analyte or the environment. We are exploring a new technique called temperature-controlled oscillation, which gives us virtual array of sensors to represent our signals/ fingerprint. In our study, we run experiments on identifying different types of beers using EVA. In order to successfully carry this classification task, the entire process starting from preparation of samples, having a consistent protocol of data collection in place all the way to providing the data to be analyzed and input to a machine learning model is very important. On this paper, we will discuss the process of sniffing volatile organic compounds from liquid beer samples and successfully classifying different brands of beers as a pilot test. We researched on different machine learning models in order to get the best classification accuracy for our Beer samples. The best classification accuracy is achieved by using a multi-level perceptron (MLP) artificial neural network (ANN) model, classification of three different brands of beers after splitting one-week data to a training and testing set yielded an accuracy of 97.334. While using separate weeks of data for training and testing set the model yielded an accuracy of 67.812, this is because of drift playing a role in the overall classification process. Using Random forest, the classification accuracy achieved by the model is 0.923. And Decision Tree achieved 0.911. ",/pdf/1b29db105a504aef15284d6d7de55412faf17600.pdf,ICLR,2020,On this paper we will discuss the process of sniffing volatile organic compounds from liquid beer samples and exploring various machine learning models +hKps4HGGGx,f2_OEo44QFN,1601310000000.0,1614990000000.0,1638,Improving robustness of softmax corss-entropy loss via inference information,"[""~Bingbing_Song1"", ""hewei@mail.ynu.edu.cn"", ""liurenyang@mail.ynu.edu.cn"", ""~Shui_Yu1"", ""~Ruxin_Wang1"", ""~Mingming_Gong1"", ""~Tongliang_Liu1"", ""zwei@ynu.edu.cn""]","[""Bingbing Song"", ""Wei He"", ""Renyang Liu"", ""Shui Yu"", ""Ruxin Wang"", ""Mingming Gong"", ""Tongliang Liu"", ""Wei Zhou""]","[""Adversarial defense"", ""Loss function"", ""Neural networks robustness""]","Adversarial examples easily mislead the vision systems based on deep neural networks (DNNs) trained with the softmax cross entropy (SCE) loss. Such a vulnerability of DNN comes from the fact that SCE drives DNNs to fit on the training samples, whereas the resultant feature distributions between the training and adversarial examples are unfortunately misaligned. Several state-of-the-arts start from improving the inter-class separability of training samples by modifying loss functions, where we argue that the adversarial examples are ignored and thus limited robustness to adversarial attacks is resulted. In this paper, we exploit inference region which inspires us to involve a margin-like inference information to SCE, resulting in a novel inference-softmax cross entropy (I-SCE) loss, which is intuitively appealing and interpretable. The inference information is a guarantee to both the inter-class separability and the improved generalization to adversarial examples, which is furthermore demonstrated under the min-max framework. Extensive experiments show that under strong adaptive attacks, the DNN models trained with the proposed I-SCE loss achieve superior performance and robustness over the state-of-the-arts.",/pdf/edd8c6db7a847da99b8d3251740e82df70bec1f2.pdf,ICLR,2021,"This paper design a novel loss function to improve robustness via inference information, and achieve better performance of robustness than state-of-the-art methods." +SJlHwkBYDH,HyeUXnp_PB,1569440000000.0,1585740000000.0,1766,Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks,"[""jdlin@hust.edu.cn"", ""cbsong@hust.edu.cn"", ""brooklet60@hust.edu.cn"", ""wanglw@cis.pku.edu.cn"", ""jeh@cs.cornell.edu""]","[""Jiadong Lin"", ""Chuanbiao Song"", ""Kun He"", ""Liwei Wang"", ""John E. Hopcroft""]","[""adversarial examples"", ""adversarial attack"", ""transferability"", ""Nesterov accelerated gradient"", ""scale invariance""]","Deep learning models are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on benign inputs. However, under the black-box setting, most existing adversaries often have a poor transferability to attack other defense models. In this work, from the perspective of regarding the adversarial example generation as an optimization process, we propose two new methods to improve the transferability of adversarial examples, namely Nesterov Iterative Fast Gradient Sign Method (NI-FGSM) and Scale-Invariant attack Method (SIM). NI-FGSM aims to adapt Nesterov accelerated gradient into the iterative attacks so as to effectively look ahead and improve the transferability of adversarial examples. While SIM is based on our discovery on the scale-invariant property of deep learning models, for which we leverage to optimize the adversarial perturbations over the scale copies of the input images so as to avoid ""overfitting” on the white-box model being attacked and generate more transferable adversarial examples. NI-FGSM and SIM can be naturally integrated to build a robust gradient-based attack to generate more transferable adversarial examples against the defense models. Empirical results on ImageNet dataset demonstrate that our attack methods exhibit higher transferability and achieve higher attack success rates than state-of-the-art gradient-based attacks.",/pdf/1f62c8142bd8b20fe236a38017133e5e938fe043.pdf,ICLR,2020,We proposed a Nesterov Iterative Fast Gradient Sign Method (NI-FGSM) and a Scale-Invariant attack Method (SIM) that can boost the transferability of adversarial examples for image classification. +ryefmpEYPr,Hyetq20LPr,1569440000000.0,1577170000000.0,442,iSparse: Output Informed Sparsification of Neural Networks,"[""ygarg@asu.edu"", ""candan@asu.edu""]","[""Yash Garg"", ""K. Selcuk Candan""]","[""dropout"", ""dropconnect"", ""sparsification"", ""deep learning"", ""neural network""]","Deep neural networks have demonstrated unprecedented success in various knowledge management applications. However, the networks created are often very complex, with large numbers of trainable edges which require extensive computational resources. We note that many successful networks nevertheless often contain large numbers of redundant edges. Moreover, many of these edges may have negligible contributions towards the overall network performance. In this paper, we propose a novel iSparse framework and experimentally show, that we can sparsify the network, by 30-50%, without impacting the network performance. iSparse leverages a novel edge significance score, E, to determine the importance of an edge with respect to the final network output. Furthermore, iSparse can be applied both while training a model or on top of a pre-trained model, making it a retraining-free approach - leading to a minimal computational overhead. Comparisons of iSparse against PFEC, NISP, DropConnect, and Retraining-Free on benchmark datasets show that iSparse leads to effective network sparsifications.",/pdf/e16fd8e15b0f08bb2e585763e5d88d1b415d31e6.pdf,ICLR,2020,iSparse eliminates irrelevant or insignificant network edges with minimal impact on network performance by determining edge importance w.r.t. the final network output. +HyxoX6EKvB,rylC-newDS,1569440000000.0,1577170000000.0,462,Reflection-based Word Attribute Transfer,"[""ishibashi.yoichi.ir3@is.naist.jp"", ""sudoh@is.naist.jp"", ""koichiro@is.naist.jp"", ""s-nakamura@is.naist.jp""]","[""Yoichi Ishibashi"", ""Katsuhito Sudoh"", ""Koichiro Yoshino"", ""Satoshi Nakamura""]","[""embedding"", ""representation learning"", ""analogy"", ""geometry""]","We propose a word attribute transfer framework based on reflection to obtain a word vector with an inverted target attribute for a given word in a word embedding space. Word embeddings based on Pointwise Mutual Information (PMI) represent such analogic relations as king - man + woman \approx queen. These relations can be used for changing a word’s attribute from king to queen by changing its gender. This attribute transfer can be performed by subtracting a difference vector man - woman from king when we have explicit knowledge of the gender of given word king. However, this knowledge cannot be developed for various words and attributes in practice. For transferring queen into king in this analogy-based manner, we need to know that queen denotes a female and add the difference vector to it. In this work, we transfer such binary attributes based on an assumption that such transfer mapping will become identity mapping when we apply it twice. We introduce a framework based on reflection mapping that satisfies this property; queen should be transferred back to king with the same mapping as the transfer from king to queen. Experimental results show that the proposed method can transfer the word attributes of the given words, and does not change the words that do not have the target attributes.",/pdf/215b8895577d85799fb2b48af4ca9db1fc5f5990.pdf,ICLR,2020,We propose a novel representation learning framework that obtains a vector with an inverted attribute in embedding space without explicit attribute knowledge of the given word. +SyhRVm-Rb,rk1AV7ZRW,1509140000000.0,1518730000000.0,1162,Automatic Goal Generation for Reinforcement Learning Agents,"[""dheld@andrew.cmu.edu"", ""young.geng@berkeley.edu"", ""florensa@berkeley.edu"", ""pabbeel@berkeley.edu""]","[""David Held"", ""Xinyang Geng"", ""Carlos Florensa"", ""Pieter Abbeel""]","[""Reinforcement Learning"", ""Multi-task Learning"", ""Curriculum Learning""]","Reinforcement learning (RL) is a powerful technique to train an agent to perform a task. However, an agent that is trained using RL is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing in its environment. We use a generator network to propose tasks for the agent to try to achieve, each task being specified as reaching a certain parametrized subset of the state-space. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent. Our method thus automatically produces a curriculum of tasks for the agent to learn. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment (Videos and code available at: https://sites.google.com/view/goalgeneration4rl). Our method can also learn to achieve tasks with sparse rewards, which pose significant challenges for traditional RL methods.",/pdf/d3bea8d42d0595a7bc8e7dc4edf32c95e8f6a035.pdf,ICLR,2018,We efficiently solve multi-task problems with an automatic curriculum generation algorithm based on a generative model that tracks the learning agent's performance. +HJgVisRqtX,HkeMAO2qt7,1538090000000.0,1545360000000.0,613,SEGEN: SAMPLE-ENSEMBLE GENETIC EVOLUTIONARY NETWORK MODEL,"[""jiawei@ifmlab.org"", ""lmcui932@163.com"", ""fisherbgouza@gmail.com""]","[""Jiawei Zhang"", ""Limeng Cui"", ""Fisher B. Gouza""]","[""Genetic Evolutionary Network"", ""Deep Learning"", ""Genetic Algorithm"", ""Ensemble Learning"", ""Representation Learning""]","Deep learning, a rebranding of deep neural network research works, has achieved a remarkable success in recent years. With multiple hidden layers, deep learning models aim at computing the hierarchical feature representations of the observational data. Meanwhile, due to its severe disadvantages in data consumption, computational resources, parameter tuning costs and the lack of result explainability, deep learning has also suffered from lots of criticism. In this paper, we will introduce a new representation learning model, namely “Sample-Ensemble Genetic Evolutionary Network” (SEGEN), which can serve as an alternative approach to deep learning models. Instead of building one single deep model, based on a set of sampled sub-instances, SEGEN adopts a genetic-evolutionary learning strategy to build a group of unit models generations by generations. The unit models incorporated in SEGEN can be either traditional machine learning models or the recent deep learning models with a much “narrower” and “shallower” architecture. The learning results of each instance at the final generation will be effectively combined from each unit model via diffusive propagation and ensemble learning strategies. From the computational perspective, SEGEN requires far less data, fewer computational resources and parameter tuning efforts, but has sound theoretic interpretability of the learning process and results. Extensive experiments have been done on several different real-world benchmark datasets, and the experimental results obtained by SEGEN have demonstrated its advantages over the state-of-the-art representation learning models.",/pdf/72acae32af5730c669efa51298f7d276b3dbc41f.pdf,ICLR,2019,"We introduce a new representation learning model, namely “Sample-Ensemble Genetic Evolutionary Network” (SEGEN), which can serve as an alternative approach to deep learning models." +SJeS16EKPr,Hyxk4rUHvH,1569440000000.0,1577170000000.0,299,Learning relevant features for statistical inference,"[""cedric.beny@gmail.com""]","[""C\u00e9dric B\u00e9ny""]","[""unsupervised learning"", ""non-parametric probabilistic model"", ""singular value decomposition"", ""fisher information metric"", ""chi-squared distance""]","We introduce an new technique to learn correlations between two types of data. +The learned representation can be used to directly compute the expectations of functions over one type of data conditioned on the other, such as Bayesian estimators and their standard deviations. +Specifically, our loss function teaches two neural nets to extract features representing the probability vectors of highest singular value for the stochastic map (set of conditional probabilities) implied by the joint dataset, relative to the inner product defined by the Fisher information metrics evaluated at the marginals. +We test the approach using a synthetic dataset, analytical calculations, and inference on occluded MNIST images. +Surprisingly, when applied to supervised learning (one dataset consists of labels), this approach automatically provides regularization and faster convergence compared to the cross-entropy objective. +We also explore using this approach to discover salient independent features of a single dataset. ",/pdf/c55108af100dcf92d20d8c4472961f0360d6ab99.pdf,ICLR,2020,"Given bipartite data and two neural nets, this new objective based on Fisher information teaches them to extract the most correlated features, which can then be used to do inference." +MkrAyYVmt7b,CCNIQ2d-b7_,1601310000000.0,1614990000000.0,1919,Perfect density models cannot guarantee anomaly detection,"[""~Charline_Le_Lan2"", ""~Laurent_Dinh1""]","[""Charline Le Lan"", ""Laurent Dinh""]","[""anomaly detection"", ""out-of-distribution detection"", ""OOD detection"", ""outlier detection"", ""density estimation""]","Thanks to the tractability of their likelihood, some deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for out-of-distribution detection relies on strong and implicit hypotheses and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.",/pdf/ba92d26653bfe81c18c2fa647ed30d0e31e54fb6.pdf,ICLR,2021,Explaining issues of density models for anomaly detection. +9WlOIHve8dU,PXUSwBImBF9,1601310000000.0,1614990000000.0,3028,Learning Binary Trees via Sparse Relaxation,"[""~Valentina_Zantedeschi2"", ""~Matt_Kusner1"", ""~Vlad_Niculae2""]","[""Valentina Zantedeschi"", ""Matt Kusner"", ""Vlad Niculae""]","[""optimization"", ""binary trees""]","One of the most classical problems in machine learning is how to learn binary trees that split data into meaningful partitions. From classification/regression via decision trees to hierarchical clustering, binary trees are useful because they (a) are often easy to visualize; (b) make computationally-efficient predictions; and (c) allow for flexible partitioning. Because of this there has been extensive research on how to learn such trees. Optimization generally falls into one of three categories: 1. greedy node-by-node optimization; 2. probabilistic relaxations for differentiability; 3. mixed-integer programming (MIP). Each of these have downsides: greedy can myopically choose poor splits, probabilistic relaxations do not have principled ways to prune trees, MIP methods can be slow on large problems and may not generalize. In this work we derive a novel sparse relaxation for binary tree learning. By sparsely relaxing a new MIP, our approach is able to learn tree splits and tree pruning using state-of-the-art gradient-based approaches. We demonstrate how our approach is easily visualizable, is efficient, and is competitive with current work in classification/regression and hierarchical clustering.",/pdf/ab46ab28494b027bd02472ca30b4e348d7801c67.pdf,ICLR,2021,We present a new sparse differentiable relaxation of mixed-integer programming methods for tree learning. +Hk95PK9le,,1478300000000.0,1488580000000.0,504,Deep Biaffine Attention for Neural Dependency Parsing,"[""tdozat@stanford.edu"", ""manning@stanford.edu""]","[""Timothy Dozat"", ""Christopher D. Manning""]","[""Natural language processing"", ""Deep learning""]","This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with +biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark—outperforming Kiperwasser & Goldberg (2016) by 1.8% and 2.2%—and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches. +",/pdf/e197f067db45fa2483f934adcff5f757205d06bd.pdf,ICLR,2017, +rJxwDTVFDB,BJg2KxiDPr,1569440000000.0,1577170000000.0,599,Pushing the bounds of dropout,"[""melisgl@google.com"", ""cblundell@google.com"", ""tkocisky@google.com"", ""kmh@google.com"", ""cdyer@google.com"", ""pblunsom@google.com""]","[""G\u00e1bor Melis"", ""Charles Blundell"", ""Tom\u00e1\u0161 Ko\u010disk\u00fd"", ""Karl Moritz Hermann"", ""Chris Dyer"", ""Phil Blunsom""]","[""dropout"", ""language""]","We push on the boundaries of our knowledge about dropout by showing theoretically that dropout training can be understood as performing MAP estimation concurrently for an entire family of conditional models whose objectives are themselves lower bounded by the original dropout objective. This discovery allows us to pick any model from this family after training, which leads to a substantial improvement on regularisation-heavy language modelling. The family includes models that compute a power mean over the sampled dropout masks, and their less stochastic subvariants with tighter and higher lower bounds than the fully stochastic dropout objective. The deterministic subvariant's bound is equal to its objective, and the highest amongst these models. It also exhibits the best model fit in our experiments. Together, these results suggest that the predominant view of deterministic dropout as a good approximation to MC averaging is misleading. Rather, deterministic dropout is the best available approximation to the true objective.",/pdf/ce199f1065e5cab62a76b2618f84e1cb9b2af539.pdf,ICLR,2020,A new view of dropout training as optimizing lower bound for an entire family of models. +DegtqJSbxo,zv486AUT2p-,1601310000000.0,1614990000000.0,1986,Adversarial and Natural Perturbations for General Robustness,"[""~Sadaf_Gulshad1"", ""~Jan_Hendrik_Metzen1"", ""~Arnold_W.M._Smeulders1""]","[""Sadaf Gulshad"", ""Jan Hendrik Metzen"", ""Arnold W.M. Smeulders""]","[""Robustness"", ""Adversarial Examples"", ""Natural Perturbations"", ""General Robustness""]","In this paper we aim to explore the general robustness of neural network classifiers by utilizing adversarial as well as natural perturbations. Different from previous works which mainly focus on studying the robustness of neural networks against adversarial perturbations, we also evaluate their robustness on natural perturbations before and after robustification. After standardizing the comparison between adversarial and natural perturbations, we demonstrate that although adversarial training improves the performance of the networks against adversarial perturbations, it leads to drop in the performance for naturally perturbed samples besides clean samples. In contrast, natural perturbations like elastic deformations, occlusions and wave does not only improve the performance against natural perturbations, but also lead to improvement in the performance for the adversarial perturbations. Additionally they do not drop the accuracy on the clean images.",/pdf/3afd748574db54a7aba2da37875dbaf0a43f6b1b.pdf,ICLR,2021, +biH_IISPxYA,oJZRwakpGxO,1601310000000.0,1614990000000.0,2925,Multi-Level Generative Models for Partial Label Learning with Non-random Label Noise,"[""~Yan_Yan10"", ""~Yuhong_Guo1""]","[""Yan Yan"", ""Yuhong Guo""]",[],"Partial label (PL) learning tackles the problem where each training instance is associated with a set of candidate labels that include both the true label and irrelevant noise labels. In this paper, we propose a novel multi-level generative model for partial label learning (MGPLL), which tackles the PL problem by learning both a label level adversarial generator and a feature level adversarial generator under a bi-directional mapping framework between the label vectors and the data samples. MGPLL uses a conditional noise label generation network to model the non-random noise labels and perform label denoising, and uses a multi-class predictor to map the training instances to the denoised label vectors, while a conditional data feature generator is used to form an inverse mapping from the denoised label vectors to data samples. Both the noise label generator and the data feature generator are learned in an adversarial manner to match the observed candidate labels and data features respectively. We conduct extensive experiments on both synthesized and real-world partial label datasets. The proposed approach demonstrates the state-of-the- art performance for partial label learning.",/pdf/12cc456e89e02e1812f0ddd75004cb3c37ca7dac.pdf,ICLR,2021,This is the first partial label learning method that handles non-random label noise with a consistent multi-level generative model. +SJeXSo09FQ,BJe_VKd1KX,1538090000000.0,1549980000000.0,76,Learning Localized Generative Models for 3D Point Clouds via Graph Convolution,"[""diego.valsesia@polito.it"", ""giulia.fracastoro@polito.it"", ""enrico.magli@polito.it""]","[""Diego Valsesia"", ""Giulia Fracastoro"", ""Enrico Magli""]","[""GAN"", ""graph convolution"", ""point clouds""]","Point clouds are an important type of geometric data and have widespread use in computer graphics and vision. However, learning representations for point clouds is particularly challenging due to their nature as being an unordered collection of points irregularly distributed in 3D space. Graph convolution, a generalization of the convolution operation for data defined over graphs, has been recently shown to be very successful at extracting localized features from point clouds in supervised or semi-supervised tasks such as classification or segmentation. This paper studies the unsupervised problem of a generative model exploiting graph convolution. We focus on the generator of a GAN and define methods for graph convolution when the graph is not known in advance as it is the very output of the generator. The proposed architecture learns to generate localized features that approximate graph embeddings of the output geometry. We also study the problem of defining an upsampling layer in the graph-convolutional generator, such that it learns to exploit a self-similarity prior on the data distribution to sample more effectively.",/pdf/e9c04a77b67f082867a22c4fba0cf3d0a3c53b47.pdf,ICLR,2019,A GAN using graph convolution operations with dynamically computed graphs from hidden features +BkxfshNYwB,SkxyH5nePS,1569440000000.0,1577170000000.0,143,Mincut Pooling in Graph Neural Networks,"[""fibi@norceresearch.no"", ""grattd@usi.ch"", ""alippc@usi.ch""]","[""Filippo Maria Bianchi"", ""Daniele Grattarola"", ""Cesare Alippi""]","[""Graph Neural Networks"", ""Pooling"", ""Graph Cuts"", ""Spectral Clustering""]","The advance of node pooling operations in Graph Neural Networks (GNNs) has lagged behind the feverish design of new message-passing techniques, and pooling remains an important and challenging endeavor for the design of deep architectures. +In this paper, we propose a pooling operation for GNNs that leverages a differentiable unsupervised loss based on the minCut optimization objective. +For each node, our method learns a soft cluster assignment vector that depends on the node features, the target inference task (e.g., a graph classification loss), and, thanks to the minCut objective, also on the connectivity structure of the graph. +Graph pooling is obtained by applying the matrix of assignment vectors to the adjacency matrix and the node features. +We validate the effectiveness of the proposed pooling method on a variety of supervised and unsupervised tasks.",/pdf/9e4420fa6fb6f26c557ab4bb4d618eb2cf4ece5f.pdf,ICLR,2020,"A new pooling layer for GNNs that learns how to pool nodes, according to their features, the graph connectivity, and the dowstream task objective." +SygcCnNKwr,BkgYColSwr,1569440000000.0,1583910000000.0,274,Measuring Compositional Generalization: A Comprehensive Method on Realistic Data,"[""keysers@google.com"", ""schaerli@google.com"", ""nkscales@google.com"", ""hylke@google.com"", ""danielfurrer@google.com"", ""sergik@google.com"", ""nikola@google.com"", ""sinopalnikov@google.com"", ""lukstafi@google.com"", ""ttihon@google.com"", ""tsar@google.com"", ""wangxiao@google.com"", ""marcvanzee@google.com"", ""obousquet@google.com""]","[""Daniel Keysers"", ""Nathanael Sch\u00e4rli"", ""Nathan Scales"", ""Hylke Buisman"", ""Daniel Furrer"", ""Sergii Kashubin"", ""Nikola Momchev"", ""Danila Sinopalnikov"", ""Lukasz Stafiniak"", ""Tibor Tihon"", ""Dmitry Tsarkov"", ""Xiao Wang"", ""Marc van Zee"", ""Olivier Bousquet""]","[""compositionality"", ""generalization"", ""natural language understanding"", ""benchmark"", ""compositional generalization"", ""compositional modeling"", ""semantic parsing"", ""generalization measurement""]","State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings. +",/pdf/2631883e2154c39de4275df932c36eebb193e3ce.pdf,ICLR,2020,Benchmark and method to measure compositional generalization by maximizing divergence of compound frequency at small divergence of atom frequency. +412_KkkGjJ4,RxyN5i8_D6L,1601310000000.0,1614990000000.0,1478,Weakly Supervised Scene Graph Grounding,"[""~Yizhou_Zhang3"", ""~Zhaoheng_Zheng1"", ""~Yan_Liu1""]","[""Yizhou Zhang"", ""Zhaoheng Zheng"", ""Yan Liu""]","[""Weakly Supervised Learning"", ""Scene Graph Grounding"", ""Visual Relation"", ""Computer Vision""]"," Recent researches have achieved substantial advances in learning structured representations from images. However, current methods rely heavily on the annotated mapping between the nodes of scene graphs and object bounding boxes inside images. Here, we explore the problem of learning the mapping between scene graph nodes and visual objects under weak supervision. Our proposed method learns a metric among visual objects and scene graph nodes by incorporating information from both object features and relational features. Extensive experiments on Visual Genome (VG) and Visual Relation Detection (VRD) datasets verify that our model post an improvement on scene graph grounding task over current state-of-the-art approaches. Further experiments on scene graph parsing task verify the grounding found by our model can reinforce the performance of the existing method. ",/pdf/2d8af16be0d561548efc890051ec2794d15602a6.pdf,ICLR,2021,We propose the task of weakly supervised scene graph grounding and provide a state-of-the-art solution. +rkgl51rKDB,BygeUdCdPr,1569440000000.0,1577170000000.0,1865,Efficient meta reinforcement learning via meta goal generation,"[""haotianfu@tju.edu.cn"", ""bluecontra@tju.edu.cn"", ""jianye.hao@tju.edu.cn""]","[""Haotian Fu"", ""Hongyao Tang"", ""Jianye Hao""]",[],"Meta reinforcement learning (meta-RL) is able to accelerate the acquisition of new tasks by learning from past experience. Current meta-RL methods usually learn to adapt to new tasks by directly optimizing the parameters of policies over primitive actions. However, for complex tasks which requires sophisticated control strategies, it would be quite inefficient to to directly learn such a meta-policy. Moreover, this problem can become more severe and even fail in spare reward settings, which is quite common in practice. To this end, we propose a new meta-RL algorithm called meta goal-generation for hierarchical RL (MGHRL) by leveraging hierarchical actor-critic framework. Instead of directly generate policies over primitive actions for new tasks, MGHRL learns to generate high-level meta strategies over subgoals given past experience and leaves the rest of how to achieve subgoals as independent RL subtasks. Our empirical results on several challenging simulated robotics environments show that our method enables more efficient and effective meta-learning from past experience and outperforms state-of-the-art meta-RL and Hierarchical-RL methods in sparse reward settings.",/pdf/96dea0af2faf882ea23a25cc37f844df887b6243.pdf,ICLR,2020, +0xdQXkz69x9,CWqKlpWLq2N,1601310000000.0,1614990000000.0,3303,Attacking Few-Shot Classifiers with Adversarial Support Sets,"[""~Elre_Talea_Oldewage1"", ""~John_F_Bronskill1"", ""~Richard_E_Turner1""]","[""Elre Talea Oldewage"", ""John F Bronskill"", ""Richard E Turner""]","[""meta-learning"", ""few-shot learning"", ""adversarial attacks"", ""poisoning""]","Few-shot learning systems, especially those based on meta-learning, have recently made significant advances, and are now being considered for real world problems in healthcare, personalization, and science. In this paper, we examine the robustness of such deployed few-shot learning systems when they are fed an imperceptibly perturbed few-shot dataset, showing that the resulting predictions on test inputs can become worse than chance. This is achieved by developing a novel Adversarial Support Set Attack which crafts a poisoned set of examples. When even a small subset of malicious data points is inserted into the support set of a meta-learner, accuracy is significantly reduced. For example, the average classification accuracy of CNAPs on the Aircraft dataset in the META-DATASET benchmark drops from 69.2% to 9.1% when only 20% of the support set is poisoned by imperceptible perturbations. We evaluate the new attack on a variety of few-shot classification algorithms including MAML, prototypical networks, and CNAPs, on both small scale (miniImageNet) and large scale (META-DATASET) few-shot classification problems. Interestingly, adversarial support sets produced by attacking a meta-learning based few-shot classifier can also reduce the accuracy of a fine-tuning based few-shot classifier when both models use similar feature extractors.",/pdf/241a09f79b4c24403777ec183b1398f34ce3d600.pdf,ICLR,2021,"We introduce an effective, novel adversarial attack called an Adversarial Support Set Attack that poisons the support set of trained few-shot learners to cause failure with high probability on unseen queries at test time." +#NAME?,3vNr02njsN9,1601310000000.0,1614990000000.0,2331,A Bayesian-Symbolic Approach to Learning and Reasoning for Intuitive Physics,"[""~Kai_Xu4"", ""~Akash_Srivastava1"", ""~Dan_Gutfreund1"", ""fsosa@fas.harvard.edu"", ""~Tomer_Ullman1"", ""~Joshua_B._Tenenbaum1"", ""~Charles_Sutton1""]","[""Kai Xu"", ""Akash Srivastava"", ""Dan Gutfreund"", ""Felix Sosa"", ""Tomer Ullman"", ""Joshua B. Tenenbaum"", ""Charles Sutton""]","[""physics learning"", ""symbolic regression"", ""intuitive physics""]","Humans are capable of reasoning about physical phenomena by inferring laws of physics from a very limited set of observations. The inferred laws can potentially depend on unobserved properties, such as mass, texture, charge, etc. This sample-efficient physical reasoning is considered a core domain of human common-sense knowledge and hints at the existence of a physics engine in the head. In this paper, we propose a Bayesian symbolic framework for learning sample-efficient models of physical reasoning and prediction, which are of special interests in the field of intuitive physics. In our framework, the environment is represented by a top-down generative model with a collection of entities with some known and unknown properties as latent variables to capture uncertainty. The physics engine depends on physical laws which are modeled as interpretable symbolic expressions and are assumed to be functions of the latent properties of the entities interacting under simple Newtonian physics. As such, learning the laws is then reduced to symbolic regression and Bayesian inference methods are used to obtain the distribution of unobserved properties. These inference and regression steps are performed in an iterative manner following the expectation–maximization algorithm to infer the unknown properties and use them to learn the laws from a very small set of observations. We demonstrate that on three physics learning tasks that compared to the existing methods of learning physics, our proposed framework is more data-efficient, accurate and makes joint reasoning and learning possible.",/pdf/557cf30c3639e42f7f093814322779e16c30b26f.pdf,ICLR,2021,A novel computational framework to perform joint learning-reasoning of physics by combining symbolic regression and Bayesian inference. +BkxWJnC9tX,HJgfgkRqKm,1538090000000.0,1550890000000.0,960,Diversity and Depth in Per-Example Routing Models,"[""prajitram@gmail.com"", ""qvl@google.com""]","[""Prajit Ramachandran"", ""Quoc V. Le""]","[""conditional computation"", ""routing models"", ""depth""]","Routing models, a form of conditional computation where examples are routed through a subset of components in a larger network, have shown promising results in recent works. Surprisingly, routing models to date have lacked important properties, such as architectural diversity and large numbers of routing decisions. Both architectural diversity and routing depth can increase the representational power of a routing network. In this work, we address both of these deficiencies. We discuss the significance of architectural diversity in routing models, and explain the tradeoffs between capacity and optimization when increasing routing depth. In our experiments, we find that adding architectural diversity to routing models significantly improves performance, cutting the error rates of a strong baseline by 35% on an Omniglot setup. However, when scaling up routing depth, we find that modern routing techniques struggle with optimization. We conclude by discussing both the positive and negative results, and suggest directions for future research.",/pdf/4ea69f6e33a193ebba5bcf9e3444e882ded0a274.pdf,ICLR,2019,"Per-example routing models benefit from architectural diversity, but still struggle to scale to a large number of routing decisions." +Hkg8bDqee,,1478290000000.0,1490160000000.0,360,Introspection:Accelerating Neural Network Training By Learning Weight Evolution,"[""abhishek.sinha94@gmail.com"", ""ahitagnimukherjeeam@gmail.com"", ""msarkar@adobe.com"", ""kbalaji@adobe.com""]","[""Abhishek Sinha"", ""Aahitagni Mukherjee"", ""Mausoom Sarkar"", ""Balaji Krishnamurthy""]","[""Computer vision"", ""Deep learning"", ""Optimization""]","Neural Networks are function approximators that have achieved state-of-the-art accuracy in numerous machine learning tasks. In spite of their great success in terms of accuracy, their large training time makes it difficult to use them for various tasks. In this paper, we explore the idea of learning weight evolution pattern from a simple network for accelerating training of novel neural networks. + +We use a neural network to learn the training pattern from MNIST classification and utilize it to accelerate training of neural networks used for CIFAR-10 and ImageNet classification. Our method has a low memory footprint and is computationally efficient. This method can also be used with other optimizers to give faster convergence. The results indicate a general trend in the weight evolution during training of neural networks.",/pdf/f5316305b0560db063525a71f36ca95d1932981e.pdf,ICLR,2017,"Acceleration of training by performing weight updates, using knowledge obtained from training other neural networks." +rkeX-3Rqtm,HJx82b05Km,1538090000000.0,1545360000000.0,1156,Training Hard-Threshold Networks with Combinatorial Search in a Discrete Target Propagation Setting,"[""lnaberga@uwaterloo.ca"", ""wjtoth@uwaterloo.ca"", ""lm2cousi@uwaterloo.ca""]","[""Lukas Nabergall"", ""Justin Toth"", ""Leah Cousins""]","[""hard-threshold network"", ""combinatorial optimization"", ""search"", ""target propagation""]","Learning deep neural networks with hard-threshold activation has recently become an important problem due to the proliferation of resource-constrained computing devices. In order to circumvent the inability to train with backpropagation in the present of hard-threshold activations, \cite{friesen2017} introduced a discrete target propagation framework for training hard-threshold networks in a layer-by-layer fashion. Rather than using a gradient-based target heuristic, we explore the use of search methods for solving the target setting problem. Building on both traditional combinatorial optimization algorithms and gradient-based techniques, we develop a novel search algorithm Guided Random Local Search (GRLS). We demonstrate the effectiveness of our algorithm in training small networks on several datasets and evaluate our target-setting algorithm compared to simpler search methods and gradient-based techniques. Our results indicate that combinatorial optimization is a viable method for training hard-threshold networks that may have the potential to eventually surpass gradient-based methods in many settings. ",/pdf/79de3bbc5cbd1165f3632d65918edc0e158681fa.pdf,ICLR,2019, +BkeYdyHYPS,ryeRlz0dwS,1569440000000.0,1577170000000.0,1813,Evo-NAS: Evolutionary-Neural Hybrid Agent for Architecture Search,"[""krzysztof.s.maziarz@gmail.com"", ""tanmingxing@google.com"", ""akhorlin@google.com"", ""kysc@google.com"", ""agesmundo@google.com""]","[""Krzysztof Maziarz"", ""Mingxing Tan"", ""Andrey Khorlin"", ""Kuang-Yu Samuel Chang"", ""Andrea Gesmundo""]",[],"Neural Architecture Search has shown potential to automate the design of neural networks. Deep Reinforcement Learning based agents can learn complex architectural patterns, as well as explore a vast and compositional search space. On the other hand, evolutionary algorithms offer higher sample efficiency, which is critical for such a resource intensive application. In order to capture the best of both worlds, we propose a class of Evolutionary-Neural hybrid agents (Evo-NAS). We show that the Evo-NAS agent outperforms both neural and evolutionary agents when applied to architecture search for a suite of text and image classification benchmarks. On a high-complexity architecture search space for image classification, the Evo-NAS agent surpasses the accuracy achieved by commonly used agents with only 1/3 of the search cost.",/pdf/c0cf4da0f3f2e272107ca38bb8681f8967aea493.pdf,ICLR,2020, +mQPBmvyAuk,70-GJRb5eN7,1601310000000.0,1615930000000.0,1836,BREEDS: Benchmarks for Subpopulation Shift,"[""~Shibani_Santurkar1"", ""~Dimitris_Tsipras1"", ""~Aleksander_Madry1""]","[""Shibani Santurkar"", ""Dimitris Tsipras"", ""Aleksander Madry""]","[""benchmarks"", ""distribution shift"", ""hierarchy"", ""robustness""]","We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing +large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of existing train-time robustness interventions. ",/pdf/267e1b0387f6edaaa3b1145def1009b6803d55b6.pdf,ICLR,2021,We develop a methodology for constructing large-scale subpopulation shift benchmarks and use them to assess model robustness as well as the effectiveness existing robustness interventions. +uhiF-dV99ir,kfTVvNJU88J,1601310000000.0,1614990000000.0,1828,Visualizing High-Dimensional Trajectories on the Loss-Landscape of ANNs,"[""~Stefan_Horoi1"", ""jiexi.huang@yale.edu"", ""~Guy_Wolf1"", ""~Smita_Krishnaswamy1""]","[""Stefan Horoi"", ""Jessie Huang"", ""Guy Wolf"", ""Smita Krishnaswamy""]",[],"Training artificial neural networks requires the optimization of highly non-convex loss functions. Throughout the years, the scientific community has developed an extensive set of tools and architectures that render this optimization task tractable and a general intuition has been developed for choosing hyper parameters that help the models reach minima that generalize well to unseen data. However, for the most part, the difference in trainability in between architectures, tasks and even the gap in network generalization abilities still remain unexplained. Visualization tools have played a key role in uncovering key geometric characteristics of the loss-landscape of ANNs and how they impact trainability and generalization capabilities. However, most visualizations methods proposed so far have been relatively limited in their capabilities since they are of linear nature and only capture features in a limited number of dimensions. We propose the use of the modern dimensionality reduction method PHATE which represents the SOTA in terms of capturing both global and local structures of high-dimensional data. We apply this method to visualize the loss landscape during and after training. Our visualizations reveal differences in training trajectories and generalization capabilities when used to make comparisons between optimization methods, initializations, architectures, and datasets. Given this success we anticipate this method to be used in making informed choices about these aspects of neural networks.",/pdf/08f105a2db07c3233df52c5ef42ff4ad86b4e24b.pdf,ICLR,2021, +H1g2NhC5KQ,B1gHXcTqY7,1538090000000.0,1549870000000.0,1489,Multiple-Attribute Text Rewriting,"[""glample@fb.com"", ""sandeep.subramanian.1@umontreal.ca"", ""ems@fb.com"", ""ludovic.denoyer@lip6.fr"", ""ranzato@fb.com"", ""ylan@fb.com""]","[""Guillaume Lample"", ""Sandeep Subramanian"", ""Eric Smith"", ""Ludovic Denoyer"", ""Marc'Aurelio Ranzato"", ""Y-Lan Boureau""]","[""controllable text generation"", ""generative models"", ""conditional generative models"", ""style transfer""]","The dominant approach to unsupervised ""style transfer'' in text is based on the idea of learning a latent representation, which is independent of the attributes specifying its ""style''. In this paper, we show that this condition is not necessary and is not always met in practice, even with domain adversarial training that explicitly aims at learning such disentangled representations. We thus propose a new model that controls several factors of variation in textual data where this condition on disentanglement is replaced with a simpler mechanism based on back-translation. Our method allows control over multiple attributes, like gender, sentiment, product type, etc., and a more fine-grained control on the trade-off between content preservation and change of style with a pooling operator in the latent space. Our experiments demonstrate that the fully entangled model produces better generations, even when tested on new and more challenging benchmarks comprising reviews with multiple sentences and multiple attributes.",/pdf/0061161de6f286e89e19c6626f651cdd77bef70a.pdf,ICLR,2019,A system for rewriting text conditioned on multiple controllable attributes +BkxSmlBFvr,r1xiM4ltPH,1569440000000.0,1583910000000.0,2209,You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings,"[""daniel@informatik.uni-mannheim.de"", ""broscheit@informatik.uni-mannheim.de"", ""rgemulla@uni-mannheim.de""]","[""Daniel Ruffinelli"", ""Samuel Broscheit"", ""Rainer Gemulla""]","[""knowledge graph embeddings"", ""hyperparameter optimization""]","Knowledge graph embedding (KGE) models learn algebraic representations of the entities and relations in a knowledge graph. A vast number of KGE techniques for multi-relational link prediction have been proposed in the recent literature, often with state-of-the-art performance. These approaches differ along a number of dimensions, including different model architectures, different training strategies, and different approaches to hyperparameter optimization. In this paper, we take a step back and aim to summarize and quantify empirically the impact of each of these dimensions on model performance. We report on the results of an extensive experimental study with popular model architectures and training strategies across a wide range of hyperparameter settings. We found that when trained appropriately, the relative performance differences between various model architectures often shrinks and sometimes even reverses when compared to prior results. For example, RESCAL~\citep{nickel2011three}, one of the first KGE models, showed strong performance when trained with state-of-the-art techniques; it was competitive to or outperformed more recent architectures. We also found that good (and often superior to prior studies) model configurations can be found by exploring relatively few random samples from a large hyperparameter space. Our results suggest that many of the more advanced architectures and techniques proposed in the literature should be revisited to reassess their individual benefits. To foster further reproducible research, we provide all our implementations and experimental results as part of the open source LibKGE framework.",/pdf/d8532341877a4ce6e4fee643e629af2957579771.pdf,ICLR,2020,We study the impact of training strategies on the performance of knowledge graph embeddings. +S1g_t1StDB,H1ldhIA_wH,1569440000000.0,1577170000000.0,1847,Self-Educated Language Agent with Hindsight Experience Replay for Instruction Following,"[""geoffrey.cideron@inria.fr"", ""mathieu.seurin@inria.fr"", ""fstrub@google.com"", ""pietquin@google.com""]","[""Geoffrey Cideron"", ""Mathieu Seurin"", ""Florian Strub"", ""Olivier Pietquin""]","[""Language"", ""reinforcement learning"", ""instruction following"", ""Hindsight Experience Replay""]","Language creates a compact representation of the world and allows the description of unlimited situations and objectives through compositionality. These properties make it a natural fit to guide the training of interactive agents as it could ease recurrent challenges in Reinforcement Learning such as sample complexity, generalization, or multi-tasking. Yet, it remains an open-problem to relate language and RL in even simple instruction following scenarios. Current methods rely on expert demonstrations, auxiliary losses, or inductive biases in neural architectures. In this paper, we propose an orthogonal approach called Textual Hindsight Experience Replay (THER) that extends the Hindsight Experience Replay approach to the language setting. Whenever the agent does not fulfill its instruction, THER learn to output a new directive that matches the agent trajectory, and it relabels the episode with a positive reward. To do so, THER learns to map a state into an instruction by using past successful trajectories, which removes the need to have external expert interventions to relabel episodes as in vanilla HER. We observe that this simple idea also initiates a learning synergy between language acquisition and policy learning on instruction following tasks in the BabyAI environment. ",/pdf/7efaac499dc3cf86f7465c9046d13525f452ab7b.pdf,ICLR,2020, +B1ethsR9Ym,B1eKGST9KX,1538090000000.0,1545360000000.0,733,"Look Ma, No GANs! Image Transformation with ModifAE","[""chada@ucsd.edu"", ""b4tam@ucsd.edu""]","[""Chad Atalla"", ""Bartholomew Tam"", ""Amanda Song"", ""Gary Cottrell""]","[""Computer Vision"", ""Deep Learning"", ""Autoencoder"", ""GAN"", ""Image Modification"", ""Social Traits"", ""Social Psychology""]","Existing methods of image to image translation require multiple steps in the training or modification process, and suffer from either an inability to generalize, or long training times. These methods also focus on binary trait modification, ignoring continuous traits. To address these problems, we propose ModifAE: a novel standalone neural network, trained exclusively on an autoencoding task, that implicitly learns to make continuous trait image modifications. As a standalone image modification network, ModifAE requires fewer parameters and less time to train than existing models. We empirically show that ModifAE produces significantly more convincing and more consistent continuous face trait modifications than the previous state-of-the-art model.",/pdf/0e0aa9731c95e85abb8abc36e85bc1bc938ee415.pdf,ICLR,2019,"ModifAE is a standalone neural network, trained exclusively on an autoencoding task, that implicitly learns to make image modifications (without GANs)." +1YLJDvSx6J4,bWoeeVA_b_,1601310000000.0,1621130000000.0,1954,Learning from Protein Structure with Geometric Vector Perceptrons,"[""~Bowen_Jing1"", ""~Stephan_Eismann1"", ""psuriana@stanford.edu"", ""~Raphael_John_Lamarre_Townshend1"", ""~Ron_Dror1""]","[""Bowen Jing"", ""Stephan Eismann"", ""Patricia Suriana"", ""Raphael John Lamarre Townshend"", ""Ron Dror""]","[""structural biology"", ""graph neural networks"", ""proteins"", ""geometric deep learning""]","Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning, but there has yet to emerge a unifying network architecture that simultaneously leverages the geometric and relational aspects of the problem domain. To address this gap, we introduce geometric vector perceptrons, which extend standard dense layers to operate on collections of Euclidean vectors. Graph neural networks equipped with such layers are able to perform both geometric and relational reasoning on efficient representations of macromolecules. We demonstrate our approach on two important problems in learning from protein structure: model quality assessment and computational protein design. Our approach improves over existing classes of architectures on both problems, including state-of-the-art convolutional neural networks and graph neural networks. We release our code at https://github.com/drorlab/gvp.",/pdf/9f3a4c7663dfbd524333bd5b864a86b7840e5808.pdf,ICLR,2021,We introduce a novel graph neural network layer to learn from the structure of macromolecules. +BfayGoTV4iQ,2ikgufvDmOc,1601310000000.0,1614990000000.0,754,SketchEmbedNet: Learning Novel Concepts by Imitating Drawings,"[""~Alexander_Wang1"", ""~Mengye_Ren1"", ""~Richard_Zemel1""]","[""Alexander Wang"", ""Mengye Ren"", ""Richard Zemel""]","[""generative"", ""probabilistic"", ""sketch"", ""drawing"", ""few-shot learning"", ""classification"", ""embedding learning""]",Sketch drawings are an intuitive visual domain that appeals to human instinct. Previous work has shown that recurrent neural networks are capable of producing sketch drawings of a single or few classes at a time. In this work we investigate representations developed by training a generative model to produce sketches from pixel images across many classes in a sketch domain. We find that the embeddings learned by this sketching model are extremely informative for visual tasks and infer a unique visual understanding. We then use them to exceed state-of-the-art performance in unsupervised few-shot classification on the Omniglot and mini-ImageNet benchmarks. We also leverage the generative capacity of our model to produce high quality sketches of novel classes based on just a single example. ,/pdf/35b098f60e81801bebdeb4ebb37fe52a7fac10c6.pdf,ICLR,2021,Learning a generative image to sketch model for few-shot classification and novel visual understanding in both embedding and image space. +Skp1ESxRZ,HJnkVHx0Z,1509080000000.0,1521490000000.0,252,Towards Synthesizing Complex Programs From Input-Output Examples,"[""xinyun.chen@berkeley.edu"", ""liuchang@eecs.berkeley.edu"", ""dawnsong.travel@gmail.com""]","[""Xinyun Chen"", ""Chang Liu"", ""Dawn Song""]",[],"In recent years, deep learning techniques have been developed to improve the performance of program synthesis from input-output examples. Albeit its significant progress, the programs that can be synthesized by state-of-the-art approaches are still simple in terms of their complexity. In this work, we move a significant step forward along this direction by proposing a new class of challenging tasks in the domain of program synthesis from input-output examples: learning a context-free parser from pairs of input programs and their parse trees. We show that this class of tasks are much more challenging than previously studied tasks, and the test accuracy of existing approaches is almost 0%. + +We tackle the challenges by developing three novel techniques inspired by three novel observations, which reveal the key ingredients of using deep learning to synthesize a complex program. First, the use of a non-differentiable machine is the key to effectively restrict the search space. Thus our proposed approach learns a neural program operating a domain-specific non-differentiable machine. Second, recursion is the key to achieve generalizability. Thus, we bake-in the notion of recursion in the design of our non-differentiable machine. Third, reinforcement learning is the key to learn how to operate the non-differentiable machine, but it is also hard to train the model effectively with existing reinforcement learning algorithms from a cold boot. We develop a novel two-phase reinforcement learning-based search algorithm to overcome this issue. In our evaluation, we show that using our novel approach, neural parsing programs can be learned to achieve 100% test accuracy on test inputs that are 500x longer than the training samples.",/pdf/59dedb5d936deed03f5aa20caeb946ad07d5cee5.pdf,ICLR,2018, +r1e_FpNFDr,BkeHyfaPDr,1569440000000.0,1586320000000.0,675,Generalization bounds for deep convolutional neural networks,"[""plong@google.com"", ""hsedghi@google.com""]","[""Philip M. Long"", ""Hanie Sedghi""]","[""generalization"", ""convolutional networks"", ""statistical learning theory""]","We prove bounds on the generalization error of convolutional networks. +The bounds are in terms of the training loss, the number of +parameters, the Lipschitz constant of the loss and the distance from +the weights to the initial weights. They are independent of the +number of pixels in the input, and the height and width of hidden +feature maps. +We present experiments using CIFAR-10 with varying +hyperparameters of a deep convolutional network, comparing our bounds +with practical generalization gaps.",/pdf/dbc3b3017f85ac85fbb19c28da04256fe1826696.pdf,ICLR,2020,We prove generalization bounds for convolutional neural networks that take account of weight-tying +S1gEIerYwH,B1lky5gtPH,1569440000000.0,1583910000000.0,2320,Transferring Optimality Across Data Distributions via Homotopy Methods,"[""gargiani@informatik.uni-freiburg.de"", ""andrea.zanelli@imtek.uni-freiburg.de"", ""quoctd@email.unc.edu"", ""moritz.diehl@imtek.uni-freiburg.de"", ""fh@cs.uni-freiburg.de""]","[""Matilde Gargiani"", ""Andrea Zanelli"", ""Quoc Tran Dinh"", ""Moritz Diehl"", ""Frank Hutter""]","[""deep learning"", ""numerical optimization"", ""transfer learning""]","Homotopy methods, also known as continuation methods, are a powerful mathematical tool to efficiently solve various problems in numerical analysis, including complex non-convex optimization problems where no or only little prior knowledge regarding the localization of the solutions is available. +In this work, we propose a novel homotopy-based numerical method that can be used to transfer knowledge regarding the localization of an optimum across different task distributions in deep learning applications. We validate the proposed methodology with some empirical evaluations in the regression and classification scenarios, where it shows that superior numerical performance can be achieved in popular deep learning benchmarks, i.e. FashionMNIST, CIFAR-10, and draw connections with the widely used fine-tuning heuristic. In addition, we give more insights on the properties of a general homotopy method when used in combination with Stochastic Gradient Descent by conducting a general local theoretical analysis in a simplified setting. ",/pdf/933d30e76e8498d1b5bbb76e00b17ef92b4e9981.pdf,ICLR,2020,"We propose a new homotopy-based method to transfer ""optimality knowledge"" across different data distributions in order to speed up training of deep models. " +ByeWogStDS,r1lqUk-FDH,1569440000000.0,1583910000000.0,2497,Sub-policy Adaptation for Hierarchical Reinforcement Learning,"[""alexli1@berkeley.edu"", ""florensa@berkeley.edu"", ""iclavera@berkeley.edu"", ""pabbeel@berkeley.edu""]","[""Alexander Li"", ""Carlos Florensa"", ""Ignasi Clavera"", ""Pieter Abbeel""]","[""Hierarchical Reinforcement Learning"", ""Transfer"", ""Skill Discovery""]","Hierarchical reinforcement learning is a promising approach to tackle long-horizon decision-making problems with sparse rewards. Unfortunately, most methods still decouple the lower-level skill acquisition process and the training of a higher level that controls the skills in a new task. Leaving the skills fixed can lead to significant sub-optimality in the transfer setting. In this work, we propose a novel algorithm to discover a set of skills, and continuously adapt them along with the higher level even when training on a new task. Our main contributions are two-fold. First, we derive a new hierarchical policy gradient with an unbiased latent-dependent baseline, and we introduce Hierarchical Proximal Policy Optimization (HiPPO), an on-policy method to efficiently train all levels of the hierarchy jointly. Second, we propose a method of training time-abstractions that improves the robustness of the obtained skills to environment changes. Code and videos are available at sites.google.com/view/hippo-rl.",/pdf/3182ba9f57df7bad5db6033a3217209b9606dd6e.pdf,ICLR,2020,"We propose HiPPO, a stable Hierarchical Reinforcement Learning algorithm that can train several levels of the hierarchy simultaneously, giving good performance both in skill discovery and adaptation." +HJeRkh05Km,r1x8gPpqF7,1538090000000.0,1551750000000.0,1033,Visual Semantic Navigation using Scene Priors,"[""wyang@ee.cuhk.edu.hk"", ""xiaolonw@cs.cmu.edu"", ""ali@cs.washington.edu"", ""abhinavg@cs.cmu.edu"", ""roozbehm@allenai.org""]","[""Wei Yang"", ""Xiaolong Wang"", ""Ali Farhadi"", ""Abhinav Gupta"", ""Roozbeh Mottaghi""]","[""Visual Navigation"", ""Scene Prior"", ""Knowledge Graph"", ""Graph Convolution Networks"", ""Deep Reinforcement Learning""]","How do humans navigate to target objects in novel scenes? Do we use the semantic/functional priors we have built over years to efficiently search and navigate? For example, to search for mugs, we search cabinets near the coffee machine and for fruits we try the fridge. In this work, we focus on incorporating semantic priors in the task of semantic navigation. We propose to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework. The agent uses the features from the knowledge graph to predict the actions. For evaluation, we use the AI2-THOR framework. Our experiments show how semantic knowledge improves the performance significantly. More importantly, we show improvement in generalization to unseen scenes and/or objects.",/pdf/289b7fed4ba41a9243344c4dfbbbc985f0ce6d20.pdf,ICLR,2019, +Bylmkh05KX,SyxCUa_cK7,1538090000000.0,1549150000000.0,973,Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching,"[""cjyeh@cs.cmu.edu"", ""chenjianshu@gmail.com"", ""czyu@tencent.com"", ""dyu@tencent.com""]","[""Chih-Kuan Yeh"", ""Jianshu Chen"", ""Chengzhu Yu"", ""Dong Yu""]","[""Unsupervised speech recognition"", ""unsupervised learning"", ""phoneme classification""]","We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%. This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.",/pdf/d6e5aea20fc83505d6a7d5984dd0c36743d71843.pdf,ICLR,2019, +rJg8NertPr,S1eOdIxKvH,1569440000000.0,1577170000000.0,2248,Top-down training for neural networks,"[""s1603602@sms.ed.ac.uk"", ""cong-thanh.do@crl.toshiba.co.uk"", ""rama.doddipatla@crl.toshiba.co.uk"", ""e.loweimi@ed.ac.uk"", ""peter.bell@ed.ac.uk"", ""s.renals@ed.ac.uk""]","[""Shucong Zhang"", ""Cong-Thanh Do"", ""Rama Doddipatla"", ""Erfan Loweimi"", ""Peter Bell"", ""Steve Renals""]","[""Neural network training"", ""speech recognition""]","Vanishing gradients pose a challenge when training deep neural networks, resulting in the top layers (closer to the output) in the network learning faster when compared with lower layers closer to the input. Interpreting the top layers as a classifier and the lower layers a feature extractor, one can hypothesize that unwanted network convergence may occur when the classifier has overfit with respect to the feature extractor. This can lead to the feature extractor being under-trained, possibly failing to learn much about the patterns in the input data. To address this we propose a good classifier hypothesis: given a fixed classifier that partitions the space well, the feature extractor can be further trained to fit that classifier and learn the data patterns well. This alleviates the problem of under-training the feature extractor and enables the network to learn patterns in the data with small partial derivatives. We verify this hypothesis empirically and propose a novel top-down training method. We train all layers jointly, obtaining a good classifier from the top layers, which are then frozen. Following re-initialization, we retrain the bottom layers with respect to the frozen classifier. Applying this approach to a set of speech recognition experiments using the Wall Street Journal and noisy CHiME-4 datasets we observe substantial accuracy gains. When combined with dropout, our method enables connectionist temporal classification (CTC) models to outperform joint CTC-attention models, which have more capacity and flexibility. ",/pdf/7cb1d776b42db0ad250e0d61e4379be1c1452611.pdf,ICLR,2020, +Ptaz_zIFbX,uu-i3izo4g3,1601310000000.0,1615560000000.0,1555,Prediction and generalisation over directed actions by grid cells,"[""~Changmin_Yu1"", ""behrens@fmrib.ox.ac.uk"", ""n.burgess@ucl.ac.uk""]","[""Changmin Yu"", ""Timothy Behrens"", ""Neil Burgess""]","[""Computational neuroscience"", ""grid cells"", ""normative models""]","Knowing how the effects of directed actions generalise to new situations (e.g. moving North, South, East and West, or turning left, right, etc.) is key to rapid generalisation across new situations. Markovian tasks can be characterised by a state space and a transition matrix and recent work has proposed that neural grid codes provide an efficient representation of the state space, as eigenvectors of a transition matrix reflecting diffusion across states, that allows efficient prediction of future state distributions. Here we extend the eigenbasis prediction model, utilising tools from Fourier analysis, to prediction over arbitrary translation-invariant directed transition structures (i.e. displacement and diffusion), showing that a single set of eigenvectors can support predictions over arbitrary directed actions via action-specific eigenvalues. We show how to define a ""sense of direction"" to combine actions to reach a target state (ignoring task-specific deviations from translation-invariance), and demonstrate that adding the Fourier representations to a deep Q network aids policy learning in continuous control tasks. We show the equivalence between the generalised prediction framework and traditional models of grid cell firing driven by self-motion to perform path integration, either using oscillatory interference (via Fourier components as velocity-controlled oscillators) or continuous attractor networks (via analysis of the update dynamics). We thus provide a unifying framework for the role of the grid system in predictive planning, sense of direction and path integration: supporting generalisable inference over directed actions across different tasks.",/pdf/708e623b918bb762d8cc7249aa89623b0dad19d6.pdf,ICLR,2021,"Extending existing normative prediction models of grid cells to directed transitions, and provide a unifying framework for mechanistic and normative models of grid cells." +SJekyhCctQ,HylvtrgcK7,1538090000000.0,1545360000000.0,947,Detecting Adversarial Examples Via Neural Fingerprinting,"[""sdathath@caltech.edu"", ""st.t.zheng@gmail.com"", ""yyue@caltech.edu"", ""murray@cds.caltech.edu""]","[""Sumanth Dathathri"", ""Stephan Zheng"", ""Yisong Yue"", ""Richard M. Murray""]","[""Adversarial Attacks"", ""Deep Neural Networks""]","Deep neural networks are vulnerable to adversarial examples: input data that has been manipulated to cause dramatic model output errors. To defend against such attacks, we propose NeuralFingerprinting: a simple, yet effective method to detect adversarial examples that verifies whether model behavior is consistent with a set of fingerprints. These fingerprints are encoded into the model response during training and are inspired by the use of biometric and cryptographic signatures. In contrast to previous defenses, our method does not rely on knowledge of the adversary and can scale to large networks and input data. The benefits of our method are that 1) it is fast, 2) it is prohibitively expensive for an attacker to reverse-engineer which fingerprints were used, and 3) it does not assume knowledge of the adversary. In this work, we 1) theoretically analyze NeuralFingerprinting for linear models and 2) show that NeuralFingerprinting significantly improves on state-of-the-art detection mechanisms for deep neural networks, by detecting the strongest known adversarial attacks with 98-100% AUC-ROC scores on the MNIST, CIFAR-10 and MiniImagenet (20 classes) datasets. In particular, we consider several threat models, including the most conservative one in which the attacker has full knowledge of the defender's strategy. In all settings, the detection accuracy of NeuralFingerprinting generalizes well to unseen test-data and is robust over a wide range of hyperparameters.",/pdf/48a1cf0e03d017db5d8f1531f19450597c321596.pdf,ICLR,2019,"Novel technique for detecting adversarial examples -- robust across gradient-based and gradient-free attacks, AUC-ROC >95%" +HyfHgI6aW,BJZBgLa6b,1508890000000.0,1518730000000.0,73,Memory Augmented Control Networks,"[""arbaazk@seas.upenn.edu"", ""clarkz@seas.upenn.edu"", ""natanasov@ucsd.edu"", ""konstantinos.karydis@ucr.edu"", ""vijay.kumar@seas.upenn.edu"", ""ddlee@seas.upenn.edu""]","[""Arbaaz Khan"", ""Clark Zhang"", ""Nikolay Atanasov"", ""Konstantinos Karydis"", ""Vijay Kumar"", ""Daniel D. Lee""]","[""planning"", ""memory networks"", ""deep learning"", ""robotics""]","Planning problems in partially observable environments cannot be solved directly with convolutional networks and require some form of memory. But, even memory networks with sophisticated addressing schemes are unable to learn intelligent reasoning satisfactorily due to the complexity of simultaneously learning to access memory and plan. To mitigate these challenges we propose the Memory Augmented Control Network (MACN). The network splits planning into a hierarchical process. At a lower level, it learns to plan in a locally observed space. At a higher level, it uses a collection of policies computed on locally observed spaces to learn an optimal plan in the global environment it is operating in. The performance of the network is evaluated on path planning tasks in environments in the presence of simple and complex obstacles and in addition, is tested for its ability to generalize to new environments not seen in the training set.",/pdf/158d5dbddc51e8efbd2fa9c8c9a6e1701aa9030e.pdf,ICLR,2018,Memory Augmented Network to plan in partially observable environments. +7K0UUL9y9lE,z3thW2YUWen,1601310000000.0,1614990000000.0,2731,You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling,"[""~Zhanpeng_Zeng1"", ""~Yunyang_Xiong2"", ""~Sathya_N._Ravi1"", ""sachary1@amfam.com"", ""~Glenn_Fung2"", ""~Vikas_Singh1""]","[""Zhanpeng Zeng"", ""Yunyang Xiong"", ""Sathya N. Ravi"", ""Shailesh Acharya"", ""Glenn Fung"", ""Vikas Singh""]","[""self-attention"", ""efficient"", ""linear complexity"", ""language model"", ""transformer"", ""BERT""]","Transformer-based models have come to dominate the landscape in a wide range of natural language processing (NLP) applications. The heart of the transformer model is the self-attention mechanism, which captures the interactions of token pairs in the input sequences and consequently, depends quadratically on the input sequence length. It is known that training such models on longer sequences is quite expensive, and often, prohibitively so. We show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity to linear. We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant). This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of LSH (based on feasibility of deployment on GPU architectures). We evaluate our proposed algorithm on the GLUE benchmark with standard 512 sequence length and our method achieves comparable or even slightly better performance than a standard pretrained Transformer. To evaluate whether our method can indeed handle longer sequences, we conduct experiments on long sequence (4096) language model pretraining and achieve consistent results as standard self-attention, while observing sizable inference speed-ups and memory savings.",/pdf/8d20671e798bdd1831d34dd1e0d8c57126006ab5.pdf,ICLR,2021, +Jf24xdaAwF9,1FtBPJJQH_Y,1601310000000.0,1614990000000.0,2033,Self-Activating Neural Ensembles for Continual Reinforcement Learning,"[""~Sam_Powers1"", ""~Abhinav_Gupta1""]","[""Sam Powers"", ""Abhinav Gupta""]","[""continual reinforcement learning"", ""lifelong learning"", ""deep reinforcement learning""]","The ability for an agent to continuously learn new skills without catastrophically forgetting existing knowledge is of critical importance for the development of generally intelligent agents. Most methods devised to address this problem depend heavily on well-defined task boundaries which simplify the problem considerably. Our task-agnostic method, Self-Activating Neural Ensembles (SANE), uses a hierarchical modular architecture designed to avoid catastrophic forgetting without making any such assumptions. At each timestep a path through the SANE tree is activated; during training only activated nodes are updated, ensuring that unused nodes do not undergo catastrophic forgetting. Additionally, new nodes are created as needed, allowing the system to leverage and retain old skills while growing and learning new ones. We demonstrate our approach on MNIST and a set of grid world environments, demonstrating that SANE does not undergo catastrophic forgetting where existing methods do.",/pdf/d73271e7608216ed24dbf3c32dd23dfb632639b6.pdf,ICLR,2021,We present a novel tree-structured neural architecture that enables the learning of tasks sequentially. +SJxE3jlA-,rJR7hsgC-,1509110000000.0,1518730000000.0,375,Now I Remember! Episodic Memory For Reinforcement Learning,"[""riloynd@microsoft.com"", ""mahauskn@microsoft.com"", ""lihongli.cs@gmail.com"", ""l.deng@ieee.org""]","[""Ricky Loynd"", ""Matthew Hausknecht"", ""Lihong Li"", ""Li Deng""]","[""Reinforcement learning"", ""Deep learning"", ""Episodic memory""]","Humans rely on episodic memory constantly, in remembering the name of someone they met 10 minutes ago, the plot of a movie as it unfolds, or where they parked the car. Endowing reinforcement learning agents with episodic memory is a key step on the path toward replicating human-like general intelligence. We analyze why standard RL agents lack episodic memory today, and why existing RL tasks don't require it. We design a new form of external memory called Masked Experience Memory, or MEM, modeled after key features of human episodic memory. To evaluate episodic memory we define an RL task based on the common children's game of Concentration. We find that a MEM RL agent leverages episodic memory effectively to master Concentration, unlike the baseline agents we tested.",/pdf/cf582ceb1d72518cf66be6e3061f028332357fa4.pdf,ICLR,2018,Implementing and evaluating episodic memory for RL. +fTeb_adw5y4,h8sYev-T5oo,1601310000000.0,1614990000000.0,2946,Improving Calibration through the Relationship with Adversarial Robustness,"[""~Yao_Qin1"", ""~Xuezhi_Wang3"", ""~Alex_Beutel1"", ""~Ed_Chi1""]","[""Yao Qin"", ""Xuezhi Wang"", ""Alex Beutel"", ""Ed Chi""]","[""Calibration"", ""Uncertainty Estimates"", ""Adversarial Robustness""]","Neural networks lack adversarial robustness -- they are vulnerable to adversarial examples that through small perturbations to inputs cause incorrect predictions. Further, trust is undermined when models give miscalibrated uncertainty estimates, i.e. the predicted probability is not a good indicator of how much we should trust our model. In this paper, we study the connection between adversarial robustness and calibration on four classification networks and datasets. We find that the inputs for which the model is sensitive to small perturbations (are easily attacked) are more likely to have poorly calibrated predictions. Based on this insight, we examine if calibration can be improved by addressing those adversarially unrobust inputs. To this end, we propose Adversarial Robustness based Adaptive Label Smoothing (AR-AdaLS) that integrates the correlations of adversarial robustness and uncertainty into training by adaptively softening labels for an example based on how easily it can be attacked by an adversary. We find that our method, taking the adversarial robustness of the in-distribution data into consideration, leads to better calibration over the model even under distributional shifts. In addition, AR-AdaLS can also be applied to an ensemble model to further improve model's calibration.",/pdf/cdcef30fc5bc532a48436f64878059dd115ecd96.pdf,ICLR,2021, +Hkla1eHFvS,rJloqsJFDH,1569440000000.0,1577170000000.0,2079,Efficient Exploration via State Marginal Matching,"[""lslee@cs.cmu.edu"", ""beysenba@cs.cmu.edu"", ""eparisot@cs.cmu.edu"", ""epxing@cs.cmu.edu"", ""svlevine@eecs.berkeley.edu"", ""rsalakhu@cs.cmu.edu""]","[""Lisa Lee"", ""Benjain Eysenbach"", ""Emilio Parisotto"", ""Erix Xing"", ""Sergey Levine"", ""Ruslan Salakhutdinov""]","[""reinforcement learning"", ""exploration"", ""distribution matching"", ""robotics""]","Reinforcement learning agents need to explore their unknown environments to solve the tasks given to them. The Bayes optimal solution to exploration is intractable for complex environments, and while several exploration methods have been proposed as approximations, it remains unclear what underlying objective is being optimized by existing exploration methods, or how they can be altered to incorporate prior knowledge about the task. Moreover, it is unclear how to acquire a single exploration strategy that will be useful for solving multiple downstream tasks. We address these shortcomings by learning a single exploration policy that can quickly solve a suite of downstream tasks in a multi-task setting, amortizing the cost of learning to explore. We recast exploration as a problem of State Marginal Matching (SMM), where we aim to learn a policy for which the state marginal distribution matches a given target state distribution, which can incorporate prior knowledge about the task. We optimize the objective by reducing it to a two-player, zero-sum game between a state density model and a parametric policy. Our theoretical analysis of this approach suggests that prior exploration methods do not learn a policy that does distribution matching, but acquire a replay buffer that performs distribution matching, an observation that potentially explains these prior methods' success in single-task settings. On both simulated and real-world tasks, we demonstrate that our algorithm explores faster and adapts more quickly than prior methods.",/pdf/810f9cfebb4d95d282380ce7d64cc0e75ff43f27.pdf,ICLR,2020,We view exploration in RL as a problem of matching a marginal distribution over states. +73WTGs96kho,xOz5yPVePCv,1601520000000.0,1616540000000.0,1574,Net-DNF: Effective Deep Modeling of Tabular Data,"[""~Liran_Katzir1"", ""~Gal_Elidan1"", ""~Ran_El-Yaniv1""]","[""Liran Katzir"", ""Gal Elidan"", ""Ran El-Yaniv""]","[""Neural Networks"", ""Architectures"", ""Tabular Data"", ""Predictive Modeling""]","A challenging open question in deep learning is how to handle tabular data. Unlike domains such as image and natural language processing, where deep architectures prevail, there is still no widely accepted neural architecture that dominates tabular data. As a step toward bridging this gap, we present Net-DNF a novel generic architecture whose inductive bias elicits models whose structure corresponds to logical Boolean formulas in disjunctive normal form (DNF) over affine soft-threshold decision terms. Net-DNFs also promote localized decisions that are taken over small subsets of the features. We present an extensive experiments showing that Net-DNFs significantly and consistently outperform fully connected networks over tabular data. With relatively few hyperparameters, Net-DNFs open the door to practical end-to-end handling of tabular data using neural networks. We present ablation studies, which justify the design choices of Net-DNF including the inductive bias elements, namely, Boolean formulation, locality, and feature selection. +",/pdf/0faa3d246b873caa0ff0a42200c685558260733c.pdf,ICLR,2021,Neural network architecture for tabular data +SkB-_mcel,,1478270000000.0,1486540000000.0,194,Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning,"[""werner.zellinger@jku.at"", ""thomas.grubinger@scch.at"", ""edwin.lughofer@jku.at"", ""thomas.natschlaeger@scch.at"", ""susanne.saminger-platz@jku.at""]","[""Werner Zellinger"", ""Thomas Grubinger"", ""Edwin Lughofer"", ""Thomas Natschl\u00e4ger"", ""Susanne Saminger-Platz""]","[""Transfer Learning"", ""Deep learning"", ""Computer vision""]","The learning of domain-invariant representations in the context of domain adaptation with neural networks is considered. We propose a new regularization method that minimizes the domain-specific latent feature representations directly in the hidden activation space. Although some standard distribution matching approaches exist that can be interpreted as the matching of weighted sums of moments, e.g. Maximum Mean Discrepancy (MMD), an explicit order-wise matching of higher order moments has not been considered before. +We propose to match the higher order central moments of probability distributions by means of order-wise moment differences. Our model does not require computationally expensive distance and kernel matrix computations. We utilize the equivalent representation of probability distributions by moment sequences to define a new distance function, called Central Moment Discrepancy (CMD). We prove that CMD is a metric on the set of probability distributions on a compact interval. We further prove that convergence of probability distributions on compact intervals w.r.t. the new metric implies convergence in distribution of the respective random variables. +We test our approach on two different benchmark data sets for object recognition (Office) and sentiment analysis of product reviews (Amazon reviews). CMD achieves a new state-of-the-art performance on most domain adaptation tasks of Office and outperforms networks trained with MMD, Variational Fair Autoencoders and Domain Adversarial Neural Networks on Amazon reviews. In addition, a post-hoc parameter sensitivity analysis shows that the new approach is stable w. r. t. parameter changes in a certain interval. The source code of the experiments is publicly available.",/pdf/d6c8afc9a2a8dac0751a9a51a4f452ddd3b55bb9.pdf,ICLR,2017,A new method for hidden activation distribution matching in the context of domain adaptation. +SkexNpNFwS,ryeAB2-PvH,1569440000000.0,1577170000000.0,473,Potential Flow Generator with $L_2$ Optimal Transport Regularity for Generative Models,"[""liu_yang@brown.edu"", ""george_karniadakis@brown.edu""]","[""Liu Yang"", ""George Em Karniadakis""]","[""generative models"", ""optimal transport"", ""GANs"", ""flow-based models""]","We propose a potential flow generator with $L_2$ optimal transport regularity, which can be easily integrated into a wide range of generative models including different versions of GANs and flow-based models. With up to a slight augmentation of the original generator loss functions, our generator is not only a transport map from the input distribution to the target one, but also the one with minimum $L_2$ transport cost. We show the correctness and robustness of the potential flow generator in several 2D problems, and illustrate the concept of ``proximity'' due to the $L_2$ optimal transport regularity. Subsequently, we demonstrate the effectiveness of the potential flow generator in image translation tasks with unpaired training data from the MNIST dataset and the CelebA dataset. ",/pdf/25a7b5dbd2a8a4633471178ae89dbb79f3011c07.pdf,ICLR,2020,"We propose a special generator with $L_2$ optimal transport regularity, which can be easily integrated into a wide range of generative models." +Hke-WTVtwr,S1xpuz8UwH,1569440000000.0,1583910000000.0,364,Encoding word order in complex embeddings,"[""wang@dei.unipd.it"", ""zhaodh@tju.edu.cn"", ""chrh@di.ku.dk"", ""qiuchili@dei.unipd.it"", ""pzhang@tju.edu.cn"", ""simonsen@di.ku.dk""]","[""Benyou Wang"", ""Donghao Zhao"", ""Christina Lioma"", ""Qiuchi Li"", ""Peng Zhang"", ""Jakob Grue Simonsen""]","[""word embedding"", ""complex-valued neural network"", ""position embedding""]","Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. We present a novel and principled solution for modeling both the global absolute positions of words and their order relationships. Our solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions can be extended to complex-valued variants. We extend CNN, RNN and Transformer NNs to complex-valued versions to incorporate our complex embedding (we make all code available). Experiments on text classification, machine translation and language modeling show gains over both classical word embeddings and position-enriched word embeddings. To our knowledge, this is the first work in NLP to link imaginary numbers in complex-valued representations to concrete meanings (i.e., word order).",/pdf/be07209dc3a935c61ba2a7f215feb20a760e70e5.pdf,ICLR,2020, +B1eyO1BFPr,B1xxcRp_PS,1569440000000.0,1583910000000.0,1789,"Don't Use Large Mini-batches, Use Local SGD","[""tao.lin@epfl.ch"", ""sebastian.stich@epfl.ch"", ""kumarkshitijpatel@gmail.com"", ""martin.jaggi@epfl.ch""]","[""Tao Lin"", ""Sebastian U. Stich"", ""Kumar Kshitij Patel"", ""Martin Jaggi""]",[],"Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. +Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. +However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. +As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants. +",/pdf/f32f07a66ae73c7caacc58ac5f8ea643d5fa4d48.pdf,ICLR,2020, +r1l4eQW0Z,H1Oflmb0Z,1509140000000.0,1519400000000.0,1106,Kernel Implicit Variational Inference,"[""shijx15@mails.tsinghua.edu.cn"", ""ssy@cs.toronto.edu"", ""dcszj@tsinghua.edu.cn""]","[""Jiaxin Shi"", ""Shengyang Sun"", ""Jun Zhu""]","[""Variational inference"", ""Bayesian neural networks"", ""Implicit distribution""]","Recent progress in variational inference has paid much attention to the flexibility of variational posteriors. One promising direction is to use implicit distributions, i.e., distributions without tractable densities as the variational posterior. However, existing methods on implicit posteriors still face challenges of noisy estimation and computational infeasibility when applied to models with high-dimensional latent variables. In this paper, we present a new approach named Kernel Implicit Variational Inference that addresses these challenges. As far as we know, for the first time implicit variational inference is successfully applied to Bayesian neural networks, which shows promising results on both regression and classification tasks.",/pdf/e39028af7ac06a671852fe2ef1c1d2413599c3f8.pdf,ICLR,2018, +HJg2b0VYDr,ryeuEv4_wr,1569440000000.0,1603760000000.0,979,Selection via Proxy: Efficient Data Selection for Deep Learning,"[""cody@cs.stanford.edu"", ""chrisyeh@stanford.edu"", ""mussmann@stanford.edu"", ""baharanm@stanford.edu"", ""pbailis@cs.stanford.edu"", ""pliang@cs.stanford.edu"", ""jure@cs.stanford.edu"", ""matei@cs.stanford.edu""]","[""Cody Coleman"", ""Christopher Yeh"", ""Stephen Mussmann"", ""Baharan Mirzasoleiman"", ""Peter Bailis"", ""Percy Liang"", ""Jure Leskovec"", ""Matei Zaharia""]","[""data selection"", ""active-learning"", ""core-set selection"", ""deep learning"", ""uncertainty sampling""]","Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data selection (e.g., selecting data points to label for active learning). By removing hidden layers from the target model, using smaller architectures, and training for fewer epochs, we create proxies that are an order of magnitude faster to train. Although these small proxy models have higher error rates, we find that they empirically provide useful signals for data selection. We evaluate this ""selection via proxy"" (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity, and Amazon Review Full. For active learning, applying SVP can give an order of magnitude improvement in data selection runtime (i.e., the time it takes to repeatedly train and select points) without significantly increasing the final error (often within 0.1%). For core-set selection on CIFAR10, proxies that are over 10× faster to train than their larger, more accurate targets can remove up to 50% of the data without harming the final accuracy of the target, leading to a 1.6× end-to-end training time improvement.",/pdf/5f0eb987f5346cf452f7041740bb0e018fde22c7.pdf,ICLR,2020,we can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection. +Syez3j0cKX,HyerzhcctQ,1538090000000.0,1545360000000.0,688,Dissecting an Adversarial framework for Information Retrieval,"[""cs15b001@cse.iitm.ac.in"", ""miteshk@cse.iitm.ac.in""]","[""Ameet Deshpande"", ""Mitesh M.Khapra""]","[""GAN"", ""Deep Learning"", ""Reinforcement Learning""]","Recent advances in Generative Adversarial Networks facilitated by improvements to the framework and successful application to various problems has resulted in extensions to multiple domains. IRGAN attempts to leverage the framework for Information-Retrieval (IR), a task that can be described as modeling the correct conditional probability distribution p(d|q) over the documents (d), given the query (q). The work that proposes IRGAN claims that optimizing their minimax loss function will result in a generator which can learn the distribution, but their setup and baseline term steer the model away from an exact adversarial formulation, and this work attempts to point out certain inaccuracies in their formulation. Analyzing their loss curves gives insight into possible mistakes in the loss functions and better performance can be obtained by using the co-training like setup we propose, where two models are trained in a co-operative rather than an adversarial fashion.",/pdf/367270fe73228cf69556bd5d0d446a5c178bed6f.pdf,ICLR,2019,"Points out problems in loss function used in IRGAN, a recently proposed GAN framework for Information Retrieval. Further, a model motivated by co-training is proposed, which achieves better performance." +D51irFX8UOG,13v8r9lsGqd,1601310000000.0,1614990000000.0,371,HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving,"[""~Sirui_Xie1"", ""~Xiaojian_Ma1"", ""~Peiyu_Yu1"", ""~Yixin_Zhu1"", ""~Ying_Nian_Wu1"", ""~Song-Chun_Zhu1""]","[""Sirui Xie"", ""Xiaojian Ma"", ""Peiyu Yu"", ""Yixin Zhu"", ""Ying Nian Wu"", ""Song-Chun Zhu""]","[""Visual Concept Development"", ""Rapid Problem Solving"", ""Abstract Reasoning""]","Humans learn compositional and causal abstraction, \ie, knowledge, in response to the structure of naturalistic tasks. When presented with a problem-solving task involving some objects, toddlers would first interact with these objects to reckon what they are and what can be done with them. Leveraging these concepts, they could understand the internal structure of this task, without seeing all of the problem instances. Remarkably, they further build cognitively executable strategies to \emph{rapidly} solve novel problems. To empower a learning agent with similar capability, we argue there shall be three levels of generalization in how an agent represents its knowledge: perceptual, conceptual, and algorithmic. In this paper, we devise the very first systematic benchmark that offers joint evaluation covering all three levels. This benchmark is centered around a novel task domain, HALMA, for visual concept development and rapid problem solving. Uniquely, HALMA has a minimum yet complete concept space, upon which we introduce a novel paradigm to rigorously diagnose and dissect learning agents' capability in understanding and generalizing complex and structural concepts. We conduct extensive experiments on reinforcement learning agents with various inductive biases and carefully report their proficiency and weakness.",/pdf/c5f67510009e53e1c8676c611f224eaa61975135.pdf,ICLR,2021,"We present a new testbed and benchmark, HALMA, with three levels of generalization in visual concept development and rapid problem solving." +BJxPk2A9Km,BkePw0hqYX,1538090000000.0,1545360000000.0,992,Learning What to Remember: Long-term Episodic Memory Networks for Learning from Streaming Data,"[""hyunwooj@kaist.ac.kr"", ""mshan92@kaist.ac.kr"", ""zzxc1133@kaist.ac.kr"", ""sjhwang82@kaist.ac.kr""]","[""Hyunwoo Jung"", ""Moonsu Han"", ""Minki Kang"", ""Sungju Hwang""]","[""Memory Network"", ""Lifelong Learning""]","Current generation of memory-augmented neural networks has limited scalability as they cannot efficiently process data that are too large to fit in the external memory storage. One example of this is lifelong learning scenario where the model receives unlimited length of data stream as an input which contains vast majority of uninformative entries. We tackle this problem by proposing a memory network fit for long-term lifelong learning scenario, which we refer to as Long-term Episodic Memory Networks (LEMN), that features a RNN-based retention agent that learns to replace less important memory entries based on the retention probability generated on each entry that is learned to identify data instances of generic importance relative to other memory entries, as well as its historical importance. Such learning of retention agent allows our long-term episodic memory network to retain memory entries of generic importance for a given task. We validate our model on a path-finding task as well as synthetic and real question answering tasks, on which our model achieves significant improvements over the memory augmented networks with rule-based memory scheduling as well as an RL-based baseline that does not consider relative or historical importance of the memory.",/pdf/a89d3595d2450b94432bfcdd2f8a8faa1b677edf.pdf,ICLR,2019, +SyevYxHtDB,BJl-bClKvH,1569440000000.0,1583910000000.0,2438,Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks,"[""orekondy@mpi-inf.mpg.de"", ""schiele@mpi-inf.mpg.de"", ""fritz@cispa.saarland""]","[""Tribhuvanesh Orekondy"", ""Bernt Schiele"", ""Mario Fritz""]","[""model functionality stealing"", ""adversarial machine learning""]","High-performance Deep Neural Networks (DNNs) are increasingly deployed in many real-world applications e.g., cloud prediction APIs. Recent advances in model functionality stealing attacks via black-box access (i.e., inputs in, predictions out) threaten the business model of such applications, which require a lot of time, money, and effort to develop. Existing defenses take a passive role against stealing attacks, such as by truncating predicted information. We find such passive defenses ineffective against DNN stealing attacks. In this paper, we propose the first defense which actively perturbs predictions targeted at poisoning the training objective of the attacker. We find our defense effective across a wide range of challenging datasets and DNN model stealing attacks, and additionally outperforms existing defenses. Our defense is the first that can withstand highly accurate model stealing attacks for tens of thousands of queries, amplifying the attacker's error rate up to a factor of 85$\times$ with minimal impact on the utility for benign users.",/pdf/54a327e15441fc15cf63a461ce8a9da3b5e51713.pdf,ICLR,2020,We propose the first approach that can resist DNN model stealing/extraction attacks +in2qzBZ-Vwr,LdSpxsI_4iG,1601310000000.0,1614990000000.0,1421,Cooperating RPN's Improve Few-Shot Object Detection,"[""~Weilin_Zhang1"", ""~Yu-Xiong_Wang1"", ""~David_Forsyth1""]","[""Weilin Zhang"", ""Yu-Xiong Wang"", ""David Forsyth""]","[""Few-shot learning"", ""Object detection""]","Learning to detect an object in an image from very few training examples - few-shot object detection - is challenging, because the classifier that sees proposal boxes has very little training data. A particularly challenging training regime occurs when there are one or two training examples. In this case, if the region proposal network (RPN) misses even one high intersection-over-union (IOU) training box, the classifier's model of how object appearance varies can be severely impacted. We use multiple distinct yet cooperating RPN's. Our RPN's are trained to be different, but not too different; doing so yields significant performance improvements over state of the art for COCO and PASCAL VOC in the very few-shot setting. This effect appears to be independent of the choice of classifier or dataset. +",/pdf/f221ad808df1a6f5e077e386ae73f9036357647f.pdf,ICLR,2021,We identify the proposal neglect effect in few-shot detection and propose cooperating RPN's that yield significant performance improvements over state of the art in the very few-shot setting. +r1Bjj8qge,,1478290000000.0,1484170000000.0,321,Encoding and Decoding Representations with Sum- and Max-Product Networks,"[""antonio.vergari@uniba.it"", ""robert.peharz@medunigraz.at"", ""nicola.dimauro@uniba.it"", ""floriana.esposito@uniba.it""]","[""Antonio Vergari"", ""Robert Peharz"", ""Nicola Di Mauro"", ""Floriana Esposito""]",[],"Sum-Product networks (SPNs) are expressive deep architectures for representing probability distributions, yet allowing exact and efficient inference. SPNs have been successfully applied in several domains, however always as black-box distribution estimators. In this paper, we argue that due to their recursive definition, SPNs can also be naturally employed as hierarchical feature extractors and thus for unsupervised representation learning. Moreover, when converted into Max-Product Networks (MPNs), it is possible to decode such representations back into the original input space. In this way, MPNs can be interpreted as a kind of generative autoencoder, even if they were never trained to reconstruct the input data. We show how these learned representations, if visualized, indeed correspond to ""meaningful parts"" of the training data. They also yield a large improvement when used in structured prediction tasks. As shown in extensive experiments, SPN and MPN encoding and decoding schemes prove very competitive against the ones employing RBMs and other stacked autoencoder architectures.",/pdf/b47e88100d407f7c6ba87a9d112bc6bc7e949177.pdf,ICLR,2017,"Sum-Product Networks can be effectively employed for unsupervised representation learning, when turned into Max-Product Networks, they can also be used as encoder-decoders" +SJxDKerKDS,SklExAeKDB,1569440000000.0,1577170000000.0,2437,Reinforcement Learning with Structured Hierarchical Grammar Representations of Actions,"[""petros.christodoulou18@imperial.ac.uk"", ""rtl17@ic.ac.uk"", ""a.shafti@imperial.ac.uk"", ""a.faisal@imperial.ac.uk""]","[""Petros Christodoulou"", ""Robert Lange"", ""Ali Shafti"", ""A. Aldo Faisal""]","[""Hierarchical Reinforcement Learning"", ""Action Representations"", ""Macro-Actions"", ""Action Grammars""]","From a young age humans learn to use grammatical principles to hierarchically combine words into sentences. Action grammars is the parallel idea; that there is an underlying set of rules (a ""grammar"") that govern how we hierarchically combine actions to form new, more complex actions. We introduce the Action Grammar Reinforcement Learning (AG-RL) framework which leverages the concept of action grammars to consistently improve the sample efficiency of Reinforcement Learning agents. AG-RL works by using a grammar inference algorithm to infer the “action grammar"" of an agent midway through training, leading to a higher-level action representation. The agent's action space is then augmented with macro-actions identified by the grammar. We apply this framework to Double Deep Q-Learning (AG-DDQN) and a discrete action version of Soft Actor-Critic (AG-SAC) and find that it improves performance in 8 out of 8 tested Atari games (median +31%, max +668%) and 19 out of 20 tested Atari games (median +96%, maximum +3,756%) respectively without substantive hyperparameter tuning. We also show that AG-SAC beats the model-free state-of-the-art for sample efficiency in 17 out of the 20 tested Atari games (median +62%, maximum +13,140%), again without substantive hyperparameter tuning.",/pdf/87c1f81bce5a540d6880bd63c776d721ad5567dd.pdf,ICLR,2020,"We use grammar inference techniques to compose primitive actions into temporal abstractions, creating a hierarchical reinforcement learning structure that consistently improves sample efficiency." +Hk91SGWR-,SJYkHG-CZ,1509140000000.0,1518730000000.0,830,Investigating Human Priors for Playing Video Games,"[""rach0012@berkeley.edu"", ""pulkitag@berkeley.edu"", ""pathak@berkeley.edu"", ""tom_griffiths@berkeley.edu"", ""efros@eecs.berkeley.edu""]","[""Rachit Dubey"", ""Pulkit Agrawal"", ""Deepak Pathak"", ""Thomas L. Griffiths"", ""Alexei A. Efros""]","[""Prior knowledge"", ""Reinforcement learning"", ""Cognitive Science""]","What makes humans so good at solving seemingly complex video games? Unlike computers, humans bring in a great deal of prior knowledge about the world, enabling efficient decision making. This paper investigates the role of human priors for solving video games. Given a sample game, we conduct a series of ablation studies to quantify the importance of various priors. We do this by modifying the video game environment to systematically mask different types of visual information that could be used by humans as priors. We find that removal of some prior knowledge causes a drastic degradation in the speed with which human players solve the game, e.g. from 2 minutes to over 20 minutes. Furthermore, our results indicate that general priors, such as the importance of objects and visual consistency, are critical for efficient game-play.",/pdf/818740513b50bab8672e7ccef3d579ecf0797ab1.pdf,ICLR,2018,We investigate the various kinds of prior knowledge that help human learning and find that general priors about objects play the most critical role in guiding human gameplay. +St1giarCHLP,vXjql7NHn1,1601310000000.0,1611610000000.0,1080,Denoising Diffusion Implicit Models,"[""~Jiaming_Song1"", ""chenlin@stanford.edu"", ""~Stefano_Ermon1""]","[""Jiaming Song"", ""Chenlin Meng"", ""Stefano Ermon""]","[""generative models"", ""variational autoencoders"", ""denoising score matching"", ""variational inference""]","Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of non-Markovian diffusion processes that lead to the same training objective. These non-Markovian processes can correspond to generative processes that are deterministic, giving rise to implicit models that produce high quality samples much faster. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, perform semantically meaningful image interpolation directly in the latent space, and reconstruct observations with very low error. ",/pdf/d01650b6472e18084dcf272952f04a11ffbd6065.pdf,ICLR,2021,"We show and justify a GAN-like iterative generative model with relatively fast sampling, high sample quality and without any adversarial training." +HkxeThNFPH,Byg7XCemvH,1569440000000.0,1577170000000.0,213,Safe Policy Learning for Continuous Control,"[""yinlamchow@google.com"", ""ofirnachum@google.com"", ""sandrafaust@google.com"", ""duenez@google.com"", ""mgh@fb.com""]","[""Yinlam Chow"", ""Ofir Nachum"", ""Aleksandra Faust"", ""Edgar Duenez-Guzman"", ""Mohammad Ghavamzadeh""]","[""reinforcement learning"", ""policy gradient"", ""safety""]","We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that keep the agent in desirable situations, both during training and at convergence. We formulate these problems as {\em constrained} Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the selected action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world robot obstacle-avoidance problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction.",/pdf/299aee2d05d3f6719b5bf633ff1f5621ec371d21.pdf,ICLR,2020,A general framework for incorporating long-term safety constraints in policy-based reinforcement learning +HW4aTJHx0X,2iKS_ggeOdk,1601310000000.0,1614990000000.0,2668,What's new? Summarizing Contributions in Scientific Literature,"[""~Hiroaki_Hayashi1"", ""~Wojciech_Maciej_Kryscinski1"", ""~Bryan_McCann1"", ""~Nazneen_Rajani1"", ""~Caiming_Xiong1""]","[""Hiroaki Hayashi"", ""Wojciech Maciej Kryscinski"", ""Bryan McCann"", ""Nazneen Rajani"", ""Caiming Xiong""]","[""abstractive summarization"", ""scientific papers""]","With thousands of academic articles shared on a daily basis, it has become increasingly difficult to keep up with the latest scientific findings. To overcome this problem, we introduce a new task of $\textit{disentangled paper summarization}$, which seeks to generate separate summaries for the paper contributions and the context of the work, making it easier to identify the key findings shared in articles. For this purpose, we extend the S2ORC corpus of academic articles, which spans a diverse set of domains ranging from economics to psychology, by adding disentangled ""contribution"" and ""context"" reference labels. Together with the dataset, we introduce and analyze three baseline approaches: 1) a unified model controlled by input code prefixes, 2) a model with separate generation heads specialized in generating the disentangled outputs, and 3) a training strategy that guides the model using additional supervision coming from inbound and outbound citations. We also propose a comprehensive automatic evaluation protocol which reports the $\textit{relevance}$, $\textit{novelty}$, and $\textit{disentanglement}$ of generated outputs. Through a human study involving expert annotators, we show that in 79%, of cases our new task is considered more helpful than traditional scientific paper summarization. +",/pdf/3266ca956d0793e52a9ee60ddfe386461ea85ca2.pdf,ICLR,2021,"We propose a new task of disentangled paper summarization which aims to summarize contributions and contexts of scientific papers, we show the importance and usefulness of the task through experiments." +Hy-lMNqex,,1478280000000.0,1484980000000.0,210,Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability,"[""delmasl1@ece.utoronto.ca"", ""sayeh@ece.utoronto.ca"", ""moshovos@ece.utoronto.ca""]","[""Alberto Delm\u00e1s Lascorz"", ""Sayeh Sharify"", ""Patrick Judd"", ""Andreas Moshovos""]","[""Deep learning"", ""Applications""]","Tartan {TRT} a hardware accelerator for inference with Deep Neural Networks (DNNs) is presented and evaluated on Convolutional Neural Networks. TRT exploits the variable per layer precision requirements of DNNs to deliver execution time that is proportional to the precision p in bits used per layer for convolutional and fully-connected layers. Prior art has demonstrated an accelerator with the same execution performance only for convolutional layers. Experiments on image classification CNNs show that on average across all networks studied, TRT outperforms a state-of-the-art bit-parallel accelerator by 1.90x without any loss in accuracy while it is 1.17x more energy efficient. TRT requires no network retraining while it enables trading off accuracy for additional improvements in execution performance and energy efficiency. For example, if a 1% relative loss in accuracy is acceptable, TRT is on average 2.04x faster and 1.25x more energy efficient than the bit-parallel accelerator. +This revision includes post-layout results and a better configuration that processes 2bits at time resulting in better efficiency and lower area overhead.",/pdf/344fd70baf30933e69d538ffe9072fab2e1438ab.pdf,ICLR,2017,A hardware accelerator whose execution time for Fully-Connected and Convolutional Layers in CNNs vary inversely proportional with the number of bits used to represent the input activations and/or weights. +SncSswKUse,e32PCXGhCcL,1601310000000.0,1614990000000.0,716,Factorized linear discriminant analysis for phenotype-guided representation learning of neuronal gene expression data,"[""~Mu_Qiao3"", ""~Markus_Meister1""]","[""Mu Qiao"", ""Markus Meister""]","[""factorized linear discriminant analysis"", ""phenotype"", ""gene expression"", ""representation learning""]","A central goal in neurobiology is to relate the expression of genes to the structural and functional properties of neuronal types, collectively called their phenotypes. Single-cell RNA sequencing can measure the expression of thousands of genes in thousands of neurons. How to interpret the data in the context of neuronal phenotypes? We propose a supervised learning approach that factorizes the gene expression data into components corresponding to individual phenotypic characteristics and their interactions. This new method, which we call factorized linear discriminant analysis (FLDA), seeks a linear transformation of gene expressions that varies highly with only one phenotypic factor and minimally with the others. We further leverage our approach with a sparsity-based regularization algorithm, which selects a few genes important to a specific phenotypic feature or feature combination. We applied this approach to a single-cell RNA-Seq dataset of Drosophila T4/T5 neurons, focusing on their dendritic and axonal phenotypes. The analysis confirms results obtained by conventional methods but also points to new genes related to the phenotypes and an intriguing hierarchy in the genetic organization of these cells.",/pdf/d7e0cdff4145d2c8c3a08202b744651443ca7097.pdf,ICLR,2021,We describe a novel supervised approach that learns a meaningful representation of neuronal gene expressions based on phenotypes +S1eVe2AqKX,SJlL4lA5tm,1538090000000.0,1545360000000.0,1069,PCNN: Environment Adaptive Model Without Finetuning,"[""boyuan@cs.ucsb.edu"", ""kun@cs.ucsb.edu"", ""shuyang1995@ucsb.edu"", ""yufeiding@cs.ucsb.edu""]","[""Boyuan Feng"", ""Kun Wan"", ""Shu Yang"", ""Yufei Ding""]","[""Class skew"", ""Runtime adaption""]","Convolutional Neural Networks (CNNs) have achieved tremendous success for many computer vision tasks, which shows a promising perspective of deploying CNNs on mobile platforms. An obstacle to this promising perspective is the tension between intensive resource consumption of CNNs and limited resource budget on mobile platforms. Existing works generally utilize a simpler architecture with lower accuracy for a higher energy-efficiency, \textit{i.e.}, trading accuracy for resource consumption. An emerging opportunity to both increasing accuracy and decreasing resource consumption is \textbf{class skew}, \textit{i.e.}, the strong temporal and spatial locality of the appearance of classes. However, it is challenging to efficiently utilize the class skew due to both the frequent switches and the huge number of class skews. Existing works use transfer learning to adapt the model towards the class skew during runtime, which consumes resource intensively. In this paper, we propose \textbf{probability layer}, an \textit{easily-implemented and highly flexible add-on module} to adapt the model efficiently during runtime \textit{without any fine-tuning} and achieving an \textit{equivalent or better} performance than transfer learning. Further, both \textit{increasing accuracy} and \textit{decreasing resource consumption} can be achieved during runtime through the combination of probability layer and pruning methods.",/pdf/0433eaef5167e45b9043bdbdd544454d1fba6239.pdf,ICLR,2019, +H1eWGREFvB,H1xLbi4uPH,1569440000000.0,1577170000000.0,993,Stein Self-Repulsive Dynamics: Benefits from Past Samples,"[""lushleaf21@gmail.com"", ""rtz19970824@gmail.com"", ""lqiang@cs.utexas.edu""]","[""Mao Ye"", ""Tongzheng Ren"", ""Qiang Liu""]","[""Approximate Inference"", ""Markov Chain Monte Carlo"", ""Stein Variational Gradient Descent""]","We propose a new Stein self-repulsive dynamics for obtaining diversified samples from intractable un-normalized distributions. Our idea is to introduce Stein variational gradient as a repulsive force to push the samples of Langevin dynamics +away from the past trajectories. This simple idea allows us to significantly decrease the auto-correlation in Langevin dynamics and hence increase the effective sample size. Importantly, as we establish in our theoretical analysis, the asymptotic stationary distribution remains correct even with the addition of the repulsive force, thanks to the special properties of the Stein variational gradient. We perform extensive empirical studies of our new algorithm, showing that our method yields much higher sample efficiency and better uncertainty estimation than vanilla Langevin dynamics.",/pdf/742602b0b30e91244ba7b7450465a68bc8dab21a.pdf,ICLR,2020,We propose a new Stein self-repulsive dynamics for obtaining diversified samples from intractable un-normalized distributions. +SkeGvaEtPr,HJgnys5PvS,1569440000000.0,1577170000000.0,589,Neural Markov Logic Networks,"[""g.marra@unifi.it"", ""kuzelo1@gmail.com""]","[""Giuseppe Marra"", ""Ond\u0159ej Ku\u017eelka""]","[""Statistical Relational Learning"", ""Markov Logic Networks""]","We introduce Neural Markov Logic Networks (NMLNs), a statistical relational learning system that borrows ideas from Markov logic. Like Markov Logic Networks (MLNs), NMLNs are an exponential-family model for modelling distributions over possible worlds, but unlike MLNs, they do not rely on explicitly specified first-order logic rules. Instead, NMLNs learn an implicit representation of such rules as a neural network that acts as a potential function on fragments of the relational structure. Interestingly, any MLN can be represented as an NMLN. Similarly to recently proposed Neural theorem provers (NTPs) (Rocktaschel at al. 2017), NMLNs can exploit embeddings of constants but, unlike NTPs, NMLNs work well also in their absence. This is extremely important for predicting in settings other than the transductive one. We showcase the potential of NMLNs on knowledge-base completion tasks and on generation of molecular (graph) data.",/pdf/acc10593e90c41f40a61b1c64456c342e773fd62.pdf,ICLR,2020, We introduce a statistical relational learning system that borrows ideas from Markov logic but learns an implicit representation of rules as a neural network. +Hkxvl0EtDH,r1lQxOXOvH,1569440000000.0,1577170000000.0,931,A Causal View on Robustness of Neural Networks,"[""cheng.zhang@microsoft.com"", ""yingzhen.li@microsoft.com""]","[""Cheng Zhang"", ""Yingzhen Li""]","[""Neural Network Robustness"", ""Variational autoencoder (VAE)"", ""Causality"", ""Deep generative model""]","We present a causal view on the robustness of neural networks against input manipulations, which applies not only to traditional classification tasks but also to general measurement data. Based on this view, we design a deep causal manipulation augmented model (deep CAMA) which explicitly models the manipulations of data as a cause to the observed effect variables. We further develop data augmentation and test-time fine-tuning methods to improve deep CAMA's robustness. When compared with discriminative deep neural networks, our proposed model shows superior robustness against unseen manipulations. As a by-product, our model achieves disentangled representation which separates the representation of manipulations from those of other latent causes.",/pdf/06a6232d3297edbe83c190720b31251fd436d7dc.pdf,ICLR,2020, +HJgExaVtwr,SkxhErJIvB,1569440000000.0,1583910000000.0,335,DivideMix: Learning with Noisy Labels as Semi-supervised Learning,"[""junnan.li@salesforce.com"", ""rsocher@salesforce.com"", ""shoi@salesforce.com""]","[""Junnan Li"", ""Richard Socher"", ""Steven C.H. Hoi""]","[""label noise"", ""semi-supervised learning""]","Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at https://github.com/LiJunnan1992/DivideMix .",/pdf/b3cf972104156ac812aba28a8c140879fe430148.pdf,ICLR,2020,We propose a novel semi-supervised learning approach with SOTA performance on combating learning with noisy labels. +SygXPaEYvH,r1l2Un9vwS,1569440000000.0,1617090000000.0,591,VL-BERT: Pre-training of Generic Visual-Linguistic Representations,"[""jackroos@mail.ustc.edu.cn"", ""ezra0408@mail.ustc.edu.cn"", ""yuecao@microsoft.com"", ""binli@ustc.edu.cn"", ""lewlu@microsoft.com"", ""fuwei@microsoft.com"", ""jifdai@microsoft.com""]","[""Weijie Su"", ""Xizhou Zhu"", ""Yue Cao"", ""Bin Li"", ""Lewei Lu"", ""Furu Wei"", ""Jifeng Dai""]","[""Visual-Linguistic"", ""Generic Representation"", ""Pre-training""]","We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.",/pdf/e81890cf813d756666607d9a5e43f5db490a5c98.pdf,ICLR,2020,"VL-BERT is a simple yet powerful pre-trainable generic representation for visual-linguistic tasks. It is pre-trained on the massive-scale caption dataset and text-only corpus, and can be finetuned for varies down-stream visual-linguistic tasks." +HWX5j6Bv_ih,8tD0_KktPOY,1601310000000.0,1614990000000.0,2203,Cross-Node Federated Graph Neural Network for Spatio-Temporal Data Modeling,"[""~Chuizheng_Meng1"", ""~Sirisha_Rambhatla1"", ""~Yan_Liu1""]","[""Chuizheng Meng"", ""Sirisha Rambhatla"", ""Yan Liu""]","[""Federated Learning"", ""Graph Neural Network"", ""Spatio-Temporal Data Modeling""]","Vast amount of data generated from networks of sensors, wearables, and the Internet of Things (IoT) devices underscores the need for advanced modeling techniques that leverage the spatio-temporal structure of decentralized data due to the need for edge computation and licensing (data access) issues. While federated learning (FL) has emerged as a framework for model training without requiring direct data sharing and exchange, effectively modeling the complex spatio-temporal dependencies to improve forecasting capabilities still remains an open problem. On the other hand, state-of-the-art spatio-temporal forecasting models assume unfettered access to the data, neglecting constraints on data sharing. To bridge this gap, we propose a federated spatio-temporal model -- Cross-Node Federated Graph Neural Network (CNFGNN) -- which explicitly encodes the underlying graph structure using graph neural network (GNN)-based architecture under the constraint of cross-node federated learning, which requires that data in a network of nodes is generated locally on each node and remains decentralized. CNFGNN operates by disentangling the temporal dynamics modeling on devices and spatial dynamics on the server, utilizing alternating optimization to reduce the communication cost, facilitating computations on the edge devices. Experiments on the traffic flow forecasting task show that CNFGNN achieves the best forecasting performance in both transductive and inductive learning settings with no extra computation cost on edge devices, while incurring modest communication cost.",/pdf/2005988c5612cefb769b69b9f774f8044ad5bc91.pdf,ICLR,2021,We propose a federated spatio-temporal model which explicitly encodes the underlying graph structure using graph neural network (GNN)-based architecture while ensuring that the data generated locally remains decentralized. +HJNMYceCW,HkQMF9eR-,1509100000000.0,1524510000000.0,353,Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback,"[""hal@umiacs.umd.edu"", ""jl@hunch.net"", ""amr@cs.umd.edu""]","[""Hal Daum\u00e9 III"", ""John Langford"", ""Amr Sharaf""]","[""Reinforcement Learning"", ""Structured Prediction"", ""Contextual Bandits"", ""Learning Reduction""]","We consider reinforcement learning and bandit structured prediction problems with very sparse loss feedback: only at the end of an episode. We introduce a novel algorithm, RESIDUAL LOSS PREDICTION (RESLOPE), that solves such problems by automatically learning an internal representation of a denser reward function. RESLOPE operates as a reduction to contextual bandits, using its learned loss representation to solve the credit assignment problem, and a contextual bandit oracle to trade-off exploration and exploitation. RESLOPE enjoys a no-regret reduction-style theoretical guarantee and outperforms state of the art reinforcement learning algorithms in both MDP environments and bandit structured prediction settings.",/pdf/56ab4f2257300654c47e017e7ae3d440ab1150d8.pdf,ICLR,2018,We present a novel algorithm for solving reinforcement learning and bandit structured prediction problems with very sparse loss feedback. +B1CQGfZ0b,SypmMG-0Z,1509130000000.0,1518730000000.0,786,Learning to select examples for program synthesis,"[""yewenpu@mit.edu"", ""zmiranda@mit.edu"", ""asolar@csail.mit.edu"", ""lpk@csail.mit.edu""]","[""Yewen Pu"", ""Zachery Miranda"", ""Armando Solar-Lezama"", ""Leslie Pack Kaelbling""]","[""program synthesis"", ""program induction"", ""example selection""]","Program synthesis is a class of regression problems where one seeks a solution, in the form of a source-code program, that maps the inputs to their corresponding outputs exactly. Due to its precise and combinatorial nature, it is commonly formulated as a constraint satisfaction problem, where input-output examples are expressed constraints, and solved with a constraint solver. A key challenge of this formulation is that of scalability: While constraint solvers work well with few well-chosen examples, constraining the entire set of example constitutes a significant overhead in both time and memory. In this paper we address this challenge by constructing a representative subset of examples that is both small and is able to constrain the solver sufficiently. We build the subset one example at a time, using a trained discriminator to predict the probability of unchosen input-output examples conditioned on the chosen input-output examples, adding the least probable example to the subset. Experiment on a diagram drawing domain shows our approach produces subset of examples that are small and representative for the constraint solver.",/pdf/6519b33ded0f577bc626a6b19a98fded7e4bffd6.pdf,ICLR,2018,"In a program synthesis context where the input is a set of examples, we reduce the cost by computing a subset of representative examples" +5Spjp0zDYt,m07x0IWW3Hb,1601310000000.0,1614990000000.0,912,Failure Modes of Variational Autoencoders and Their Effects on Downstream Tasks,"[""~Yaniv_Yacoby1"", ""~Weiwei_Pan1"", ""~Finale_Doshi-Velez1""]","[""Yaniv Yacoby"", ""Weiwei Pan"", ""Finale Doshi-Velez""]","[""Variational Autoencoders"", ""Variational Inference"", ""VAE"", ""Approximate Inference"", ""Semi-Supervision""]","Variational Auto-encoders (VAEs) are deep generative latent variable models that are widely used for a number of downstream tasks. While it has been demonstrated that VAE training can suffer from a number of pathologies, existing literature lacks characterizations of exactly when these pathologies occur and how they impact down-stream task performance. In this paper we concretely characterize conditions under which VAE training exhibits pathologies and connect these failure modes to undesirable effects on specific downstream tasks, such as learning compressed and disentangled representations, adversarial robustness and semi-supervised learning.",/pdf/fa72de7f5f701eb32699dc81cf0f1fc66243bec9.pdf,ICLR,2021,We concretely characterize conditions under which VAE training exhibits pathologies and connect these failure modes to undesirable effects on specific downstream tasks. +Gj9aQfQEHRS,XCEXuyNxr28,1601310000000.0,1614990000000.0,49,Transformers satisfy,"[""~Feng_Shi1"", ""~CHEN_LI14"", ""~Shijie_Bian1"", ""~Yiqiao_Jin1"", ""~Ziheng_Xu1"", ""~Tian_Han1"", ""~Song-Chun_Zhu1""]","[""Feng Shi"", ""CHEN LI"", ""Shijie Bian"", ""Yiqiao Jin"", ""Ziheng Xu"", ""Tian Han"", ""Song-Chun Zhu""]","[""constraint satisfaction problem"", ""graph attention"", ""transformers""]","The Propositional Satisfiability Problem (SAT), and more generally, the Constraint Satisfaction Problem (CSP), are mathematical questions defined as finding an assignment to a set of objects that satisfies a series of constraints. The modern approach is trending to solve CSP through neural symbolic methods. Most recent works are sequential model-based and adopt neural embedding, i.e., reinforcement learning with neural graph networks, and graph recurrent neural networks. This work proposes a one-shot model derived from the eminent Transformer architecture for factor graph structure to solve the CSP problem. We define the heterogeneous attention mechanism based on meta-paths for the self-attention between literals, the cross-attention based on the bipartite graph links from literal to clauses, or vice versa. This model takes advantage of parallelism. Our model achieves high speed and very high accuracy on the factor graph for CSPs with arbitrary size.",/pdf/f86e9b8f29b252aafb6ef2818d811f5bf0bdf8d7.pdf,ICLR,2021,we propose a Graph Transformer architecture for solving constraint satisfaction problems. +SJldZ2RqFX,r1e7XbAcKm,1538090000000.0,1545360000000.0,1188,D-GAN: Divergent generative adversarial network for positive unlabeled learning and counter-examples generation,"[""florent.chiaroni@vedecom.fr"", ""mohamed.rahal@vedecom.fr"", ""nicolas.hueber@isl.eu"", ""frederic.dufaux@l2s.centralesupelec.fr""]","[""Florent CHIARONI. Mohamed-Cherif RAHAL. Nicolas HUEBER. Fr\u00e9d\u00e9ric DUFAUX.""]","[""Representation learning. Generative Adversarial Network (GAN). Positive Unlabeled learning. Image classification""]","Positive Unlabeled (PU) learning consists in learning to distinguish samples of our class of interest, the positive class, from the counter-examples, the negative class, by using positive labeled and unlabeled samples during the training. Recent approaches exploit the GANs abilities to address the PU learning problem by generating relevant counter-examples. In this paper, we propose a new GAN-based PU learning approach named Divergent-GAN (D-GAN). The key idea is to incorporate a standard Positive Unlabeled learning risk inside the GAN discriminator loss function. In this way, the discriminator can ask the generator to converge towards the unlabeled samples distribution while diverging from the positive samples distribution. This enables the generator convergence towards the unlabeled counter-examples distribution without using prior knowledge, while keeping the standard adversarial GAN architecture. In addition, we discuss normalization techniques in the context of the proposed framework. Experimental results show that the proposed approach overcomes previous GAN-based PU learning methods issues, and it globally outperforms two-stage state of the art PU learning performances in terms of stability and prediction on both simple and complex image datasets.",/pdf/b2c481faf894ebf6143fb4e8a6108dc3546a76d7.pdf,ICLR,2019,A new two-stage positive unlabeled learning approach with GAN +HyxlHsActm,Bkg5Pm9VFQ,1538090000000.0,1545360000000.0,59,Efficient Dictionary Learning with Gradient Descent,"[""dg2893@columbia.edu"", ""sdb2157@columbia.edu"", ""jw2966@columbia.edu""]","[""Dar Gilboa"", ""Sam Buchanan"", ""John Wright""]","[""dictionary learning"", ""nonconvex optimization""]","Randomly initialized first-order optimization algorithms are the method of choice for solving many high-dimensional nonconvex problems in machine learning, yet general theoretical guarantees cannot rule out convergence to critical points of poor objective value. For some highly structured nonconvex problems however, the success of gradient descent can be understood by studying the geometry of the objective. We study one such problem -- complete orthogonal dictionary learning, and provide converge guarantees for randomly initialized gradient descent to the neighborhood of a global optimum. The resulting rates scale as low order polynomials in the dimension even though the objective possesses an exponential number of saddle points. This efficient convergence can be viewed as a consequence of negative curvature normal to the stable manifolds associated with saddle points, and we provide evidence that this feature is shared by other nonconvex problems of importance as well. ",/pdf/3e45a265d01fd7351c1a3cd99261b0894afae18e.pdf,ICLR,2019,We provide an efficient convergence rate for gradient descent on the complete orthogonal dictionary learning objective based on a geometric analysis. +ByeWdiR5Ym,HJlFWEKctm,1538090000000.0,1545360000000.0,330,Adaptive Convolutional Neural Networks,"[""julio.c.zamora.esquivel@intel.com"", ""jesus.a.cruz.vargas@intel.com"", ""omesh.tickoo@intel.com""]","[""Julio Cesar Zamora"", ""Jesus Adan Cruz Vargas"", ""Omesh Tickoo""]","[""Adaptive kernels"", ""Dynamic kernels"", ""Pattern recognition"", ""low memory CNNs""]","The quest for increased visual recognition performance has led to the development of highly complex neural networks with very deep topologies. To avoid high computing resource requirements of such complex networks and to enable operation on devices with limited resources, this paper introduces adaptive kernels for convolutional layers. Motivated by the non-linear perception response in human visual cells, the input image is used to define the weights of a dynamic kernel called Adaptive kernel. This new adaptive kernel is used to perform a second convolution of the input image generating the output pixel. Adaptive kernels enable accurate recognition with lower memory requirements; This is accomplished through reducing the number of kernels and the number of layers needed in the typical CNN configuration, in addition to reducing the memory used, increasing 2X the training speed and the number of activation function evaluations. Our experiments show a reduction of 70X in the memory used for MNIST, maintaining 99% accuracy and 16X memory reduction for CIFAR10 with 92.5% accuracy.",/pdf/7af32dbba7c9025be97407e62dd3b176f2e9080c.pdf,ICLR,2019,"An adaptve convolutional kernel, that includes non-linear transformations obtaining similar results as the state of the art algorithms, while yielding a reduction in required memory up to 16x in the CIFAR10" +rJe4ShAcF7,HklwWl09KX,1538090000000.0,1550930000000.0,1531,Music Transformer: Generating Music with Long-Term Structure,"[""chengzhiannahuang@gmail.com"", ""avaswani@google.com"", ""uszkoreit@google.com"", ""iansimon@google.com"", ""fjord@google.com"", ""noam@google.com"", ""adai@google.com"", ""mhoffman@google.com"", ""noms@google.com"", ""deck@google.com""]","[""Cheng-Zhi Anna Huang"", ""Ashish Vaswani"", ""Jakob Uszkoreit"", ""Ian Simon"", ""Curtis Hawthorne"", ""Noam Shazeer"", ""Andrew M. Dai"", ""Matthew D. Hoffman"", ""Monica Dinculescu"", ""Douglas Eck""]","[""music generation""]","Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity is quadratic in the sequence length. We propose an algorithm that reduces the intermediate memory requirements to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long (thousands of steps) compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-competition, and obtain state-of-the-art results on the latter.",/pdf/7b81553aa686c32e7cb5199913264282dc6f0b24.pdf,ICLR,2019,We show the first successful use of Transformer in generating music that exhibits long-term structure. +rJg8TeSFDH,Syex7bZFPS,1569440000000.0,1583910000000.0,2575,An Exponential Learning Rate Schedule for Deep Learning,"[""zhiyuanli@cs.princeton.edu"", ""arora@cs.princeton.edu""]","[""Zhiyuan Li"", ""Sanjeev Arora""]","[""batch normalization"", ""weight decay"", ""learning rate"", ""deep learning theory""]","Intriguing empirical evidence exists that deep learning can work well with exotic schedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiq- uitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other words, the typical use case which was not considered in earlier theoretical analyses of stand-alone BN (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Arora et al., 2018) +• Training can be done using SGD with momentum and an exponentially in- creasing learning rate schedule, i.e., learning rate increases by some (1 + α) factor in every epoch for some α > 0. (Precise statement in the paper.) To the best of our knowledge this is the first time such a rate schedule has been successfully used, let alone for highly successful architectures. As ex- pected, such training rapidly blows up network weights, but the net stays well-behaved due to normalization. +• Mathematical explanation of the success of the above rate schedule: a rigor- ous proof that it is equivalent to the standard setting of BN + SGD + Standard Rate Tuning + Weight Decay + Momentum. This equivalence holds for other normalization layers as well, Group Normalization(Wu & He, 2018), Layer Normalization(Ba et al., 2016), Instance Norm(Ulyanov et al., 2016), etc. +• A worked-out toy example illustrating the above linkage of hyper- parameters. Using either weight decay or BN alone reaches global minimum, but convergence fails when both are used.",/pdf/c0a5feba02a5a8332a346bf69eb4ec194a8c9a24.pdf,ICLR,2020,"We propose an exponentially growing learning rate schedule for networks with BatchNorm, which surprisingly performs well in practice and is provably equivalent to popular LR schedules like Step Decay." +S1gV6AVKwB,S1x6MZ9uvS,1569440000000.0,1577170000000.0,1391,Cross Domain Imitation Learning,"[""khkim@cs.stanford.edu"", ""gyh15@mails.tsinghua.edu.cn"", ""jiaming.tsong@gmail.com"", ""sjzhao@stanford.edu"", ""ermon@cs.stanford.edu""]","[""Kun Ho Kim"", ""Yihong Gu"", ""Jiaming Song"", ""Shengjia Zhao"", ""Stefano Ermon""]","[""Imitation Learning"", ""Domain Adaptation"", ""Reinforcement Learning"", ""Zeroshot Learning"", ""Machine Learning"", ""Artificial Intelligence""]","We study the question of how to imitate tasks across domains with discrepancies such as embodiment and viewpoint mismatch. Many prior works require paired, aligned demonstrations and an additional RL procedure for the task. However, paired, aligned demonstrations are seldom obtainable and RL procedures are expensive. In this work, we formalize the Cross Domain Imitation Learning (CDIL) problem, which encompasses imitation learning in the presence of viewpoint and embodiment mismatch. Informally, CDIL is the process of learning how to perform a task optimally, given demonstrations of the task in a distinct domain. We propose a two step approach to CDIL: alignment followed by adaptation. In the alignment step we execute a novel unsupervised MDP alignment algorithm, Generative Adversarial MDP Alignment (GAMA), to learn state and action correspondences from unpaired, unaligned demonstrations. In the adaptation step we leverage the correspondences to zero-shot imitate tasks across domains. To describe when CDIL is feasible via alignment and adaptation, we introduce a theory of MDP alignability. We experimentally evaluate GAMA against baselines in both embodiment and viewpoint mismatch scenarios where aligned demonstrations don’t exist and show the effectiveness of our approach.",/pdf/4657de442dacd38e6329146451a01deb4f6c0a0b.pdf,ICLR,2020,Imitation learning across domains with discrepancies such as embodiment and viewpoint mismatch. +Skl4mRNYDr,H1epZOr_vS,1569440000000.0,1583910000000.0,1034,"Deep Imitative Models for Flexible Inference, Planning, and Control","[""nrhineha@cs.cmu.edu"", ""rmcallister@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Nicholas Rhinehart"", ""Rowan McAllister"", ""Sergey Levine""]","[""imitation learning"", ""planning"", ""autonomous driving""]","Imitation Learning (IL) is an appealing approach to learn desirable autonomous behavior. However, directing IL to achieve arbitrary goals is difficult. In contrast, planning-based algorithms use dynamics models and reward functions to achieve goals. Yet, reward functions that evoke desirable behavior are often difficult to specify. In this paper, we propose ""Imitative Models"" to combine the benefits of IL and goal-directed planning. Imitative Models are probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals. We derive families of flexible goal objectives, including constrained goal regions, unconstrained goal sets, and energy-based goals. We show that our method can use these objectives to successfully direct behavior. Our method substantially outperforms six IL approaches and a planning-based approach in a dynamic simulated autonomous driving task, and is efficiently learned from expert demonstrations without online data collection. We also show our approach is robust to poorly-specified goals, such as goals on the wrong side of the road.",/pdf/a0ea230e9b98f012c909e5d99c6ccf04797d5447.pdf,ICLR,2020,"In this paper, we propose Imitative Models to combine the benefits of IL and goal-directed planning: probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals." +sy4Kg_ZQmS7,Q_AsvGwWsy6,1601310000000.0,1616070000000.0,1670,Learning Deep Features in Instrumental Variable Regression,"[""~Liyuan_Xu1"", ""~Yutian_Chen1"", ""sidsrini@cs.washington.edu"", ""~Nando_de_Freitas1"", ""~Arnaud_Doucet2"", ""~Arthur_Gretton1""]","[""Liyuan Xu"", ""Yutian Chen"", ""Siddarth Srinivasan"", ""Nando de Freitas"", ""Arnaud Doucet"", ""Arthur Gretton""]","[""Causal Inference"", ""Instrumental Variable Regression"", ""Deep Learning"", ""Reinforcement Learning""]","Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by using an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. +DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.",/pdf/44b63f5cdbd7a7ace70009826587fc8c176ef370.pdf,ICLR,2021,Propose a novel deep learning based method for instrumental variable regression +H1VGkIxRZ,SJNzkIlRZ,1509080000000.0,1519240000000.0,267,Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks,"[""sliang26@illinois.edu"", ""yli@cs.cornell.edu"", ""rsrikant@illinois.edu"", ""liangshiyu@icloud.com"", ""yl2363@cornell.edu""]","[""Shiyu Liang"", ""Yixuan Li"", ""R. Srikant""]","[""Neural networks"", ""out-of-distribution detection""]","We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions of in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10 and Tiny-ImageNet) when the true positive rate is 95%.",/pdf/84c8d448ff900100916e176b3628b1f40c532d43.pdf,ICLR,2018, +SkxZFoAqtQ,ByeQL6fcKm,1538090000000.0,1545360000000.0,417,Improving Composition of Sentence Embeddings through the Lens of Statistical Relational Learning,"[""damien.sileo@synapse-fr.com"", ""tim.van-de-cruys@irit.fr"", ""camille.pradel@synapse-fr.com"", ""philippe.muller@irit.fr""]","[""Damien Sileo"", ""Tim Van de Cruys"", ""Camille Pradel"", ""Philippe Muller""]","[""Statistical Relational Learning"", ""Sentence Embedding"", ""Composition functions"", ""Natural Language Inference"", ""InferSent"", ""SentEval"", ""ComplEx""]","Various NLP problems -- such as the prediction of sentence similarity, entailment, and discourse relations -- are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. We call them textual relational problems. A popular model for textual relational problems is to embed sentences into fixed size vectors and use composition functions (e.g. difference or concatenation) of those vectors as features for the prediction. Meanwhile, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this work, we show that textual relational models implicitly use compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL models to address textual relational problems, showing that they are more expressive, and can alleviate issues from simpler compositions. The resulting models significantly improve the state of the art in both transferable sentence representation learning and relation prediction.",/pdf/a8511c58af84a680ba227567d76f2cdc8424315d.pdf,ICLR,2019,We apply ideas from Statistical Relational Learning to compose sentence embeddings with more expressivity +BJlAzTEKwS,rkg-pGTUvS,1569440000000.0,1577170000000.0,431,Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning,"[""thang.doan@mail.mcgill.ca"", ""bogdan.mazoure@mail.mcgill.ca"", ""audrey.durand@ift.ulaval.ca"", ""jpineau@cs.mcgill.ca"", ""devon.hjelm@microsoft.com""]","[""Thang Doan"", ""Bogdan Mazoure"", ""Audrey Durand"", ""Joelle Pineau"", ""R Devon Hjelm""]","[""reinforcement learning"", ""continuous control"", ""multi-agent"", ""mujoco""]","Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensional state spaces with deceptive rewards, where the agent can easily become trapped into suboptimal solutions. +One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the ``best"" coverage is still an open problem. In this work, we present a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies. Empirical results on the MuJoCo suite demonstrate a high performance gain for our algorithm compared to prior work, including Soft-Actor Critic (SAC). ",/pdf/07860ab227e342f3c456eb1df91bc0466c6594f0.pdf,ICLR,2020, +yu8JOcFCFrE,knuWBMtCRez,1601310000000.0,1614990000000.0,208,Deep Clustering and Representation Learning that Preserves Geometric Structures,"[""~Lirong_Wu1"", ""~Zicheng_Liu2"", ""~Zelin_Zang2"", ""~Jun_Xia1"", ""~Siyuan_Li6"", ""~Stan_Z._Li2""]","[""Lirong Wu"", ""Zicheng Liu"", ""Zelin Zang"", ""Jun Xia"", ""Siyuan Li"", ""Stan Z. Li""]","[""Deep Clustering"", ""Manifold Representation Learning""]","In this paper, we propose a novel framework for Deep Clustering and multimanifold Representation Learning (DCRL) that preserves the geometric structure of data. In the proposed DCRL framework, manifold clustering is done in the latent space guided by a clustering loss. To overcome the problem that clusteringoriented losses may deteriorate the geometric structure of embeddings in the latent space, an isometric loss is proposed for preserving intra-manifold structure locally and a ranking loss for inter-manifold structure globally. Experimental results on various datasets show that the DCRL framework leads to performances comparable to current state-of-the-art deep clustering algorithms, yet exhibits superior performance for manifold representation. Our results also demonstrate the importance and effectiveness of the proposed losses in preserving geometric structure in terms of visualization and performance metrics. The code is provided in the Supplementary Material.",/pdf/1fa4a3f2b021cad8a41c12e56c2831fc08986136.pdf,ICLR,2021,"The proposed framework uses two principles, intra-manifold metric-preserving and inter-manifold metric rank-preserving to solve multi-manifold clustering problem effectively." +mb2L9vL-MjI,Imj99AYCyPg,1601310000000.0,1614990000000.0,2383,The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models,"[""~Chao_Ma8"", ""~Lei_Wu1"", ""~Weinan_E1""]","[""Chao Ma"", ""Lei Wu"", ""Weinan E""]","[""Gradient descent"", ""neural networks"", ""implicit regularization"", ""quenching-activation""]","A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes. It is found that there are two distinctive phases in the GD dynamics in the under-parameterized regime: An early phase in which the GD dynamics follows closely that of the corresponding random feature model, followed by a late phase in which the neurons are divided into two groups: a group of a few (maybe none) “activated” neurons that dominate the dynamics and a group of ``quenched” neurons that support the continued activation and deactivation process. In particular, when the target function can be accurately approximated by a relatively small number of neurons, this quenching-activation process biases GD to picking sparse solutions. This neural network-like behavior is continued into the mildly over-parameterized regime, in which it undergoes a transition to a random feature-like behavior where the inner-layer parameters are effectively frozen during the training process. The quenching process seems to provide a clear mechanism for ``implicit regularization''. This is qualitatively different from the GD dynamics associated with the ``mean-field'' scaling where all neurons participate equally.",/pdf/dcb8090b97dedba1dd82ca18e60a190bca43ab49.pdf,ICLR,2021,The gradient descent dynamics for two-layer neural networks exhibits a quenching-activation behavior. +HJV1zP5xg,,1478290000000.0,1483980000000.0,363,Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models,"[""ashwinkv@vt.edu"", ""cogswell@vt.edu"", ""ram21@vt.edu"", ""sunqing@vt.edu"", ""steflee@vt.edu"", ""djcran@indiana.edu"", ""dbatra@vt.edu""]","[""Ashwin K Vijayakumar"", ""Michael Cogswell"", ""Ramprasaath R. Selvaraju"", ""Qing Sun"", ""Stefan Lee"", ""David Crandall"", ""Dhruv Batra""]","[""Deep learning"", ""Computer vision"", ""Natural language processing""]","Neural sequence models are widely used to model time-series data. Equally ubiquitous is the usage of beam search (BS) as an approximate inference algorithm to decode output sequences from these models. BS explores the search space in a greedy left-right fashion retaining only the top B candidates. This tends to result in sequences that differ only slightly from each other. Producing lists of nearly identical sequences is not only computationally wasteful but also typically fails to capture the inherent ambiguity of complex AI tasks. To overcome this problem, we propose Diverse Beam Search (DBS), an alternative to BS that decodes a list of diverse outputs by optimizing a diversity-augmented objective. We observe that our method not only improved diversity but also finds better top 1 solutions by controlling for the exploration and exploitation of the search space. Moreover, these gains are achieved with minimal computational or memory overhead com- pared to beam search. To demonstrate the broad applicability of our method, we present results on image captioning, machine translation, conversation and visual question generation using both standard quantitative metrics and qualitative human studies. We find that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.",/pdf/ed535716902c60d30f7787e7f452aa999165a5e3.pdf,ICLR,2017,"We introduce a novel, diversity promoting beam search algorithm that results in significantly improved diversity between decoded sequences as evaluated on multiple sequence generation tasks." +Sk8csP5ex,,1478290000000.0,1482070000000.0,423,The loss surface of residual networks: Ensembles and the role of batch normalization,"[""etai.littwin@gmail.com"", ""liorwolf@gmail.com""]","[""Etai Littwin"", ""Lior Wolf""]","[""Deep learning"", ""Theory""]","Deep Residual Networks present a premium in performance in comparison to conventional +networks of the same depth and are trainable at extreme depths. It has +recently been shown that Residual Networks behave like ensembles of relatively +shallow networks. We show that these ensemble are dynamic: while initially +the virtual ensemble is mostly at depths lower than half the network’s depth, as +training progresses, it becomes deeper and deeper. The main mechanism that controls +the dynamic ensemble behavior is the scaling introduced, e.g., by the Batch +Normalization technique. We explain this behavior and demonstrate the driving +force behind it. As a main tool in our analysis, we employ generalized spin glass +models, which we also use in order to study the number of critical points in the +optimization of Residual Networks.",/pdf/e04f62365b1b59ae7fb3363bf32e28b60fddc225.pdf,ICLR,2017,Residual nets are dynamic ensembles +HygT9oRqFX,ryx0fINqKm,1538090000000.0,1545360000000.0,572,MixFeat: Mix Feature in Latent Space Learns Discriminative Space,"[""yoichi_yaguchi@ot.olympus.co.jp"", ""f_shiratani@ot.olympus.co.jp"", ""h_iwaki@ot.olympus.co.jp""]","[""Yoichi Yaguchi"", ""Fumiyuki Shiratani"", ""Hidekazu Iwaki""]","[""regularization"", ""generalization"", ""image classification"", ""latent space"", ""feature learning""]","Deep learning methods perform well in various tasks. However, the over-fitting problem, which causes the performance to decrease for unknown data, remains. We hence propose a method named MixFeat that directly creates latent spaces in a network that can distinguish classes. MixFeat mixes two feature maps in each latent space in the network and uses unmixed labels for learning. We discuss the difference between a method that mixes only features (MixFeat) and a method that mixes both features and labels (mixup and its family). Mixing features repeatedly is effective in expanding feature diversity, but mixing labels repeatedly makes learning difficult. MixFeat makes it possible to obtain the advantages of repeated mixing by mixing only features. We report improved results obtained using existing network models with MixFeat on CIFAR-10/100 datasets. In addition, we show that MixFeat effectively reduces the over-fitting problem even when the training dataset is small or contains errors. MixFeat is easy to implement and can be added to various network models without additional computational cost in the inference phase.",/pdf/6ef77a53c7ba95bb34cc76db8e94e3adeba12d5d.pdf,ICLR,2019,"We provide a novel method named MixFeat, which directly makes the latent space discriminative." +r1lfF2NYvH,BJl0jK4sLH,1569440000000.0,1583910000000.0,71,InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization,"[""sunfanyun@gmail.com"", ""jhoffmann@g.harvard.edu"", ""vikasverma.iitm@gmail.com"", ""jian.tang@hec.ca""]","[""Fan-Yun Sun"", ""Jordan Hoffman"", ""Vikas Verma"", ""Jian Tang""]","[""graph-level representation learning"", ""mutual information maximization""]","This paper studies learning the representations of whole graphs in both unsupervised and semi-supervised scenarios. Graph-level representations are critical in a variety of real-world applications such as predicting the properties of molecules and community analysis in social networks. Traditional graph kernel based methods are simple, yet effective for obtaining fixed-length representations for graphs but they suffer from poor generalization due to hand-crafted designs. There are also some recent methods based on language models (e.g. graph2vec) but they tend to only consider certain substructures (e.g. subtrees) as graph representatives. Inspired by recent progress of unsupervised representation learning, in this paper we proposed a novel method called InfoGraph for learning graph-level representations. We maximize the mutual information between the graph-level representation and the representations of substructures of different scales (e.g., nodes, edges, triangles). By doing so, the graph-level representations encode aspects of the data that are shared across different scales of substructures. Furthermore, we further propose InfoGraph*, an extension of InfoGraph for semisupervised scenarios. InfoGraph* maximizes the mutual information between unsupervised graph representations learned by InfoGraph and the representations learned by existing supervised methods. As a result, the supervised encoder learns from unlabeled data while preserving the latent semantic space favored by the current supervised task. Experimental results on the tasks of graph classification and molecular property prediction show that InfoGraph is superior to state-of-the-art baselines and InfoGraph* can achieve performance competitive with state-of-the-art semi-supervised models.",/pdf/af171fb8c60fa180c4dcf349ccc51ff006211216.pdf,ICLR,2020, +BkUDvt5gg,,1478300000000.0,1478300000000.0,500,Wav2Letter: an End-to-End ConvNet-based Speech Recognition System,"[""locronan@fb.com"", ""cpuhrsch@fb.com"", ""gab@fb.com""]","[""Ronan Collobert"", ""Christian Puhrsch"", ""Gabriel Synnaeve""]","[""Deep learning"", ""Speech"", ""Structured prediction""]","This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC (Graves et al., 2006) while being simpler. We show competitive results in word error rate on the Librispeech corpus (Panayotov et al., 2015) with MFCC features, and promising results from raw waveform.",/pdf/d5a0b5529d3e8ef3cd62319541458c8b929d4bd2.pdf,ICLR,2017,We propose convnet models and new sequence criterions for training end-to-end letter-based speech systems. +B16yEqkCZ,rknkNqk0Z,1509040000000.0,1518730000000.0,145,Avoiding Catastrophic States with Intrinsic Fear,"[""zlipton@cmu.edu"", ""kazizzad@uci.edu"", ""abkumar@ucsd.edu"", ""lihongli.cs@gmail.com"", ""jfgao@microsoft.com"", ""l.deng@ieee.org""]","[""Zachary C. Lipton"", ""Kamyar Azizzadenesheli"", ""Abhishek Kumar"", ""Lihong Li"", ""Jianfeng Gao"", ""Li Deng""]","[""reinforcement learning"", ""safe exploration"", ""dqn""]","Many practical reinforcement learning problems contain catastrophic states that the optimal policy visits infrequently or never. Even on toy problems, deep reinforcement learners periodically revisit these states, once they are forgotten under a new policy. In this paper, we introduce intrinsic fear, a learned reward shaping that accelerates deep reinforcement learning and guards oscillating policies against periodic catastrophes. Our approach incorporates a second model trained via supervised learning to predict the probability of imminent catastrophe. This score acts as a penalty on the Q-learning objective. Our theoretical analysis demonstrates that the perturbed objective yields the same average return under strong assumptions and an $\epsilon$-close average return under weaker assumptions. Our analysis also shows robustness to classification errors. Equipped with intrinsic fear, our DQNs solve the toy environments and improve on the Atari games Seaquest, Asteroids, and Freeway.",/pdf/5a09deba4481a8fe20463cc25a32aff1e7457dd2.pdf,ICLR,2018,Shape reward with intrinsic motivation to avoid catastrophic states and mitigate catastrophic forgetting. +_HsKf3YaWpG,cqjeNo5u-Lq,1601310000000.0,1614990000000.0,1682,Uniform Priors for Data-Efficient Transfer,"[""~Samarth_Sinha1"", ""~Karsten_Roth1"", ""~Anirudh_Goyal1"", ""~Marzyeh_Ghassemi1"", ""~Hugo_Larochelle1"", ""~Animesh_Garg1""]","[""Samarth Sinha"", ""Karsten Roth"", ""Anirudh Goyal"", ""Marzyeh Ghassemi"", ""Hugo Larochelle"", ""Animesh Garg""]","[""Meta Learning"", ""Deep Metric Learning"", ""Transfer Learning""]","Deep Neural Networks have shown great promise on a variety of downstream applications; but their ability to adapt and generalize to new data and tasks remains a challenging problem. However, the ability to perform few or zero-shot adaptation to novel tasks is important for the scalability and deployment of machine learning models. It is therefore crucial to understand what makes for good, transferable features in deep networks that best allow for such adaptation. In this paper, we shed light on this by showing that features that are most transferable have high uniformity in the embedding space and propose a uniformity regularization scheme that encourages better transfer and feature reuse. We evaluate the regularization on its ability to facilitate adaptation to unseen tasks and data, for which we conduct a thorough experimental study covering four relevant, and distinct domains: few-shot Meta-Learning, Deep Metric Learning, Zero-Shot Domain Adaptation, as well as Out-of-Distribution classification. Across all experiments, we show that uniformity regularization consistently offers benefits over baseline methods and is able to achieve state-of-the-art performance in Deep Metric Learning and Meta-Learning.",/pdf/56dba903a0cf8efcffb726406bf47b861dd084ee.pdf,ICLR,2021,"We observe that the uniformity of embeddings is important for better transfer and adaptation, and propose a simple technique to promote uniformity which improve meta learning, deep metric learning, zero-shot domain adaptation and OOD generalization." +pW--cu2FCHY,2ZXJFCag7vk,1601310000000.0,1614990000000.0,2688,An Attention Free Transformer,"[""~Shuangfei_Zhai3"", ""~Walter_Talbott1"", ""~Nitish_Srivastava1"", ""~Chen_Huang6"", ""~Hanlin_Goh2"", ""~Joshua_M._Susskind1""]","[""Shuangfei Zhai"", ""Walter Talbott"", ""Nitish Srivastava"", ""Chen Huang"", ""Hanlin Goh"", ""Joshua M. Susskind""]","[""Transformers"", ""attention"", ""efficient""]","We introduce Attention Free Transformer (AFT), an efficient variant of Transformers \citep{transformer} that eliminates the need for dot product attention. AFT offers great simplicity and efficiency compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. During training time, AFT has linear time and space complexity w.r.t. both the sequence length and feature dimension; in the autoregressive decoding mode, AFT has constant memory and time complexity per step. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts and other efficient variants. In particular, AFT achieves the state-of-the-art result on CIFAR10 autoregressive modeling with much reduced complexity, and also outperforms several efficient Transformer variants on Enwik8.",/pdf/058a6d9983e0bc22ef82e8dcc08f01b292f4a59c.pdf,ICLR,2021,We propose an efficient Transformer that eliminates attention. +HylA41Btwr,Hyx0tyauwB,1569440000000.0,1577170000000.0,1674,CP-GAN: Towards a Better Global Landscape of GANs,"[""ruoyus@illinois.edu"", ""tf6@illinois.edu"", ""aschwing@illinois.edu""]","[""Ruoyu Sun"", ""Tiantian Fang"", ""Alex Schwing""]","[""GAN"", ""global landscape"", ""non-convex optimization"", ""min-max optimization"", ""dynamics""]","GANs have been very popular in data generation and unsupervised learning, but our understanding of GAN training is still very limited. One major reason is that GANs are often formulated as non-convex-concave min-max optimization. As a result, most recent studies focused on the analysis in the local region around the equilibrium. In this work, we perform a global analysis of GANs from two perspectives: the global landscape of the outer-optimization problem and the global behavior of the gradient descent dynamics. We find that the original GAN has exponentially many bad strict local minima which are perceived as mode-collapse, and the training dynamics (with linear discriminators) cannot escape mode collapse. To address these issues, we propose a simple modification to the original GAN, by coupling the generated samples and the true samples. We prove that the new formulation has no bad basins, and its training dynamics (with linear discriminators) has a Lyapunov function that leads to global convergence. Our experiments on standard datasets show that this simple loss outperforms the original GAN and WGAN-GP. ",/pdf/5f32814c2dc7b200071a056beac1d87052979020.pdf,ICLR,2020, +rkgqm0VKwB,Bylzp3HuvH,1569440000000.0,1577170000000.0,1048,End-to-end named entity recognition and relation extraction using pre-trained language models,"[""john.giorgi@utoronto.ca"", ""xindi.wang@uhnresearch.ca"", ""nicola.sahar@mail.utoronto.ca"", ""wonyoung.shin@mail.utoronto.ca"", ""gary.bader@utoronto.ca"", ""bowang@vectorinstitute.ai""]","[""John Giorgi"", ""Xindi Wang"", ""Nicola Sahar"", ""Won Young Shin"", ""Gary Bader"", ""Bo Wang""]","[""named entity recognition"", ""relation extraction"", ""information extraction"", ""information retrival"", ""transfer learning"", ""multi-task learning"", ""BERT"", ""transformers"", ""language models""]","Named entity recognition (NER) and relation extraction (RE) are two important tasks in information extraction and retrieval (IE & IR). Recent work has demonstrated that it is beneficial to learn these tasks jointly, which avoids the propagation of error inherent in pipeline-based systems and improves performance. However, state-of-the-art joint models typically rely on external natural language processing (NLP) tools, such as dependency parsers, limiting their usefulness to domains (e.g. news) where those tools perform well. The few neural, end-to-end models that have been proposed are trained almost completely from scratch. In this paper, we propose a neural, end-to-end model for jointly extracting entities and their relations which does not rely on external NLP tools and which integrates a large, pre-trained language model. Because the bulk of our model's parameters are pre-trained and we eschew recurrence for self-attention, our model is fast to train. On 5 datasets across 3 domains, our model matches or exceeds state-of-the-art performance, sometimes by a large margin.",/pdf/1462260f8feb38efc3c3c6a48383eeb9d679c3c9.pdf,ICLR,2020,"A novel, high-performing architecture for end-to-end named entity recognition and relation extraction that is fast to train." +EKV158tSfwv,oVEQaDD9a6Q,1601310000000.0,1614180000000.0,483,Efficient Continual Learning with Modular Networks and Task-Driven Priors,"[""~Tom_Veniat1"", ""~Ludovic_Denoyer1"", ""~MarcAurelio_Ranzato1""]","[""Tom Veniat"", ""Ludovic Denoyer"", ""MarcAurelio Ranzato""]","[""Continual learning"", ""Lifelong learning"", ""Benchmark"", ""Modular network"", ""Neural Network""]","Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. +There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. +Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. +Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work. The Benchmark is publicly available at https://github.com/facebookresearch/CTrLBenchmark.",/pdf/8c3f194bd890ab75d3046245b587fdf9c6393d9b.pdf,ICLR,2021,We propose a new benchmark allowing a detailed analysis of the properties of continual learning alogrithms and a new modular neural network leveraging task-based priors to efficiently learn in the CL setting. +vYVI1CHPaQg,Rs_2WjCSSHs,1601310000000.0,1615640000000.0,2162,A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning,"[""~Samuel_Horv\u00e1th1"", ""~Peter_Richtarik1""]","[""Samuel Horv\u00e1th"", ""Peter Richtarik""]","[""distributed optimization"", ""communication efficiency""]","Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computing systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-$K$ or PowerSGD. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.",/pdf/74506361fdee66a12846aa4de625aa16b5745878.pdf,ICLR,2021, +HkghV209tm,rJgogeCqKQ,1538090000000.0,1545360000000.0,1484,Optimistic Acceleration for Optimization,"[""jimwang@gatech.edu"", ""xl374@scarletmail.rutgers.edu"", ""pingli98@gmail.com""]","[""Jun-Kun Wang"", ""Xiaoyun Li"", ""Ping Li""]","[""optimization"", ""Adam"", ""AMSGrad""]","We consider new variants of optimization algorithms. Our algorithms are based on the observation that mini-batch of stochastic gradients in consecutive iterations do not change drastically and consequently may be predictable. Inspired by the similar setting in online learning literature called Optimistic Online learning, we propose two new optimistic algorithms for AMSGrad and Adam, respectively, by exploiting the predictability of gradients. The new algorithms combine the idea of momentum method, adaptive gradient method, and algorithms in Optimistic Online learning, which leads to speed up in training deep neural nets in practice.",/pdf/06002122d11de65c0db50d9107bc5112a302d160.pdf,ICLR,2019,We consider new variants of optimization algorithms for training deep nets. +rJ4km2R5t7,SJl9k0TrYQ,1538090000000.0,1550880000000.0,1323,GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,"[""alexwang@nyu.edu"", ""amanpreet@nyu.edu"", ""julianjm@cs.washington.edu"", ""felixhill@google.com"", ""omerlevy@cs.washington.edu"", ""bowman@nyu.edu""]","[""Alex Wang"", ""Amanpreet Singh"", ""Julian Michael"", ""Felix Hill"", ""Omer Levy"", ""Samuel R. Bowman""]","[""natural language understanding"", ""multi-task learning"", ""evaluation""]","For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.",/pdf/e661931af788bb41220e35ab989a6e051b9e602b.pdf,ICLR,2019,We present a multi-task benchmark and analysis platform for evaluating generalization in natural language understanding systems. +1TIrbngpW0x,5LJbmLtwsWf,1601310000000.0,1614990000000.0,2385,Transformers with Competitive Ensembles of Independent Mechanisms,"[""~Alex_Lamb1"", ""~Di_He1"", ""~Anirudh_Goyal1"", ""~Guolin_Ke3"", ""~Chien-Feng_Liao1"", ""~Mirco_Ravanelli1"", ""~Yoshua_Bengio1""]","[""Alex Lamb"", ""Di He"", ""Anirudh Goyal"", ""Guolin Ke"", ""Chien-Feng Liao"", ""Mirco Ravanelli"", ""Yoshua Bengio""]","[""transformer"", ""mechanism"", ""modularity"", ""modules"", ""independence""]","An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance. ",/pdf/844949cababddd36ef6d17be742269e46f7dd7a4.pdf,ICLR,2021,"Transformers with Independent Mechanisms which have separate parameters, share information through attention, and specialize over positions. " +VJnrYcnRc6,fT5qN7EDXo9,1601310000000.0,1616030000000.0,2818,Conditional Generative Modeling via Learning the Latent Space,"[""~Sameera_Ramasinghe1"", ""~Kanchana_Nisal_Ranasinghe1"", ""~Salman_Khan4"", ""~Nick_Barnes3"", ""~Stephen_Gould1""]","[""Sameera Ramasinghe"", ""Kanchana Nisal Ranasinghe"", ""Salman Khan"", ""Nick Barnes"", ""Stephen Gould""]","[""Multimodal Spaces"", ""Conditional Generation"", ""Generative Modeling""]","Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find solutions corresponding to multiple output modes. Compared to existing generative solutions, our approach demonstrates faster and more stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can perform better than highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs. Code available at https://github.com/samgregoost/cGML.",/pdf/ad10b1238b8c96783d156228bbe0a955123a991c.pdf,ICLR,2021,Conditional generation in continuous multimodal spaces by learning the behavior of latent variables. +#NAME?,7SQkg11f_r,1601310000000.0,1614990000000.0,1748,Convergent Adaptive Gradient Methods in Decentralized Optimization,"[""~Xiangyi_Chen1"", ""~Belhal_Karimi1"", ""weijiezhao@baidu.com"", ""~Ping_Li3""]","[""Xiangyi Chen"", ""Belhal Karimi"", ""Weijie Zhao"", ""Ping Li""]","[""Adam"", ""decentralized optimization"", ""adaptive gradient methods""]","Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks, in the past few years. Meanwhile, given the need for distributed training procedures, distributed optimization algorithms are at the center of attention. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In that regard, more and more attention is shifted from the traditional parameter server training paradigm to the decentralized one, which usually requires lower communication costs. In this paper, we rigorously incorporate adaptive gradient methods into decentralized training procedures and introduce novel convergent decentralized adaptive gradient methods. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. ",/pdf/2bd2d35d9dea15574aa5d8097f5875d23e685404.pdf,ICLR,2021,We analyzed a framework to decentralize adaptive gradient methods and proposed a convergent decentralized adaptive gradient method using the framework. +HklOo0VFDH,SJehCfY_Dr,1569440000000.0,1583910000000.0,1325,Decoding As Dynamic Programming For Recurrent Autoregressive Models,"[""syed.zaidi1@monash.edu"", ""t.cohn@unimelb.edu.au"", ""reza.haffari@gmail.com""]","[""Najam Zaidi"", ""Trevor Cohn"", ""Gholamreza Haffari""]","[""Decoding""]","Decoding in autoregressive models (ARMs) consists of searching for a high scoring output sequence under the trained model. Standard decoding methods, based on unidirectional greedy algorithm or beam search, are suboptimal due to error propagation and myopic decisions which do not account for future steps in the generation process. In this paper we present a novel decoding approach based on the method of auxiliary coordinates (Carreira-Perpinan & Wang, 2014) to address the aforementioned shortcomings. Our method introduces discrete variables for output tokens, and auxiliary continuous variables representing the states of the underlying ARM. The auxiliary variables lead to a factor graph approximation of the ARM, whose maximum a posteriori (MAP) inference is found exactly using dynamic programming. The MAP inference is then used to recreate an improved factor graph approximation of the ARM via updated auxiliary variables. We then extend our approach to decode in an ensemble of ARMs, possibly with different generation orders, which is out of reach for the standard unidirectional decoding algorithms. Experiments on the text infilling task over SWAG and Daily Dialogue datasets show that our decoding method is superior to strong unidirectional decoding baselines.",/pdf/cbec9ce6e7b1e593eac93abfff0fd90fd2ee6ed8.pdf,ICLR,2020,Approximate inference using dynamic programming for Autoregressive models. +BJl9ZTVKwB,B1xrlQ_Uwr,1569440000000.0,1577170000000.0,386,MIM: Mutual Information Machine,"[""mlivne@cs.toronto.edu"", ""kswersky@google.com"", ""leet@cs.toronto.edu""]","[""Micha Livne"", ""Kevin Swersky"", ""David J. Fleet""]","[""Mutual Information"", ""Representation Learning"", ""Generative Models"", ""Probability Density Estimator""]"," We introduce the Mutual Information Machine (MIM), an autoencoder framework + for learning joint distributions over observations and latent states. + The model formulation reflects two key design principles: 1) symmetry, to encourage + the encoder and decoder to learn different factorizations of the same + underlying distribution; and 2) mutual information, to encourage the learning + of useful representations for downstream tasks. + The objective comprises the Jensen-Shannon divergence between the encoding and + decoding joint distributions, plus a mutual information regularizer. + We show that this can be bounded by a tractable cross-entropy loss between + the true model and a parameterized approximation, and relate this to + maximum likelihood estimation and variational autoencoders. + Experiments show that MIM is capable of learning a latent representation with high mutual information, + and good unsupervised clustering, while providing NLL comparable to VAE + (with a sufficiently expressive architecture).",/pdf/3c4f50a314fa9e8e0bda89eb8976302981afc6ee.pdf,ICLR,2020,We propose an alternative latent variable modelling framework to variational auto-encoders that encourages the principles of symmetry and high mutual information. +SJlRUkrFPS,SJlVIYTuPS,1569440000000.0,1583910000000.0,1748,Learning transport cost from subset correspondence,"[""ruishan@stanford.edu"", ""akshay7@gmail.com"", ""jamesyzou@gmail.com""]","[""Ruishan Liu"", ""Akshay Balsubramani"", ""James Zou""]",[],"Learning to align multiple datasets is an important problem with many applications, and it is especially useful when we need to integrate multiple experiments or correct for confounding. Optimal transport (OT) is a principled approach to align datasets, but a key challenge in applying OT is that we need to specify a cost function that accurately captures how the two datasets are related. Reliable cost functions are typically not available and practitioners often resort to using hand-crafted or Euclidean cost even if it may not be appropriate. In this work, we investigate how to learn the cost function using a small amount of side information which is often available. The side information we consider captures subset correspondence---i.e. certain subsets of points in the two data sets are known to be related. For example, we may have some images labeled as cars in both datasets; or we may have a common annotated cell type in single-cell data from two batches. We develop an end-to-end optimizer (OT-SI) that differentiates through the Sinkhorn algorithm and effectively learns the suitable cost function from side information. On systematic experiments in images, marriage-matching and single-cell RNA-seq, our method substantially outperform state-of-the-art benchmarks. ",/pdf/fbc4a415c10a8dfd7099be635ff6c8e3dbf29edf.pdf,ICLR,2020, +ysti0DEWTSo,CYNXXW0WOhN,1601310000000.0,1614990000000.0,1098,Is deeper better? It depends on locality of relevant features,"[""~Takashi_Mori1"", ""ueda@phys.s.u-tokyo.ac.jp""]","[""Takashi Mori"", ""Masahito Ueda""]","[""deep learning"", ""generalization"", ""overparameterization""]","It has been recognized that a heavily overparameterized artificial neural network exhibits surprisingly good generalization performance in various machine-learning tasks. Recent theoretical studies have made attempts to unveil the mystery of the overparameterization. In most of those previous works, the overparameterization is achieved by increasing the width of the network, while the effect of increasing the depth has been less well understood. In this work, we investigate the effect of increasing the depth within an overparameterized regime. To gain an insight into the advantage of depth, we introduce local and global labels as abstract but simple classification rules. It turns out that the locality of the relevant feature for a given classification rule plays an important role; our experimental results suggest that deeper is better for local labels, whereas shallower is better for global labels. We also compare the results of finite networks with those of the neural tangent kernel (NTK), which is equivalent to an infinitely wide network with a proper initialization and an infinitesimal learning rate. It is shown that the NTK does not correctly capture the depth dependence of the generalization performance, which indicates the importance of the feature learning, rather than the lazy learning.",/pdf/3f11d8d7de8244c32a7de84da6a61799d4f50620.pdf,ICLR,2021,It depends on locality of relevant features whether the depth is beneficial in deep learning for classification tasks. +BkgVx3A9Km,rkltHvn9FX,1538090000000.0,1545360000000.0,1068,A More Globally Accurate Dimensionality Reduction Method Using Triplets,"[""eamid@ucsc.edu"", ""manfred@ucsc.edu""]","[""Ehsan Amid"", ""Manfred K. Warmuth""]","[""Dimensionality Reduction"", ""Visualization"", ""Triplets"", ""t-SNE"", ""LargeVis""]","We first show that the commonly used dimensionality reduction (DR) methods such as t-SNE and LargeVis +poorly capture the global structure of the data in the low dimensional embedding. We show this via a number of tests for the DR methods that can be easily applied by any practitioner to the dataset at hand. Surprisingly enough, t-SNE performs the best w.r.t. the commonly used measures that reward the local neighborhood accuracy such as precision-recall while having the worst performance in our tests for global structure. We then contrast the performance of these two DR method +against our new method called TriMap. The main idea behind TriMap is to capture higher orders of structure with triplet information (instead of pairwise information used by t-SNE and LargeVis), and to minimize a robust loss function for satisfying the chosen triplets. We provide compelling experimental evidence on large natural datasets for the clear advantage of the TriMap DR results. As LargeVis, TriMap is fast and and provides comparable runtime on large datasets.",/pdf/099568854149750460c3ab76e21cd660fe62875e.pdf,ICLR,2019,A new dimensionality reduction method using triplets which is significantly faster than t-SNE and provides more accurate results globally +Hk8rlUqge,,1478280000000.0,1481720000000.0,284,Joint Multimodal Learning with Deep Generative Models,"[""masa@weblab.t.u-tokyo.ac.jp"", ""k-nakayama@weblab.t.u-tokyo.ac.jp"", ""matsuo@weblab.t.u-tokyo.ac.jp""]","[""Masahiro Suzuki"", ""Kotaro Nakayama"", ""Yutaka Matsuo""]",[],"We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally. +",/pdf/cb01f96bb430eeac976c1e4ea3a12095c97881a8.pdf,ICLR,2017, +de11dbHzAMF,NB2ucHzPl5,1601310000000.0,1628170000000.0,367,Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data,"[""~Jonathan_Pilault1"", ""~Amine_El_hattami1"", ""~Christopher_Pal1""]","[""Jonathan Pilault"", ""Amine El hattami"", ""Christopher Pal""]","[""Multi-Task Learning"", ""Adaptive Learning"", ""Transfer Learning"", ""Natural Language Processing"", ""Hypernetwork""]","Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks. However, MTL must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Often, in Natural Language Processing (NLP), a separate model per task is needed to obtain the best performance. However, many fine-tuning approaches are both parameter inefficient, i.e., potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel Transformer based Hypernetwork Adapter consisting of a new conditional attention mechanism as well as a set of task-conditioned modules that facilitate weight sharing. Through this construction, we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach, we are able to surpass single task fine-tuning methods while being parameter and data efficient (using around 66% of the data). Compared to other BERT Large methods on GLUE, our 8-task model surpasses other Adapter methods by 2.8% and our 24-task model outperforms by 0.7-1.0% models that use MTL and single task fine-tuning. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets.",/pdf/c3044c4a7c51d46a59a66bf5a93e9d87747fce37.pdf,ICLR,2021,Can multi-task outperform single task fine-tuning? CA-MTL is a new method that shows that it is possible with task conditioned model adaption via a Hypernetwork and uncertainty sampling. +wTWLfuDkvKp,UZLdqLzjyBm,1601310000000.0,1614990000000.0,2077,Should Ensemble Members Be Calibrated?,"[""~Xixin_Wu1"", ""~Mark_Gales1""]","[""Xixin Wu"", ""Mark Gales""]",[],"Underlying the use of statistical approaches for a wide range of applications is the assumption that the probabilities obtained from a statistical model are representative of the “true” probability that event, or outcome, will occur. Unfortunately, for modern deep neural networks this is not the case, they are often observed to be poorly calibrated. Additionally, these deep learning approaches make use of large numbers of model parameters, motivating the use of Bayesian, or ensemble approximation, approaches to handle issues with parameter estimation. This paper explores the application of calibration schemes to deep ensembles from both a theoretical perspective and empirically on a standard image classification task, CIFAR-100. The underlying theoretical requirements for calibration, and associated calibration criteria, are first described. It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members. On CIFAR-100 the impact of calibration for ensemble prediction, and associated calibration is evaluated. Additionally the situation where multiple different topologies are combined together is discussed.",/pdf/a84f99c180a2229b2edddde4757b5b5df3fbe75c.pdf,ICLR,2021, +BJgsN3R9Km,SJgKy0n9tX,1538090000000.0,1545360000000.0,1482,AntMan: Sparse Low-Rank Compression To Accelerate RNN Inference,"[""samyamr@microsoft.com"", ""hshrivastava3@gatech.edu"", ""yuxhe@microsoft.com""]","[""Samyam Rajbhandari"", ""Harsh Shrivastava"", ""Yuxiong He""]","[""model compression"", ""RNN"", ""perforamnce optimization"", ""langugage model"", ""machine reading comprehension"", ""knowledge distillation"", ""teacher-student""]","Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.",/pdf/fe973ecfa9406ae9b2df179395ce49a7cbc2583a.pdf,ICLR,2019,"Reducing computation and memory complexity of RNN models by up to 100x using sparse low-rank compression modules, trained via knowledge distillation." +HJxV-ANKDH,H1gU9GVOvB,1569440000000.0,1583910000000.0,962,Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform,"[""liju2@oregonstate.edu"", ""fuxin.li@oregonstate.edu"", ""sinisa@oregonstate.edu""]","[""Jun Li"", ""Fuxin Li"", ""Sinisa Todorovic""]","[""Orthonormality"", ""Efficient Riemannian Optimization"", ""the Stiefel manifold.""]","Strictly enforcing orthonormality constraints on parameter matrices has been shown advantageous in deep learning. This amounts to Riemannian optimization on the Stiefel manifold, which, however, is computationally expensive. To address this challenge, we present two main contributions: (1) A new efficient retraction map based on an iterative Cayley transform for optimization updates, and (2) An implicit vector transport mechanism based on the combination of a projection of the momentum and the Cayley transform on the Stiefel manifold. We specify two new optimization algorithms: Cayley SGD with momentum, and Cayley ADAM on the Stiefel manifold. Convergence of Cayley SGD is theoretically analyzed. Our experiments for CNN training demonstrate that both algorithms: (a) Use less running time per iteration relative to existing approaches that enforce orthonormality of CNN parameters; and (b) Achieve faster convergence rates than the baseline SGD and ADAM algorithms without compromising the performance of the CNN. Cayley SGD and Cayley ADAM are also shown to reduce the training time for optimizing the unitary transition matrices in RNNs.",/pdf/bd962d83aec39bde1214d36ae0f8e0d4220c9579.pdf,ICLR,2020,This paper is about efficient Riemannian optimization on the Stiefel manifold that enforces the parameter matrices orthonormal. +HyTqHL5xg,,1478280000000.0,1488560000000.0,296,Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data,"[""karlma@in.tum.de"", ""m.soelch@tum.de"", ""bayer.justin@googlemail.com"", ""smagt@brml.org""]","[""Maximilian Karl"", ""Maximilian Soelch"", ""Justin Bayer"", ""Patrick van der Smagt""]","[""Deep learning"", ""Unsupervised Learning""]","We introduce Deep Variational Bayes Filters (DVBF), a new method for unsupervised learning and identification of latent Markovian state space models. Leveraging recent advances in Stochastic Gradient Variational Bayes, DVBF can overcome intractable inference distributions via variational inference. Thus, it can handle highly nonlinear input data with temporal and spatial dependencies such as image sequences without domain knowledge. Our experiments show that enabling backpropagation through transitions enforces state space assumptions and significantly improves information content of the latent embedding. This also enables realistic long-term prediction. +",/pdf/6734b367969e365aa39c22f7101c259ca0efebbd.pdf,ICLR,2017, +HylNWkHtvB,H1xn-9iuvS,1569440000000.0,1577170000000.0,1540,Domain-Independent Dominance of Adaptive Methods,"[""savarese@ttic.edu"", ""mcallester@ttic.edu"", ""sudarshan@ttic.edu"", ""mmaire@uchicago.edu""]","[""Pedro Savarese"", ""David McAllester"", ""Sudarshan Babu"", ""Michael Maire""]",[],"From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling of learning rate and adaptability, greatly simplifying hyperparameter search. In light of this observation, we demonstrate that, against conventional wisdom, Adam can also outperform SGD on vision tasks, as long as the coupling between its learning rate and adaptability is taken into account. In practice, AvaGrad matches the best results, as measured by generalization accuracy, delivered by any existing optimizer (SGD or adaptive) across image classification (CIFAR, ImageNet) and character-level language modelling (Penn Treebank) tasks. This later observation, alongside of AvaGrad's decoupling of hyperparameters, could make it the preferred optimizer for deep learning, replacing both SGD and Adam.",/pdf/9af302ed6da4c59db377e6e67f134d42c2426975.pdf,ICLR,2020, +4ADnf1HqIw,p3UVawonoLB,1601310000000.0,1614990000000.0,694,Recovering Geometric Information with Learned Texture Perturbations,"[""~Jane_Wu2"", ""yxjin@stanford.edu"", ""zhenglin@stanford.edu"", ""hui.zhou@jd.com"", ""~Ronald_Fedkiw1""]","[""Jane Wu"", ""Yongxu Jin"", ""Zhenglin Geng"", ""Hui Zhou"", ""Ronald Fedkiw""]",[],"Regularization is used to avoid overfitting when training a neural network; unfortunately, this reduces the attainable level of detail hindering the ability to capture high-frequency information present in the training data. Even though various approaches may be used to re-introduce high-frequency detail, it typically does not match the training data and is often not time coherent. In the case of network inferred cloth, these sentiments manifest themselves via either a lack of detailed wrinkles or unnaturally appearing and/or time incoherent surrogate wrinkles. Thus, we propose a general strategy whereby high-frequency information is procedurally embedded into low-frequency data so that when the latter is smeared out by the network the former still retains its high-frequency detail. We illustrate this approach by learning texture coordinates which when smeared do not in turn smear out the high-frequency detail in the texture itself but merely smoothly distort it. Notably, we prescribe perturbed texture coordinates that are subsequently used to correct the over-smoothed appearance of inferred cloth, and correcting the appearance from multiple camera views naturally recovers lost geometric information.",/pdf/b27c031ead609dd6c81f4dfe690232f184b8a25d.pdf,ICLR,2021, +0WWj8muw_rj,NMVrL82QJ5R,1601310000000.0,1614990000000.0,2908,Adaptive Gradient Methods Can Be Provably Faster than SGD with Random Shuffling,"[""~Xunpeng_Huang1"", ""~Vicky_Jiaqi_Zhang2"", ""zhouhao.nlp@bytedance.com"", ""~Lei_Li11""]","[""Xunpeng Huang"", ""Vicky Jiaqi Zhang"", ""Hao Zhou"", ""Lei Li""]",[],"Adaptive gradient methods have been shown to outperform SGD in many tasks of training neural networks. However, the acceleration effect is yet to be explained in the non-convex setting since the best convergence rate of adaptive gradient methods is worse than that of SGD in literature. In this paper, we prove that adaptive gradient methods exhibit an $\small\tilde{O}(T^{-1/2})$-convergence rate for finding first-order stationary points under the strong growth condition, which improves previous best convergence results of adaptive gradient methods and random shuffling SGD by factors of $\small O(T^{-1/4})$ and $\small O(T^{-1/6})$, respectively. In particular, we study two variants of AdaGrad with random shuffling for finite sum minimization. Our analysis suggests that the combination of random shuffling and adaptive learning rates gives rise to better convergence.",/pdf/aa3e1d9b91089369ab6f3bbcb7c3f96c846a5d1d.pdf,ICLR,2021, +ryUPiRvge,,1478120000000.0,1483970000000.0,50,Extrapolation and learning equations,"[""gmartius@ist.ac.at"", ""chl@ist.ac.at""]","[""Georg Martius"", ""Christoph H. Lampert""]","[""Supervised Learning"", ""Deep learning"", ""Structured prediction""]","In classical machine learning, regression is treated as a black box process of identifying a +suitable function from a hypothesis set without attempting to gain insight into the mechanism connecting inputs and outputs. +In the natural sciences, however, finding an interpretable function for a phenomenon is the prime goal as it allows to understand and generalize results. This paper proposes a novel type of function learning network, called equation learner (EQL), that can learn analytical expressions and is able to extrapolate to unseen domains. It is implemented as an end-to-end differentiable feed-forward network and allows for efficient gradient based training. Due to sparsity regularization concise interpretable expressions can be obtained. Often the true underlying source expression is identified. +",/pdf/bb2864eaf4009f785b839d201c2a27a52510f352.pdf,ICLR,2017,We present the learning of analytical equation from data using a new forward network architecture. +B14uJzW0b,rkQO1MbAb,1509130000000.0,1518730000000.0,755,No Spurious Local Minima in a Two Hidden Unit ReLU Network,"[""wucw14@mails.tsinghua.edu.cn"", ""jiajunlu@usc.edu"", ""jasonlee@marshall.usc.edu""]","[""Chenwei Wu"", ""Jiajun Luo"", ""Jason D. Lee""]","[""Non-convex optimization"", ""Deep Learning""]","Deep learning models can be efficiently optimized via stochastic gradient descent, but there is little theoretical evidence to support this. A key question in optimization is to understand when the optimization landscape of a neural network is amenable to gradient-based optimization. We focus on a simple neural network two-layer ReLU network with two hidden units, and show that all local minimizers are global. This combined with recent work of Lee et al. (2017); Lee et al. (2016) show that gradient descent converges to the global minimizer.",/pdf/1d79cc1c095e26c810f84f769c88abf960d0e2af.pdf,ICLR,2018,"Recovery guarantee of stochastic gradient descent with random initialization for learning a two-layer neural network with two hidden nodes, unit-norm weights, ReLU activation functions and Gaussian inputs." +rklOg6EFwS,ryxYhqG8wH,1569440000000.0,1583910000000.0,343,Improving Adversarial Robustness Requires Revisiting Misclassified Examples,"[""eewangyisen@gmail.com"", ""knowzou@ucla.edu"", ""jinfengyi.ustc@gmail.com"", ""baileyj@unimelb.edu.au"", ""xingjun.ma@unimelb.edu.au"", ""qgu@cs.ucla.edu""]","[""Yisen Wang"", ""Difan Zou"", ""Jinfeng Yi"", ""James Bailey"", ""Xingjun Ma"", ""Quanquan Gu""]","[""Robustness"", ""Adversarial Defense"", ""Adversarial Training""]","Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by imperceptible perturbations. A range of defense techniques have been proposed to improve DNN robustness to adversarial examples, among which adversarial training has been demonstrated to be the most effective. Adversarial training is often formulated as a min-max optimization problem, with the inner maximization for generating adversarial examples. However, there exists a simple, yet easily overlooked fact that adversarial examples are only defined on correctly classified (natural) examples, but inevitably, some (natural) examples will be misclassified during training. In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training. Specifically, we find that misclassified examples indeed have a significant impact on the final robustness. More surprisingly, we find that different maximization techniques on misclassified examples may have a negligible influence on the final robustness, while different minimization techniques are crucial. Motivated by the above discovery, we propose a new defense algorithm called {\em Misclassification Aware adveRsarial Training} (MART), which explicitly differentiates the misclassified and correctly classified examples during the training. We also propose a semi-supervised extension of MART, which can leverage the unlabeled data to further improve the robustness. Experimental results show that MART and its variant could significantly improve the state-of-the-art adversarial robustness.",/pdf/08f03663964e4c8da165be864d32eed9723ef6fa.pdf,ICLR,2020,"By differentiating misclassified and correctly classified data, we propose a new misclassification aware defense that improves the state-of-the-art adversarial robustness." +rJl5rRVFvH,rygOq6Idvr,1569440000000.0,1577170000000.0,1121,Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog,"[""jaquesn@mit.edu"", ""asma_gh@mit.edu"", ""judyshen@mit.edu"", ""fergusoc@mit.edu"", ""agata@mit.edu"", ""ncjones@mit.edu"", ""shanegu@google.com"", ""picard@media.mit.edu""]","[""Natasha Jaques"", ""Asma Ghandeharioun"", ""Judy Hanwen Shen"", ""Craig Ferguson"", ""Agata Lapedriza"", ""Noah Jones"", ""Shixiang Gu"", ""Rosalind Picard""]","[""batch reinforcement learning"", ""deep learning"", ""dialog"", ""off-policy"", ""human preferences""]","Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. This is a critical shortcoming for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms which use KL-control to penalize divergence from a pre-trained prior model of probable actions. This KL-constraint reduces extrapolation error, enabling effective offline learning, without exploration, from a fixed batch of data. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. This Way Off-Policy (WOP) algorithm is tested on both traditional RL tasks from OpenAI Gym, and on the problem of open-domain dialog generation; a challenging reinforcement learning problem with a 20,000 dimensional action space. WOP allows for the extraction of multiple different reward functions post-hoc from collected human interaction data, and can learn effectively from all of these. We test real-world generalization by deploying dialog models live to converse with humans in an open-domain setting, and demonstrate that WOP achieves significant improvements over state-of-the-art prior methods in batch deep RL. +",/pdf/83567ae322b6b19689c31e643df4419e5eaa3924.pdf,ICLR,2020,"We show that KL-control from a pre-trained prior can allow RL models to learn from a static batch of collected data, without the ability to explore online in the environment." +HkljfjFee,,1478240000000.0,1492140000000.0,122,Support Regularized Sparse Coding and Its Fast Encoder,"[""superyyzg@gmail.com"", ""jyu79@illinois.edu"", ""pkohli@microsoft.com"", ""jianchao.yang@snapchat.com"", ""t-huang1@illinois.edu""]","[""Yingzhen Yang"", ""Jiahui Yu"", ""Pushmeet Kohli"", ""Jianchao Yang"", ""Thomas S. Huang""]",[],"Sparse coding represents a signal by a linear combination of only a few atoms of a learned over-complete dictionary. While sparse coding exhibits compelling performance for various machine learning tasks, the process of obtaining sparse code with fixed dictionary is independent for each data point without considering the geometric information and manifold structure of the entire data. We propose Support Regularized Sparse Coding (SRSC) which produces sparse codes that account for the manifold structure of the data by encouraging nearby data in the manifold to choose similar dictionary atoms. In this way, the obtained support regularized sparse codes capture the locally linear structure of the data manifold and enjoy robustness to data noise. We present the optimization algorithm of SRSC with theoretical guarantee for the optimization over the sparse codes. We also propose a feed-forward neural network termed Deep Support Regularized Sparse Coding (Deep-SRSC) as a fast encoder to approximate the sparse codes generated by SRSC. Extensive experimental results demonstrate the effectiveness of SRSC and Deep-SRSC.",/pdf/e086819f940ba0e8b713e3fdd1f6b9d73f1035a4.pdf,ICLR,2017,"We present Support Regularized Sparse Coding (SRSC) to improve the regular sparse coding, and propose a feed-forward neural network termed Deep Support Regularized Sparse Coding (Deep-SRSC) as its fast encoder." +BUlyHkzjgmA,aO0xwPRXYan,1601310000000.0,1615450000000.0,2357,Improved Estimation of Concentration Under $\ell_p$-Norm Distance Metrics Using Half Spaces,"[""jbp2jn@virginia.edu"", ""~Xiao_Zhang2"", ""~David_Evans1""]","[""Jack Prescott"", ""Xiao Zhang"", ""David Evans""]","[""Adversarial Examples"", ""Concentration of Measure"", ""Gaussian Isoperimetric Inequality""]","Concentration of measure has been argued to be the fundamental cause of adversarial vulnerability. Mahloujifar et al. (2019) presented an empirical way to measure the concentration of a data distribution using samples, and employed it to find lower bounds on intrinsic robustness for several benchmark datasets. However, it remains unclear whether these lower bounds are tight enough to provide a useful approximation for the intrinsic robustness of a dataset. To gain a deeper understanding of the concentration of measure phenomenon, we first extend the Gaussian Isoperimetric Inequality to non-spherical Gaussian measures and arbitrary $\ell_p$-norms ($p \geq 2$). We leverage these theoretical insights to design a method that uses half-spaces to estimate the concentration of any empirical dataset under $\ell_p$-norm distance metrics. Our proposed algorithm is more efficient than Mahloujifar et al. (2019)'s, and experiments on synthetic datasets and image benchmarks demonstrate that it is able to find much tighter intrinsic robustness bounds. These tighter estimates provide further evidence that rules out intrinsic dataset concentration as a possible explanation for the adversarial vulnerability of state-of-the-art classifiers.",/pdf/5d9950ac35e5e85a527dacf6286c7b9c148005bd.pdf,ICLR,2021,We show that concentration of measure does not prohibit the existence of adversarially robust classifiers using a novel method of empirical concentration estimation. +jDdzh5ul-d,r7hs62sRafj,1601310000000.0,1614290000000.0,1350,Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning,"[""~Haibo_Yang1"", ""myfang@iastate.edu"", ""~Jia_Liu1""]","[""Haibo Yang"", ""Minghong Fang"", ""Jia Liu""]","[""Federated Learning"", ""Linear Speedup"", ""Partial Worker Participation""]","Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate $\mathcal{O}(\frac{1}{\sqrt{mKT}} + \frac{1}{T})$ for full worker participation and a convergence rate $\mathcal{O}(\frac{\sqrt{K}}{\sqrt{nT}} + \frac{1}{T})$ for partial worker participation, where $K$ is the number of local steps, $T$ is the number of total communication rounds, $m$ is the total worker number and $n$ is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to $T/m$ in full worker participation. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.",/pdf/d80668746783bad5c0b3ff7eff505b3c40b825cd.pdf,ICLR,2021, +HJy_5Mcll,,1478270000000.0,1478290000000.0,178,ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation,"[""a.paszke@students.mimuw.edu.pl"", ""aabhish@purdue.edu"", ""sangpilkim@purdue.edu"", ""euge@purdue.edu""]","[""Adam Paszke"", ""Abhishek Chaurasia"", ""Sangpil Kim"", ""Eugenio Culurciello""]","[""Deep learning""]","The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in practical mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18x faster, requires 75x less FLOPs, has 79x less parameters, and provides similar or better accuracy to existing models. +We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster. +",/pdf/19025e9f9e55430fa57f3730bcf86cc5ad1c73b8.pdf,ICLR,2017, +ryeEr0EFvS,HkgZrj8OPB,1569440000000.0,1577170000000.0,1107,A Hierarchy of Graph Neural Networks Based on Learnable Local Features,"[""mlli@mit.edu"", ""mengdong@g.harvard.edu"", ""jzhou02@g.harvard.edu"", ""srush@cornell.edu""]","[""Michael Lingzhi Li"", ""Meng Dong"", ""Jiawei Zhou"", ""Alexander M. Rush""]","[""Graph Neural Networks"", ""Hierarchy"", ""Weisfeiler-Lehman"", ""Discriminative Power""]","Graph neural networks (GNNs) are a powerful tool to learn representations on graphs by iteratively aggregating features from node neighbourhoods. Many variant models have been proposed, but there is limited understanding on both how to compare different architectures and how to construct GNNs systematically. Here, we propose a hierarchy of GNNs based on their aggregation regions. We derive theoretical results about the discriminative power and feature representation capabilities of each class. Then, we show how this framework can be utilized to systematically construct arbitrarily powerful GNNs. As an example, we construct a simple architecture that exceeds the expressiveness of the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theory on both synthetic and real-world benchmarks, and demonstrate our example's theoretical power translates to state-of-the-art results on node classification, graph classification, and graph regression tasks. ",/pdf/df84c1abe32a8762fc20c9ec5c9622a52a4fbba5.pdf,ICLR,2020,"We developed a theoretically-sound hierarchy of graph neural networks (GNNs) based on aggregation regions, and demonstrated how to create powerful GNNs systematically using such framework." +S1emOTNKvS,BkeeJy2wDS,1569440000000.0,1577170000000.0,625,Robust Graph Representation Learning via Neural Sparsification,"[""chengzheng@cs.ucla.edu"", ""bzong@nec-labs.com"", ""weicheng@nec-labs.com"", ""dsong@nec-labs.com"", ""jni@nec-labs.com"", ""yuwenchao@ucla.edu"", ""haifeng@nec-labs.com"", ""weiwang@cs.ucla.edu""]","[""Cheng Zheng"", ""Bo Zong"", ""Wei Cheng"", ""Dongjin Song"", ""Jingchao Ni"", ""Wenchao Yu"", ""Haifeng Chen"", ""Wei Wang""]",[],"Graph representation learning serves as the core of many important prediction tasks, ranging from product recommendation in online marketing to fraud detection in financial domain. Real-life graphs are usually large with complex local neighborhood, where each node is described by a rich set of features and easily connects to dozens or even hundreds of neighbors. Most existing graph learning techniques rely on neighborhood aggregation, however, the complexity on real-life graphs is usually high, posing non-trivial overfitting risk during model training. In this paper, we present Neural Sparsification (NeuralSparse), a supervised graph sparsification technique that mitigates the overfitting risk by reducing the complexity of input graphs. Our method takes both structural and non-structural information as input, utilizes deep neural networks to parameterize the sparsification process, and optimizes the parameters by feedback signals from downstream tasks. Under the NeuralSparse framework, supervised graph sparsification could seamlessly connect with existing graph neural networks for more robust performance on testing data. Experimental results on both benchmark and private datasets show that NeuralSparse can effectively improve testing accuracy and bring up to 7.4% improvement when working with existing graph neural networks on node classification tasks.",/pdf/b0f22078a835fd027b887d742128845fc13e2b39.pdf,ICLR,2020, +BylyV1BtDB,Hye959hdwr,1569440000000.0,1577170000000.0,1638,FR-GAN: Fair and Robust Training,"[""rohyj113@gmail.com"", ""kangwook.lee@wisc.edu"", ""hkj4276@kaist.ac.kr"", ""swhang@kaist.ac.kr"", ""chsuh@kaist.ac.kr""]","[""Yuji Roh"", ""Kangwook Lee"", ""Gyeong Jo Hwang"", ""Steven Euijong Whang"", ""Changho Suh""]","[""generative adversarial networks"", ""model fairness"", ""model robustness""]","We consider the problem of fair and robust model training in the presence of data poisoning. Ensuring fairness usually involves a tradeoff against accuracy, so if the data poisoning is mistakenly viewed as additional bias to be fixed, the accuracy will be sacrificed even more. We demonstrate that this phenomenon indeed holds for state-of-the-art model fairness techniques. We then propose FR-GAN, which holistically performs fair and robust model training using generative adversarial networks (GANs). We first use a generator that attempts to classify examples as accurately as possible. In addition, we deploy two discriminators: (1) a fairness discriminator that predicts the sensitive attribute from classification results and (2) a robustness discriminator that distinguishes examples and predictions from a clean validation set. Our framework respects all the prominent fairness measures: disparate impact, equalized odds, and equal opportunity. Also, FR-GAN optimizes fairness without requiring the knowledge of prior statistics of the sensitive attributes. In our experiments, FR-GAN shows almost no decrease in fairness and accuracy in the presence of data poisoning unlike other state-of-the-art fairness methods, which are vulnerable. In addition, FR-GAN can be adjusted using parameters to maintain reasonable accuracy and fairness even if the validation set is too small or unavailable.",/pdf/428fcd11e30f1c5111f9cd720b2415ee5a19ba20.pdf,ICLR,2020,"We propose FR-GAN, which holistically performs fair and robust model training using generative adversarial networks. " +HJtN5K9gx,,1478300000000.0,1484240000000.0,515,Learning Disentangled Representations in Deep Generative Models,"[""nsid@robots.ox.ac.uk"", ""brooks@robots.ox.ac.uk"", ""alban@robots.ox.ac.uk"", ""j.vandemeent@northeastern.edu"", ""fwood@robots.ox.ac.uk"", ""ngoodman@stanford.edu"", ""pkohli@microsoft.com"", ""philip.torr@eng.ox.ac.uk""]","[""N. Siddharth"", ""Brooks Paige"", ""Alban Desmaison"", ""Jan-Willem van de Meent"", ""Frank Wood"", ""Noah D. Goodman"", ""Pushmeet Kohli"", ""Philip H.S. Torr""]","[""Semi-Supervised Learning"", ""Deep learning"", ""Computer vision""]","Deep generative models provide a powerful and flexible means to learn complex distributions over data by incorporating neural networks into latent-variable models. Variational approaches to training such models introduce a probabilistic encoder that casts data, typically unsupervised, into an entangled and unstructured representation space. While unsupervised learning is often desirable, sometimes even necessary, when we lack prior knowledge about what to represent, being able to incorporate domain knowledge in characterising certain aspects of variation in the data can often help learn better disentangled representations. Here, we introduce a new formulation of semi-supervised learning in variational autoencoders that allows precisely this. It permits flexible specification of probabilistic encoders as directed graphical models via a stochastic computation graph, containing both continuous and discrete latent variables, with conditional distributions parametrised by neural networks. We demonstrate how the provision of structure, along with a few labelled examples indicating plausible values for some components of the latent space, can help quickly learn disentangled representations. We then evaluate its ability to do so, both qualitatively by exploring its generative capacity, and quantitatively by using the disentangled representation to perform classification, on a variety of models and datasets.",/pdf/eb9acbda978d6821ccbcbf8ee7fb77dc9b506fca.pdf,ICLR,2017, +HyxFF34FPr,HkxTyNRh8r,1569440000000.0,1577170000000.0,86,FoveaBox: Beyound Anchor-based Object Detection,"[""taokongcn@gmail.com"", ""fcsun@tsinghua.edu.cn"", ""hpliu@tsinghua.edu.cn"", ""jiangyuning@bytedance.com"", ""lileilab@bytedance.com"", ""jshi@seas.upenn.edu""]","[""Tao Kong"", ""Fuchun Sun"", ""Huaping Liu"", ""Yuning Jiang"", ""Lei Li"", ""Jianbo Shi""]",[],"We present FoveaBox, an accurate, flexible, and completely anchor-free framework for object detection. While almost all state-of-the-art object detectors utilize predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors. Instead, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations. We demonstrate its effectiveness on standard benchmarks and report extensive experimental analysis. Without bells and whistles, FoveaBox achieves state-of-the-art single model performance on the standard COCO detection benchmark. More importantly, FoveaBox avoids all computation and hyper-parameters related to anchor boxes, which are often sensitive to the final detection performance. We believe the simple and effective approach will serve as a solid baseline and help ease future research for object detection. ",/pdf/c729b3a75ff60601acc3e68bd461d7d017e01970.pdf,ICLR,2020, +ByxAOoR5K7,rJgf8sFqK7,1538090000000.0,1545360000000.0,400,Policy Generalization In Capacity-Limited Reinforcement Learning,"[""lerchr2@rpi.edu"", ""simsc3@rpi.edu""]","[""Rachel A. Lerch"", ""Chris R. Sims""]","[""reinforcement learning"", ""generalization"", ""capacity constraints"", ""information theory""]","Motivated by the study of generalization in biological intelligence, we examine reinforcement learning (RL) in settings where there are information-theoretic constraints placed on the learner's ability to represent a behavioral policy. We first show that the problem of optimizing expected utility within capacity-limited learning agents maps naturally to the mathematical field of rate-distortion (RD) theory. Applying the RD framework to the RL setting, we develop a new online RL algorithm, Capacity-Limited Actor-Critic, that learns a policy that optimizes a tradeoff between utility maximization and information processing costs. Using this algorithm in a 2D gridworld environment, we demonstrate two novel empirical results. First, at high information rates (high channel capacity), the algorithm achieves faster learning and discovers better policies compared to the standard tabular actor-critic algorithm. Second, we demonstrate that agents with capacity-limited policy representations avoid 'overfitting' and exhibit superior transfer to modified environments, compared to policies learned by agents with unlimited information processing resources. Our work provides a principled framework for the development of computationally rational RL agents.",/pdf/d8ecb44aa671cf095226cba56db8f864dd49acc1.pdf,ICLR,2019,This paper describes the application of rate-distortion theory to the learning of efficient (capacity limited) policy representations in the reinforcement learning setting. +HkUR_y-RZ,r1U0O1ZCZ,1509120000000.0,1519380000000.0,496,SEARNN: Training RNNs with global-local losses,"[""remi.leblond@inria.fr"", ""jean-baptiste.alayrac@inria.fr"", ""aosokin@hse.ru"", ""slacoste@iro.umontreal.ca""]","[""R\u00e9mi Leblond"", ""Jean-Baptiste Alayrac"", ""Anton Osokin"", ""Simon Lacoste-Julien""]","[""Structured prediction"", ""RNNs""]","We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the ""learning to search"" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task.",/pdf/15c49b74558c239a4dbfa438d3abe751a51ac298.pdf,ICLR,2018,"We introduce SeaRNN, a novel algorithm for RNN training, inspired by the learning to search approach to structured prediction, in order to avoid the limitations of MLE training." +r1ez_sRcFQ,r1gw-SKctQ,1538090000000.0,1545360000000.0,334,Pixel Redrawn For A Robust Adversarial Defense,"[""ho_jiacang@hotmail.com"", ""dkkang@dongseo.ac.kr""]","[""Jiacang Ho"", ""Dae-Ki Kang""]","[""adversarial machine learning"", ""deep learning"", ""adversarial example""]","Recently, an adversarial example becomes a serious problem to be aware of because it can fool trained neural networks easily. +To prevent the issue, many researchers have proposed several defense techniques such as adversarial training, input transformation, stochastic activation pruning, etc. +In this paper, we propose a novel defense technique, Pixel Redrawn (PR) method, which redraws every pixel of training images to convert them into distorted images. +The motivation for our PR method is from the observation that the adversarial attacks have redrawn some pixels of the original image with the known parameters of the trained neural network. +Mimicking these attacks, our PR method redraws the image without any knowledge of the trained neural network. +This method can be similar to the adversarial training method but our PR method can be used to prevent future attacks. +Experimental results on several benchmark datasets indicate our PR method not only relieves the over-fitting issue when we train neural networks with a large number of epochs, but it also boosts the robustness of the neural network.",/pdf/495ecfd024b40fb82e0c9b6ce036513549327cf6.pdf,ICLR,2019, +SyeKf30cFQ,ryej0CJ5Km,1538090000000.0,1545360000000.0,1284,A theoretical framework for deep and locally connected ReLU network,"[""yuandong@fb.com""]","[""Yuandong Tian""]","[""theoretical analysis"", ""deep network"", ""optimization"", ""disentangled representation""]","Understanding theoretical properties of deep and locally connected nonlinear network, such as deep convolutional neural network (DCNN), is still a hard problem despite its empirical success. In this paper, we propose a novel theoretical framework for such networks with ReLU nonlinearity. The framework bridges data distribution with gradient descent rules, favors disentangled representations and is compatible with common regularization techniques such as Batch Norm, after a novel discovery of its projection nature. The framework is built upon teacher-student setting, by projecting the student's forward/backward pass onto the teacher's computational graph. We do not impose unrealistic assumptions (e.g., Gaussian inputs, independence of activation, etc). Our framework could help facilitate theoretical analysis of many practical issues, e.g. disentangled representations in deep networks. ",/pdf/a46dbffa3b74f4f0bd6621b9c02f44bfe1695001.pdf,ICLR,2019,This paper presents a theoretical framework that models data distribution explicitly for deep and locally connected ReLU network +SyNvti09KQ,B1lE1XZ9tQ,1538090000000.0,1550880000000.0,456,Visceral Machines: Risk-Aversion in Reinforcement Learning with Intrinsic Physiological Rewards,"[""damcduff@microsoft.com"", ""akapoor@microsoft.com""]","[""Daniel McDuff"", ""Ashish Kapoor""]","[""Reinforcement Learning"", ""Simulation"", ""Affective Computing""]"," As people learn to navigate the world, autonomic nervous system (e.g., ``fight or flight) responses provide intrinsic feedback about the potential consequence of action choices (e.g., becoming nervous when close to a cliff edge or driving fast around a bend.) Physiological changes are correlated with these biological preparations to protect one-self from danger. We present a novel approach to reinforcement learning that leverages a task-independent intrinsic reward function trained on peripheral pulse measurements that are correlated with human autonomic nervous system responses. Our hypothesis is that such reward functions can circumvent the challenges associated with sparse and skewed rewards in reinforcement learning settings and can help improve sample efficiency. We test this in a simulated driving environment and show that it can increase the speed of learning and reduce the number of collisions during the learning stage.",/pdf/a319c280c607c2305e442c3be08ab6220accec3a.pdf,ICLR,2019,We present a novel approach to reinforcement learning that leverages a task-independent intrinsic reward function trained on peripheral pulse measurements that are correlated with human autonomic nervous system responses. +Syl7OsRqY7,rJeNyTcStm,1538090000000.0,1545790000000.0,340,Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering,"[""victor@victorzhong.com"", ""cxiong@salesforce.com"", ""nkeskar@salesforce.com"", ""richard@socher.org""]","[""Victor Zhong"", ""Caiming Xiong"", ""Nitish Shirish Keskar"", ""Richard Socher""]","[""question answering"", ""reading comprehension"", ""nlp"", ""natural language processing"", ""attention"", ""representation learning""]","End-to-end neural models have made significant progress in question answering, however recent studies show that these models implicitly assume that the answer and evidence appear close together in a single document. In this work, we propose the Coarse-grain Fine-grain Coattention Network (CFC), a new question answering model that combines information from evidence across multiple documents. The CFC consists of a coarse-grain module that interprets documents with respect to the query then finds a relevant answer, and a fine-grain module which scores each candidate answer by comparing its occurrences across all of the documents with the query. We design these modules using hierarchies of coattention and self-attention, which learn to emphasize different parts of the input. On the Qangaroo WikiHop multi-evidence question answering task, the CFC obtains a new state-of-the-art result of 70.6% on the blind test set, outperforming the previous best by 3% accuracy despite not using pretrained contextual encoders.",/pdf/5df7a4e2cf855f9416e8dac29a9942d08c4c2e33.pdf,ICLR,2019,A new state-of-the-art model for multi-evidence question answering using coarse-grain fine-grain hierarchical attention. +gl3D-xY7wLq,sqBZcbTshUy,1601310000000.0,1615840000000.0,2112,Noise or Signal: The Role of Image Backgrounds in Object Recognition,"[""~Kai_Yuanqing_Xiao1"", ""~Logan_Engstrom1"", ""~Andrew_Ilyas1"", ""~Aleksander_Madry1""]","[""Kai Yuanqing Xiao"", ""Logan Engstrom"", ""Andrew Ilyas"", ""Aleksander Madry""]","[""Backgrounds"", ""Model Biases"", ""Robustness"", ""Computer Vision""]","We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 88% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance. +",/pdf/16106e84acc8eb97c1a091b0c318d22052d9ab9b.pdf,ICLR,2021,We develop and use a toolkit to investigate models’ use of (and reliance on) image backgrounds. +uxYjVEXx48i,GdLM09I_PGn,1601310000000.0,1614990000000.0,1259,An Examination of Preference-based Reinforcement Learning for Treatment Recommendation,"[""~Nan_Xu2"", ""~Nitin_Kamra1"", ""~Yan_Liu1""]","[""Nan Xu"", ""Nitin Kamra"", ""Yan Liu""]","[""Preference-based Reinforcement Learning"", ""Treatment Recommendation"", ""healthcare""]","Treatment recommendation is a complex multi-faceted problem with many conflicting objectives, e.g., optimizing the survival rate (or expected lifetime), mitigating negative impacts, reducing financial expenses and time costs, avoiding over-treatment, etc. While this complicates the hand-engineering of a reward function for learning treatment policies, fortunately, qualitative feedback from human experts is readily available and can be easily exploited. Since direct estimation of rewards via inverse reinforcement learning is a challenging task and requires the existence of an optimal human policy, the field of treatment recommendation has recently witnessed the development of the preference-based ReinforcementLearning (PRL) framework, which infers a reward function from only qualitative and imperfect human feedback to ensure that a human expert’s preferred policy has a higher expected return over a less preferred policy. In this paper, we first present an open simulation platform to model the progression of two diseases, namely Cancer and Sepsis, and the reactions of the affected individuals to the received treatment. Secondly, we investigate important problems in adopting preference-basedRL approaches for treatment recommendation, such as advantages of learning from preference over hand-engineered reward, addressing incomparable policies, reward interpretability, and agent design via simulated experiments. The designed simulation platform and insights obtained for preference-based RL approaches are beneficial for achieving the right trade-off between various human objectives during treatment recommendation.",/pdf/ab9d9eaf60be6f66ecde42086838f1fc3f3a6542.pdf,ICLR,2021,Develop a simulation platform and investigate preference-based reinforcement learning approaches for treatment recommendation +HkgU3xBtDS,H1gPLl-YPr,1569440000000.0,1577170000000.0,2538,REFINING MONTE CARLO TREE SEARCH AGENTS BY MONTE CARLO TREE SEARCH,"[""katsuki.ohto@gmail.com""]","[""Katsuki Ohto""]","[""Reinforcement Learning"", ""Monte Carlo Tree Search"", ""Alpha Zero""]","Reinforcement learning methods that continuously learn neural networks by episode generation with game tree search have been successful in two-person complete information deterministic games such as chess, shogi, and Go. However, there are only reports of practical cases and there are little evidence to guarantee the stability and the final performance of learning process. In this research, the coordination of episode generation was focused on. By means of regarding the entire system as game tree search, the new method can handle the trade-off between exploitation and exploration during episode generation. The experiments with a small problem showed that it had robust performance compared to the existing method, Alpha Zero.",/pdf/905a687ee32441494a1a9f40cae3f0c5e4e815c8.pdf,ICLR,2020,Apply Monte Carlo Tree Search to episode generation in Alpha Zero +HJeYSxHFDS,ByxxTdgKPS,1569440000000.0,1577170000000.0,2293,Gauge Equivariant Spherical CNNs,"[""b.kicanaoglu@uva.nl"", ""pimdehaan@gmail.com"", ""taco.cohen@gmail.com""]","[""Berkay Kicanaoglu"", ""Pim de Haan"", ""Taco Cohen""]","[""deep learning"", ""convolutional networks"", ""equivariance"", ""gauge equivariance"", ""symmetry"", ""geometric deep learning"", ""manifold convolution""]","Spherical CNNs are convolutional neural networks that can process signals on the sphere, such as global climate and weather patterns or omnidirectional images. Over the last few years, a number of spherical convolution methods have been proposed, based on generalized spherical FFTs, graph convolutions, and other ideas. However, none of these methods is simultaneously equivariant to 3D rotations, able to detect anisotropic patterns, computationally efficient, agnostic to the type of sample grid used, and able to deal with signals defined on only a part of the sphere. To address these limitations, we introduce the Gauge Equivariant Spherical CNN. Our method is based on the recently proposed theory of Gauge Equivariant CNNs, which is in principle applicable to signals on any manifold, and which can be computed on any set of local charts covering all of the manifold or only part of it. In this paper we show how this method can be implemented efficiently for the sphere, and show that the resulting method is fast, numerically accurate, and achieves good results on the widely used benchmark problems of climate pattern segmentation and omnidirectional semantic segmentation.",/pdf/b318b186758b46c2a48c6dc4dda0433bf77ffaf1.pdf,ICLR,2020,This paper proposes a scalable equivariant spherical convolution. +HylZIT4Yvr,BkgPrMOPDH,1569440000000.0,1577170000000.0,549,Structural Language Models for Any-Code Generation,"[""urialon@cs.technion.ac.il"", ""roysadaka@gmail.com"", ""omerlevy@gmail.com"", ""yahave@cs.technion.ac.il""]","[""Uri Alon"", ""Roy Sadaka"", ""Omer Levy"", ""Eran Yahav""]","[""Program Generation"", ""Structural Language Model"", ""SLM"", ""Generative Model"", ""Code Generation""]","We address the problem of Any-Code Generation (AnyGen) - generating code without any restriction on the vocabulary or structure. The state-of-the-art in this problem is the sequence-to-sequence (seq2seq) approach, which treats code as a sequence and does not leverage any structural information. We introduce a new approach to AnyGen that leverages the strict syntax of programming languages to model a code snippet as tree structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous structural techniques that have severely restricted the kinds of expressions that can be generated, our approach can generate arbitrary expressions in any programming language. Our model significantly outperforms both seq2seq and a variety of existing structured approaches in generating Java and C# code. We make our code, datasets, and models available online.",/pdf/cd2191e3e737ff42e92d21e0faaf6196872a4273.pdf,ICLR,2020,We generate source code using a Structural Language Model over the program's Abstract Syntax Tree +rJx8I1rFwr,Hkg_p8a_DS,1569440000000.0,1577170000000.0,1729,Meta-Learning by Hallucinating Useful Examples,"[""yuxiongw@cs.cmu.edu"", ""braverthan2@gmail.com"", ""hebert@ri.cmu.edu"", ""karteek.alahari@inria.fr""]","[""Yu-Xiong Wang"", ""Yuki Uchiyama"", ""Martial Hebert"", ""Karteek Alahari""]","[""few-shot learning"", ""meta-learning""]","Learning to hallucinate additional examples has recently been shown as a promising direction to address few-shot learning tasks, which aim to learn novel concepts from very few examples. The hallucination process, however, is still far from generating effective samples for learning. In this work, we investigate two important requirements for the hallucinator --- (i) precision: the generated examples should lead to good classifier performance, and (ii) collaboration: both the hallucinator and the classification component need to be trained jointly. By integrating these requirements as novel loss functions into a general meta-learning with hallucination framework, our model-agnostic PrecisE Collaborative hAlluciNator (PECAN) facilitates data hallucination to improve the performance of new classification tasks. Extensive experiments demonstrate state-of-the-art performance on competitive miniImageNet and ImageNet based few-shot benchmarks in various scenarios.",/pdf/c5c53a6e6cefd1a495a582b3f2c1f7723d513b8b.pdf,ICLR,2020, +c7rtqjVaWiE,aTmk1UdN7Nw,1601310000000.0,1614990000000.0,1621,Efficient Sampling for Generative Adversarial Networks with Reparameterized Markov Chains,"[""~Yifei_Wang1"", ""~Yisen_Wang1"", ""yjs@math.pku.edu.cn"", ""~Zhouchen_Lin1""]","[""Yifei Wang"", ""Yisen Wang"", ""Jiansheng Yang"", ""Zhouchen Lin""]","[""Generative Adverarial Networks"", ""Sampling"", ""Markov chain Monte Carlo"", ""Reparameterization""]","Recently, sampling methods have been successfully applied to enhance the sample quality of Generative Adversarial Networks (GANs). However, in practice, they typically have poor sample efficiency because of the independent proposal sampling from the generator. In this work, we propose REP-GAN, a novel sampling method that allows general dependent proposals by REParameterizing the Markov chains into the latent space of the generator. Theoretically, we show that our reparameterized proposal admits a closed-form Metropolis-Hastings acceptance ratio. Empirically, extensive experiments on synthetic and real datasets demonstrate that our REP-GAN largely improves the sample efficiency and obtains better sample quality simultaneously.",/pdf/73a28026ce5a250d4131c24ca67f9953a144da87.pdf,ICLR,2021,"We develop a novel sampling method for GANs, called REP-GAN, that enjoys better sample efficiency and sample quality." +r1gNLAEFPS,HkxKUxDODS,1569440000000.0,1577170000000.0,1136,Neural ODEs for Image Segmentation with Level Sets,"[""rafaelvalle@nvidia.com"", ""freda@nvidia.com"", ""mshoeybi@nvidia.com"", ""plegresley@nvidia.com"", ""atao@nvidia.com"", ""bcatanzaro@nvidia.com""]","[""Rafael Valle"", ""Fitsum Reda"", ""Mohammad Shoeybi"", ""Patrick Legresley"", ""Andrew Tao"", ""Bryan Catanzaro""]","[""neural odes"", ""level sets"", ""image segmentation""]","We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. Our approach parametrizes the evolution of an initial contour with a NODE that implicitly learns from data a speed function describing the evolution. In addition, for cases where an initial contour is not available and to alleviate the need for careful choice or design of contour embedding functions, we propose a NODE-based method that evolves an image embedding into a dense per-pixel semantic label space. We evaluate our methods on kidney segmentation (KiTS19) and on salient object detection (PASCAL-S, ECSSD and HKU-IS). In addition to improving initial contours provided by deep learning models while using a fraction of their number of parameters, our approach achieves F scores that are higher than several state-of-the-art deep learning algorithms",/pdf/e04274311e894c7e7079cbd952f222d805094d38.pdf,ICLR,2020,We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. +rylCP6NFDB,BJlX_diPDB,1569440000000.0,1577170000000.0,614,Hindsight Trust Region Policy Optimization,"[""zhanghanbo163@stu.xjtu.edu.cn"", ""best99317@stu.xjtu.edu.cn"", ""xglan@xjtu.edu.cn"", ""nnzheng@xjtu.edu.cn""]","[""Hanbo Zhang"", ""Site Bai"", ""Xuguang Lan"", ""Nanning Zheng""]","[""Hindsight"", ""Sparse Reward"", ""Reinforcement Learning"", ""Policy Gradients""]","As reinforcement learning continues to drive machine intelligence beyond its conventional boundary, unsubstantial practices in sparse reward environment severely limit further applications in a broader range of advanced fields. Motivated by the demand for an effective deep reinforcement learning algorithm that accommodates sparse reward environment, this paper presents Hindsight Trust Region Policy Optimization (HTRPO), a method that efficiently utilizes interactions in sparse reward conditions to optimize policies within trust region and, in the meantime, maintains learning stability. Firstly, we theoretically adapt the TRPO objective function, in the form of the expected return of the policy, to the distribution of hindsight data generated from the alternative goals. Then, we apply Monte Carlo with importance sampling to estimate KL-divergence between two policies, taking the hindsight data as input. Under the condition that the distributions are sufficiently close, the KL-divergence is approximated by another f-divergence. Such approximation results in the decrease of variance and alleviates the instability during policy update. Experimental results on both discrete and continuous benchmark tasks demonstrate that HTRPO converges significantly faster than previous policy gradient methods. It achieves effective performances and high data-efficiency for training policies in sparse reward environments.",/pdf/502553557152a6aa563737f058aa15c350f1d5cd.pdf,ICLR,2020,This paper proposes an advanced policy optimization method with hindsight experience for sparse reward reinforcement learning. +V69LGwJ0lIN,l1rLEavcjTr,1601310000000.0,1616040000000.0,870,OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning,"[""~Anurag_Ajay1"", ""~Aviral_Kumar2"", ""~Pulkit_Agrawal1"", ""~Sergey_Levine1"", ""~Ofir_Nachum1""]","[""Anurag Ajay"", ""Aviral Kumar"", ""Pulkit Agrawal"", ""Sergey Levine"", ""Ofir Nachum""]","[""Offline Reinforcement Learning"", ""Primitive Discovery"", ""Unsupervised Learning""]","Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent’s ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations and code are available at https://sites.google.com/view/opal-iclr",/pdf/b57bd58e05e17b49da003dded33a75506b9ddbac.pdf,ICLR,2021,"An effective way to leverage multimodal offline behavioral data is to extract a continuous space of primitives, and use it for downstream task learning." +ARFshOO1Iu,KxJuZWiNbPQ,1601310000000.0,1614990000000.0,2017,Adaptive Self-training for Neural Sequence Labeling with Few Labels,"[""~Yaqing_Wang1"", ""~Subhabrata_Mukherjee2"", ""haochu@microsoft.com"", ""yuantu@microsoft.com"", ""mingwu@microsoft.com"", ""~Jing_Gao1"", ""~Ahmed_Hassan_Awadallah1""]","[""Yaqing Wang"", ""Subhabrata Mukherjee"", ""Haoda Chu"", ""Yuancheng Tu"", ""Ming Wu"", ""Jing Gao"", ""Ahmed Hassan Awadallah""]","[""Self-training"", ""Neural Sequence Labeling"", ""Meta Learning""]","Neural sequence labeling is an important technique employed for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), slot tagging for dialog systems and semantic parsing. Large-scale pre-trained language models obtain very good performance on these tasks when fine-tuned on large amounts of task-specific labeled data. However, such large-scale labeled datasets are difficult to obtain for several tasks and domains due to the high cost of human annotation as well as privacy and data access constraints for sensitive user applications. This is exacerbated for sequence labeling tasks requiring such annotations at token-level. In this work, we develop techniques to address the label scarcity challenge for neural sequence labeling models. Specifically, we develop self-training and meta-learning techniques for training neural sequence taggers with few labels. While self-training serves as an effective mechanism to learn from large amounts of unlabeled data -- meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. Extensive experiments on six benchmark datasets including two for massive multilingual NER and four slot tagging datasets for task-oriented dialog systems demonstrate the effectiveness of our method. With only 10 labeled examples for each class for each task, our method obtains 10% improvement over state-of-the-art systems demonstrating its effectiveness for the low-resource setting. ",/pdf/d1fbbbb7041d9863a000f81ce4fe792f52bd77b8.pdf,ICLR,2021,We develop techniques leveraging self-training and meta-learning for few-shot training of neural sequence taggers based on large-scale pre-trained language models. +SkeP3yBFDS,rkgja-1tvr,1569440000000.0,1577170000000.0,1955,Reducing Computation in Recurrent Networks by Selectively Updating State Neurons,"[""twhartvigsen@wpi.edu"", ""csen@wpi.edu"", ""xkong@wpi.edu"", ""rundenst@wpi.edu""]","[""Thomas Hartvigsen"", ""Cansu Sen"", ""Xiangnan Kong"", ""Elke Rundensteiner""]","[""recurrent neural networks"", ""conditional computation"", ""representation learning""]","Recurrent Neural Networks (RNN) are the state-of-the-art approach to sequential learning. However, standard RNNs use the same amount of computation at each timestep, regardless of the input data. As a result, even for high-dimensional hidden states, all dimensions are updated at each timestep regardless of the recurrent memory cell. Reducing this rigid assumption could allow for models with large hidden states to perform inference more quickly. Intuitively, not all hidden state dimensions need to be recomputed from scratch at each timestep. Thus, recent methods have begun studying this problem by imposing mainly a priori-determined patterns for updating the state. In contrast, we now design a fully-learned approach, SA-RNN, that augments any RNN by predicting discrete update patterns at the fine granularity of independent hidden state dimensions through the parameterization of a distribution of update-likelihoods driven entirely by the input data. We achieve this without imposing assumptions on the structure of the update pattern. Better yet, our method adapts the update patterns online, allowing different dimensions to be updated conditional to the input. To learn which to update, the model solves a multi-objective optimization problem, maximizing accuracy while minimizing the number of updates based on a unified control. Using publicly-available datasets we demonstrate that our method consistently achieves higher accuracy with fewer updates compared to state-of-the-art alternatives. Additionally, our method can be directly applied to a wide variety of models containing RNN architectures.",/pdf/08cf4380ea9c107d75980378e8b7059b3bb275e3.pdf,ICLR,2020,We show that conditionally computing individual dimensions of an RNN's hidden state depending on input data at each time step from scratch with no assumptions leads to higher accuracy with far fewer computations than state-of-the-art approach. +BJy0fcgRZ,HykAGcxRZ,1509100000000.0,1518730000000.0,349,Capturing Human Category Representations by Sampling in Deep Feature Spaces,"[""peterson.c.joshua@gmail.com"", ""kaghi@berkeley.edu"", ""suchow@berkeley.edu"", ""alexku@berkeley.edu"", ""tom_griffiths@berkeley.edu""]","[""Joshua Peterson"", ""Krishan Aghi"", ""Jordan Suchow"", ""Alexander Ku"", ""Tom Griffiths""]","[""category representations"", ""psychology"", ""cognitive science"", ""deep neural networks""]","Understanding how people represent categories is a core problem in cognitive science, with the flexibility of human learning remaining a gold standard to which modern artificial intelligence and machine learning aspire. Decades of psychological research have yielded a variety of formal theories of categories, yet validating these theories with naturalistic stimuli remains a challenge. The problem is that human category representations cannot be directly observed and running informative experiments with naturalistic stimuli such as images requires having a workable representation of these stimuli. Deep neural networks have recently been successful in a range of computer vision tasks and provide a way to represent the features of images. In this paper, we introduce a method for estimating the structure of human categories that draws on ideas from both cognitive science and machine learning, blending human-based algorithms with state-of-the-art deep representation learners. We provide qualitative and quantitative results as a proof of concept for the feasibility of the method. Samples drawn from human distributions rival the quality of current state-of-the-art generative models and outperform alternative methods for estimating the structure of human categories.",/pdf/25517f85b1d4ecfe69c658faadbbf877b2e69a0f.pdf,ICLR,2018,using deep neural networks and clever algorithms to capture human mental visual concepts +H1lj0nNFwB,r1ejteZrwB,1569440000000.0,1583910000000.0,277,The Implicit Bias of Depth: How Incremental Learning Drives Generalization,"[""daniel.gissin@mail.huji.ac.il"", ""shais@cs.huji.ac.il"", ""amit.daniely@mail.huji.ac.il""]","[""Daniel Gissin"", ""Shai Shalev-Shwartz"", ""Amit Daniely""]","[""gradient flow"", ""gradient descent"", ""implicit regularization"", ""implicit bias"", ""generalization"", ""optimization"", ""quadratic network"", ""matrix sensing""]","A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.",/pdf/e4badf9a069cdc9246fc1ed7147b791b50dedb4f.pdf,ICLR,2020,"We study the sparsity-inducing bias of deep models, caused by their learning dynamics." +rJPcZ3txx,,1478240000000.0,1488350000000.0,134,Faster CNNs with Direct Sparse Convolutions and Guided Pruning,"[""jongsoo.park@intel.com"", ""sheng.r.li@intel.com"", ""peter.tang@intel.com"", ""weiwen.web@gmail.com"", ""HAL66@pitt.edu"", ""yic52@pitt.edu"", ""pradeep.dubey@intel.com""]","[""Jongsoo Park"", ""Sheng Li"", ""Wei Wen"", ""Ping Tak Peter Tang"", ""Hai Li"", ""Yiran Chen"", ""Pradeep Dubey""]","[""Deep learning"", ""Optimization""]","Phenomenally successful in practical inference problems, convolutional neural networks (CNN) are widely deployed in mobile devices, data centers, and even supercomputers. +The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. +Nevertheless, the resulting CNNs offer limited benefits. While pruning the fully connected layers reduces a CNN's size considerably, it does not improve inference speed noticeably as the compute heavy parts lie in convolutions. Pruning CNNs in a way that increase inference speed often imposes specific sparsity structures, thus limiting the achievable sparsity levels. + +We present a method to realize simultaneously size economy and speed improvement while pruning CNNs. Paramount to our success is an efficient general sparse-with-dense matrix +multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns. Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures. Together, these two allow us to demonstrate 3.1-7.3x convolution speedups over dense convolution in AlexNet, on Intel Atom, Xeon, and Xeon Phi processors, spanning the spectrum from mobile devices to supercomputers. +",/pdf/b64769c6f7937fa521cbcfceb2bc0af1c24aeb00.pdf,ICLR,2017,"Highly-performance sparse convolution outperforms dense with only 70% sparsity. Performance model that guides training to find useful sparsity range, applied to AlexNet and GoogLeNet" +SJxmfgSYDB,rJgrCWxYvB,1569440000000.0,1577170000000.0,2168,Representing Unordered Data Using Multiset Automata and Complex Numbers,"[""jdebened@nd.edu"", ""dchiang@nd.edu""]","[""Justin DeBenedetto"", ""David Chiang""]","[""sets"", ""multisets"", ""automata"", ""complex numbers"", ""position encodings""]","Unordered, variable-sized inputs arise in many settings across multiple fields. The ability for set- and multiset- oriented neural networks to handle this type of input has been the focus of much work in recent years. We propose to represent multisets using complex-weighted multiset automata and show how the multiset representations of certain existing neural architectures can be viewed as special cases of ours. Namely, (1) we provide a new theoretical and intuitive justification for the Transformer model's representation of positions using sinusoidal functions, and (2) we extend the DeepSets model to use complex numbers, enabling it to outperform the existing model on an extension of one of their tasks. +",/pdf/6f237a0c94b692383df4fa1536667a632645481e.pdf,ICLR,2020,Automata for multisets and complex numbers give a new way of thinking about DeepSets and Transformer position encodings. +rJVoEiCqKQ,rken9UgutQ,1538090000000.0,1545360000000.0,32,Deep Perm-Set Net: Learn to predict sets with unknown permutation and cardinality using deep neural networks,"[""hamid.rezatofighi@adelaide.edu.au"", ""roman.kaskman@tum.de"", ""farbod.motlagh@student.adelaide.edu.au"", ""javen.shi@adelaide.edu.au"", ""cremers@tum.de"", ""leal.taixe@tum.de"", ""ian.reid@adelaide.edu.au""]","[""S. Hamid Rezatofighi"", ""Roman Kaskman"", ""Farbod T. Motlagh"", ""Qinfeng Shi"", ""Daniel Cremers"", ""Laura Leal-Taix\u00e9"", ""Ian Reid""]","[""Set learning"", ""Permutation invariant"", ""Object detection"", ""CAPTCHA test""]","Many real-world problems, e.g. object detection, have outputs that are naturally expressed as sets of entities. This creates a challenge for traditional deep neural networks which naturally deal with structured outputs such as vectors, matrices or tensors. We present a novel approach for learning to predict sets with unknown permutation and cardinality using deep neural networks. Specifically, in our formulation we incorporate the permutation as unobservable variable and estimate its distribution during the learning process using alternating optimization. We demonstrate the validity of this new formulation on two relevant vision problems: object detection, for which our formulation outperforms state-of-the-art detectors such as Faster R-CNN and YOLO, and a complex CAPTCHA test, where we observe that, surprisingly, our set based network acquired the ability of mimicking arithmetics without any rules being coded.",/pdf/1a3f4c416c02f44141e22fb24062203ccd658b06.pdf,ICLR,2019,We present a novel approach for learning to predict sets with unknown permutation and cardinality using feed-forward deep neural networks. +SyxBgkBFPS,B1gV9EiODB,1569440000000.0,1577170000000.0,1505,Guided Adaptive Credit Assignment for Sample Efficient Policy Optimization,"[""lhao499@gmail.com"", ""rsocher@salesforce.com"", ""cxiong@salesforce.com""]","[""Hao Liu"", ""Richard Socher"", ""Caiming Xiong""]","[""credit assignment"", ""sparse reward"", ""policy optimization"", ""sample efficiency""]","Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from sparse reward tasks, which leads to poor sample efficiency during training. In this work, we propose a guided adaptive credit assignment method to do effectively credit assignment for policy gradient methods. Motivated by entropy regularized policy optimization, our method extends the previous credit assignment methods by introducing more general guided adaptive credit assignment(GACA). The benefit of GACA is a principled way of utilizing off-policy samples. The effectiveness of proposed algorithm is demonstrated on the challenging \textsc{WikiTableQuestions} and \textsc{WikiSQL} benchmarks and an instruction following environment. The task is generating action sequences or program sequences from natural language questions or instructions, where only final binary success-failure execution feedback is available. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy optimization approaches.",/pdf/63cce53d34840572bda1a6ccd24877368f6c33ba.pdf,ICLR,2020,A new and general credit assignment method for obtaining sample efficiency of policy optimization in sparse reward setting +eNSpdJeR_J,Ihu77EKrkTH,1601310000000.0,1614990000000.0,736,Deep Learning with Data Privacy via Residual Perturbation,"[""twq17@mails.tsinghua.edu.cn"", ""linghm18@mails.tsinghua.edu.cn"", ""~Zuoqiang_Shi1"", ""~Bao_Wang1""]","[""Wenqi Tao"", ""Huaming Ling"", ""Zuoqiang Shi"", ""Bao Wang""]","[""Data Privacy"", ""Residual Perturbation"", ""Deep Learning""]","Protecting data privacy in deep learning (DL) is at its urgency. Several celebrated privacy notions have been established and used for privacy-preserving DL. However, many of the existing mechanisms achieve data privacy at the cost of significant utility degradation. In this paper, we propose a stochastic differential equation principled \emph{residual perturbation} for privacy-preserving DL, which injects Gaussian noise into each residual mapping of ResNets. Theoretically, we prove that residual perturbation guarantees differential privacy (DP) and reduces the generalization gap for DL. Empirically, we show that residual perturbation outperforms the state-of-the-art DP stochastic gradient descent (DPSGD) in both membership privacy protection and maintaining the DL models' utility. For instance, in the process of training ResNet8 for the IDC dataset classification, residual perturbation obtains an accuracy of 85.7\% and protects the perfect membership privacy; in contrast, DPSGD achieves an accuracy of 82.8\% and protects worse membership privacy. ",/pdf/304dc3574b5c25cee112a4d4f6e31a445f4f3d77.pdf,ICLR,2021,We propose a stochastic differential equation principled residual perturbation for privacy-preserving DL. +r1xdH3CcKX,rkgR1Sa9Y7,1538090000000.0,1550910000000.0,1557,Stochastic Prediction of Multi-Agent Interactions from Partial Observations,"[""chensun@google.com"", ""perk@google.com"", ""jiajunwu@mit.edu"", ""jbt@mit.edu"", ""kpmurphy@google.com""]","[""Chen Sun"", ""Per Karlsson"", ""Jiajun Wu"", ""Joshua B Tenenbaum"", ""Kevin Murphy""]","[""Dynamics modeling"", ""partial observations"", ""multi-agent interactions"", ""predictive models""]","We present a method which learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents. Our method is based on a graph-structured variational recurrent neural network, which is trained end-to-end to infer the current state of the (partially observed) world, as well as to forecast future states. We show that our method outperforms various baselines on two sports datasets, one based on real basketball trajectories, and one generated by a soccer game engine.",/pdf/fe101282d3869202cd110467d865a2e62985f27a.pdf,ICLR,2019,We present a method which learns to integrate temporal information and ambiguous visual information in the context of interacting agents. +exa2mDqPb5E,m2GjaCIQPwJ,1601310000000.0,1614990000000.0,564,CIGMO: Learning categorical invariant deep generative models from grouped data,"[""~Haruo_Hosoya1""]","[""Haruo Hosoya""]","[""variational autoencoders"", ""disentangling"", ""mixture"", ""clustering""]","Images of general objects are often composed of three hidden factors: category (e.g., car or chair), shape (e.g., particular car form), and view (e.g., 3D orientation). While there have been many disentangling models that can discover either a category or shape factor separately from a view factor, such models typically cannot capture the structure of general objects that the diversity of shapes is much larger across categories than within a category. Here, we propose a novel generative model called CIGMO, which can learn to represent the category, shape, and view factors at once only with weak supervision. Concretely, we develop mixture of disentangling deep generative models, where the mixture components correspond to object categories and each component model represents shape and view in a category-specific and mutually invariant manner. We devise a learning method based on variational autoencoders that does not explicitly use label information but uses only grouping information that links together different views of the same object. Using several datasets of 3D objects including ShapeNet, we demonstrate that our model often outperforms previous relevant models including state-of-the-art methods in invariant clustering and one-shot classification tasks, in a manner exposing the importance of categorical invariant representation. +",/pdf/49cb9d3172dc250d7fdb450b1a0b40d7b5d9097f.pdf,ICLR,2021,"A novel VAE-based deep general model that can discover category, shape, and view factors of general object images is presented." +r1efr3C9Ym,S1gVkB6qF7,1538090000000.0,1550880000000.0,1523,Interpolation-Prediction Networks for Irregularly Sampled Time Series,"[""snshukla@cs.umass.edu"", ""marlin@cs.umass.edu""]","[""Satya Narayan Shukla"", ""Benjamin Marlin""]","[""irregular sampling"", ""multivariate time series"", ""supervised learning"", ""interpolation"", ""missing data""]","In this paper, we present a new deep learning architecture for addressing the problem of supervised learning with sparse and irregularly sampled multivariate time series. The architecture is based on the use of a semi-parametric interpolation network followed by the application of a prediction network. The interpolation network allows for information to be shared across multiple dimensions of a multivariate time series during the interpolation stage, while any standard deep learning model can be used for the prediction network. This work is motivated by the analysis of physiological time series data in electronic health records, which are sparse, irregularly sampled, and multivariate. We investigate the performance of this architecture on both classification and regression tasks, showing that our approach outperforms a range of baseline and recently proposed models. +",/pdf/5201cabcb77acbccbcd047875c6156afdf096bac.pdf,ICLR,2019,This paper presents a new deep learning architecture for addressing the problem of supervised learning with sparse and irregularly sampled multivariate time series. +H1ls_eSKPH,HklVV6xtwB,1569440000000.0,1577170000000.0,2410,Overcoming Catastrophic Forgetting via Hessian-free Curvature Estimates,"[""butyreld@iis.fraunhofer.de"", ""georgios.kontes@iis.fraunhofer.de"", ""christoffer.loeffler@iis.fraunhofer.de"", ""christopher.mutschler@iis.fraunhofer.de""]","[""Leonid Butyrev"", ""Georgios Kontes"", ""Christoffer L\u00f6ffler"", ""Christopher Mutschler""]","[""catastrophic forgetting"", ""multi-task learning"", ""continual learning""]","Learning neural networks with gradient descent over a long sequence of tasks is problematic as their fine-tuning to new tasks overwrites the network weights that are important for previous tasks. This leads to a poor performance on old tasks – a phenomenon framed as catastrophic forgetting. While early approaches use task rehearsal and growing networks that both limit the scalability of the task sequence orthogonal approaches build on regularization. Based on the Fisher information matrix (FIM) changes to parameters that are relevant to old tasks are penalized, which forces the task to be mapped into the available remaining capacity of the network. This requires to calculate the Hessian around a mode, which makes learning tractable. In this paper, we introduce Hessian-free curvature estimates as an alternative method to actually calculating the Hessian. In contrast to previous work, we exploit the fact that most regions in the loss surface are flat and hence only calculate a Hessian-vector-product around the surface that is relevant for the current task. Our experiments show that on a variety of well-known task sequences we either significantly outperform or are en par with previous work.",/pdf/ab15c22442b0c99c538ce4e98ef2c265e2cbf12b.pdf,ICLR,2020,This paper provides an approach to address catastrophic forgetting via Hessian-free curvature estimates +BkpiPMbA-,r1isPGWRb,1509140000000.0,1521760000000.0,863,Decision Boundary Analysis of Adversarial Examples,"[""_w@eecs.berkeley.edu"", ""lxbosky@gmail.com"", ""dawnsong.travel@gmail.com""]","[""Warren He"", ""Bo Li"", ""Dawn Song""]","[""adversarial machine learning"", ""supervised representation learning"", ""decision regions"", ""decision boundaries""]","Deep neural networks (DNNs) are vulnerable to adversarial examples, which are carefully crafted instances aiming to cause prediction errors for DNNs. Recent research on adversarial examples has examined local neighborhoods in the input space of DNN models. However, previous work has limited what regions to consider, focusing either on low-dimensional subspaces or small balls. In this paper, we argue that information from larger neighborhoods, such as from more directions and from greater distances, will better characterize the relationship between adversarial examples and the DNN models. First, we introduce an attack, OPTMARGIN, which generates adversarial examples robust to small perturbations. These examples successfully evade a defense that only considers a small ball around an input instance. Second, we analyze a larger neighborhood around input instances by looking at properties of surrounding decision boundaries, namely the distances to the boundaries and the adjacent classes. We find that the boundaries around these adversarial examples do not resemble the boundaries around benign examples. Finally, we show that, under scrutiny of the surrounding decision boundaries, our OPTMARGIN examples do not convincingly mimic benign examples. Although our experiments are limited to a few specific attacks, we hope these findings will motivate new, more evasive attacks and ultimately, effective defenses.",/pdf/07458ec060bb54583b51954ce9377141138be895.pdf,ICLR,2018,Looking at decision boundaries around an input gives you more information than a fixed small neighborhood +r1GkMhAqYm,SkewPbn5F7,1538090000000.0,1545360000000.0,1228,CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication,"[""kitaev@cs.berkeley.edu"", ""jnhwkim@gmail.com"", ""xinleic@fb.com"", ""maroffm@gmail.com"", ""yuandong@fb.com"", ""dbatra@gatech.edu"", ""parikh@gatech.edu""]","[""Nikita Kitaev"", ""Jin-Hwa Kim"", ""Xinlei Chen"", ""Marcus Rohrbach"", ""Yuandong Tian"", ""Dhruv Batra"", ""Devi Parikh""]","[""CoDraw"", ""collaborative drawing"", ""grounded language""]","In this work, we propose a goal-driven collaborative task that contains language, vision, and action in a virtual environment as its core components. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate via two-way communication using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human agents. We define protocols and metrics to evaluate the effectiveness of learned agents on this testbed, highlighting the need for a novel ""crosstalk"" condition which pairs agents trained independently on disjoint subsets of the training data for evaluation. We present models for our task, including simple but effective baselines and neural network approaches trained using a combination of imitation learning and goal-driven training. All models are benchmarked using both fully automated evaluation and by playing the game with live human agents.",/pdf/6251ae315133b2de3431dad0d8ee0282660cb5c1.pdf,ICLR,2019,"We introduce a dataset, models, and training + evaluation protocols for a collaborative drawing task that allows studying goal-driven and perceptually + actionably grounded language generation and understanding. " +SJky6Ry0W,ByCChR10W,1509060000000.0,1518730000000.0,197,Learning Independent Causal Mechanisms,"[""gparascandolo@tue.mpg.de"", ""mrojascarulla@gmail.com"", ""nkilbertus@tue.mpg.de"", ""bs@tue.mpg.de""]","[""Giambattista Parascandolo"", ""Mateo Rojas Carulla"", ""Niki Kilbertus"", ""Bernhard Schoelkopf""]",[],"Independent causal mechanisms are a central concept in the study of causality +with implications for machine learning tasks. In this work we develop +an algorithm to recover a set of (inverse) independent mechanisms relating +a distribution transformed by the mechanisms to a reference distribution. +The approach is fully unsupervised and based on a set of experts that compete +for data to specialize and extract the mechanisms. We test and analyze +the proposed method on a series of experiments based on image transformations. +Each expert successfully maps a subset of the transformed data +to the original domain, and the learned mechanisms generalize to other +domains. We discuss implications for domain transfer and links to recent +trends in generative modeling.",/pdf/6a8a2ce728fc2f60f3f165fb54dc0e3bfce9802e.pdf,ICLR,2018, +IqVB8e0DlUd,nsX7Zk8muYt,1601310000000.0,1614990000000.0,1528,Fair Differential Privacy Can Mitigate the Disparate Impact on Model Accuracy,"[""~Wenyan_Liu1"", ""~Xiangfeng_Wang1"", ""xjlu@cs.ecnu.edu.cn"", ""jhcheng@stu.ecnu.edu.cn"", ""~Bo_Jin1"", ""xlwang@cs.ecnu.edu.cn"", ""~Hongyuan_Zha1""]","[""Wenyan Liu"", ""Xiangfeng Wang"", ""Xingjian Lu"", ""Junhong Cheng"", ""Bo Jin"", ""Xiaoling Wang"", ""Hongyuan Zha""]",[],"The techniques based on the theory of differential privacy (DP) has become a standard building block in the machine learning community. DP training mechanisms offer strong guarantees that an adversary cannot determine with high confidence about the training data based on analyzing the released model, let alone any details of the instances. However, DP may disproportionately affect the underrepresented and relatively complicated classes. That is, the reduction in utility is unequal for each class. This paper proposes a fair differential privacy algorithm (FairDP) to mitigate the disparate impact on model accuracy for each class. We cast the learning procedure as a two-stage optimization problem, which integrates differential privacy with fairness. FairDP establishes a self-adaptive DP mechanism and dynamically adjusts instance influence in each class depending on the theoretical bias-variance bound. Our experimental evaluation shows the effectiveness of FairDP in mitigating the disparate impact on model accuracy among the classes on several benchmark datasets and scenarios ranging from text to vision.",/pdf/8835a8423ad60f884ffc21cdb28fbb71f79eadc7.pdf,ICLR,2021, +B1x996EKPS,BkxKCW0vPB,1569440000000.0,1577170000000.0,717,Fast Machine Learning with Byzantine Workers and Servers,"[""elmahdi.elmhamdi@epfl.ch"", ""rachid.guerraoui@epfl.ch"", ""arsany.guirguis@epfl.ch""]","[""El-Mahdi El-Mhamdi"", ""Rachid Guerraoui"", ""Arsany Guirguis""]","[""Distributed machine learning"", ""Byzantine resilience"", ""Fault tolerance""]","Machine Learning (ML) solutions are nowadays distributed and are prone to various types of component failures, which can be encompassed in so-called Byzantine behavior. This paper introduces LiuBei, a Byzantine-resilient ML algorithm that does not trust any individual component in the network (neither workers nor servers), nor does it induce additional communication rounds (on average), compared to standard non-Byzantine resilient algorithms. LiuBei builds upon gradient aggregation rules (GARs) to tolerate a minority of Byzantine workers. Besides, LiuBei replicates the parameter server on multiple machines instead of trusting it. We introduce a novel filtering mechanism that enables workers to filter out replies from Byzantine server replicas without requiring communication with all servers. Such a filtering mechanism is based on network synchrony, Lipschitz continuity of the loss function, and the GAR used to aggregate workers’ gradients. We also introduce a protocol, scatter/gather, to bound drifts between models on correct servers with a small number of communication messages. We theoretically prove that LiuBei achieves Byzantine resilience to both servers and workers and guarantees convergence. We build LiuBei using TensorFlow, and we show that LiuBei tolerates Byzantine behavior with an accuracy loss of around 5% and around 24% convergence overhead compared to vanilla TensorFlow. We moreover show that the throughput gain of LiuBei compared to another state–of–the–art Byzantine–resilient ML algorithm (that assumes network asynchrony) is 70%.",/pdf/89a7a256d6081d76e6b56f054112df6ed1a14de2.pdf,ICLR,2020,We present an algorithm that tolerates not only Byzantine workers but also Byzantine servers in synchronous networks with a low overhead. +B1gkpR4FDB,H1e_ETtuPr,1569440000000.0,1577170000000.0,1378,Statistical Adaptive Stochastic Optimization,"[""penzhan@microsoft.com"", ""hjl@mit.edu"", ""lqiang@cs.utexas.edu"", ""lin.xiao@microsoft.com""]","[""Pengchuan Zhang"", ""Hunter Lang"", ""Qiang Liu"", ""Lin Xiao""]",[],"We investigate statistical methods for automatically scheduling the learning rate (step size) in stochastic optimization. First, we consider a broad family of stochastic optimization methods with constant hyperparameters (including the learning rate and various forms of momentum) and derive a general necessary condition for the resulting dynamics to be stationary. Based on this condition, we develop a simple online statistical test to detect (non-)stationarity and use it to automatically drop the learning rate by a constant factor whenever stationarity is detected. Unlike in prior work, our stationarity condition and our statistical test applies to different algorithms without modification. Finally, we propose a smoothed stochastic line-search method that can be used to warm up the optimization process before the statistical test can be applied effectively. This removes the expensive trial and error for setting a good initial learning rate. The combined method is highly autonomous and it attains state-of-the-art training and testing performance in our experiments on several deep learning tasks.",/pdf/1ca2acf7c9a77a798ce15cef8d9820a805b6e77c.pdf,ICLR,2020, +ryY4RhkCZ,SyYNA2kAZ,1509050000000.0,1518730000000.0,176,DEEP DENSITY NETWORKS AND UNCERTAINTY IN RECOMMENDER SYSTEMS,"[""yoel.z@taboola.com"", ""sth@deeplab.ai"", ""efrat.s@taboola.com"", ""aviv.r@taboola.com"", ""gil.c@taboola.com"", ""dan.f@taboola.com""]","[""Yoel Zeldes"", ""Stavros Theodorakis"", ""Efrat Solodnik"", ""Aviv Rotman"", ""Gil Chamiel"", ""Dan Friedman""]","[""deep learning"", ""recommendation system"", ""uncertainty"", ""context-based and collaborative filtering""]","Building robust online content recommendation systems requires learning com- plex interactions between user preferences and content features. The field has evolved rapidly in recent years from traditional multi-arm bandit and collabora- tive filtering techniques, with new methods integrating Deep Learning models that enable to capture non-linear feature interactions. Despite progress, the dynamic nature of online recommendations still poses great challenges, such as finding the delicate balance between exploration and exploitation. In this paper we provide a novel method, Deep Density Networks (DDN) which deconvolves measurement and data uncertainty and predicts probability densities of CTR, enabling us to perform more efficient exploration of the feature space. We show the usefulness of using DDN online in a real world content recommendation system that serves billions of recommendations per day, and present online and offline results to eval- uate the benefit of using DDN.",/pdf/1e9b725db47e3c9f229dc1a840ad17bd37de1491.pdf,ICLR,2018,"We have introduced Deep Density Network, a unified DNN model to estimate uncertainty for exploration/exploitation in recommendation systems." +BJ0Ee8cxx,,1478280000000.0,1478450000000.0,283,Hierarchical Memory Networks,"[""apsarathchandar@gmail.com"", ""sjn.ahn@gmail.com"", ""hugo@twitter.com"", ""vincentp@iro.umontreal.ca"", ""gtesauro@us.ibm.com"", ""yoshua.bengio@umontreal.ca""]","[""Sarath Chandar"", ""Sungjin Ahn"", ""Hugo Larochelle"", ""Pascal Vincent"", ""Gerald Tesauro"", ""Yoshua Bengio""]","[""Deep learning"", ""Natural language processing""]","Memory networks are neural networks with an explicit memory component that can be both read and written to by the network. The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible. However, this is not computationally scalable for applications which require the network to read from extremely large memories. On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully. In this paper, we explore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks. The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention over a flat memory. Specifically, we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network. We explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task. +",/pdf/cc931474b5c9d578c9e5e4543ba03ead784ffd50.pdf,ICLR,2017,We propose a hierarchical memory organization strategy for efficient memory access in memory networks with large memory. +r1xHxgrKwr,BylFfayKPB,1569440000000.0,1577170000000.0,2098,Anomaly Detection Based on Unsupervised Disentangled Representation Learning in Combination with Manifold Learning,"[""xli343@uottawa.ca"", ""iluju.kiringa@uottawa.ca"", ""tyeap@uottawa.ca"", ""xiaodan.zhu@queensu.ca"", ""yifeng.li@nrc-cnrc.gc.ca""]","[""Xiaoyan Li"", ""Iluju Kiringa"", ""Tet Yeap"", ""Xiaodan Zhu"", ""Yifeng Li""]","[""anomaly detection"", ""disentangled representation learning"", ""manifold learning""]","Identifying anomalous samples from highly complex and unstructured data is a crucial but challenging task in a variety of intelligent systems. In this paper, we present a novel deep anomaly detection framework named AnoDM (standing for Anomaly detection based on unsupervised Disentangled representation learning and Manifold learning). The disentanglement learning is currently implemented by beta-VAE for automatically discovering interpretable factorized latent representations in a completely unsupervised manner. The manifold learning is realized by t-SNE for projecting the latent representations to a 2D map. We define a new anomaly score function by combining beta-VAE's reconstruction error in the raw feature space and local density estimation in the t-SNE space. AnoDM was evaluated on both image and time-series data and achieved better results than models that use just one of the two measures and other deep learning methods.",/pdf/73b24ac9b4f376e22bab42e483e2e71a96dd8e52.pdf,ICLR,2020,We developed anomaly detection framework based on beta-VAE and t-SNE +083vV3utxpC,h05tfHyE7-f,1601310000000.0,1614990000000.0,922,Deep Partial Updating,"[""~Zhongnan_Qu1"", ""~Cong_Liu2"", ""~Junfeng_Guo2"", ""~Lothar_Thiele1""]","[""Zhongnan Qu"", ""Cong Liu"", ""Junfeng Guo"", ""Lothar Thiele""]","[""Partial updating"", ""communication constraints"", ""server-to-edge"", ""deep neural networks""]","Emerging edge intelligence applications require the server to continuously retrain and update deep neural networks deployed on remote edge nodes to leverage newly collected data samples. Unfortunately, it may be impossible in practice to continuously send fully updated weights to these edge nodes due to the highly constrained communication resource. In this paper, we propose the weight-wise deep partial updating paradigm, which smartly selects only a subset of weights to update at each server-to-edge communication round, while achieving a similar performance compared to full updating. Our method is established through analytically upper-bounding the loss difference between partial updating and full updating, and only updates the weights which make the largest contributions to the upper bound. Extensive experimental results demonstrate the efficacy of our partial updating methodology which achieves a high inference accuracy while updating a rather small number of weights.",/pdf/cc5a2d7154866620190f05c837936144c3c99910.pdf,ICLR,2021,"To iteratively improve the deployed deep neural network with newly collected data samples, we propose the server only sends a weight-wise partial update to edge devices to save the communication and computation resources." +B1l8iiA9tQ,r1xm69dqKQ,1538090000000.0,1545360000000.0,624,Backdrop: Stochastic Backpropagation,"[""siavash.golkar@gmail.com"", ""kyle.cranmer@nyu.edu""]","[""Siavash Golkar"", ""Kyle Cranmer""]","[""stochastic optimization"", ""multi-scale data analysis"", ""non-decomposable loss"", ""generalization"", ""one-shot learning""]","We introduce backdrop, a flexible and simple-to-implement method, intuitively described as dropout acting only along the backpropagation pipeline. Backdrop is implemented via one or more masking layers which are inserted at specific points along the network. Each backdrop masking layer acts as the identity in the forward pass, but randomly masks parts of the backward gradient propagation. Intuitively, inserting a backdrop layer after any convolutional layer leads to stochastic gradients corresponding to features of that scale. Therefore, backdrop is well suited for problems in which the data have a multi-scale, hierarchical structure. Backdrop can also be applied to problems with non-decomposable loss functions where standard SGD methods are not well suited. We perform a number of experiments and demonstrate that backdrop leads to significant improvements in generalization.",/pdf/83dfc0524fa50888ad55198adbd0377e3f184130.pdf,ICLR,2019,"We introduce backdrop, intuitively described as dropout acting on the backpropagation pipeline and find significant improvements in generalization for problems with non-decomposable losses and problems with multi-scale, hierarchical data structure." +H1epaJSYDS,SJeziH1YvS,1569440000000.0,1577170000000.0,2004,Anchor & Transform: Learning Sparse Representations of Discrete Objects,"[""pliang@cs.cmu.edu"", ""manzilzaheer@google.com"", ""yuanwang@google.com"", ""amra@google.com""]","[""Paul Pu Liang"", ""Manzil Zaheer"", ""Yuan Wang"", ""Amr Ahmed""]","[""sparse representation learning"", ""discrete inputs"", ""natural language processing""]","Learning continuous representations of discrete objects such as text, users, and items lies at the heart of many applications including text and user modeling. Unfortunately, traditional methods that embed all objects do not scale to large vocabulary sizes and embedding dimensions. In this paper, we propose a general method, Anchor & Transform (ANT) that learns sparse representations of discrete objects by jointly learning a small set of anchor embeddings and a sparse transformation from anchor objects to all objects. ANT is scalable, flexible, end-to-end trainable, and allows the user to easily incorporate domain knowledge about object relationships (e.g. WordNet, co-occurrence, item clusters). ANT also recovers several task-specific baselines under certain structural assumptions on the anchors and transformation matrices. On text classification and language modeling benchmarks, ANT demonstrates stronger performance with fewer parameters as compared to existing vocabulary selection and embedding compression baselines.",/pdf/e73eb16003af6c20dc39ef5db1fd65f54cba0a42.pdf,ICLR,2020,"We propose a general method to learn sparse representations of discrete objects that is scalable, flexible, end-to-end trainable, and allows the user to easily incorporate domain knowledge about object relationships." +BJg7x1HFvB,Hkeyozi_wB,1569440000000.0,1577170000000.0,1499,Well-Read Students Learn Better: On the Importance of Pre-training Compact Models,"[""iuliaturc@google.com"", ""mingweichang@google.com"", ""kentonl@google.com"", ""kristout@google.com""]","[""Iulia Turc"", ""Ming-Wei Chang"", ""Kenton Lee"", ""Kristina Toutanova""]","[""NLP"", ""self-supervised learning"", ""language model pre-training"", ""knowledge distillation"", ""BERT"", ""compact models""]","Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.",/pdf/224b8126e47ada10dae37b8f2a68d2e076676eed.pdf,ICLR,2020,Studies how self-supervised learning and knowledge distillation interact in the context of building compact models. +HJgK0h4Ywr,ryl92ogrvH,1569440000000.0,1583910000000.0,273,Theory and Evaluation Metrics for Learning Disentangled Representations,"[""dkdo@deakin.edu.au"", ""truyen.tran@deakin.edu.au""]","[""Kien Do"", ""Truyen Tran""]","[""disentanglement"", ""metrics""]","We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept “disentangled representations” used in supervised and unsupervised methods along three dimensions–informativeness, separability and interpretability–which can be expressed and quantified explicitly using information-theoretic constructs. This helps explain the behaviors of several well-known disentanglement learning models. We then propose robust metrics for measuring informativeness, separability and interpretability. Through a comprehensive suite of experiments, we show that our metrics correctly characterize the representations learned by different methods and are consistent with qualitative (visual) results. Thus, the metrics allow disentanglement learning methods to be compared on a fair ground. We also empirically uncovered new interesting properties of VAE-based methods and interpreted them with our formulation. These findings are promising and hopefully will encourage the design of more theoretically driven models for learning disentangled representations. ",/pdf/2e192160d2c979003aa25256ddab01689a94c2b7.pdf,ICLR,2020, +e3KNSdWFOfT,fPFaDWbxtNV,1601310000000.0,1614990000000.0,2595,Solving Min-Max Optimization with Hidden Structure via Gradient Descent Ascent,"[""~Emmanouil-Vasileios_Vlatakis-Gkaragkounis1"", ""~Lampros_Flokas1"", ""~Georgios_Piliouras1""]","[""Emmanouil-Vasileios Vlatakis-Gkaragkounis"", ""Lampros Flokas"", ""Georgios Piliouras""]","[""Min-max optimization"", ""Lyapunov functions"", ""Stability Analysis"", ""Generative Adversarial Networks"", ""Non-convex optimization""]","Many recent AI architectures are inspired by zero-sum games, however, the behavior of their dynamics is still not well understood. Inspired by this, we study standard gradient descent ascent (GDA) dynamics in a specific class of non-convex non-concave zero-sum games, that we call hidden zero-sum games. In this class, players control the inputs of smooth but possibly non-linear functions whose outputs are being applied as inputs to a convex-concave game. Unlike general min-max games, these games have a well-defined notion of solution; outcomes that implement the von-Neumann equilibrium of the ``hidden convex-concave game. We prove that if the hidden game is strictly convex-concave then vanilla GDA converges not merely to local Nash, but typically to the von-Neumann solution. If the game lacks strict convexity properties, GDA may fail to converge to any equilibrium, however, by applying standard regularization techniques we can prove convergence to a von-Neumann solution of a slightly perturbed min-max game. Our convergence guarantees are non-local, which as far as we know is a first-of-its-kind type of result in non-convex non-concave games. Finally, we discuss connections of our framework with generative adversarial networks. +",/pdf/809a8d03b53bf40e39ed01ad02b80a7684777567.pdf,ICLR,2021,We prove non-local asymptotic convergence guarantees in a class of non-convex non-concave zero-sum. +BJxOHs0cKm,Syl0N-FYtm,1538090000000.0,1545360000000.0,101,Identifying Generalization Properties in Neural Networks,"[""huan.wang@salesforce.com"", ""nkeskar@salesforce.com"", ""cxiong@salesforce.com"", ""rsocher@salesforce.com""]","[""Huan Wang"", ""Nitish Shirish Keskar"", ""Caiming Xiong"", ""Richard Socher""]","[""generalization"", ""PAC-Bayes"", ""Hessian"", ""perturbation""]","While it has not yet been proven, empirical evidence suggests that model generalization is related to local properties of the optima which can be described via the Hessian. We connect model generalization with the local property of a solution under the PAC-Bayes paradigm. In particular, we prove that model generalization ability is related to the Hessian, the higher-order ""smoothness"" terms characterized by the Lipschitz constant of the Hessian, and the scales of the parameters. Guided by the proof, we propose a metric to score the generalization capability of the model, as well as an algorithm that optimizes the perturbed model accordingly. ",/pdf/264740b8de4799f2111a2f6db4bbde286aa987f0.pdf,ICLR,2019,a theory connecting Hessian of the solution and the generalization power of the model +K4wkUp5xNK,90lxzSewIhJ,1601310000000.0,1614990000000.0,3443,Invariant Causal Representation Learning,"[""~Chaochao_Lu1"", ""~Yuhuai_Wu1"", ""~Jos\u00e9_Miguel_Hern\u00e1ndez-Lobato1"", ""~Bernhard_Sch\u00f6lkopf1""]","[""Chaochao Lu"", ""Yuhuai Wu"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato"", ""Bernhard Sch\u00f6lkopf""]",[],"Due to spurious correlations, machine learning systems often fail to generalize to environments whose distributions differ from the ones used at training time. Prior work addressing this, either explicitly or implicitly, attempted to find a data representation that has an invariant causal relationship with the outcome. This is done by leveraging a diverse set of training environments to reduce the effect of spurious features, on top of which an invariant classifier is then built. However, these methods have generalization guarantees only when both data representation and classifiers come from a linear model class. As an alternative, we propose Invariant Causal Representation Learning (ICRL), a learning paradigm that enables out-of-distribution generalization in the nonlinear setting (i.e., nonlinear representations and nonlinear classifiers). It builds upon a practical and general assumption: data representations factorize when conditioning on the outcome and the environment. Based on this, we show identifiability up to a permutation and pointwise transformation. We also prove that all direct causes of the outcome can be fully discovered, which further enables us to obtain generalization guarantees in the nonlinear setting. Extensive experiments on both synthetic and real-world datasets show that our approach significantly outperforms a variety of baseline methods.",/pdf/8d88bacf2003373864c0da490d54b99f9e7a81e3.pdf,ICLR,2021,"We propose Invariant Causal Representation Learning (ICRL), a novel learning paradigm that enables out-of-distribution generalization in the nonlinear setting." +Hye9lnCct7,B1gCvwTqtm,1538090000000.0,1550950000000.0,1099,Learning Actionable Representations with Goal Conditioned Policies,"[""dibya.ghosh@berkeley.edu"", ""abhigupta@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Dibya Ghosh"", ""Abhishek Gupta"", ""Sergey Levine""]","[""Representation Learning"", ""Reinforcement Learning""]","Representation learning is a central challenge across a range of machine learning areas. In reinforcement learning, effective and functional representations have the potential to tremendously accelerate learning progress and solve more challenging problems. Most prior work on representation learning has focused on generative approaches, learning representations that capture all the underlying factors of variation in the observation space in a more disentangled or well-ordered manner. In this paper, we instead aim to learn functionally salient representations: representations that are not necessarily complete in terms of capturing all factors of variation in the observation space, but rather aim to capture those factors of variation that are important for decision making -- that are ""actionable"". These representations are aware of the dynamics of the environment, and capture only the elements of the observation that are necessary for decision making rather than all factors of variation, eliminating the need for explicit reconstruction. We show how these learned representations can be useful to improve exploration for sparse reward problems, to enable long horizon hierarchical reinforcement learning, and as a state representation for learning policies for downstream tasks. We evaluate our method on a number of simulated environments, and compare it to prior methods for representation learning, exploration, and hierarchical reinforcement learning.",/pdf/3ea74941cff53b4fb4027089afaec787ce62ebdd.pdf,ICLR,2019,Learning state representations which capture factors necessary for control +iqmOTi9J7E8,w_cknKncJAg,1601310000000.0,1614990000000.0,2687,Private Split Inference of Deep Networks,"[""~Mohammad_Samragh1"", ""~Hossein_Hosseini4"", ""kambiza@qti.qualcomm.com"", ""jsoriaga@qti.qualcomm.com""]","[""Mohammad Samragh"", ""Hossein Hosseini"", ""Kambiz Azarian"", ""Joseph Soriaga""]","[""ML privacy"", ""split inference""]","Splitting network computations between the edge device and the cloud server is a promising approach for enabling low edge-compute and private inference of neural networks. Current methods for providing the privacy train the model to minimize information leakage for a given set of private attributes. In practice, however, the test queries might contain private attributes that are not foreseen during training. +We propose an alternative solution, in which, instead of obfuscating the information corresponding to a set of attributes, the edge device discards the information irrelevant to the main task. To this end, the edge device runs the model up to a split layer determined based on its computational capacity and then removes the activation content that is in the null space of the next layer of the model before sending it to the server. It can further remove the low-energy components of the remaining signal to improve the privacy at the cost of reducing the accuracy. The experimental results show that our methods provide privacy while maintaining the accuracy and introducing only a small computational overhead. ",/pdf/11b11c2add40bfea26ca9fca95faf941769e4802.pdf,ICLR,2021,"We propose a split learning framework for private inference of neural networks, in which the edge device removes the content of the activations that is not relevant to the main task. " +iUTHidd-ylL,U3ObVTCA6Yd,1601310000000.0,1614990000000.0,1563,Matrix Data Deep Decoder - Geometric Learning for Structured Data Completion,"[""~Maria_Schmidt1"", ""~Alexander_Bronstein1""]","[""Maria Schmidt"", ""Alexander Bronstein""]","[""Deep learning"", ""Non-Euclidean data completion"", ""Sparse matrices"", ""Recommender systems"", ""Recommendation systems"", ""Sparse representations""]","In this work, we present a fully convolutional end to end method to reconstruct corrupted sparse matrices of Non-Euclidean data. The classic example for such matrices is recommender systems matrices where the rows/columns represent items/users and the entries are ratings. The method we present is inspired by the surprising and spectacular success of methods like$""$ deep image prior$""$ and $``$deep decoder$""$ for corrupted image completion. In sharp contrast to previous Matrix Completion methods wherein the latent matrix or its factors directly serve as the optimization variable, in the method we present, the matrix is parameterized as the weights of a graph neural network acting on a random noisy input. Then we are tuning the network parameters to get a result as close as possible to the initial sparse matrix (using its factors) getting that way state of the art matrix completion result. In addition to the conceptual simplicity of our method, which is just Non-Euclidean generalization of deep image priors, it holds fewer parameters than previously presented methods which makes the parameters more trackable and the method more computationally efficient and more applicable for the real-world tasks. The method also achieves state-of-the-art results for the matrix completion task on the classical benchmarks in the field. The method also surprisingly shows that untrained convolutional neural network can use a good prior not only for image completion but also for Matrix Completion when redefined for graphs.",/pdf/38ffb209883019f8715291b06a7b8004b3c591cb.pdf,ICLR,2021,Non-Euclidean Data Matrix Completion with end-to-end fully convolutional graph neural network based on Deep Image Prior Generalization +AJY3fGPF1DC,Xg8n3Rn2cdR,1601310000000.0,1614990000000.0,1261,Selecting Treatment Effects Models for Domain Adaptation Using Causal Knowledge,"[""~Trent_Kyono1"", ""~Ioana_Bica1"", ""~Zhaozhi_Qian1"", ""~Mihaela_van_der_Schaar2""]","[""Trent Kyono"", ""Ioana Bica"", ""Zhaozhi Qian"", ""Mihaela van der Schaar""]","[""causal inference"", ""treatment effects"", ""healthcare""]","Selecting causal inference models for estimating individualized treatment effects (ITE) from observational data presents a unique challenge since the counterfactual outcomes are never observed. The problem is challenged further in the unsupervised domain adaptation (UDA) setting where we only have access to labeled samples in the source domain, but desire selecting a model that achieves good performance on a target domain for which only unlabeled samples are available. Existing techniques for UDA model selection are designed for the predictive setting. These methods examine discriminative density ratios between the input covariates in the source and target domain and do not factor in the model's predictions in the target domain. Because of this, two models with identical performance on the source domain would receive the same risk score by existing methods, but in reality, have significantly different performance in the test domain. We leverage the invariance of causal structures across domains to propose a novel model selection metric specifically designed for ITE methods under the UDA setting. In particular, we propose selecting models whose predictions of interventions' effects satisfy known causal structures in the target domain. Experimentally, our method selects ITE models that are more robust to covariate shifts on several healthcare datasets, including estimating the effect of ventilation in COVID-19 patients from different geographic locations.",/pdf/5ce1a042bb18ddc43b40ec3c1ea4dfbd119aaa09.pdf,ICLR,2021,We take advantage of the invariance of causal graphs across domains and propose a novel model selection metric for individualized treatment effect models in the unsupervised domain adaptation setting. +Hyig0zb0Z,H1ZwaG-Cb,1509140000000.0,1518730000000.0,1020,Gated ConvNets for Letter-Based ASR,"[""vitaliy888@fb.com"", ""gab@fb.com"", ""locronan@fb.com""]","[""Vitaliy Liptchinsky"", ""Gabriel Synnaeve"", ""Ronan Collobert""]","[""automatic speech recognition"", ""letter-based acoustic model"", ""gated convnets""]","In this paper we introduce a new speech recognition system, leveraging a simple letter-based ConvNet acoustic model. The acoustic model requires only audio transcription for training -- no alignment annotations, nor any forced alignment step is needed. At inference, our decoder takes only a word list and a language model, and is fed with letter scores from the acoustic model -- no phonetic word lexicon is needed. Key ingredients for the acoustic model are Gated Linear Units and high dropout. We show near state-of-the-art results in word error rate on the LibriSpeech corpus with MFSC features, both on the clean and other configurations. +",/pdf/fd0e5b0202837238db3b2fbaf15b614696ade0a1.pdf,ICLR,2018,A letter-based ConvNet acoustic model leads to a simple and competitive speech recognition pipeline. +7EDgLu9reQD,0-njZSaYvwW,1601310000000.0,1616500000000.0,3563,SALD: Sign Agnostic Learning with Derivatives,"[""~Matan_Atzmon1"", ""~Yaron_Lipman1""]","[""Matan Atzmon"", ""Yaron Lipman""]","[""implicit neural representations"", ""3D shapes learning"", ""sign agnostic learning""]","Learning 3D geometry directly from raw data, such as point clouds, triangle soups, or unoriented meshes is still a challenging task that feeds many downstream computer vision and graphics applications. + +In this paper, we introduce SALD: a method for learning implicit neural representations of shapes directly from raw data. We generalize sign agnostic learning (SAL) to include derivatives: given an unsigned distance function to the input raw data, we advocate a novel sign agnostic regression loss, incorporating both pointwise values and gradients of the unsigned distance function. Optimizing this loss leads to a signed implicit function solution, the zero level set of which is a high quality and valid manifold approximation to the input 3D data. The motivation behind SALD is that incorporating derivatives in a regression loss leads to a lower sample complexity, and consequently better fitting. In addition, we provide empirical evidence, as well as theoretical motivation in 2D that SAL enjoys a minimal surface property, favoring minimal area solutions. More importantly, we are able to show that this property still holds for SALD, i.e., with derivatives included. + +We demonstrate the efficacy of SALD for shape space learning on two challenging datasets: ShapeNet that contains inconsistent orientation and non-manifold meshes, and D-Faust that contains raw 3D scans (triangle soups). On both these datasets, we present state-of-the-art results.",/pdf/8ee2bee9e961323ee6b93586fd34377a074e8889.pdf,ICLR,2021,Sign agnostic learning with derivatives for learning high fidelity 3D implicit neural representations shape space from raw data. +SJ1nzBeA-,SkCozSlAW,1509080000000.0,1519270000000.0,250,Multi-Task Learning for Document Ranking and Query Suggestion,"[""wasiahmad@cs.ucla.edu"", ""kwchang@cs.ucla.edu"", ""hw5x@virginia.edu""]","[""Wasi Uddin Ahmad"", ""Kai-Wei Chang"", ""Hongning Wang""]","[""Multitask Learning"", ""Document Ranking"", ""Query Suggestion""]","We propose a multi-task learning framework to jointly learn document ranking and query suggestion for web search. It consists of two major components, a document ranker, and a query recommender. Document ranker combines current query and session information and compares the combined representation with document representation to rank the documents. Query recommender tracks users' query reformulation sequence considering all previous in-session queries using a sequence to sequence approach. As both tasks are driven by the users' underlying search intent, we perform joint learning of these two components through session recurrence, which encodes search context and intent. Extensive comparisons against state-of-the-art document ranking and query suggestion algorithms are performed on the public AOL search log, and the promising results endorse the effectiveness of the joint learning framework.",/pdf/47c769a32eac9274edb21007afe21431e9c383cb.pdf,ICLR,2018, +tYxG_OMs9WE,9YzncB8Ak-1,1601310000000.0,1613630000000.0,2614,Property Controllable Variational Autoencoder via Invertible Mutual Dependence,"[""~Xiaojie_Guo1"", ""~Yuanqi_Du1"", ""~Liang_Zhao1""]","[""Xiaojie Guo"", ""Yuanqi Du"", ""Liang Zhao""]","[""deep generative models"", ""interpretable latent representation"", ""disentangled representation learning""]","Deep generative models have made important progress towards modeling complex, high dimensional data via learning latent representations. Their usefulness is nevertheless often limited by a lack of control over the generative process or a poor understanding of the latent representation. To overcome these issues, attention is now focused on discovering latent variables correlated to the data properties and ways to manipulate these properties. This paper presents the new Property controllable VAE (PCVAE), where a new Bayesian model is proposed to inductively bias the latent representation using explicit data properties via novel group-wise and property-wise disentanglement. Each data property corresponds seamlessly to a latent variable, by innovatively enforcing invertible mutual dependence between them. This allows us to move along the learned latent dimensions to control specific properties of the generated data with great precision. Quantitative and qualitative evaluations confirm that the PCVAE outperforms the existing models by up to 28% in capturing and 65% in manipulating the desired properties.",/pdf/7a243ac1776d6ff97e3e45e24a68cc1d897b9b36.pdf,ICLR,2021,A novel generative model for learning interpretable latent representation for generating data with desired properties. +rygunsAqYQ,r1x_Jrpqt7,1538090000000.0,1545360000000.0,726,Implicit Maximum Likelihood Estimation,"[""ke.li@eecs.berkeley.edu"", ""malik@eecs.berkeley.edu""]","[""Ke Li"", ""Jitendra Malik""]","[""likelihood-free inference"", ""implicit probabilistic models""]","Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results. ",/pdf/ae8aebd478202c1b544dca674b44118d7b6b4df3.pdf,ICLR,2019,We develop a new likelihood-free parameter estimation method that is equivalent to maximum likelihood under some conditions +ryewE3R5YX,BJxuzlCcFQ,1538090000000.0,1545360000000.0,1460,Characterizing Attacks on Deep Reinforcement Learning,"[""xiaocw@umich.edu"", ""xinleipan@berkeley.edu"", ""_w@eecs.berkeley.edu"", ""lxbosky@gmail.com"", ""jianpeng@illinois.edu"", ""sunmj15@mails.tsinghua.com"", ""jinfengyi.ustc@gmail.com"", ""mingyan@umich.edu"", ""dawnsong@gmail.com""]","[""Chaowei Xiao"", ""Xinlei Pan"", ""Warren He"", ""Bo Li"", ""Jian Peng"", ""Mingjie Sun"", ""Jinfeng Yi"", ""Mingyan Liu"", ""Dawn Song.""]",[],"Deep Reinforcement learning (DRL) has achieved great success in various applications, such as playing computer games and controlling robotic manipulation. However, recent studies show that machine learning models are vulnerable to adversarial examples, which are carefully crafted instances that aim to mislead learning models to make arbitrarily incorrect prediction, and raised severe security concerns. DRL has been attacked by adding perturbation to each observed frame. However, such observation based attacks are not quite realistic considering that it would be hard for adversaries to directly manipulate pixel values in practice. Therefore, we propose to understand the vulnerabilities of DRL from various perspectives and provide a throughout taxonomy of adversarial perturbation against DRL, and we conduct the first experiments on unexplored parts of this taxonomy. In addition to current observation based attacks against DRL, we propose attacks based on the actions and environment dynamics. Among these experiments, we introduce a novel sequence-based attack to attack a sequence of frames for real-time scenarios such as autonomous driving, and the first targeted attack that perturbs environment dynamics to let the agent fail in a specific way. We show empirically that our sequence-based attack can generate effective perturbations in a blackbox setting in real time with a small number of queries, independent of episode length. We conduct extensive experiments to compare the effectiveness of different attacks with several baseline attack methods in several game playing, robotics control, and autonomous driving environments.",/pdf/e7e48c995bd46e04da8b6adc7d2decdd22f4ea92.pdf,ICLR,2019, +ldxlzGYWDmW,eh7PPiDj647,1601310000000.0,1612770000000.0,576,Effective Abstract Reasoning with Dual-Contrast Network,"[""~Tao_Zhuo3"", ""~Mohan_Kankanhalli1""]","[""Tao Zhuo"", ""Mohan Kankanhalli""]","[""abstract reasoning"", ""raven's progressive matrices"", ""deep learning""]","As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven’s Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM problem formulation, the correct answer filled into the missing entry of the third row/column has to best satisfy the same rules shared between the first two rows/columns.Thus we design a simple yet effective Dual-Contrast Network (DCNet) to exploit the inherent structure of RPM puzzles. Specifically, a rule contrast module is designed to compare the latent rules between the filled row/column and the first two rows/columns; a choice contrast module is designed to increase the relative differences between candidate choices. Experimental results on the RAVEN and PGM datasets show that DCNet outperforms the state-of-the-art methods by a large margin of 5.77%. Further experiments on few training samples and model generalization also show the effectiveness of DCNet. Code is available at https://github.com/visiontao/dcnet.",/pdf/130f155d219a92a6ce0511bf6e936499ff17abdd.pdf,ICLR,2021,We propose a simple yet effective Dual-Contrast Network (DCNet) to solve Raven's progressive matrices without using auxiliary annotations and assumptions. +rygUoeHKvB,HylajkbtwH,1569440000000.0,1577170000000.0,2510,Deep exploration by novelty-pursuit with maximum state entropy,"[""liziniu1997@gmail.com"", ""chenxh@lamda.nju.edu.cn"", ""yuy@lamda.nju.edu.cn""]","[""Zi-Niu Li"", ""Xiong-Hui Chen"", ""Yang Yu""]","[""Exploration"", ""Reinforcement Learning""]","Efficient exploration is essential to reinforcement learning in huge state space. Recent approaches to address this issue include the intrinsically motivated goal exploration process (IMGEP) and the maximum state entropy exploration (MSEE). In this paper, we disclose that goal-conditioned exploration behaviors in IMGEP can also maximize the state entropy, which bridges the IMGEP and the MSEE. From this connection, we propose a maximum entropy criterion for goal selection in goal-conditioned exploration, which results in the new exploration method novelty-pursuit. Novelty-pursuit performs the exploration in two stages: first, it selects a goal for the goal-conditioned exploration policy to reach the boundary of the explored region; then, it takes random actions to explore the non-explored region. We demonstrate the effectiveness of the proposed method in environments from simple maze environments, Mujoco tasks, to the long-horizon video game of SuperMarioBros. Experiment results show that the proposed method outperforms the state-of-the-art approaches that use curiosity-driven exploration.",/pdf/55d7d10556a027cc280505b8fe4e654ccc26217e.pdf,ICLR,2020,We propose an efficient exploration method called Novelty-pursuit for reinforcement learning. This method bridges the intrinsically motivated goal exploration process and the the maximum state entropy exploration. +UoaQUQREMOs,ZMG-Z0GPwZ,1601310000000.0,1615910000000.0,1771,CT-Net: Channel Tensorization Network for Video Classification,"[""~Kunchang_Li1"", ""~Xianhang_Li1"", ""~Yali_Wang1"", ""~Jun_Wang7"", ""~Yu_Qiao1""]","[""Kunchang Li"", ""Xianhang Li"", ""Yali Wang"", ""Jun Wang"", ""Yu Qiao""]","[""Video Classification"", ""3D Convolution"", ""Channel Tensorization""]","3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes convolution in a multiple dimension way, leading to a light computation burden. On the other hand, it can effectively enhance feature interaction from different channels, and progressively enlarge the 3D receptive field of such interaction to boost classification accuracy. Furthermore, we equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive experiments are conducted on several challenging video benchmarks, e.g., Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency.",/pdf/c320f306f4e64aa4b2f77386c7128cf4b1caef92.pdf,ICLR,2021,"To achieve convolution efficiency and feature-interaction sufficiency, we propose a Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions." +ryG6xZ-RZ,B1ealbWAW,1509130000000.0,1518730000000.0,645,DLVM: A modern compiler infrastructure for deep learning systems,"[""xwei12@illinois.edu"", ""lanes@illinois.edu"", ""vadve@illinois.edu""]","[""Richard Wei"", ""Lane Schwartz"", ""Vikram Adve""]","[""deep learning"", ""automatic differentiation"", ""algorithmic differentiation"", ""domain specific languages"", ""neural networks"", ""programming languages"", ""DSLs""]","Deep learning software demands reliability and performance. However, many of the existing deep learning frameworks are software libraries that act as an unsafe DSL in Python and a computation graph interpreter. We present DLVM, a design and implementation of a compiler infrastructure with a linear algebra intermediate representation, algorithmic differentiation by adjoint code generation, domain- specific optimizations and a code generator targeting GPU via LLVM. Designed as a modern compiler infrastructure inspired by LLVM, DLVM is more modular and more generic than existing deep learning compiler frameworks, and supports tensor DSLs with high expressivity. With our prototypical staged DSL embedded in Swift, we argue that the DLVM system enables a form of modular, safe and performant frameworks for deep learning.",/pdf/508794338a98dc9d8b0d4487987f921642876361.pdf,ICLR,2018,We introduce a novel compiler infrastructure that addresses shortcomings of existing deep learning frameworks. +HJg_ECEKDr,HylrrSLuDH,1569440000000.0,1577170000000.0,1081,Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data,"[""felipe.such@uber.com"", ""aditya.rawal@uber.com"", ""joel.lehman@uber.com"", ""kstanley@uber.com"", ""jeffclune@uber.com""]","[""Felipe Petroski Such"", ""Aditya Rawal"", ""Joel Lehman"", ""Kenneth Stanley"", ""Jeff Clune""]","[""Generative models"", ""generating synthetic data"", ""neural architecture search"", ""learning to teach"", ""meta-learning""]","This paper investigates the intriguing question of whether we can create learning algorithms that automatically generate training data, learning environments, and curricula in order to help AI agents rapidly learn. We show that such algorithms are possible via Generative Teaching Networks (GTNs), a general approach that is applicable to supervised, unsupervised, and reinforcement learning. GTNs are deep neural networks that generate data and/or training environments that a learner (e.g.\ a freshly initialized neural network) trains on before being tested on a target task. We then differentiate \emph{through the entire learning process} via meta-gradients to update the GTN parameters to improve performance on the target task. GTNs have the beneficial property that they can theoretically generate any type of data or training environment, making their potential impact large. This paper introduces GTNs, discusses their potential, and showcases that they can substantially accelerate learning. We also demonstrate a practical and exciting application of GTNs: accelerating the evaluation of candidate architectures for neural architecture search (NAS), which is rate-limited by such evaluations, enabling massive speed-ups in NAS. GTN-NAS improves the NAS state of the art, finding higher performing architectures when controlling for the search proposal mechanism. GTN-NAS also is competitive with the overall state of the art approaches, which achieve top performance while using orders of magnitude less computation than typical NAS methods. Overall, GTNs represent a first step toward the ambitious goal of algorithms that generate their own training data and, in doing so, open a variety of interesting new research questions and directions.",/pdf/6cec4c9577171ff5efb0a75f7cb1c507cbc310b8.pdf,ICLR,2020,"We meta-learn a DNN to generate synthetic training data that rapidly teaches a learning DNN a target task, speeding up neural architecture search nine-fold. " +r1eCukHYDH,HyeifQ0dPB,1569440000000.0,1577170000000.0,1823,Manifold Learning and Alignment with Generative Adversarial Networks,"[""jkim@bi.snu.ac.kr"", ""sjjung@bi.snu.ac.kr"", ""hdlee@bi.snu.ac.kr"", ""btzhang@bi.snu.ac.kr""]","[""Jiseob Kim"", ""Seungjae Jung"", ""Hyundo Lee"", ""Byoung-Tak Zhang""]","[""Generative Adversarial Networks"", ""Manifold Learning"", ""Manifold Alignment""]","We present a generative adversarial network (GAN) that conducts manifold learning and alignment (MLA): A task to learn the multi-manifold structure underlying data and to align those manifolds without any correspondence information. Our main idea is to exploit the powerful abstraction ability of encoder architecture. Specifically, we define multiple generators to model multiple manifolds, but in a particular way that their inverse maps can be commonly represented by a single smooth encoder. Then, the abstraction ability of the encoder enforces semantic similarities between the generators and gives a plausibly aligned embedding in the latent space. In experiments with MNIST, 3D-Chair, and UT-Zap50k datasets, we demonstrate the superiority of our model in learning the manifolds by FID scores and in aligning the manifolds by disentanglement scores. Furthermore, by virtue of the abstractive modeling, we show that our model can generate data from an untrained manifold, which is unique to our model.",/pdf/2926ddbaa2624db20a9a79f90978808c917630fb.pdf,ICLR,2020,"We present a generative adversarial network that performs both multi-manifold learning and manifold alignment, utilizing the abstraction ability of encoder architecture. " +rJxcBpNKPr,rygdzTIvDr,1569440000000.0,1577170000000.0,533,OvA-INN: Continual Learning with Invertible Neural Networks,"[""guillaume.hocquet@live.fr"", ""olivier.bichler@cea.fr"", ""damien.querlioz@c2n.upsaclay.fr""]","[""HOCQUET Guillaume"", ""BICHLER Olivier"", ""QUERLIOZ Damien""]","[""Deep Learning"", ""Continual Learning"", ""Invertible Neural Networks""]","In the field of Continual Learning, the objective is to learn several tasks one after the other without access to the data from previous tasks. Several solutions have been proposed to tackle this problem but they usually assume that the user knows which of the tasks to perform at test time on a particular sample, or rely on small samples from previous data and most of them suffer of a substantial drop in accuracy when updated with batches of only one class at a time. In this article, we propose a new method, OvA-INN, which is able to learn one class at a time and without storing any of the previous data. To achieve this, for each class, we train a specific Invertible Neural Network to output the zero vector for its class. At test time, we can predict the class of a sample by identifying which network outputs the vector with the smallest norm. With this method, we show that we can take advantage of pretrained models by stacking an invertible network on top of a features extractor. This way, we are able to outperform state-of-the-art approaches that rely on features learning for the Continual Learning of MNIST and CIFAR-100 datasets. In our experiments, we are reaching 72% accuracy on CIFAR-100 after training our model one class at a time.",/pdf/5d175381cdce4d1da57c01a7775e00df7b87d4f7.pdf,ICLR,2020,We propose to train an Invertible Neural Network for each class to perform class-by-class Continual Learning. +HyeJmlrFvH,rJgYtmgKwr,1569440000000.0,1577170000000.0,2194,Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization,"[""alir@vectorinstitute.ai"", ""faghri@cs.toronto.edu"", ""droy@utstat.toronto.edu"", ""dan.alistarh@ist.ac.at"", ""markovilya197@gmail.com"", ""vitalii.aksenov@ist.ac.at""]","[""Ali Ramezani-Kebrya"", ""Fartash Faghri"", ""Ilya Markov"", ""Vitalii Aksenov"", ""Dan Alistarh"", ""Daniel M. Roy""]",[],"As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel. Alistarh et al. (2017) describe two variants of data-parallel SGD that quantize and encode gradients to lessen communication costs. For the first variant, QSGD, they provide strong theoretical guarantees. For the second variant, which we call QSGDinf, they demonstrate impressive empirical gains for distributed training of large neural networks. Building on their work, we propose an alternative scheme for quantizing gradients and show that it yields stronger theoretical guarantees than exist for QSGD while matching the empirical performance of QSGDinf.",/pdf/e6fe72b209054a2cc13eba7a3649d629ccfca64b.pdf,ICLR,2020,NUQSGD closes the gap between the theoretical guarantees of QSGD and the empirical performance of QSGDinf. +Byldr3RqKX,HklGrCTqK7,1538090000000.0,1545360000000.0,1555,Tinkering with black boxes: counterfactuals uncover modularity in generative models,"[""michel.besserve@tuebingen.mpg.de"", ""remy.sun@ens-rennes.fr"", ""bs@tuebingen.mpg.de""]","[""Michel Besserve"", ""Remy Sun"", ""Bernhard Schoelkopf""]","[""generatice models"", ""causality"", ""disentangled representations""]","Deep generative models such as Generative Adversarial Networks (GANs) and +Variational Auto-Encoders (VAEs) are important tools to capture and investigate +the properties of complex empirical data. However, the complexity of their inner +elements makes their functionment challenging to assess and modify. In this +respect, these architectures behave as black box models. In order to better +understand the function of such networks, we analyze their modularity based on +the counterfactual manipulation of their internal variables. Our experiments on the +generation of human faces with VAEs and GANs support that modularity between +activation maps distributed over channels of generator architectures is achieved +to some degree, can be used to better understand how these systems operate and allow meaningful transformations of the generated images without further training. +erate and edit the content of generated images.",/pdf/2dd7d96831ef070735f1b563c53ae14851c3c20e.pdf,ICLR,2019,We investigate the modularity of deep generative models. +SyVhg20cK7,S1e0Tnnctm,1538090000000.0,1545360000000.0,1116,Inducing Cooperation via Learning to reshape rewards in semi-cooperative multi-agent reinforcement learning,"[""ddhostallero@kaist.ac.kr"", ""kdw2139@gmail.com"", ""khson@lanada.kaist.ac.kr"", ""yiyung@kaist.edu""]","[""David Earl Hostallero"", ""Daewoo Kim"", ""Kyunghwan Son"", ""Yung Yi""]","[""multi-agent reinforcement learning"", ""deep reinforcement learning"", ""multi-agent systems""]","We propose a deep reinforcement learning algorithm for semi-cooperative multi-agent tasks, where agents are equipped with their separate reward functions, yet with willingness to cooperate. Under these semi-cooperative scenarios, popular methods of centralized training with decentralized execution for inducing cooperation and removing the non-stationarity problem do not work well due to lack of a common shared reward as well as inscalability in centralized training. Our algorithm, called Peer-Evaluation based Dual DQN (PED-DQN), proposes to give peer evaluation signals to observed agents, which quantifies how they feel about a certain transition. This exchange of peer evaluation over time turns out to render agents to gradually reshape their reward functions so that their action choices from the myopic best-response tend to result in the good joint action with high cooperation. This evaluation-based method also allows flexible and scalable training by not assuming knowledge of the number of other agents and their observation and action spaces. We provide the performance evaluation of PED-DQN for the scenarios ranging from a simple two-person prisoner's dilemma to more complex semi-cooperative multi-agent tasks. In special cases where agents share a common reward function as in the centralized training methods, we show that inter-agent +evaluation leads to better performance +",/pdf/c0d82610150172220b6d0f9634e5fc3585f07791.pdf,ICLR,2019,We use an peer evaluation mechanism to make semi-cooperative agents learn collaborative strategies in multiagent reinforcement learning settings +Hy6b4Pqee,,1478290000000.0,1488970000000.0,377,Deep Probabilistic Programming,"[""dustin@cs.columbia.edu"", ""mathoffm@adobe.com"", ""rif@google.com"", ""ebrevdo@google.com"", ""kpmurphy@google.com"", ""david.blei@columbia.edu""]","[""Dustin Tran"", ""Matthew D. Hoffman"", ""Rif A. Saurous"", ""Eugene Brevdo"", ""Kevin Murphy"", ""David M. Blei""]",[],"We propose Edward, a Turing-complete probabilistic programming language. Edward defines two compositional representations—random variables and inference. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning. For flexibility, Edward makes it easy to fit the same model using a variety of composable inference methods, ranging from point estimation to variational inference to MCMC. In addition, Edward can reuse the modeling representation as part of inference, facilitating the design of rich variational models and generative adversarial networks. For efficiency, Edward is integrated into TensorFlow, providing significant speedups over existing probabilistic systems. For example, we show on a benchmark logistic regression task that Edward is at least 35x faster than Stan and 6x faster than PyMC3. Further, Edward incurs no runtime overhead: it is as fast as handwritten TensorFlow.",/pdf/b51fb1ef5fa81889f226cfb62f5b6870b0ddb4fc.pdf,ICLR,2017, +9YlaeLfuhJF,PEs_BUes5i3,1601310000000.0,1616010000000.0,2132,Model Patching: Closing the Subgroup Performance Gap with Data Augmentation,"[""~Karan_Goel1"", ""~Albert_Gu1"", ""~Yixuan_Li1"", ""~Christopher_Re1""]","[""Karan Goel"", ""Albert Gu"", ""Yixuan Li"", ""Christopher Re""]","[""Robust Machine Learning"", ""Data Augmentation"", ""Consistency Training"", ""Invariant Representations""]","Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that encourages the model to be invariant to subgroup differences, and focus on class information shared by subgroups. Model patching first models subgroup features within a class and learns semantic transformations between them, and then trains a classifier with data augmentations that deliberately manipulate subgroup features. We instantiate model patching with CAMEL, which (1) uses a CycleGAN to learn the intra-class, inter-subgroup augmentations, and (2) balances subgroup performance using a theoretically-motivated subgroup consistency regularizer, accompanied by a new robust objective. We demonstrate CAMEL’s effectiveness on 3 benchmark datasets, with reductions in robust error of up to 33% relative to the best baseline. Lastly, CAMEL successfully patches a model that fails due to spurious features on a real-world skin cancer dataset.",/pdf/a78017923e9af1b5c37e2ec56d3564794b91dd79.pdf,ICLR,2021,We describe how to fix classifiers that fail on subgroups of a class using a combination of learned data augmentation & consistency training to achieve subgroup invariance. +SyxXWC4KPB,BJgAAZNuwH,1569440000000.0,1577170000000.0,960,Structured consistency loss for semi-supervised semantic segmentation,"[""win98man1@gmail.com"", ""jyjang1090@gmail.com"", ""phw08132@gmail.com""]","[""JongMok Kim"", ""Joo Young Jang"", ""Hyunwoo Park""]","[""semi-supervised learning"", ""semantic segmentation"", ""structured prediction"", ""structured consistency loss""]","The consistency loss has played a key role in solving problems in recent studies on semi-supervised learning. Yet extant studies with the consistency loss are limited to its application to classification tasks; extant studies on semi-supervised semantic segmentation rely on pixel-wise classification, which does not reflect the structured nature of characteristics in prediction. We propose a structured consistency loss to address this limitation of extant studies. Structured consistency loss promotes consistency in inter-pixel similarity between teacher and student networks. Specifically, collaboration with CutMix optimizes the efficient performance of semi-supervised semantic segmentation with structured consistency loss by reducing computational burden dramatically. The superiority of proposed method is verified with the Cityscapes; The Cityscapes benchmark results with validation and with test data are 81.9 mIoU and 83.84 mIoU respectively. This ranks the first place on the pixel-level semantic labeling task of Cityscapes benchmark suite. To the best of our knowledge, we are the first to present the superiority of state-of-the-art semi-supervised learning in semantic segmentation.",/pdf/a5dccfe309f72e65e7312203ae56cb012d2171d2.pdf,ICLR,2020,We propose a novel structured consistency loss for semi-supervised semantic segmentation +BygZK2VYvB,BJx_PGuqIr,1569440000000.0,1577170000000.0,67,Utilizing Edge Features in Graph Neural Networks via Variational Information Maximization,"[""chenpf.cuhk@gmail.com"", ""wwliu@cse.cuhk.edu.hk"", ""kimhsieh@tencent.com"", ""gycchen@tencent.com"", ""pheng@cse.cuhk.edu.hk""]","[""Pengfei Chen"", ""Weiwen Liu"", ""Chang-Yu Hsieh"", ""Guangyong Chen"", ""Pheng Ann Heng""]","[""Graph Neural Network"", ""Edge Feature"", ""Mutual Information""]","Graph Neural Networks (GNNs) broadly follow the scheme that the representation vector of each node is updated recursively using the message from neighbor nodes, where the message of a neighbor is usually pre-processed with a parameterized transform matrix. To make better use of edge features, we propose the Edge Information maximized Graph Neural Network (EIGNN) that maximizes the Mutual Information (MI) between edge features and message passing channels. The MI is reformulated as a differentiable objective via a variational approach. We theoretically show that the newly introduced objective enables the model to preserve edge information, and empirically corroborate the enhanced performance of MI-maximized models across a broad range of learning tasks including regression on molecular graphs and relation prediction in knowledge graphs.",/pdf/a08a34e34420bd1b3daf369c3b4bebcee807070b.pdf,ICLR,2020,We use a principled variational approach to preserve edge information in graph neural networks and show the importance of edge features and the superior of our method in extensive benchmarks. +H1livgrFvr,B1gd12etwH,1569440000000.0,1577170000000.0,2374,Out-of-Distribution Image Detection Using the Normalized Compression Distance,"[""hunu12@postech.ac.kr"", ""dongha0914@postech.ac.kr"", ""hwanjoyu@postech.ac.kr""]","[""Sehun Yu"", ""Donga Lee"", ""Hwanjo Yu""]","[""Out-of-Distribution Detection"", ""Normalized Compression Distance"", ""Convolutional Neural Networks""]","On detection of the out-of-distribution images, whose underlying distribution is different from that of the training dataset, we tackle to apply out-of-distribution detection methods to already deployed convolutional neural networks. Most recent approaches have to utilize out-of-distribution samples for validation or retrain the model, which makes it less practical for real-world applications. We propose a novel out-of-distribution detection method MALCOM, which neither uses any out-of-distribution samples nor retrain the model. Inspired by the method using the global average pooling on the feature maps of the convolutional neural networks, the goal of our method is to extract informative sequential patterns from the feature maps. To this end, we introduce a similarity metric which focuses on the shared patterns between two sequences. In short, MALCOM uses both the global average and spatial pattern of the feature maps to accurately identify out-of-distribution samples. ",/pdf/1bf680519866ba4e59ce494f1fe30917c9a3f7c8.pdf,ICLR,2020,We propose MALCOM which utilizes both the global average and spatial pattern of the feature maps to accurately identify out-of-distribution samples. +j9Rv7qdXjd,ygStmSYspEB,1601310000000.0,1613710000000.0,1775,Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels,"[""~Binxin_Ru1"", ""~Xingchen_Wan1"", ""~Xiaowen_Dong1"", ""~Michael_Osborne1""]","[""Binxin Ru"", ""Xingchen Wan"", ""Xiaowen Dong"", ""Michael Osborne""]",[],"Current neural architecture search (NAS) strategies focus only on finding a single, good, architecture. They offer little insight into why a specific network is performing well, or how we should modify the architecture if we want further improvements. We propose a Bayesian optimisation (BO) approach for NAS that combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate. Our method not only optimises the architecture in a highly data-efficient manner, but also affords interpretability by discovering useful network features and their corresponding impact on the network performance. Moreover, our method is capable of capturing the topological structures of the architectures and is scalable to large graphs, thus making the high-dimensional and graph-like search spaces amenable to BO. We demonstrate empirically that our surrogate model is capable of identifying useful motifs which can guide the generation of new architectures. We finally show that our method outperforms existing NAS approaches to achieve the state of the art on both closed- and open-domain search spaces.",/pdf/7a44783a6b1ada0693c095a8030ac42d0cdcf005.pdf,ICLR,2021,"We propose a NAS method that is sample-efficient, highly performant and interpretable." +B1x1ma4tDr,B1x_Xca8Pr,1569440000000.0,1598470000000.0,435,DDSP: Differentiable Digital Signal Processing,"[""jesseengel@google.com"", ""hanoih@google.com"", ""gcj@google.com"", ""adarob@google.com""]","[""Jesse Engel"", ""Lamtharn (Hanoi) Hantrakul"", ""Chenjie Gu"", ""Adam Roberts""]","[""dsp"", ""audio"", ""music"", ""nsynth"", ""wavenet"", ""wavernn"", ""vocoder"", ""synthesizer"", ""sound"", ""signal"", ""processing"", ""tensorflow"", ""autoencoder"", ""disentanglement""]","Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library will is available at https://github.com/magenta/ddsp and we encourage further contributions from the community and domain experts. +",/pdf/bd8d353bca498f66f2bf5db02c5fda8135120349.pdf,ICLR,2020,Better audio synthesis by combining interpretable DSP with end-to-end learning. +H1lMogrKDH,HJxydk-KDH,1569440000000.0,1577170000000.0,2501,LEARNING DIFFICULT PERCEPTUAL TASKS WITH HODGKIN-HUXLEY NETWORKS,"[""alan.lockett@gmail.com"", ""ankitp@bcm.edu"", ""paulp@bcm.edu""]","[""Alan Lockett"", ""Ankit Patel"", ""Paul Pfaffinger""]","[""conductance-weighted averaging"", ""neural modeling"", ""normalization methods""]","This paper demonstrates that a computational neural network model using ion channel-based conductances to transmit information can solve standard computer vision datasets at near state-of-the-art performance. Although not fully biologically accurate, this model incorporates fundamental biophysical principles underlying the control of membrane potential and the processing of information by Ohmic ion channels. The key computational step employs Conductance-Weighted Averaging (CWA) in place of the traditional affine transformation, representing a fundamentally different computational principle. +Importantly, CWA based networks are self-normalizing and range-limited. We also demonstrate for the first time that a network with excitatory and inhibitory neurons and nonnegative synapse strengths can successfully solve computer vision problems. Although CWA models do not yet surpass the current state-of-the-art in deep learning, the results are competitive on CIFAR-10. There remain avenues for improving these networks, e.g. by more closely modeling ion channel function and connectivity patterns of excitatory and inhibitory neurons found in the brain. ",/pdf/0ce4b025a82c256268796897dbb249314f032420.pdf,ICLR,2020,A network of static time Hodgkin-Huxley neurons can perform well on computer vision datasets. +SRzz6RtOdKR,vc3kbOLXHf1,1601310000000.0,1614990000000.0,2598,Batch Inverse-Variance Weighting: Deep Heteroscedastic Regression,"[""~Vincent_Mai1"", ""~Waleed_Khamies1"", ""~Liam_Paull1""]","[""Vincent Mai"", ""Waleed Khamies"", ""Liam Paull""]","[""Regression"", ""Noisy labels"", ""Supervised Learning"", ""Uncertainty"", ""Variance"", ""Heteroscedastic"", ""Privileged Information""]","In model learning, when the training dataset on which the parameters are optimized and the testing dataset on which the model is evaluated are not sampled from identical distributions, we say that the datasets are misaligned. It is well-known that this misalignment can negatively impact model performance. A common source of misalignment is that the inputs are sampled from different distributions. Another source for this misalignment is that the label generating process used to create the training dataset is imperfect. In this work, we consider this setting and additionally assume that the label generating process is able to provide us with a quantity for the role of each label in the misalignment between the datasets, which we consider to be privileged information. Specifically, we consider the task of regression with labels corrupted by heteroscedastic noise and we assume that we have access to an estimate of the variance over each sample. We propose a general approach to include this privileged information in the loss function together with dataset statistics inferred from the mini-batch to mitigate the impact of the dataset misalignment. Subsequently, we propose a specific algorithm for the heteroscedastic regression case, called Batch Inverse-Variance weighting, which adapts inverse-variance weighting for linear regression to the case of neural network function approximation. We demonstrate that this approach achieves a significant improvement in network training performances compared to baselines when confronted with high, input-independent noise.",/pdf/3b8feb4acb73a876bf25c4c28aea3d5d339d376d.pdf,ICLR,2021,A method to reduce the effect of heteroscedastic noisy labels in regression by weighting them based on their variance and the variance of the other samples in the minibatch. +SkxxIs0qY7,rJeAFRFbFX,1538090000000.0,1545360000000.0,145,CoT: Cooperative Training for Generative Modeling of Discrete Data,"[""steve_lu@apex.sjtu.edu.cn"", ""yulantao@apex.sjtu.edu.cn"", ""siyuanfeng@apex.sjtu.edu"", ""ymzhu@apex.sjtu.edu.cn"", ""wnzhang@apex.sjtu.edu.cn"", ""yyu@apex.sjtu.edu.cn""]","[""Sidi Lu"", ""Lantao Yu"", ""Siyuan Feng"", ""Yaoming Zhu"", ""Weinan Zhang"", ""Yong Yu""]","[""Generative Models"", ""Sequence Modeling"", ""Text Generation""]","We propose Cooperative Training (CoT) for training generative models that measure a tractable density for discrete data. CoT coordinately trains a generator G and an auxiliary predictive mediator M. The training target of M is to estimate a mixture density of the learned distribution G and the target distribution P, and that of G is to minimize the Jensen-Shannon divergence estimated through M. CoT achieves independent success without the necessity of pre-training via Maximum Likelihood Estimation or involving high-variance algorithms like REINFORCE. This low-variance algorithm is theoretically proved to be superior for both sample generation and likelihood prediction. We also theoretically and empirically show the superiority of CoT over most previous algorithms in terms of generative quality and diversity, predictive generalization ability and computational cost.",/pdf/985b6264a97066a72904cbbfdcb95cb1876c4425.pdf,ICLR,2019,"We proposed Cooperative Training, a novel training algorithm for generative modeling of discrete data." +mNtmhaDkAr,wYXMnCLMEXz,1601310000000.0,1616030000000.0,3826,Predicting Inductive Biases of Pre-Trained Models,"[""~Charles_Lovering1"", ""rohan_jha@brown.edu"", ""~Tal_Linzen1"", ""~Ellie_Pavlick1""]","[""Charles Lovering"", ""Rohan Jha"", ""Tal Linzen"", ""Ellie Pavlick""]","[""information-theoretical probing"", ""probing"", ""challenge sets"", ""natural language processing""]","Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then fine-tuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via ""probing classifiers"") finds evidence that rich features of linguistic structure can be decoded with high accuracy, but work that analyzes model behavior after fine-tuning (via ""challenge sets"") indicates that decisions are often not based on such structure but rather on spurious heuristics specific to the training set. In this work, we test the hypothesis that the extent to which a feature influences a model's decisions can be predicted using a combination of two factors: The feature's ""extractability"" after pre-training (measured using information-theoretic probing techniques), and the ""evidence"" available during fine-tuning (defined as the feature's co-occurrence rate with the label). In experiments with both synthetic and natural language data, we find strong evidence (statistically significant correlations) supporting this hypothesis.",/pdf/5f8e7508b216ea50a36e7f4584e4e6d8953917be.pdf,ICLR,2021,"We find that feature extractability, measured by probing classifiers, can be viewed as an inductive bias: the more extractable a feature is after pre-training, the less statistical evidence needed during fine-tuning for the model to use the feature." +ryeYHi0ctQ,BylpuqDvKX,1538090000000.0,1551960000000.0,108,DPSNet: End-to-end Deep Plane Sweep Stereo,"[""dlarl8927@kaist.ac.kr"", ""haegonj@andrew.cmu.edu"", ""stevelin@microsoft.com"", ""iskweon77@kaist.ac.kr""]","[""Sunghoon Im"", ""Hae-Gon Jeon"", ""Stephen Lin"", ""In So Kweon""]","[""Deep Learning"", ""Stereo"", ""Depth"", ""Geometry""]","Multiview stereo aims to reconstruct scene depth from images acquired by a camera under arbitrary motion. Recent methods address this problem through deep learning, which can utilize semantic cues to deal with challenges such as textureless and reflective regions. In this paper, we present a convolutional neural network called DPSNet (Deep Plane Sweep Network) whose design is inspired by best practices of traditional geometry-based approaches. Rather than directly estimating depth and/or optical flow correspondence from image pairs as done in many previous deep learning methods, DPSNet takes a plane sweep approach that involves building a cost volume from deep features using the plane sweep algorithm, regularizing the cost volume via a context-aware cost aggregation, and regressing the depth map from the cost volume. The cost volume is constructed using a differentiable warping process that allows for end-to-end training of the network. Through the effective incorporation of conventional multiview stereo concepts within a deep learning framework, DPSNet achieves state-of-the-art reconstruction results on a variety of challenging datasets.",/pdf/6c524664342b2dad1ed394e5fbedd840485332f7.pdf,ICLR,2019,A convolution neural network for multi-view stereo matching whose design is inspired by best practices of traditional geometry-based approaches +rkxhX209FX,S1xpkDi5Fm,1538090000000.0,1545360000000.0,1398,An Active Learning Framework for Efficient Robust Policy Search,"[""saikirann94@gmail.com"", ""nandan@iitm.ac.in"", ""ravi@cse.iitm.ac.in""]","[""Sai Kiran Narayanaswami"", ""Nandan Sudarsanam"", ""Balaraman Ravindran""]","[""Deep Reinforcement Learning""]","Robust Policy Search is the problem of learning policies that do not degrade in performance when subject to unseen environment model parameters. It is particularly relevant for transferring policies learned in a simulation environment to the real world. Several existing approaches involve sampling large batches of trajectories which reflect the differences in various possible environments, and then selecting some subset of these to learn robust policies, such as the ones that result in the worst performance. We propose an active learning based framework, EffAcTS, to selectively choose model parameters for this purpose so as to collect only as much data as necessary to select such a subset. We apply this framework to an existing method, namely EPOpt, and experimentally validate the gains in sample efficiency and the performance of our approach on standard continuous control tasks. We also present a Multi-Task Learning perspective to the problem of Robust Policy Search, and draw connections from our proposed framework to existing work on Multi-Task Learning.",/pdf/cb006bc4b32d26f907b94d3925ebf95e3d795dd4.pdf,ICLR,2019,An Active Learning framework that leads to efficient robust RL and opens up possibilities in Multi-Task RL +BJ8vJebC-,BkHwygWRW,1509130000000.0,1519450000000.0,543,Synthetic and Natural Noise Both Break Neural Machine Translation,"[""belinkov@mit.edu"", ""ybisk@yonatanbisk.com""]","[""Yonatan Belinkov"", ""Yonatan Bisk""]","[""neural machine translation"", ""characters"", ""noise"", ""adversarial examples"", ""robust training""]","Character-based neural machine translation (NMT) models alleviate out-of-vocabulary issues, learn morphology, and move us closer to completely end-to-end translation systems. Unfortunately, they are also very brittle and easily falter when presented with noisy data. In this paper, we confront NMT models with synthetic and natural sources of noise. We find that state-of-the-art models fail to translate even moderately noisy texts that humans have no trouble comprehending. We explore two approaches to increase model robustness: structure-invariant word representations and robust training on noisy texts. We find that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise. ",/pdf/76dfbeba1cf42abb13f7dc148795613794e38195.pdf,ICLR,2018,CharNMT is brittle +ryzHXnR5Y7,BkxAFx05Y7,1538090000000.0,1545360000000.0,1355,Select Via Proxy: Efficient Data Selection For Training Deep Networks,"[""cody@cs.stanford.edu"", ""mussmann@stanford.edu"", ""baharanm@stanford.edu"", ""pbailis@stanford.edu"", ""pliang@cs.stanford.edu"", ""jure@cs.stanford.edu"", ""mzaharia@stanford.edu""]","[""Cody Coleman"", ""Stephen Mussmann"", ""Baharan Mirzasoleiman"", ""Peter Bailis"", ""Percy Liang"", ""Jure Leskovec"", ""Matei Zaharia""]","[""data selection"", ""deep learning"", ""uncertainty sampling""]","At internet scale, applications collect a tremendous amount of data by logging user events, analyzing text, and collecting images. This data powers a variety of machine learning models for tasks such as image classification, language modeling, content recommendation, and advertising. However, training large models over all available data can be computationally expensive, creating a bottleneck in the development of new machine learning models. In this work, we develop a novel approach to efficiently select a subset of training data to achieve faster training with no loss in model predictive performance. In our approach, we first train a small proxy model quickly, which we then use to estimate the utility of individual training data points, and then select the most informative ones for training the large target model. Extensive experiments show that our approach leads to a 1.6x and 1.8x speed-up on CIFAR10 and SVHN by selecting 60% and 50% subsets of the data, while maintaining the predictive performance of the model trained on the entire dataset.",/pdf/4df17e198dea337baaf404031db38b500cf9345c.pdf,ICLR,2019,we develop an efficient method for selecting training data to quickly and efficiently learn large machine learning models. +BkpXqwUTZ,HJh79DLab,1508440000000.0,1518730000000.0,18,Iterative temporal differencing with fixed random feedback alignment support spike-time dependent plasticity in vanilla backpropagation for deep learning,"[""arasdar@uri.edu""]","[""Aras Dargazany"", ""Kunal Mankodiya""]","[""Iterative temporal differencing"", ""feedback alignment"", ""spike-time dependent plasticity"", ""vanilla backpropagation"", ""deep learning""]","In vanilla backpropagation (VBP), activation function matters considerably in terms of non-linearity and differentiability. +Vanishing gradient has been an important problem related to the bad choice of activation function in deep learning (DL). +This work shows that a differentiable activation function is not necessary any more for error backpropagation. +The derivative of the activation function can be replaced by an iterative temporal differencing (ITD) using fixed random feedback weight alignment (FBA). +Using FBA with ITD, we can transform the VBP into a more biologically plausible approach for learning deep neural network architectures. +We don't claim that ITD works completely the same as the spike-time dependent plasticity (STDP) in our brain but this work can be a step toward the integration of STDP-based error backpropagation in deep learning.",/pdf/775ad0b287f599d4744d54c1a7e2189ee66e40be.pdf,ICLR,2018,Iterative temporal differencing with fixed random feedback alignment support spike-time dependent plasticity in vanilla backpropagation for deep learning. +SyeyF0VtDr,BJlBpb_ODS,1569440000000.0,1577170000000.0,1232,Recurrent Event Network : Global Structure Inference Over Temporal Knowledge Graph,"[""woojeong.jin@usc.edu"", ""jian567@usc.edu"", ""meng.qu@umontreal.ca"", ""tongc2@andrew.cmu.edu"", ""changlin.zhang@usc.edu"", ""pszekely@isi.edu"", ""xiangren@usc.edu""]","[""Woojeong Jin"", ""He Jiang"", ""Meng Qu"", ""Tong Chen"", ""Changlin Zhang"", ""Pedro\u00a0Szekely"", ""Xiang Ren""]","[""Temporal Knowledge Graphs"", ""Representation Learning"", ""Graph Sequence Inference"", ""Knowledge Graph Completion""]","Modeling dynamically-evolving, multi-relational graph data has received a surge of interests with the rapid growth of heterogeneous event data. However, predicting future events on such data requires global structure inference over time and the ability to integrate temporal and structural information, which are not yet well understood. We present Recurrent Event Network (RE-Net), a novel autoregressive architecture for modeling temporal sequences of multi-relational graphs (e.g., temporal knowledge graph), which can perform sequential, global structure inference over future time stamps to predict new events. RE-Net employs a recurrent event encoder to model the temporally conditioned joint probability distribution for the event sequences, and equips the event encoder with a neighborhood aggregator for modeling the concurrent events within a time window associated with each entity. We apply teacher forcing for model training over historical data, and infer graph sequences over future time stamps by sampling from the learned joint distribution in a sequential manner. We evaluate the proposed method via temporal link prediction on five public datasets. Extensive experiments demonstrate the strength of RE-Net, especially on multi-step inference over future time stamps.",/pdf/076ffbc995d7dcde4d6b4289ece94ac94fdf0e53.pdf,ICLR,2020,We propose an autoregressive model to infer graph structures on temporal knowledge graphs. +rk3b2qxCW,rknZh9eA-,1509100000000.0,1518730000000.0,357,Policy Gradient For Multidimensional Action Spaces: Action Sampling and Entropy Bonus,"[""quan.hovuong@gmail.com"", ""yiming.zhang@nyu.edu"", ""kenny.song@nyu.edu"", ""xygong@mit.edu"", ""keithwross@nyu.edu""]","[""Vuong Ho Quan"", ""Yiming Zhang"", ""Kenny Song"", ""Xiao-Yue Gong"", ""Keith W. Ross""]","[""deep reinforcement learning"", ""policy gradient"", ""multidimensional action space"", ""entropy bonus"", ""entropy regularization"", ""discrete action space""]","In recent years deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. In this paper, we develop a novel policy gradient methodology for the case of large multidimensional discrete action spaces. We propose two approaches for creating parameterized policies: LSTM parameterization and a Modified MDP (MMDP) giving rise to Feed-Forward Network (FFN) parameterization. Both of these approaches provide expressive models to which backpropagation can be applied for training. We then consider entropy bonus, which is typically added to the reward function to enhance exploration. In the case of high-dimensional action spaces, calculating the entropy and the gradient of the entropy requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem.",/pdf/3cf3964a291a7dc317a4ed334d65ebefa741fa20.pdf,ICLR,2018,policy parameterizations and unbiased policy entropy estimators for MDP with large multidimensional discrete action space +TTUVg6vkNjK,eSo3LNud2U6,1601310000000.0,1616060000000.0,850,RODE: Learning Roles to Decompose Multi-Agent Tasks,"[""~Tonghan_Wang1"", ""~Tarun_Gupta3"", ""~Anuj_Mahajan1"", ""~Bei_Peng2"", ""~Shimon_Whiteson1"", ""~Chongjie_Zhang1""]","[""Tonghan Wang"", ""Tarun Gupta"", ""Anuj Mahajan"", ""Bei Peng"", ""Shimon Whiteson"", ""Chongjie Zhang""]","[""Multi-Agent Reinforcement Learning"", ""Role-Based Learning"", ""Hierarchical Multi-Agent Learning"", ""Multi-Agent Transfer Learning""]","Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy: the role selector searches in a smaller role space and at a lower temporal resolution, while role policies learn in significantly reduced primitive action-observation spaces. We further integrate information about action effects into the role policies to boost learning efficiency and policy generalization. By virtue of these advances, our method (1) outperforms the current state-of-the-art MARL algorithms on 9 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents. Demonstrative videos can be viewed at https://sites.google.com/view/rode-marl.",/pdf/b3099a13254cda0cd68fedf29a0227d9684bc973.pdf,ICLR,2021,"We propose a scalable role-based multi-agent learning method which effectively discovers roles based on joint action space decomposition according to action effects, establishing a new state of the art on the StarCraft multi-agent benchmark." +30EvkP2aQLD,4JswdvRJVIn,1601310000000.0,1616970000000.0,2298,What are the Statistical Limits of Offline RL with Linear Function Approximation?,"[""~Ruosong_Wang1"", ""~Dean_Foster1"", ""~Sham_M._Kakade1""]","[""Ruosong Wang"", ""Dean Foster"", ""Sham M. Kakade""]","[""batch reinforcement learning"", ""function approximation"", ""lower bound"", ""representation""]","Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. + +This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).",/pdf/abc8c212aff033c6042f681a3e27a3c9af250066.pdf,ICLR,2021,Exponential lower bounds for batch RL with linear function approximation. +H1gy1erYDH,B1ltiOyYDr,1569440000000.0,1577170000000.0,2047,CaptainGAN: Navigate Through Embedding Space For Better Text Generation,"[""jsaon92@gmail.com"", ""alvin.chiang.180@gmail.com"", ""liangtaiwan1230@gmail.com"", ""gblin75468@gmail.com"", ""cph@yoctol.com"", ""raywu0@gmail.com"", ""ypiheyn.imm02g@g2.nctu.edu.tw"", ""cyhuang@ntu.edu.tw""]","[""Chun-Hsing Lin"", ""Alvin Chiang"", ""Chi-Liang Liu"", ""Chien-Fu Lin"", ""Po-Hsien Chu"", ""Siang-Ruei Wu"", ""Yi-En Tsai"", ""Chung-Yang (Ric) Huang""]","[""Generative Adversarial Network"", ""Text Generation"", ""Straight-Through Estimator""]","Score-function-based text generation approaches such as REINFORCE, in general, suffer from high computational complexity and training instability problems. This is mainly due to the non-differentiable nature of the discrete space sampling and thus these methods have to treat the discriminator as a reward function and ignore the gradient information. In this paper, we propose a novel approach, CaptainGAN, which adopts the straight-through gradient estimator and introduces a ”re-centered” gradient estimation technique to steer the generator toward better text tokens through the embedding space. Our method is stable to train and converges quickly without maximum likelihood pre-training. On multiple metrics of text quality and diversity, our method outperforms existing GAN-based methods on natural language generation.",/pdf/1e28e6406283f37f7caec31ef50dfe43f9a3d97f.pdf,ICLR,2020,An effective gradient-based method for training a text generating GAN +BJxgz2R9t7,SJe0W6c5Ym,1538090000000.0,1546110000000.0,1232,Learning To Solve Circuit-SAT: An Unsupervised Differentiable Approach,"[""saeed.amizadeh@gmail.com"", ""sergiym@microsoft.com"", ""markus.weimer@microsoft.com""]","[""Saeed Amizadeh"", ""Sergiy Matusevych"", ""Markus Weimer""]","[""Neuro-Symbolic Methods"", ""Circuit Satisfiability"", ""Neural SAT Solver"", ""Graph Neural Networks""]","Recent efforts to combine Representation Learning with Formal Methods, commonly known as the Neuro-Symbolic Methods, have given rise to a new trend of applying rich neural architectures to solve classical combinatorial optimization problems. In this paper, we propose a neural framework that can learn to solve the Circuit Satisfiability problem. Our framework is built upon two fundamental contributions: a rich embedding architecture that encodes the problem structure and an end-to-end differentiable training procedure that mimics Reinforcement Learning and trains the model directly toward solving the SAT problem. The experimental results show the superior out-of-sample generalization performance of our framework compared to the recently developed NeuroSAT method.",/pdf/ba13bce9a0eb129d8a00c2300bde416990ab5ae4.pdf,ICLR,2019,We propose a neural framework that can learn to solve the Circuit Satisfiability problem from (unlabeled) circuit instances. +Byg9AR4YDB,SyedaFcuwH,1569440000000.0,1577170000000.0,1442,Exploring Cellular Protein Localization Through Semantic Image Synthesis,"[""daniel.li@columbia.edu"", ""ma.qiang@columbia.edu"", ""andrew@ml.berkeley.edu"", ""justin.cheung@stonybrookmedicine.edu"", ""peerster@gmail.com"", ""itsik@cs.columbia.edu""]","[""Daniel Li"", ""Qiang Ma"", ""Andrew Liu"", ""Justin Cheung"", ""Dana Pe\u2019er"", ""Itsik Pe\u2019er""]","[""Computational biology"", ""image synthesis"", ""GANs"", ""exploring multiplex images"", ""attention"", ""interpretability""]","Cell-cell interactions have an integral role in tumorigenesis as they are critical in governing immune responses. As such, investigating specific cell-cell interactions has the potential to not only expand upon the understanding of tumorigenesis, but also guide clinical management of patient responses to cancer immunotherapies. A recent imaging technique for exploring cell-cell interactions, multiplexed ion beam imaging by time-of-flight (MIBI-TOF), allows for cells to be quantified in 36 different protein markers at sub-cellular resolutions in situ as high resolution multiplexed images. To explore the MIBI images, we propose a GAN for multiplexed data with protein specific attention. By conditioning image generation on cell types, sizes, and neighborhoods through semantic segmentation maps, we are able to observe how these factors affect cell-cell interactions simultaneously in different protein channels. Furthermore, we design a set of metrics and offer the first insights towards cell spatial orientations, cell protein expressions, and cell neighborhoods. Our model, cell-cell interaction GAN (CCIGAN), outperforms or matches existing image synthesis methods on all conventional measures and significantly outperforms on biologically motivated metrics. To our knowledge, we are the first to systematically model multiple cellular protein behaviors and interactions under simulated conditions through image synthesis.",/pdf/1d2c1c92c8b26fec3c1d485395664632bbfa2f8e.pdf,ICLR,2020,"We explore cell-cell interactions across tumor environment contexts observed in highly multiplexed images, by image synthesis using a novel attention GAN architecture." +r1eEG20qKQ,B1gYhkRcYX,1538090000000.0,1552020000000.0,1254,Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions,"[""mmackay@cs.toronto.edu"", ""pvicol@cs.toronto.edu"", ""lorraine@cs.toronto.edu"", ""duvenaud@cs.toronto.edu"", ""rgrosse@cs.toronto.edu""]","[""Matthew Mackay"", ""Paul Vicol"", ""Jonathan Lorraine"", ""David Duvenaud"", ""Roger Grosse""]","[""hyperparameter optimization"", ""game theory"", ""optimization""]","Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).",/pdf/724b484decb86782e05d9e039fcc75e1040ea79d.pdf,ICLR,2019,"We use a hypernetwork to predict optimal weights given hyperparameters, and jointly train everything together." +b4Phn_aTm_e,MimvjOUGt2OT,1601310000000.0,1614990000000.0,3408,Pseudo Label-Guided Multi Task Learning for Scene Understanding,"[""~Sunkyung_Kim1"", ""~Hyesong_Choi1"", ""~Dongbo_Min3""]","[""Sunkyung Kim"", ""Hyesong Choi"", ""Dongbo Min""]","[""Multi-task learning"", ""monocular depth estimation"", ""semantic segmentation"", ""pseudo label"", ""cross-view consistency""]","Multi-task learning (MTL) for scene understanding has been actively studied by exploiting correlation of multiple tasks. This work focuses on improving the performance of the MTL network that infers depth and semantic segmentation maps from a single image. Specifically, we propose a novel MTL architecture, called Pseudo-MTL, that introduces pseudo labels for joint learning of monocular depth estimation and semantic segmentation tasks. The pseudo ground truth depth maps, generated from pretrained stereo matching methods, are leveraged to supervise the monocular depth estimation. More importantly, the pseudo depth labels serve to impose a cross-view consistency on the estimated monocular depth and segmentation maps of two views. This enables for mitigating the mismatch problem incurred by inconsistent prediction results across two views. A thorough ablation study validates that the cross-view consistency leads to a substantial performance gain by ensuring inference-view invariance for the two tasks.",/pdf/59be1c152eb2cc9f77ada4b6a5e640ff33ae9cee.pdf,ICLR,2021,"This paper proposes a novel multi-task learning (MTL) architecture, called Pseudo-MTL, that leverages pseudo labels for joint learning of monocular depth estimation and semantic segmentation tasks." +Bkg5aoAqKm,Bye6Yq35YX,1538090000000.0,1545360000000.0,830,Fast Binary Functional Search on Graph,"[""laos1984@gmail.com"", ""zhixin0825@gmail.com"", ""zhaozhuoxu@gmail.com"", ""pingli98@gmail.com""]","[""Shulong Tan"", ""Zhixin Zhou"", ""Zhaozhuo Xu"", ""Ping Li""]","[""Binary Functional Search"", ""Large-scale Search"", ""Approximate Nearest Neighbor Search""]","The large-scale search is an essential task in modern information systems. Numerous learning based models are proposed to capture semantic level similarity measures for searching or ranking. However, these measures are usually complicated and beyond metric distances. As Approximate Nearest Neighbor Search (ANNS) techniques have specifications on metric distances, efficient searching by advanced measures is still an open question. In this paper, we formulate large-scale search as a general task, Optimal Binary Functional Search (OBFS), which contains ANNS as special cases. We analyze existing OBFS methods' limitations and explain they are not applicable for complicated searching measures. We propose a flexible graph-based solution for OBFS, Search on L2 Graph (SL2G). SL2G approximates gradient decent in Euclidean space, with accessible conditions. Experiments demonstrate SL2G's efficiency in searching by advanced matching measures (i.e., Neural Network based measures).",/pdf/7a2a3227130ef7d7eae554093e2cc8f7db79f891.pdf,ICLR,2019,Efficient Search by Neural Network based searching measures. +SkxJ8REYPH,ByxSsTUOPS,1569440000000.0,1583910000000.0,1123,SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum,"[""jianyuw1@andrew.cmu.edu"", ""tantia@fb.com"", ""ballasn@fb.com"", ""mikerabbat@fb.com""]","[""Jianyu Wang"", ""Vinayak Tantia"", ""Nicolas Ballas"", ""Michael Rabbat""]","[""distributed optimization"", ""decentralized training methods"", ""communication-efficient distributed training with momentum"", ""large-scale parallel SGD""]","Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SlowMo runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses. Since BMUF can be expressed through the SlowMo framework, our results also correspond to the first theoretical convergence guarantees for BMUF.",/pdf/e8558035e2690cc168ce3d147df58cab1cb0d93e.pdf,ICLR,2020,SlowMo improves the optimization and generalization performance of communication-efficient decentralized algorithms without sacrificing speed. +S1xh5sYgx,,1478240000000.0,1478240000000.0,125,SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,"[""forresti@eecs.berkeley.edu"", ""songhan@stanford.edu"", ""moskewcz@eecs.berkeley.edu"", ""kashraf@eecs.berkeley.edu"", ""dally@stanford.edu"", ""keutzer@eecs.berkeley.edu""]","[""Forrest N. Iandola"", ""Song Han"", ""Matthew W. Moskewicz"", ""Khalid Ashraf"", ""William J. Dally"", ""Kurt Keutzer""]","[""Computer vision"", ""Deep learning""]","Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet).",/pdf/6b649f83c76024a3120850a3f9e78d1560ad6fcd.pdf,ICLR,2017,Small CNN models +HJlfAo09KX,SJec3AK5Fm,1538090000000.0,1545360000000.0,875,Guaranteed Recovery of One-Hidden-Layer Neural Networks via Cross Entropy,"[""fu.436@osu.edu"", ""yuejiechi@cmu.edu"", ""liang.889@osu.edu""]","[""Haoyu Fu"", ""Yuejie Chi"", ""Yingbin Liang""]","[""cross entropy"", ""neural networks"", ""parameter recovery""]","We study model recovery for data classification, where the training labels are generated from a one-hidden-layer fully -connected neural network with sigmoid activations, and the goal is to recover the weight vectors of the neural network. We prove that under Gaussian inputs, the empirical risk function using cross entropy exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, as soon as the sample complexity is sufficiently large. This implies that if initialized in this neighborhood, which can be achieved via the tensor method, gradient descent converges linearly to a critical point that is provably close to the ground truth without requiring a fresh set of samples at each iteration. To the best of our knowledge, this is the first global convergence guarantee established for the empirical risk minimization using cross entropy via gradient descent for learning one-hidden-layer neural networks, at the near-optimal sample and computational complexity with respect to the network input dimension.",/pdf/28df4734fd8a152b0297b6c078605ef964201dc3.pdf,ICLR,2019,We provide the first theoretical analysis of guaranteed recovery of one-hidden-layer neural networks under cross entropy loss for classification problems. +3rRgu7OGgBI,qeb3JLvL7nK,1601310000000.0,1614990000000.0,996,Bi-tuning of Pre-trained Representations,"[""zhongjinchengwork@gmail.com"", ""~Ximei_Wang1"", ""kz19@mails.tsinghua.edu.cn"", ""~Jianmin_Wang1"", ""~Mingsheng_Long5""]","[""Jincheng Zhong"", ""Ximei Wang"", ""Zhi Kou"", ""Jianmin Wang"", ""Mingsheng Long""]","[""Deep learning"", ""fine-tuning"", ""pre-training""]","It is common within the deep learning community to first pre-train a deep neural network from a large-scale dataset and then fine-tune the pre-trained model to a specific downstream task. Recently, both supervised and unsupervised pre-training approaches to learning representations have achieved remarkable advances, which exploit the discriminative knowledge of labels and the intrinsic structure of data, respectively. It follows natural intuition that both discriminative knowledge and intrinsic structure of the downstream task can be useful for fine-tuning, however, existing fine-tuning methods mainly leverage the former and discard the latter. A question arises: How to fully explore the intrinsic structure of data for boosting fine-tuning? In this paper, we propose Bi-tuning, a general learning framework to fine-tuning both supervised and unsupervised pre-trained representations to downstream tasks. Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations: a classifier head with an improved contrastive cross-entropy loss to better leverage the label information in an instance-contrast way, and a projector head with a newly-designed categorical contrastive learning loss to fully exploit the intrinsic structure of data in a category-consistent way. Comprehensive experiments confirm that Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins (e.g.~10.7\% absolute rise in accuracy on CUB in low-data regime).",/pdf/34ccd635fcab0acb0f1c026a488ae37bfcb51159.pdf,ICLR,2021,This paper proposes a general approach to deeply fine-tuning both supervised and unsupervised pre-trained representations to downstream tasks. +jWkw45-9AbL,zSdzofRhNFy,1601310000000.0,1615770000000.0,2042,A Distributional Approach to Controlled Text Generation,"[""~Muhammad_Khalifa2"", ""~Hady_Elsahar2"", ""~Marc_Dymetman1""]","[""Muhammad Khalifa"", ""Hady Elsahar"", ""Marc Dymetman""]","[""Controlled NLG"", ""Pretrained Language Models"", ""Bias in Language Models"", ""Energy-Based Models"", ""Information Geometry"", ""Exponential Families""]","We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LM). This approach permits to specify, in a single formal framework, both “pointwise’” and “distributional” constraints over the target LM — to our knowledge, the first model with such generality —while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-BasedModel) representation. From that optimal representation, we then train a target controlled Autoregressive LM through an adaptive distributional variant of PolicyGradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the pretrained LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence. +Code available at https://github.com/naver/gdc",/pdf/11f0063ca9b22ee0ed8462004057c2417891ade2.pdf,ICLR,2021,"We propose a novel approach to Controlled NLG, relying on Constraints over Distributions, Information Geometry, and Sampling from Energy-Based Models." +rytNfI1AZ,SJ_VzI1A-,1509020000000.0,1519390000000.0,124,Training wide residual networks for deployment using a single bit for each weight,"[""mark.mcdonnell@unisa.edu.au""]","[""Mark D. McDonnell""]","[""wide residual networks"", ""model compression"", ""quantization"", ""1-bit weights""]","For fast and energy-efficient deployment of trained deep neural networks on resource-constrained embedded hardware, each learned weight parameter should ideally be represented and stored using a single bit. Error-rates usually increase when this requirement is imposed. Here, we report large improvements in error rates on multiple datasets, for deep convolutional neural networks deployed with 1-bit-per-weight. Using wide residual networks as our main baseline, our approach simplifies existing methods that binarize weights by applying the sign function in training; we apply scaling factors for each layer with constant unlearned values equal to the layer-specific standard deviations used for initialization. For CIFAR-10, CIFAR-100 and ImageNet, and models with 1-bit-per-weight requiring less than 10 MB of parameter memory, we achieve error rates of 3.9%, 18.5% and 26.0% / 8.5% (Top-1 / Top-5) respectively. We also considered MNIST, SVHN and ImageNet32, achieving 1-bit-per-weight test results of 0.27%, 1.9%, and 41.3% / 19.1% respectively. For CIFAR, our error rates halve previously reported values, and are within about 1% of our error-rates for the same network with full-precision weights. For networks that overfit, we also show significant improvements in error rate by not learning batch normalization scale and offset parameters. This applies to both full precision and 1-bit-per-weight networks. Using a warm-restart learning-rate schedule, we found that training for 1-bit-per-weight is just as fast as full-precision networks, with better accuracy than standard schedules, and achieved about 98%-99% of peak performance in just 62 training epochs for CIFAR-10/100. For full training code and trained models in MATLAB, Keras and PyTorch see https://github.com/McDonnell-Lab/1-bit-per-weight/ .",/pdf/861cb006a62eb71925571a5d4979901d047a92ea.pdf,ICLR,2018,"We train wide residual networks that can be immediately deployed using only a single bit for each convolutional weight, with signficantly better accuracy than past methods." +UmrVpylRExB,5Jugm4T8K8x,1601310000000.0,1614990000000.0,3573,Dual-Tree Wavelet Packet CNNs for Image Classification,"[""~Hubert_Leterme1"", ""kevin.polisano@univ-grenoble-alpes.fr"", ""valerie.perrier@grenoble-inp.fr"", ""~Karteek_Alahari1""]","[""Hubert Leterme"", ""K\u00e9vin Polisano"", ""Val\u00e9rie Perrier"", ""Karteek Alahari""]","[""convolutional neural networks"", ""wavelet packet transform"", ""dual-tree wavelet packet transform"", ""image classification"", ""deep learning"", ""image processing""]","In this paper, we target an important issue of deep convolutional neural networks (CNNs) — the lack of a mathematical understanding of their properties. We present an explicit formalism that is motivated by the similarities between trained CNN kernels and oriented Gabor filters for addressing this problem. The core idea is to constrain the behavior of convolutional layers by splitting them into a succession of wavelet packet decompositions, which are modulated by freely-trained mixture weights. We evaluate our approach with three variants of wavelet decompositions with the AlexNet architecture for image classification as an example. The first variant relies on the separable wavelet packet transform while the other two implement the 2D dual-tree real and complex wavelet packet transforms, taking advantage of their feature extraction properties such as directional selectivity and shift invariance. Our experiments show that we achieve the accuracy rate of standard AlexNet, but with a significantly lower number of parameters, and an interpretation of the network that is grounded in mathematical theory.",/pdf/8797c50437804b0b3888915a5ab2faa982f59ca1.pdf,ICLR,2021,We introduce the dual-tree wavelet packet transform into convolutional neural networks in order to constrain their behavior while keeping their predicting power. +BJzuKiC9KX,H1xVFWjqFQ,1538090000000.0,1545360000000.0,459,Revisiting Reweighted Wake-Sleep,"[""tuananh@robots.ox.ac.uk"", ""adamk@robots.ox.ac.uk"", ""nsid@robots.ox.ac.uk"", ""y.w.teh@stats.ox.ac.uk"", ""fwood@cs.ubc.ca""]","[""Tuan Anh Le"", ""Adam R. Kosiorek"", ""N. Siddharth"", ""Yee Whye Teh"", ""Frank Wood""]","[""variational inference"", ""approximate inference"", ""generative models"", ""gradient estimators""]"," Discrete latent-variable models, while applicable in a variety of settings, can often be difficult to learn. Sampling discrete latent variables can result in high-variance gradient estimators for two primary reasons: 1) branching on the samples within the model, and 2) the lack of a pathwise derivative for the samples. While current state-of-the-art methods employ control-variate schemes for the former and continuous-relaxation methods for the latter, their utility is limited by the complexities of implementing and training effective control-variate schemes and the necessity of evaluating (potentially exponentially) many branch paths in the model. Here, we revisit the Reweighted Wake Sleep (RWS; Bornschein and Bengio, 2015) algorithm, and through extensive evaluations, show that it circumvents both these issues, outperforming current state-of-the-art methods in learning discrete latent-variable models. Moreover, we observe that, unlike the Importance-weighted Autoencoder, RWS learns better models and inference networks with increasing numbers of particles, and that its benefits extend to continuous latent-variable models as well. Our results suggest that RWS is a competitive, often preferable, alternative for learning deep generative models.",/pdf/c653bb9ffd52e2374b3b1568ae2b870a653004f6.pdf,ICLR,2019,Empirical analysis and explanation of particle-based gradient estimators for approximate inference with deep generative models. +BJxsrgStvr,BJgzA_etwB,1569440000000.0,1596780000000.0,2297,Drawing Early-Bird Tickets: Toward More Efficient Training of Deep Networks,"[""hy34@rice.edu"", ""cl114@rice.edu"", ""px5@rice.edu"", ""yf22@rice.edu"", ""yw68@rice.edu"", ""chernxh@tamu.edu"", ""richb@rice.edu"", ""atlaswang@tamu.edu"", ""yingyan.lin@rice.edu""]","[""Haoran You"", ""Chaojian Li"", ""Pengfei Xu"", ""Yonggan Fu"", ""Yue Wang"", ""Xiaohan Chen"", ""Richard G. Baraniuk"", ""Zhangyang Wang"", ""Yingyan Lin""]",[],"(Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this paper, we discover for the first time that the winning tickets can be identified at the very early training stage, which we term as Early-Bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early. Furthermore, we propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Finally, we leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy. Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 5.8x ~ 10.7x energy savings while maintaining comparable or even better accuracy as compared to the most competitive state-of-the-art training methods, demonstrating a promising and easily adopted method for tackling cost-prohibitive deep network training.",/pdf/a1f8e34017463e572125292f8ac060420e835a1b.pdf,ICLR,2020, +rket4i0qtX,H1l7YpILKX,1538090000000.0,1545360000000.0,20,"The meaning of ""most"" for visual question answering models","[""aok25@cam.ac.uk"", ""aac10@cam.ac.uk""]","[""Alexander Kuhnle"", ""Ann Copestake""]","[""quantifier"", ""evaluation methodology"", ""psycholinguistics"", ""visual question answering""]","The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of ""most"", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design data to replicate experiments from psycholinguistics where the same question was investigated for humans. Focusing on the FiLM visual question answering model, our experiments indicate that a form of approximate number system emerges whose performance declines with more difficult scenes as predicted by Weber's law. Moreover, we identify confounding factors, like spatial arrangement of the scene, which impede the effectiveness of this system.",/pdf/cfa9e1cebabbdee1621e129032dbba854de511f2.pdf,ICLR,2019,Psychology-inspired evaluation of quantifier understanding for visual question answering models +HJ7O61Yxe,,1478190000000.0,1481880000000.0,73,Modelling Relational Time Series using Gaussian Embeddings,"[""ludovic.dossantos@lip6.fr"", ""ali.ziat@vedecom.fr"", ""ludovic.denoyer@lip6.fr"", ""benjamin.piwowarski@lip6.fr"", ""patrick.gallinari@lip6.fr""]","[""Ludovic Dos Santos"", ""Ali Ziat"", ""Ludovic Denoyer"", ""Benjamin Piwowarski"", ""Patrick Gallinari""]","[""Applications"", ""Deep learning""]","We address the problem of modeling multiple simultaneous time series where the observations are correlated not only inside each series, but among the different series. This problem happens in many domains such as ecology, meteorology, etc. We propose a new dynamical state space model, based on representation learning, for modeling the evolution of such series. The joint relational and temporal dynamics of the series are modeled as Gaussian distributions in a latent space. A decoder maps the latent representations to the observations. The two components (dynamic model and decoder) are jointly trained. Using stochastic representations allows us to model the uncertainty inherent to observations and to predict unobserved values together with a confidence in the prediction.",/pdf/393639b108094123bb479f130715a48dede5fe30.pdf,ICLR,2017,We learn latent gaussian distributions for modelling correlated series. +ByldLrqlx,,1478280000000.0,1488550000000.0,252,DeepCoder: Learning to Write Programs,"[""matej.balog@gmail.com"", ""t-algaun@microsoft.com"", ""mabrocks@microsoft.com"", ""Sebastian.Nowozin@microsoft.com"", ""dtarlow@microsoft.com""]","[""Matej Balog"", ""Alexander L. Gaunt"", ""Marc Brockschmidt"", ""Sebastian Nowozin"", ""Daniel Tarlow""]","[""Deep learning"", ""Supervised Learning"", ""Applications"", ""Structured prediction""]","We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network's predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.",/pdf/ffe925e50a919120e1753d6ecf7de6dbd57546c8.pdf,ICLR,2017, +4qR3coiNaIv,TMEjtinshw1,1601310000000.0,1616060000000.0,846,Scalable Bayesian Inverse Reinforcement Learning,"[""~Alex_James_Chan1"", ""~Mihaela_van_der_Schaar2""]","[""Alex James Chan"", ""Mihaela van der Schaar""]","[""Bayesian"", ""Inverse reinforcement learning"", ""Imitation Learning""]","Bayesian inference over the reward presents an ideal solution to the ill-posed nature of the inverse reinforcement learning problem. Unfortunately current methods generally do not scale well beyond the small tabular setting due to the need for an inner-loop MDP solver, and even non-Bayesian methods that do themselves scale often require extensive interaction with the environment to perform well, being inappropriate for high stakes or costly applications such as healthcare. In this paper we introduce our method, Approximate Variational Reward Imitation Learning (AVRIL), that addresses both of these issues by jointly learning an approximate posterior distribution over the reward that scales to arbitrarily complicated state spaces alongside an appropriate policy in a completely offline manner through a variational approach to said latent reward. Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods, as well as task performance competitive with focused offline imitation learning algorithms.",/pdf/a94f190029f9ebd5affafc141a770843d501a8a2.pdf,ICLR,2021,A variational inference approach to Bayesian inverse reinforcement learning. +GNv-TyWu3PY,w6pA28FOxbq,1601310000000.0,1614990000000.0,307,Robust Learning for Congestion-Aware Routing,"[""~Sreenivas_Gollapudi2"", ""kostaskollias@google.com"", ""~Benjamin_Plaut2"", ""~Ameya_Velingker1""]","[""Sreenivas Gollapudi"", ""Kostas Kollias"", ""Benjamin Plaut"", ""Ameya Velingker""]","[""routing algorithms"", ""adversarial learning"", ""congestion functions""]","We consider the problem of routing users through a network with unknown congestion functions over an infinite time horizon. On each time step $t$, the algorithm receives a routing request and must select a valid path. For each edge $e$ in the selected path, the algorithm incurs a cost $c_e^t = f_e(x_e^t) + \eta_e^t$, where $x_e^t$ is the flow on edge $e$ at time $t$, $f_e$ is the congestion function, and $\eta_e^t$ is a noise sample drawn from an unknown distribution. The algorithm observes $c_e^t$, and can use this observation in future routing decisions. The routing requests are supplied adversarially. + +We present an algorithm with cumulative regret $\tilde{O}(|E| t^{2/3})$, where the regret on each time step is defined as the difference between the total cost incurred by our chosen path and the minimum cost among all valid paths. Our algorithm has space complexity $O(|E| t^{1/3})$ and time complexity $O(|E| \log t)$. We also validate our algorithm empirically using graphs from New York City road networks.",/pdf/ef57175baf996f8ec69b1f7cebda0c28401ff666.pdf,ICLR,2021,"We present an algorithm which learns an optimal routing policy on any graph for any Lipschitz-continuous congestion functions, even in the presence of noisy observations and adversarial routing requests." +Sygx4305KQ,S1lRj5D9Ym,1538090000000.0,1545360000000.0,1418,Small steps and giant leaps: Minimal Newton solvers for Deep Learning,"[""joao@robots.ox.ac.uk"", ""hyenal@robots.ox.ac.uk"", ""albanie@robots.ox.ac.uk"", ""vedali@robots.ox.ac.uk""]","[""Joao Henriques"", ""Sebastien Ehrhardt"", ""Samuel Albanie"", ""Andrea Vedaldi""]","[""deep learning""]","We propose a fast second-order method that can be used as a drop-in replacement for current deep learning solvers. Compared to stochastic gradient descent (SGD), it only requires two additional forward-mode automatic differentiation operations per iteration, which has a computational cost comparable to two standard forward passes and is easy to implement. Our method addresses long-standing issues with current second-order solvers, which invert an approximate Hessian matrix every iteration exactly or by conjugate-gradient methods, procedures that are much slower than a SGD step. Instead, we propose to keep a single estimate of the gradient projected by the inverse Hessian matrix, and update it once per iteration with just two passes over the network. This estimate has the same size and is similar to the momentum variable that is commonly used in SGD. No estimate of the Hessian is maintained. +We first validate our method, called CurveBall, on small problems with known solutions (noisy Rosenbrock function and degenerate 2-layer linear networks), where current deep learning solvers struggle. We then train several large models on CIFAR and ImageNet, including ResNet and VGG-f networks, where we demonstrate faster convergence with no hyperparameter tuning. We also show our optimiser's generality by testing on a large set of randomly-generated architectures.",/pdf/1e1dc23b5fa5436d68821b3cf095809205d60d3e.pdf,ICLR,2019,A fast second-order solver for deep learning that works on ImageNet-scale problems with no hyper-parameter tuning +Skl6peHFwS,rkxn6V-YDS,1569440000000.0,1577170000000.0,2590,Best feature performance in codeswitched hate speech texts,"[""eombui@anu.ac.ke"", ""lmuchemi@uonbi.ac.ke"", ""waiganjo@uonbi.ac.ke""]","[""Edward Ombui"", ""Lawrence Muchemi"", ""Peter Wagacha""]","[""Hate Speech"", ""Code-switching"", ""feature selection"", ""representation learning""]","How well can hate speech concept be abstracted in order to inform automatic classification in codeswitched texts by machine learning classifiers? We explore different representations and empirically evaluate their predictiveness using both conventional and deep learning algorithms in identifying hate speech in a ~48k human-annotated dataset that contain mixed languages, a phenomenon common among multilingual speakers. This paper espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Allocation to generate topic models that feed into another high-level feature set that we acronym PDC. PDC groups similar meaning words in word families during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on Ombui et al, (2019) hate speech annotation framework that is informed by the triangular theory of hate (Stanberg,2003). Results obtained from frequency-based models using the PDC feature on the annotated dataset of ~48k short messages comprising of tweets generated during the 2012 and 2017 Kenyan presidential elections indicate an improvement on classification accuracy in identifying hate speech as compared to the baseline",/pdf/d8741878becd17755e18333f269789a84f66520f.pdf,ICLR,2020,Analysis of features and algorithm performance on codeswitched language datasets +6jlNy83JUQ_,k2Oouo4d4rg,1601310000000.0,1614990000000.0,2865,Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning,"[""~Gil_I._Shamir1"", ""~Wojciech_Szpankowski1""]","[""Gil I. Shamir"", ""Wojciech Szpankowski""]","[""Bayesian methods"", ""logistic regression"", ""regret"", ""online learning"", ""MDL.""]","Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must be tuned on some surrogate loss, defeating theoretical advantages of Bayesian methods. The surrogate loss, defined to approximate the mixture, requires techniques as Monte Carlo sampling, increasing computations per example. We propose low complexity analytical approximations for sparse online logistic and probit regressions. Unlike variational inference and other methods, our methods use analytical closed forms, substantially lowering computations. Unlike dense solutions, +as Gaussian Mixtures, our methods allow for sparse problems with huge feature sets without increasing complexity. With the analytical closed forms, there is also no need for applying stochastic gradient methods on surrogate losses, and for tuning and balancing learning and regularization hyper-parameters. Empirical results top the performance of the more computationally involved methods. Like such methods, our methods still reveal per feature and per example uncertainty measures. +",/pdf/dfce76a5c9bbc1dcca5109804d029d5a3684a266.pdf,ICLR,2021,Simple online learning methods for logistic regression with empirical regret better than other methods that appears to be close to lower bounds. +rklhqkHFDB,rJgjSsCdwB,1569440000000.0,1577170000000.0,1892,LARGE SCALE REPRESENTATION LEARNING FROM TRIPLET COMPARISONS,"[""siyavash.haghiri@gmail.com"", ""leena.chennuru-vankadara@uni-tuebingen.de"", ""luxburg@informatik.uni-tuebingen.de""]","[""Siavash Haghiri"", ""Leena Chennuru Vankadara"", ""Ulrike von Luxburg""]","[""representation learning"", ""triplet comparison"", ""contrastive learning"", ""ordinal embedding""]","In this paper, we discuss the fundamental problem of representation learning from a new perspective. It has been observed in many supervised/unsupervised DNNs that the final layer of the network often provides an informative representation for many tasks, even though the network has been trained to perform a particular task. The common ingredient in all previous studies is a low-level feature representation for items, for example, RGB values of images in the image context. In the present work, we assume that no meaningful representation of the items is given. Instead, we are provided with the answers to some triplet comparisons of the following form: Is item A more similar to item B or item C? We provide a fast algorithm based on DNNs that constructs a Euclidean representation for the items, using solely the answers to the above-mentioned triplet comparisons. This problem has been studied in a sub-community of machine learning by the name ""Ordinal Embedding"". Previous approaches to the problem are painfully slow and cannot scale to larger datasets. We demonstrate that our proposed approach is significantly faster than available methods, and can scale to real-world large datasets. + +Thereby, we also draw attention to the less explored idea of using neural networks to directly, approximately solve non-convex, NP-hard optimization problems that arise naturally in unsupervised learning problems.",/pdf/3d40dac6797637c8548ca6626171ba28533e3a96.pdf,ICLR,2020, +Hy3_KuYxg,,1478230000000.0,1479010000000.0,108,Divide and Conquer with Neural Networks,"[""anv273@nyu.edu"", ""bruna@cims.nyu.edu""]","[""Alex Nowak"", ""Joan Bruna""]","[""Deep learning""]","We consider the learning of algorithmic tasks by mere observation of input-output pairs. +Rather than studying this as a black-box discrete regression problem with no assumption whatsoever +on the input-output mapping, we concentrate on tasks that are amenable to the principle of divide and conquer, and study what are its implications in terms of learning. + +This principle creates a powerful inductive bias that we exploit with neural architectures that are defined recursively, by learning two scale-invariant atomic operators: how to split a given input into two disjoint sets, and how to merge two partially solved tasks into a larger partial solution. The scale invariance creates parameter sharing across all stages of the architecture, and the dynamic design creates architectures whose complexity can be tuned in a differentiable manner. + +As a result, our model is trained by backpropagation not only to minimize the errors at the output, but also to do so as efficiently as possible, by enforcing shallower computation graphs. Moreover, thanks to the scale invariance, the model can be trained only with only input/output pairs, removing the need to know oracle intermediate split and merge decisions. As it turns out, accuracy and complexity are not independent qualities, and we verify empirically that when the learnt complexity matches the underlying complexity of the task, this results in higher accuracy and better generalization in two paradigmatic problems: sorting and finding planar convex hulls.",/pdf/5fc8ed0ce293f8444896021bdacd823dcb8cea37.pdf,ICLR,2017,learn dynamic programming with neural networks +sebtMY-TrXh,8ZrqTI2nHU0,1601310000000.0,1614990000000.0,1584,AriEL: Volume Coding for Sentence Generation Comparisons,"[""~Luca_Celotti1"", ""~Simon_Brodeur1"", ""~Jean_Rouat1""]","[""Luca Celotti"", ""Simon Brodeur"", ""Jean Rouat""]",[],"Mapping sequences of discrete data to a point in a continuous space makes it difficult to retrieve those sequences via random sampling. Mapping the input to a volume would make it easier to retrieve at test time, and that is the strategy followed by the family of approaches based on Variational Autoencoder. However the fact that they are at the same time optimizing for prediction and for smoothness of representation, forces them to trade-off between the two. We benchmark the performance of some of the standard methods in deep learning to generate sentences by uniformly sampling a continuous space. We do it by proposing AriEL, that constructs volumes in a continuous space, without the need of encouraging the creation of volumes through the loss function. We first benchmark on a toy grammar, that allows to automatically evaluate the language learned and generated by the models. Then, we benchmark on a real dataset of human dialogues. Our results indicate that the random access to the stored information can be significantly improved, since our method AriEL is able to generate a wider variety of correct language by randomly sampling the latent space. VAE follows in performance for the toy dataset while, AE and Transformer follow for the real dataset. This partially supports the hypothesis that encoding information into volumes instead of into points, leads to improved retrieval of learned information with random sampling. We hope this analysis can clarify directions to lead to better generators.",/pdf/f060c74a965cc2f640663f97fac67d11823b6746.pdf,ICLR,2021,Coding discrete information into volumes in a continuous space can improve generation by random sampling retrieval. +r1xywsC9tQ,H1gA15L5YQ,1538090000000.0,1545360000000.0,226,Mapping the hyponymy relation of wordnet onto vector Spaces,"[""jean-philippe.bernardy@gu.se"", ""aleksandre.maskharashvili@gu.se""]","[""Jean-Philippe Bernardy"", ""Aleksandre Maskharashvili""]","[""fasttext"", ""hyponymy"", ""wordnet""]"," In this paper, we investigate mapping the hyponymy relation of + wordnet to feature vectors. + We aim to model lexical knowledge in such a way that it can be used as + input in generic machine-learning models, such as phrase entailment + predictors. + We propose two models. The first one leverages an existing mapping of + words to feature vectors (fasttext), and attempts to classify + such vectors as within or outside of each class. The second model is fully supervised, + using solely wordnet as a ground truth. It maps each concept to an + interval or a disjunction thereof. + On the first model, we approach, but not quite attain state of the + art performance. The second model can achieve near-perfect accuracy. +",/pdf/02fcd10e0c1e0a7a5e9ea671896bfafd8e11b770.pdf,ICLR,2019,We investigate mapping the hyponymy relation of wordnet to feature vectors +Hrtbm8u0RXu,vBuOOx9RkD5,1601310000000.0,1614990000000.0,2522,Provable Memorization via Deep Neural Networks using Sub-linear Parameters,"[""~Sejun_Park1"", ""~Jaeho_Lee3"", ""~Chulhee_Yun1"", ""~Jinwoo_Shin1""]","[""Sejun Park"", ""Jaeho Lee"", ""Chulhee Yun"", ""Jinwoo Shin""]","[""memorization""]","It is known that $\Theta(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $\Theta(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width $3$) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide empirical results that support our theoretical findings.",/pdf/bfddce25f94a9054d088a60fe8876f1963c08888.pdf,ICLR,2021,Neural networks with o(N) parameters can memorize arbitrary N samples under some mild condition. +r1xa9TVFvH,S1xWFVRPDr,1569440000000.0,1577170000000.0,724,NeuralUCB: Contextual Bandits with Neural Network-Based Exploration,"[""drzhou@cs.ucla.edu"", ""lihongli.cs@gmail.com"", ""qgu@cs.ucla.edu""]","[""Dongruo Zhou"", ""Lihong Li"", ""Quanquan Gu""]","[""contextual bandits"", ""neural network"", ""upper confidence bound""]","We study the stochastic contextual bandit problem, where the reward is generated from an unknown bounded function with additive noise. We propose the NeuralUCB algorithm, which leverages the representation power of deep neural networks and uses the neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration. We prove that, under mild assumptions, NeuralUCB achieves $\tilde O(\sqrt{T})$ regret bound, where $T$ is the number of rounds. To the best of our knowledge, our algorithm is the first neural network-based contextual bandit algorithm with near-optimal regret guarantee. Preliminary experiment results on synthetic data corroborate our theory, and shed light on potential applications of our algorithm to real-world problems.",/pdf/33a0927d6e52ad2cacb15acfe6503121d2daf589.pdf,ICLR,2020, +SJzRZ-WCZ,r1g0-WbCW,1509130000000.0,1518730000000.0,657,Latent Space Oddity: on the Curvature of Deep Generative Models,"[""gear@dtu.dk"", ""lkai@dtu.dk"", ""sohau@dtu.dk""]","[""Georgios Arvanitidis"", ""Lars Kai Hansen"", ""S\u00f8ren Hauberg""]","[""Generative models"", ""Riemannian Geometry"", ""Latent Space""]","Deep generative models provide a systematic way to learn nonlinear data distributions through a set of latent variables and a nonlinear ""generator"" function that maps latent points into the input space. The nonlinearity of the generator implies that the latent space gives a distorted view of the input space. Under mild conditions, we show that this distortion can be characterized by a stochastic Riemannian metric, and we demonstrate that distances and interpolants are significantly improved under this metric. This in turn improves probability distributions, sampling algorithms and clustering in the latent space. Our geometric analysis further reveals that current generators provide poor variance estimates and we propose a new generator architecture with vastly improved variance estimates. Results are demonstrated on convolutional and fully connected variational autoencoders, but the formalism easily generalizes to other deep generative models.",/pdf/1812e30a45f4f7f74ea11279b906923b889f4019.pdf,ICLR,2018, +BJgdOh4Ywr,Sye9k-3ULr,1569440000000.0,1577170000000.0,48,Visual Imitation with Reinforcement Learning using Recurrent Siamese Networks,"[""gberseth@gmail.com"", ""christopher.pal@polymtl.ca""]","[""Glen Berseth"", ""Christopher Pal""]","[""imitation learning"", ""reinforcement learning"", ""imitation from video""]","It would be desirable for a reinforcement learning (RL) based agent to learn behaviour by merely watching a demonstration. However, defining rewards that facilitate this goal within the RL paradigm remains a challenge. Here we address this problem with Siamese networks, trained to compute distances between observed behaviours and the agent’s behaviours. Given a desired motion such Siamese networks can be used to provide a reward signal to an RL agent via the distance between the desired motion and the agent’s motion. We experiment with an RNN-based comparator model that can compute distances in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we have had also found that the inclusion of multi-task data and an additional image encoding loss helps enforce the temporal consistency. These two components appear to balance reward for matching a specific instance of a behaviour versus that behaviour in general. Furthermore, we focus here on a particularly challenging form of this problem where only a single demonstration is provided for a given task – the one-shot learning setting. We demonstrate our approach on humanoid agents in both 2D with 10 degrees of freedom (DoF) and 3D with 38 DoF.",/pdf/87b3aeaae15eff061a2a49170fcbab9dbc1511ae.pdf,ICLR,2020,Learning recurrent distance models for imitation from a single video clip using reinforcement learning. +H1gBsgBYwH,H1gO9y-Yvr,1569440000000.0,1583910000000.0,2506,Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint,"[""jba@cs.toronto.edu"", ""erdogdu@cs.toronto.edu"", ""taiji@mist.i.u-tokyo.ac.jp"", ""dennywu@cs.toronto.edu"", ""ztz16@mails.tsinghua.edu.cn""]","[""Jimmy Ba"", ""Murat Erdogdu"", ""Taiji Suzuki"", ""Denny Wu"", ""Tianzong Zhang""]","[""Neural Networks"", ""Generalization"", ""High-dimensional Statistics""]","This paper investigates the generalization properties of two-layer neural networks in high-dimensions, i.e. when the number of samples $n$, features $d$, and neurons $h$ tend to infinity at the same rate. Specifically, we derive the exact population risk of the unregularized least squares regression problem with two-layer neural networks when either the first or the second layer is trained using a gradient flow under different initialization setups. When only the second layer coefficients are optimized, we recover the \textit{double descent} phenomenon: a cusp in the population risk appears at $h\approx n$ and further overparameterization decreases the risk. In contrast, when the first layer weights are optimized, we highlight how different scales of initialization lead to different inductive bias, and show that the resulting risk is \textit{independent} of overparameterization. Our theoretical and experimental results suggest that previously studied model setups that provably give rise to \textit{double descent} might not translate to optimizing two-layer neural networks.",/pdf/f00f191828c2fb2bfaedb094247dab585d9b6b7f.pdf,ICLR,2020,"Derived population risk of two-layer neural networks in high dimensions and examined presence / absence of ""double descent""." +Skgvy64tvr,rJeUcjwSPH,1569440000000.0,1583910000000.0,305,Enhancing Adversarial Defense by k-Winners-Take-All,"[""chang@cs.columbia.edu"", ""pz2225@columbia.edu"", ""cxz@cs.columbia.edu""]","[""Chang Xiao"", ""Peilin Zhong"", ""Changxi Zheng""]","[""adversarial defense"", ""activation function"", ""winner takes all""]","We propose a simple change to existing neural network structures for better defending against gradient-based adversarial attacks. Instead of using popular activation functions (such as ReLU), we advocate the use of k-Winners-Take-All (k-WTA) activation, a C0 discontinuous function that purposely invalidates the neural network model’s gradient at densely distributed input data points. The proposed k-WTA activation can be readily used in nearly all existing networks and training methods with no significant overhead. Our proposal is theoretically rationalized. We analyze why the discontinuities in k-WTA networks can largely prevent gradient-based search of adversarial examples and why they at the same time remain innocuous to the network training. This understanding is also empirically backed. We test k-WTA activation on various network structures optimized by a training method, be it adversarial training or not. In all cases, the robustness of k-WTA networks outperforms that of traditional networks under white-box attacks.",/pdf/5c94cf92b84ba613df82441fcef5299a5c0105ad.pdf,ICLR,2020,"We propose a simple change to existing neural network structures for better defending against gradient-based adversarial attacks, using the k-winners-take-all activation function." +SJMnG2C9YX,Syes4fOqF7,1538090000000.0,1545360000000.0,1305,Complementary-label learning for arbitrary losses and models,"[""ishida@ms.k.u-tokyo.ac.jp"", ""gang.niu@riken.jp"", ""aditya.menon@anu.edu.au"", ""sugi@k.u-tokyo.ac.jp""]","[""Takashi Ishida"", ""Gang Niu"", ""Aditya Krishna Menon"", ""Masashi Sugiyama""]","[""complementary labels"", ""weak supervision""]","In contrast to the standard classification paradigm where the true (or possibly noisy) class is given to each training pattern, complementary-label learning only uses training patterns each equipped with a complementary label. This only specifies one of the classes that the pattern does not belong to. The seminal paper on complementary-label learning proposed an unbiased estimator of the classification risk that can be computed only from complementarily labeled data. How- ever, it required a restrictive condition on the loss functions, making it impossible to use popular losses such as the softmax cross-entropy loss. Recently, another formulation with the softmax cross-entropy loss was proposed with consistency guarantee. However, this formulation does not explicitly involve a risk estimator. Thus model/hyper-parameter selection is not possible by cross-validation— we may need additional ordinarily labeled data for validation purposes, which is not available in the current setup. In this paper, we give a novel general framework of complementary-label learning, and derive an unbiased risk estimator for arbitrary losses and models. We further improve the risk estimator by non-negative correction and demonstrate its superiority through experiments.",/pdf/e98a6e24cbb3e2d9545a12aa83f4115c1bc1f800.pdf,ICLR,2019,"From now on, you can train ResNet and DenseNet, even if no class label given for training is correct!" +B1lnzn0ctQ,B1xa-XacFQ,1538090000000.0,1550650000000.0,1304,ALISTA: Analytic Weights Are As Good As Learned Weights in LISTA,"[""liujl11@math.ucla.edu"", ""chernxh@tamu.edu"", ""atlaswang@tamu.edu"", ""wotaoyin@math.ucla.edu""]","[""Jialin Liu"", ""Xiaohan Chen"", ""Zhangyang Wang"", ""Wotao Yin""]","[""sparse recovery"", ""neural networks""]","Deep neural networks based on unfolding an iterative algorithm, for example, LISTA (learned iterative shrinkage thresholding algorithm), have been an empirical success for sparse signal recovery. The weights of these neural networks are currently determined by data-driven “black-box” training. In this work, we propose Analytic LISTA (ALISTA), where the weight matrix in LISTA is computed as the solution to a data-free optimization problem, leaving only the stepsize and threshold parameters to data-driven learning. This significantly simplifies the training. Specifically, the data-free optimization problem is based on coherence minimization. We show our ALISTA retains the optimal linear convergence proved in (Chen et al., 2018) and has a performance comparable to LISTA. Furthermore, we extend ALISTA to convolutional linear operators, again determined in a data-free manner. We also propose a feed-forward framework that combines the data-free optimization and ALISTA networks from end to end, one that can be jointly trained to gain robustness to small perturbations in the encoding model.",/pdf/e664f80e199d7c901289a4b7319d1f7558b83040.pdf,ICLR,2019, +HkljioCcFQ,BkgqzBYctm,1538090000000.0,1551280000000.0,655,MARGINALIZED AVERAGE ATTENTIONAL NETWORK FOR WEAKLY-SUPERVISED LEARNING,"[""yuanyuan910115@gmail.com"", ""lv_yueming@outlook.com"", ""shenxiluc@gmail.com"", ""ivor.tsang@uts.edu.au"", ""dyyeung@cse.ust.hk""]","[""Yuan Yuan"", ""Yueming Lyu"", ""Xi Shen"", ""Ivor W. Tsang"", ""Dit-Yan Yeung""]","[""feature aggregation"", ""weakly supervised learning"", ""temporal action localization""]","In weakly-supervised temporal action localization, previous works have failed to locate dense and integral regions for each entire action due to the overestimation of the most salient regions. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples multiple subsets from the video snippet features according to a set of latent discriminative probabilities and takes the expectation over all the averaged subset features. Theoretically, we prove that the MAA module with learned latent discriminative probabilities successfully reduces the difference in responses between the most salient regions and the others. Therefore, MAAN is able to generate better class activation sequences and identify dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from $O(2^T)$ to $O(T^2)$. Extensive experiments on two large-scale video datasets show that our MAAN achieves a superior performance on weakly-supervised temporal action localization. + + +",/pdf/f19c85e5192744836bdb8673a35e70549fdf1f79.pdf,ICLR,2019,A novel marginalized average attentional network for weakly-supervised temporal action localization +B1gdkxHFDH,rklrs9ytvr,1569440000000.0,1583910000000.0,2068,Training individually fair ML models with sensitive subspace robustness,"[""mikhail.yurochkin@ibm.com"", ""amandarg@umich.edu"", ""yuekai@umich.edu""]","[""Mikhail Yurochkin"", ""Amanda Bower"", ""Yuekai Sun""]","[""fairness"", ""adversarial robustness""]","We consider training machine learning models that are fair in the sense that their performance is invariant under certain sensitive perturbations to the inputs. For example, the performance of a resume screening system should be invariant under changes to the gender and/or ethnicity of the applicant. We formalize this notion of algorithmic fairness as a variant of individual fairness and develop a distributionally robust optimization approach to enforce it during training. We also demonstrate the effectiveness of the approach on two ML tasks that are susceptible to gender and racial biases. ",/pdf/c58026f0eb4878500263d20e9fb3ceb1ba26c7ca.pdf,ICLR,2020,Algorithm for training individually fair classifier using adversarial robustness +BkeDGJBKvB,H1lZtz3OvB,1569440000000.0,1577170000000.0,1583,Multitask Soft Option Learning,"[""maximilian.igl@gmail.com"", ""gambs@robots.ox.ac.uk"", ""jinkehe1996@gmail.com"", ""nantas@robots.ox.ac.uk"", ""nsid@robots.ox.ac.uk"", ""wendelin.boehmer@cs.ox.ac.uk"", ""shimon.whiteson@cs.ox.ac.uk""]","[""Maximilian Igl"", ""Andrew Gambardella"", ""Jinke He"", ""Nantas Nardelli"", ""N. Siddharth"", ""Wendelin B\u00f6hmer"", ""Shimon Whiteson""]","[""Hierarchical Reinforcement Learning"", ""Reinforcement Learning"", ""Control as Inference"", ""Options"", ""Multitask Learning""]","We present Multitask Soft Option Learning (MSOL), a hierarchical multi-task framework based on Planning-as-Inference. MSOL extends the concept of Options, using separate variational posteriors for each task, regularized by a shared prior. The learned soft-options are temporally extended, allowing a higher-level master policy to train faster on new tasks by making decisions with lower frequency. Additionally, MSOL allows fine-tuning of soft-options for new tasks without unlearning previously useful behavior, and avoids problems with local minima in multitask training. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines in challenging multi-task environments.",/pdf/5e7c12118369a575ab52f9fb553689c36468b080.pdf,ICLR,2020,"In Hierarchical RL, we introduce the notion of a 'soft', i.e. adaptable, option and show that this helps learning in multitask settings." +SJzvDjAcK7,Skl_gFZYYQ,1538090000000.0,1545360000000.0,271,Intriguing Properties of Learned Representations,"[""amartya.sanyal@cs.ox.ac.uk"", ""varunk@cs.ox.ac.uk"", ""philip.torr@eng.ox.ac.uk""]","[""Amartya Sanyal"", ""Varun Kanade"", ""Philip H. Torr""]","[""deep learning"", ""low rank representations"", ""adversarial robustness""]","A key feature of neural networks, particularly deep convolutional neural networks, is their ability to learn useful representations from data. The very last layer of a neural network is then simply a linear model trained on these learned representations. Despite their numerous applications in other tasks such as classification, retrieval, clustering etc., a.k.a. transfer learning, not much work has been published that investigates the structure of these representations or indeed whether structure can be imposed on them during the training process. + +In this paper, we study the effective dimensionality of the learned representations by models that have proved highly successful for image classification. We focus on ResNet-18, ResNet-50 and VGG-19 and observe that when trained on CIFAR10 or CIFAR100, the learned representations exhibit a fairly low rank structure. We propose a modification to the training procedure, which further encourages low rank structure on learned activations. Empirically, we show that this has implications for robustness to adversarial examples and compression.",/pdf/8e48f18cc855c1a41314a43723a4a98315671607.pdf,ICLR,2019,Imposing a low rank structure on learned representations in deep networks yields a lot of interesting benefits. +BJgd81SYwr,HylbYP6OwB,1569440000000.0,1583910000000.0,1735,Meta Dropout: Learning to Perturb Latent Features for Generalization,"[""haebeom.lee@kaist.ac.kr"", ""namsan@kaist.ac.kr"", ""eunhoy@kaist.ac.kr"", ""sjhwang82@kaist.ac.kr""]","[""Hae Beom Lee"", ""Taewook Nam"", ""Eunho Yang"", ""Sung Ju Hwang""]",[],"A machine learning model that generalizes well should obtain low errors on unseen test examples. Thus, if we know how to optimally perturb training examples to account for test examples, we may achieve better generalization performance. However, obtaining such perturbation is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a novel regularization method, meta-dropout, which learns to perturb the latent features of training examples for generalization in a meta-learning framework. Specifically, we meta-learn a noise generator which outputs a multiplicative noise distribution for latent features, to obtain low errors on the test instances in an input-dependent manner. Then, the learned noise generator can perturb the training examples of unseen tasks at the meta-test time for improved generalization. We validate our method on few-shot classification datasets, whose results show that it significantly improves the generalization performance of the base model, and largely outperforms existing regularization methods such as information bottleneck, manifold mixup, and information dropout.",/pdf/0eed1795c32d19171f8f21961a0cdc58d8315d41.pdf,ICLR,2020, +ryeh4jA9F7,HkeDx8REKQ,1538090000000.0,1545360000000.0,37,Playing the Game of Universal Adversarial Perturbations,"[""perolat@google.com"", ""mateuszm@google.com"", ""piot@google.com"", ""pietquin@google.com""]","[""Julien Perolet"", ""Mateusz Malinowski"", ""Bilal Piot"", ""Olivier Pietquin""]","[""adversarial perturbations"", ""universal adversarial perturbations"", ""game theory"", ""robust machine learning""]","We study the problem of learning classifiers robust to universal adversarial perturbations. While prior work approaches this problem via robust optimization, adversarial training, or input transformation, we instead phrase it as a two-player zero-sum game. In this new formulation, both players simultaneously play the same game, where one player chooses a classifier that minimizes a classification loss whilst the other player creates an adversarial perturbation that increases the same loss when applied to every sample in the training set. +By observing that performing a classification (respectively creating adversarial samples) is the best response to the other player, we propose a novel extension of a game-theoretic algorithm, namely fictitious play, to the domain of training robust classifiers. Finally, we empirically show the robustness and versatility of our approach in two defence scenarios where universal attacks are performed on several image classification datasets -- CIFAR10, CIFAR100 and ImageNet.",/pdf/4f61f7e7ae37b7c1dbfba6cf4472e05befd60945.pdf,ICLR,2019,"We propose a robustification method under the presence of universal adversarial perturbations, by connecting a game theoretic method (fictitious play) with the problem of robustification, and making it more scalable." +Vd7lCMvtLqg,q2pqOghtwLi,1601310000000.0,1615440000000.0,1353,Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies,"[""~Paul_Pu_Liang1"", ""~Manzil_Zaheer1"", ""~Yuan_Wang1"", ""~Amr_Ahmed1""]","[""Paul Pu Liang"", ""Manzil Zaheer"", ""Yuan Wang"", ""Amr Ahmed""]","[""sparse embeddings"", ""large vocabularies"", ""text classification"", ""language modeling"", ""recommendation systems""]","Learning continuous representations of discrete objects such as text, users, movies, and URLs lies at the heart of many applications including language and user modeling. When using discrete objects as input to neural networks, we often ignore the underlying structures (e.g., natural groupings and similarities) and embed the objects independently into individual vectors. As a result, existing methods do not scale to large vocabulary sizes. In this paper, we design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. We call our method Anchor & Transform (ANT) as the embeddings of discrete objects are a sparse linear combination of the anchors, weighted according to the transformation matrix. ANT is scalable, flexible, and end-to-end trainable. We further provide a statistical interpretation of our algorithm as a Bayesian nonparametric prior for embeddings that encourages sparsity and leverages natural groupings among objects. By deriving an approximate inference algorithm based on Small Variance Asymptotics, we obtain a natural extension that automatically learns the optimal number of anchors instead of having to tune it as a hyperparameter. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes and demonstrates stronger performance with fewer parameters (up to 40x compression) as compared to existing compression baselines.",/pdf/22fba9165ecde0ee1ac4f19e7d5c5d99fd4c5146.pdf,ICLR,2021,End-to-end learning of sparse embeddings for large vocabularies with a Bayesian nonparametric interpretation that results in up to 40x smaller embedding tables. +9_J4DrgC_db,g46mJvgBC9k,1601310000000.0,1614990000000.0,3412,Deep Coherent Exploration For Continuous Control,"[""~Yijie_Zhang1"", ""~Herke_van_Hoof4""]","[""Yijie Zhang"", ""Herke van Hoof""]","[""reinforcement learning"", ""exploration"", ""latent variable models""]","In policy search methods for reinforcement learning (RL), exploration is often performed by injecting noise either in action space at each step independently or in parameter space over each full trajectory. In prior work, it has been shown that with linear policies, a more balanced trade-off between these two exploration strategies is beneficial. However, that method did not scale to policies using deep neural networks. In this paper, we introduce Deep Coherent Exploration, a general and scalable exploration framework for deep RL algorithms on continuous control, that generalizes step-based and trajectory-based exploration. This framework models the last layer parameters of the policy network as latent variables and uses a recursive inference step within the policy update to handle these latent variables in a scalable manner. We find that Deep Coherent Exploration improves the speed and stability of learning of A2C, PPO, and SAC on several continuous control tasks.",/pdf/3591e5e8c956d0f41ba0efbab42b96f6dfac050a.pdf,ICLR,2021, +S1EzRgb0W,HkQGAg-0W,1509130000000.0,1518730000000.0,635,Explaining the Mistakes of Neural Networks with Latent Sympathetic Examples,"[""riaan.zoetmulder@student.uva.nl"", ""egavves@uva.nl"", ""peter.ed.oconnor@gmail.com""]","[""Riaan Zoetmulder"", ""Efstratios Gavves"", ""Peter O'Connor""]","[""Deep learning"", ""Adversarial Examples"", ""Difference Target Propagation"", ""Generative Modelling"", ""Classifiers"", ""Explaining"", ""Sympathetic Examples""]","Neural networks make mistakes. The reason why a mistake is made often remains a mystery. As such neural networks often are considered a black box. It would be useful to have a method that can give an explanation that is intuitive to a user as to why an image is misclassified. In this paper we develop a method for explaining the mistakes of a classifier model by visually showing what must be added to an image such that it is correctly classified. Our work combines the fields of adversarial examples, generative modeling and a correction technique based on difference target propagation to create an technique that creates explanations of why an image is misclassified. In this paper we explain our method and demonstrate it on MNIST and CelebA. This approach could aid in demystifying neural networks for a user. +",/pdf/931e9440a4f12402d9ead65e0c72e51efefaae0e.pdf,ICLR,2018,New way of explaining why a neural network has misclassified an image +rkGZuJb0b,HJZWdJbA-,1509120000000.0,1518730000000.0,486,Compact Neural Networks based on the Multiscale Entanglement Renormalization Ansatz,"[""andrew.hallam.10@ucl.ac.uk"", ""edward.grant.16@ucl.ac.uk"", ""vstojevic@gtn.ai"", ""s.severini@ucl.ac.uk"", ""andrew.green@ucl.ac.uk""]","[""Andrew Hallam"", ""Edward Grant"", ""Vid Stojevic"", ""Simone Severini"", ""Andrew G. Green""]","[""Neural Networks"", ""Tensor Networks"", ""Tensor Trains""]","The goal of this paper is to demonstrate a method for tensorizing neural networks based upon an efficient way of approximating scale invariant quantum states, the Multi-scale Entanglement Renormalization Ansatz (MERA). We employ MERA as a replacement for linear layers in a neural network and test this implementation on the CIFAR-10 dataset. The proposed method outperforms factorization using tensor trains, providing greater compression for the same level of accuracy and greater accuracy for the same level of compression. We demonstrate MERA-layers with 3900 times fewer parameters and a reduction in accuracy of less than 1% compared to the equivalent fully connected layers. +",/pdf/c5ebb754365181006035c23258ecef24b24615e6.pdf,ICLR,2018,"We replace the fully connected layers of a neural network with the multi-scale entanglement renormalization ansatz, a type of quantum operation which describes long range correlations. " +HkCvZXbC-,H1jUW7ZRb,1509140000000.0,1518730000000.0,1129,3C-GAN: AN CONDITION-CONTEXT-COMPOSITE GENERATIVE ADVERSARIAL NETWORKS FOR GENERATING IMAGES SEPARATELY,"[""ycharn@cs.unc.edu"", ""vjojic@gmail.com""]","[""Yeu-Chern Harn"", ""Vladimir Jojic""]",[],"We present 3C-GAN: a novel multiple generators structures, that contains one conditional generator that generates a semantic part of an image conditional on its input label, and one context generator generates the rest of an image. Compared to original GAN model, this model has multiple generators and gives control over what its generators should generate. Unlike previous multi-generator models use a subsequent generation process, that one layer is generated given the previous layer, our model uses a process of generating different part of the images together. This way the model contains fewer parameters and the generation speed is faster. Specifically, the model leverages the label information to separate the object from the image correctly. Since the model conditional on the label information does not restrict to generate other parts of an image, we proposed a cost function that encourages the model to generate only the succinct part of an image in terms of label discrimination. We also found an exclusive prior on the mask of the model help separate the object. The experiments on MNIST, SVHN, and CelebA datasets show 3C-GAN can generate different objects with different generators simultaneously, according to the labels given to each generator.",/pdf/4e7adb7f2c608488a1c1cc4482fae685ee2da3e8.pdf,ICLR,2018, +aUX5Plaq7Oy,7TFh5gTCf6x,1601310000000.0,1611940000000.0,3532,Learning continuous-time PDEs from sparse data with graph neural networks,"[""~Valerii_Iakovlev1"", ""~Markus_Heinonen1"", ""~Harri_L\u00e4hdesm\u00e4ki1""]","[""Valerii Iakovlev"", ""Markus Heinonen"", ""Harri L\u00e4hdesm\u00e4ki""]","[""dynamical systems"", ""partial differential equations"", ""PDEs"", ""graph neural networks"", ""continuous time""]","The behavior of many dynamical systems follow complex, yet still unknown partial differential equations (PDEs). While several machine learning methods have been proposed to learn PDEs directly from data, previous methods are limited to discrete-time approximations or make the limiting assumption of the observations arriving at regular grids. We propose a general continuous-time differential model for dynamical systems whose governing equations are parameterized by message passing graph neural networks. The model admits arbitrary space and time discretizations, which removes constraints on the locations of observation points and time intervals between the observations. The model is trained with continuous-time adjoint method enabling efficient neural PDE inference. We demonstrate the model's ability to work with unstructured grids, arbitrary time steps, and noisy observations. We compare our method with existing approaches on several well-known physical systems that involve first and higher-order PDEs with state-of-the-art predictive performance.",/pdf/65985773a2d3e7c1e7d121a48729c30c9a007745.pdf,ICLR,2021,The paper introduces a method for learning partial differential equations on arbitrary spatial and temporal grids. +r1eQeCEYwB,BJxGeVQ_vS,1569440000000.0,1577170000000.0,922,GRAPH ANALYSIS AND GRAPH POOLING IN THE SPATIAL DOMAIN,"[""mostafarahmani@baidu.com"", ""liping11@baidu.com""]","[""Mostafa Rahmani"", ""Ping Li""]","[""Graph Neural Network"", ""Graph Classification"", ""Graph Pooling"", ""Graph Embedding""]","The spatial convolution layer which is widely used in the Graph Neural Networks (GNNs) aggregates the feature vector of each node with the feature vectors of its neighboring nodes. The GNN is not aware of the locations of the nodes in the global structure of the graph and when the local structures corresponding to different nodes are similar to each other, the convolution layer maps all those nodes to similar or same feature vectors in the continuous feature space. Therefore, the GNN cannot distinguish two graphs if their difference is not in their local structures. In addition, when the nodes are not labeled/attributed the convolution layers can fail to distinguish even different local structures. In this paper, we propose an effective solution to address this problem of the GNNs. The proposed approach leverages a spatial representation of the graph which makes the neural network aware of the differences between the nodes and also their locations in the graph. The spatial representation which is equivalent to a point-cloud representation of the graph is obtained by a graph embedding method. Using the proposed approach, the local feature extractor of the GNN distinguishes similar local structures in different locations of the graph and the GNN infers the topological structure of the graph from the spatial distribution of the locally extracted feature vectors. Moreover, the spatial representation is utilized to simplify the graph down-sampling problem. A new graph pooling method is proposed and it is shown that the proposed pooling method achieves competitive or better results in comparison with the state-of-the-art methods. +",/pdf/7ce731d0848d434b54a0d0ae89eb102fbb263703.pdf,ICLR,2020,Addressing a serious shortcoming of the GNNs by making them aware of the role of the nodes in the structure of the graph and proposing a novel graph pooling method. +F9sPTWSKznC,_1q6QNVPefm,1601310000000.0,1614990000000.0,2530,DiP Benchmark Tests: Evaluation Benchmarks for Discourse Phenomena in MT,"[""~Prathyusha_Jwalapuram1"", ""~Barbara_Rychalska1"", ""~Shafiq_Joty1"", ""~Dominika_Basaj1""]","[""Prathyusha Jwalapuram"", ""Barbara Rychalska"", ""Shafiq Joty"", ""Dominika Basaj""]","[""machine translation"", ""discourse"", ""evaluation"", ""benchmark"", ""testsets"", ""leaderboard""]","Despite increasing instances of machine translation (MT) systems including extrasentential context information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark testsets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several competitive baseline MT systems on the curated datasets. Surprisingly, we find that the complex context-aware models that we test do not improve discourse-related translations consistently across languages and phenomena. Our evaluation benchmark is available as a leaderboard at . ",/pdf/fea04e966fb34fe8d27503da83377b31b3dcbde2.pdf,ICLR,2021,"We introduce first-of-their-kind discourse benchmark testsets and evaluation procedures aimed at tracking improvement in machine translation quality for phenomena like anaphora, coherence & readability, lexical consistency and discourse connectives." +rkEFLFqee,,1478300000000.0,1493410000000.0,497,Decomposing Motion and Content for Natural Video Sequence Prediction,"[""rubville@umich.edu"", ""jimyang@adobe.com"", ""maga33@postech.ac.kr"", ""timelin@buaa.edu.cn"", ""honglak@umich.edu""]","[""Ruben Villegas"", ""Jimei Yang"", ""Seunghoon Hong"", ""Xunyu Lin"", ""Honglak Lee""]","[""Computer vision"", ""Deep learning"", ""Unsupervised Learning""]","We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the pro- posed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatio-temporal dynamics for pixel-level future prediction in natural videos.",/pdf/2ff0b9e83fee44be6779e3f5e3bfd6e35777e0bb.pdf,ICLR,2017, +xQnvyc6r3LL,8lODwpqJaijI,1601310000000.0,1614990000000.0,717,Finding Patient Zero: Learning Contagion Source with Graph Neural Networks,"[""~Chintan_Shah2"", ""~Nima_Dehmamy1"", ""nicolaperra@gmail.com"", ""~Matteo_Chinazzi1"", ""~Albert-Laszlo_Barabasi1"", ""~Alessandro_Vespignani1"", ""~Rose_Yu1""]","[""Chintan Shah"", ""Nima Dehmamy"", ""Nicola Perra"", ""Matteo Chinazzi"", ""Albert-Laszlo Barabasi"", ""Alessandro Vespignani"", ""Rose Yu""]","[""contagion dynamics"", ""theory of graph neural networks"", ""epidemic modeling""]","Locating the source of an epidemic, or patient zero (P0), can provide critical insights into the infection's transmission course and allow efficient resource allocation. +Existing methods use graph-theoretic centrality measures and expensive message-passing algorithms, requiring knowledge of the underlying dynamics and its parameters. +In this paper, we revisit this problem using graph neural networks (GNNs) to learn P0. +We observe that GNNs can identify P0 close to the theoretical bound on accuracy, without explicit input of dynamics or its parameters. +In addition, GNN is over 100 times faster than classic methods for inference on arbitrary graph topologies. +Our theoretical bound also shows that the epidemic is like a ticking clock, emphasizing the importance of early contact-tracing. +We find a maximum time after which accurate recovery of the source becomes impossible, regardless of the algorithm used.",/pdf/06c7989fa82ee317be6ad157fb26a389ebcee01e.pdf,ICLR,2021,Investigation and theoretical understanding for the behavior of GNN when using it to find patient zero in epidemic dynamics. +HkgqFiAcFm,HylKTELFYm,1538090000000.0,1550350000000.0,468,Marginal Policy Gradients: A Unified Family of Estimators for Bounded Action Spaces with Applications,"[""eisenach@princeton.edu"", ""h.yang@rochester.edu"", ""ji.liu.uwisc@gmail.com"", ""hanliu.cmu@gmail.com""]","[""Carson Eisenach"", ""Haichuan Yang"", ""Ji Liu"", ""Han Liu""]","[""reinforcement learning"", ""policy gradient"", ""MOBA games""]","Many complex domains, such as robotics control and real-time strategy (RTS) games, require an agent to learn a continuous control. In the former, an agent learns a policy over R^d and in the latter, over a discrete set of actions each of which is parametrized by a continuous parameter. Such problems are naturally solved using policy based reinforcement learning (RL) methods, but unfortunately these often suffer from high variance leading to instability and slow convergence. Unnecessary variance is introduced whenever policies over bounded action spaces are modeled using distributions with unbounded support by applying a transformation T to the sampled action before execution in the environment. Recently, the variance reduced clipped action policy gradient (CAPG) was introduced for actions in bounded intervals, but to date no variance reduced methods exist when the action is a direction, something often seen in RTS games. To this end we introduce the angular policy gradient (APG), a stochastic policy gradient method for directional control. With the marginal policy gradients family of estimators we present a unified analysis of the variance reduction properties of APG and CAPG; our results provide a stronger guarantee than existing analyses for CAPG. Experimental results on a popular RTS game and a navigation task show that the APG estimator offers a substantial improvement over the standard policy gradient.",/pdf/d5223f992f7043a906e04fdf9c5a9c2ff93dd07d.pdf,ICLR,2019, +SJgwNerKvB,r1e7AIeFvr,1569440000000.0,1583910000000.0,2252,Continual learning with hypernetworks,"[""voswaldj@ethz.ch"", ""henningc@ethz.ch"", ""sacramento@ini.ethz.ch"", ""bgrewe@ethz.ch""]","[""Johannes von Oswald"", ""Christian Henning"", ""Jo\u00e3o Sacramento"", ""Benjamin F. Grewe""]","[""Continual Learning"", ""Catastrophic Forgetting"", ""Meta Model"", ""Hypernetwork""]","Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks. To overcome this problem, we present a novel approach based on task-conditioned hypernetworks, i.e., networks that generate the weights of a target model based on task identity. Continual learning (CL) is less difficult for this class of models thanks to a simple key feature: instead of recalling the input-output relations of all previously seen data, task-conditioned hypernetworks only require rehearsing task-specific weight realizations, which can be maintained in memory using a simple regularizer. Besides achieving state-of-the-art performance on standard CL benchmarks, additional experiments on long task sequences reveal that task-conditioned hypernetworks display a very large capacity to retain previous memories. Notably, such long memory lifetimes are achieved in a compressive regime, when the number of trainable hypernetwork weights is comparable or smaller than target network size. We provide insight into the structure of low-dimensional task embedding spaces (the input space of the hypernetwork) and show that task-conditioned hypernetworks demonstrate transfer learning. Finally, forward information transfer is further supported by empirical results on a challenging CL benchmark based on the CIFAR-10/100 image datasets.",/pdf/5206218e137ab12a45ab2c7cdde9c53fb4c73b94.pdf,ICLR,2020, +SkxybANtDB,BJe0NA7OwH,1569440000000.0,1583910000000.0,950,Dynamic Time Lag Regression: Predicting What & When,"[""mandar.chandorkar@cwi.nl"", ""furtlehn@lri.fr"", ""bala.poduval@unh.edu"", ""e.camporeale@cwi.nl"", ""michele.sebag@lri.fr""]","[""Mandar Chandorkar"", ""Cyril Furtlehner"", ""Bala Poduval"", ""Enrico Camporeale"", ""Michele Sebag""]","[""Dynamic Time-Lag Regression"", ""Time Delay"", ""Regression"", ""Time Series""]","This paper tackles a new regression problem, called Dynamic Time-Lag Regression (DTLR), where a cause signal drives an effect signal with an unknown time delay. +The motivating application, pertaining to space weather modelling, aims to predict the near-Earth solar wind speed based on estimates of the Sun's coronal magnetic field. +DTLR differs from mainstream regression and from sequence-to-sequence learning in two respects: firstly, no ground truth (e.g., pairs of associated sub-sequences) is available; secondly, the cause signal contains much information irrelevant to the effect signal (the solar magnetic field governs the solar wind propagation in the heliosphere, of which the Earth's magnetosphere is but a minuscule region). + +A Bayesian approach is presented to tackle the specifics of the DTLR problem, with theoretical justifications based on linear stability analysis. A proof of concept on synthetic problems is presented. Finally, the empirical results on the solar wind modelling task improve on the state of the art in solar wind forecasting.",/pdf/a4f69bb7bd2fb1d3fb10b51090e1347bcb70bebe.pdf,ICLR,2020,We propose a new regression framework for temporal phenomena having non-stationary time-lag dependencies. +BJfRpoA9YX,BJgoJPF9YX,1538090000000.0,1545360000000.0,851,Adversarial Information Factorization,"[""ac2211@ic.ac.uk"", ""ym1008@ic.ac.uk"", ""biswasengupta@gmail.com"", ""aab01@ic.ac.uk""]","[""Antonia Creswell"", ""Yumnah Mohamied"", ""Biswa Sengupta"", ""Anil Bharath""]","[""disentangled representations"", ""factored representations"", ""generative adversarial networks"", ""variational auto encoders"", ""generative models""]","We propose a novel generative model architecture designed to learn representations for images that factor out a single attribute from the rest of the representation. A single object may have many attributes which when altered do not change the identity of the object itself. Consider the human face; the identity of a particular person is independent of whether or not they happen to be wearing glasses. The attribute of wearing glasses can be changed without changing the identity of the person. However, the ability to manipulate and alter image attributes without altering the object identity is not a trivial task. Here, we are interested in learning a representation of the image that separates the identity of an object (such as a human face) from an attribute (such as 'wearing glasses'). We demonstrate the success of our factorization approach by using the learned representation to synthesize the same face with and without a chosen attribute. We refer to this specific synthesis process as image attribute manipulation. We further demonstrate that our model achieves competitive scores, with state of the art, on a facial attribute classification task.",/pdf/12f26bff1aaf98460516d5296c845b09f3cd2ed7.pdf,ICLR,2019,Learn representations for images that factor out a single attribute. +BkGiPoC5FX,Bye9rTuVtX,1538090000000.0,1545360000000.0,297,Efficient Convolutional Neural Network Training with Direct Feedback Alignment,"[""hdh4797@kaist.ac.kr"", ""hjyoo@kaist.ac.kr""]","[""Donghyeon Han"", ""Hoi-jun Yoo""]","[""Direct Feedback Alignment"", ""Convolutional Neural Network"", ""DNN Training""]","There were many algorithms to substitute the back-propagation (BP) in the deep neural network (DNN) training. However, they could not become popular because their training accuracy and the computational efficiency were worse than BP. One of them was direct feedback alignment (DFA), but it showed low training performance especially for the convolutional neural network (CNN). In this paper, we overcome the limitation of the DFA algorithm by combining with the conventional BP during the CNN training. To improve the training stability, we also suggest the feedback weight initialization method by analyzing the patterns of the fixed random matrices in the DFA. Finally, we propose the new training algorithm, binary direct feedback alignment (BDFA) to minimize the computational cost while maintaining the training accuracy compared with the DFA. In our experiments, we use the CIFAR-10 and CIFAR-100 dataset to simulate the CNN learning from the scratch and apply the BDFA to the online learning based object tracking application to examine the training in the small dataset environment. Our proposed algorithms show better performance than conventional BP in both two different training tasks especially when the dataset is small.",/pdf/5dcc9a62d816f75f56b6dde2609bb197a2c5487b.pdf,ICLR,2019, +BJ4prNx0W,ryVarEgCZ,1509080000000.0,1518730000000.0,237,Learning what to learn in a neural program,"[""ricshin@berkeley.edu"", ""dawnsong.travel@gmail.com""]","[""Richard Shin"", ""Dawn Song""]",[],"Learning programs with neural networks is a challenging task, addressed by a long line of existing work. It is difficult to learn neural networks which will generalize to problem instances that are much larger than those used during training. Furthermore, even when the learned neural program empirically works on all test inputs, we cannot verify that it will work on every possible input. Recent work has shown that it is possible to address these issues by using recursion in the Neural Programmer-Interpreter, but this technique requires a verification set which is difficult to construct without knowledge of the internals of the oracle used to generate training data. In this work, we show how to automatically build such a verification set, which can also be directly used for training. By interactively querying an oracle, we can construct this set with minimal additional knowledge about the oracle. We empirically demonstrate that our method allows automated learning and verification of a recursive NPI program with provably perfect generalization. +",/pdf/ab677e08c59c0214387c85c569e11ea85d784997.pdf,ICLR,2018, +Sked_0EYwB,SJeEaaPdwr,1569440000000.0,1577170000000.0,1217,Objective Mismatch in Model-based Reinforcement Learning,"[""nol@berkeley.edu"", ""brandon.amos.cs@gmail.com"", ""omry@fb.com"", ""rcalandra@fb.com""]","[""Nathan Lambert"", ""Brandon Amos"", ""Omry Yadan"", ""Roberto Calandra""]","[""Model-based Reinforcement learning"", ""dynamics model"", ""reinforcement learning""]","Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, leaving the general framework virtually unchanged since its conception. In this paper, we identify a fundamental issue of the standard MBRL framework -- what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t. the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of the one-step ahead prediction is not always correlated with downstream control performance. This observation highlights a critical flaw in the current MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of future research for addressing this issue.",/pdf/93d75b65626dc045229193fd318ba031a1189e63.pdf,ICLR,2020,"We define, explore, and begin to address the objective mismatch issue in model-based reinforcement learning." +HyiRazbRb,S1U_6f-Rb,1509140000000.0,1518730000000.0,1027,Demystifying overcomplete nonlinear auto-encoders: fast SGD convergence towards sparse representation from random initialization,"[""tangch@gwu.edu"", ""cmontel@gwu.edu""]","[""Cheng Tang"", ""Claire Monteleoni""]","[""stochastic gradient descent"", ""autoencoders"", ""nonconvex optimization"", ""representation learning"", ""theory""]","Auto-encoders are commonly used for unsupervised representation learning and for pre-training deeper neural networks. +When its activation function is linear and the encoding dimension (width of hidden layer) is smaller than the input dimension, it is well known that auto-encoder is optimized to learn the principal components of the data distribution (Oja1982). +However, when the activation is nonlinear and when the width is larger than the input dimension (overcomplete), auto-encoder behaves differently from PCA, and in fact is known to perform well empirically for sparse coding problems. + +We provide a theoretical explanation for this empirically observed phenomenon, when rectified-linear unit (ReLu) is adopted as the activation function and the hidden-layer width is set to be large. +In this case, we show that, with significant probability, initializing the weight matrix of an auto-encoder by sampling from a spherical Gaussian distribution followed by stochastic gradient descent (SGD) training converges towards the ground-truth representation for a class of sparse dictionary learning models. +In addition, we can show that, conditioning on convergence, the expected convergence rate is O(1/t), where t is the number of updates. +Our analysis quantifies how increasing hidden layer width helps the training performance when random initialization is used, and how the norm of network weights influence the speed of SGD convergence. ",/pdf/bdb223b72b592d7166defadedd6717ddcfb6c855.pdf,ICLR,2018,theoretical analysis of nonlinear wide autoencoder +djwS0m4Ft_A,S-wTeuDKil,1601310000000.0,1611610000000.0,1422,Evaluating the Disentanglement of Deep Generative Models through Manifold Topology,"[""~Sharon_Zhou1"", ""~Eric_Zelikman1"", ""fredlu@stanford.edu"", ""~Andrew_Y._Ng1"", ""~Gunnar_E._Carlsson1"", ""~Stefano_Ermon1""]","[""Sharon Zhou"", ""Eric Zelikman"", ""Fred Lu"", ""Andrew Y. Ng"", ""Gunnar E. Carlsson"", ""Stefano Ermon""]","[""generative models"", ""evaluation"", ""disentanglement""]","Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an ad-hoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases both unsupervised and supervised variants. To illustrate the effectiveness and applicability of our method, we empirically evaluate several state-of-the-art models across multiple datasets. We find that our method ranks models similarly to existing methods. We make our code publicly available at https://github.com/stanfordmlgroup/disentanglement.",/pdf/1a87763ff698ed701fa9648dad3eeb0fee6bb67b.pdf,ICLR,2021,Evaluate disentanglement of generative models by measuring manifold topology using persistent homology +BJlSHsAcK7,HJgQms2tdm,1538090000000.0,1545360000000.0,86,Overcoming catastrophic forgetting through weight consolidation and long-term memory,"[""shixianwen1993@gmail.com"", ""itti@usc.edu""]","[""Shixian Wen"", ""Laurent Itti""]","[""Catastrophic Forgetting"", ""Life-Long Learning"", ""adversarial examples""]","Sequential learning of multiple tasks in artificial neural networks using gradient descent leads to catastrophic forgetting, whereby previously learned knowledge is erased during learning of new, disjoint knowledge. Here, we propose a new approach to sequential learning which leverages the recent discovery of adversarial examples. We use adversarial subspaces from previous tasks to enable learning of new tasks with less interference. We apply our method to sequentially learning to classify digits 0, 1, 2 (task 1), 4, 5, 6, (task 2), and 7, 8, 9 (task 3) in MNIST (disjoint MNIST task). We compare and combine our Adversarial Direction (AD) method with the recently proposed Elastic Weight Consolidation (EWC) method for sequential learning. We train each task for 20 epochs, which yields good initial performance (99.24% correct task 1 performance). After training task 2, and then task 3, both plain gradient descent (PGD) and EWC largely forget task 1 (task 1 accuracy 32.95% for PGD and 41.02% for EWC), while our combined approach (AD+EWC) still achieves 94.53% correct on task 1. We obtain similar results with a much more difficult disjoint CIFAR10 task (70.10% initial task 1 performance, 67.73% after learning tasks 2 and 3 for AD+EWC, while PGD and EWC both fall to chance level). We confirm qualitatively similar results for EMNIST with 5 tasks and under 3 variants of our approach. Our results suggest that AD+EWC can provide better sequential learning performance than either PGD or EWC.",/pdf/d416b27ccc89fe33c2a5d954a1277f2bb407f0bb.pdf,ICLR,2019,We enable sequential learning of multiple tasks by adding task-dependent memory units to avoid interference between tasks +rkgCJ64tDB,rygDhM3rwH,1569440000000.0,1577170000000.0,321,Scale-Equivariant Neural Networks with Decomposed Convolutional Filters,"[""zhu@math.duke.edu"", ""qiang.qiu@duke.edu"", ""robert.calderbank@duke.edu"", ""guillermo.sapiro@duke.edu"", ""xiuyuan.cheng@duke.edu""]","[""Wei Zhu"", ""Qiang Qiu"", ""Robert Calderbank"", ""Guillermo Sapiro"", ""Xiuyuan Cheng""]","[""scale-equivariant"", ""convolutional neural network"", ""deformation robustness""]","Encoding the input scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many vision tasks especially when dealing with multiscale input signals. We study, in this paper, a scale-equivariant CNN architecture with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve scale-equivariant representations. To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation. Numerical experiments demonstrate that the proposed scale-equivariant neural network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.",/pdf/283d6920332ea72c1c8b8ae89bdb9f0c8ec35789.pdf,ICLR,2020,We construct scale-equivariant convolutional neural networks in the most general form with both computational efficiency and proved deformation robustness. +Hye-LiR5Y7,S1enJOHjdQ,1538090000000.0,1545360000000.0,149,SOSELETO: A Unified Approach to Transfer Learning and Training with Noisy Labels,"[""orlitany@gmail.com"", ""danielfreedman@google.com""]","[""Or Litany"", ""Daniel Freedman""]","[""transfer learning""]","We present SOSELETO (SOurce SELEction for Target Optimization), a new method for exploiting a source dataset to solve a classification problem on a target dataset. SOSELETO is based on the following simple intuition: some source examples are more informative than others for the target problem. To capture this intuition, source samples are each given weights; these weights are solved for jointly with the source and target classification problems via a bilevel optimization scheme. The target therefore gets to choose the source samples which are most informative for its own classification task. Furthermore, the bilevel nature of the optimization acts as a kind of regularization on the target, mitigating overfitting. SOSELETO may be applied to both classic transfer learning, as well as the problem of training on datasets with noisy labels; we show state of the art results on both of these problems.",/pdf/0a15ff87f4565b8c8b32435937bc6a6acc6388b9.pdf,ICLR,2019,"Learning with limited training data by exploiting ""helpful"" instances from a rich data source. " +SyZI0GWCZ,B1fE0fZA-,1509140000000.0,1518790000000.0,1055,Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models,"[""wieland.brendel@bethgelab.org"", ""jonas.rauber@bethgelab.org"", ""matthias.bethge@bethgelab.org""]","[""Wieland Brendel *"", ""Jonas Rauber *"", ""Matthias Bethge""]","[""adversarial attacks"", ""adversarial examples"", ""adversarials"", ""robustness"", ""security""]","Many machine learning algorithms are vulnerable to almost imperceptible perturbations of their inputs. So far it was unclear how much risk adversarial perturbations carry for the safety of real-world machine learning applications because most methods used to generate such perturbations rely either on detailed model information (gradient-based attacks) or on confidence scores such as class probabilities (score-based attacks), neither of which are available in most real-world scenarios. In many such cases one currently needs to retreat to transfer-based attacks which rely on cumbersome substitute models, need access to the training data and can be defended against. Here we emphasise the importance of attacks which solely rely on the final model decision. Such decision-based attacks are (1) applicable to real-world black-box models such as autonomous cars, (2) need less knowledge and are easier to apply than transfer-based attacks and (3) are more robust to simple defences than gradient- or score-based attacks. Previous attacks in this category were limited to simple models or simple datasets. Here we introduce the Boundary Attack, a decision-based attack that starts from a large adversarial perturbation and then seeks to reduce the perturbation while staying adversarial. The attack is conceptually simple, requires close to no hyperparameter tuning, does not rely on substitute models and is competitive with the best gradient-based attacks in standard computer vision tasks like ImageNet. We apply the attack on two black-box algorithms from Clarifai.com. The Boundary Attack in particular and the class of decision-based attacks in general open new avenues to study the robustness of machine learning models and raise new questions regarding the safety of deployed machine learning systems. An implementation of the attack is available as part of Foolbox (https://github.com/bethgelab/foolbox).",/pdf/da19d08fdc4c341975c0042678b941d40ccbea3d.pdf,ICLR,2018,A novel adversarial attack that can directly attack real-world black-box machine learning models without transfer. +LIR3aVGIlln,92FTOZu3X8R,1601310000000.0,1614990000000.0,1174,Equivariant Normalizing Flows for Point Processes and Sets,"[""~Marin_Bilo\u01611"", ""~Stephan_G\u00fcnnemann1""]","[""Marin Bilo\u0161"", ""Stephan G\u00fcnnemann""]","[""point process"", ""set"", ""normalizing flow"", ""equivariance""]","A point process describes how random sets of exchangeable points are generated. The points usually influence the positions of each other via attractive and repulsive forces. To model this behavior, it is enough to transform the samples from the uniform process with a sufficiently complex equivariant function. However, learning the parameters of the resulting process is challenging since the likelihood is hard to estimate and often intractable. This leads us to our proposed model - CONFET. Based on continuous normalizing flows, it allows arbitrary interactions between points while having tractable likelihood. Experiments on various real and synthetic datasets show the improved performance of our new scalable approach.",/pdf/bb14f827cd90f091ad0fecca471f10287215c02d.pdf,ICLR,2021,Having permutation equivariant mapping in continuous normalizing flows allows modeling densities over sets. +HketHo0qFm,rJxzwqFFt7,1538090000000.0,1545360000000.0,104,Hybrid Policies Using Inverse Rewards for Reinforcement Learning,"[""yao.shi@huawei.com"", ""xiatian14@huawei.com"", ""zhaoguanjun1@huawei.com"", ""gaoxin17@huawei.com""]","[""Yao Shi"", ""Tian Xia"", ""Guanjun Zhao"", ""Xin Gao""]","[""Reinforcement Learning"", ""Rewards""]","This paper puts forward a broad-spectrum improvement for reinforcement learning algorithms, which combines the policies using original rewards and inverse (negative) rewards. The policies using inverse rewards are competitive with the original policies, and help the original policies correct their mis-actions. We have proved the convergence of the inverse policies. The experiments for some games in OpenAI gym show that the hybrid polices based on deep Q-learning, double Q-learning, and on-policy actor-critic obtain the rewards up to 63.8%, 97.8%, and 54.7% more than the original algorithms. The improved polices are more stable than the original policies as well.",/pdf/2c627441f7a0650c9a826b8196186312d882b7ad.pdf,ICLR,2019,"A broad-spectrum improvement for reinforcement learning algorithms, which combines the policies using original rewards and inverse (negative) rewards" +f_GA2IU9-K-,_x7hZNxcifF,1601310000000.0,1614990000000.0,1504,Non-decreasing Quantile Function Network with Efficient Exploration for Distributional Reinforcement Learning,"[""~Fan_Zhou7"", ""~Zhoufan_Zhu1"", ""~Qi_Kuang1"", ""~Liwen_Zhang3""]","[""Fan Zhou"", ""Zhoufan Zhu"", ""Qi Kuang"", ""Liwen Zhang""]","[""Non-decreasing Quantile Function"", ""Distributional Reinforcement Learning"", ""Distributional Prediction Error"", ""Exploration""]","Although distributional reinforcement learning (DRL) has been widely examined in the past few years, there are two open questions people are still trying to address. One is how to ensure the validity of the learned quantile function, the other is how to efficiently utilize the distribution information. This paper attempts to provide some new perspectives to encourage the future in-depth studies in these two fields. We first propose a non-decreasing quantile function network (NDQFN) to guarantee the monotonicity of the obtained quantile estimates and then design a general exploration framework called distributional prediction error (DPE) for DRL which utilizes the entire distribution of the quantile function. In this paper, we not only discuss the theoretical necessity of our method but also show the performance gain it achieves in practice by comparing with some competitors on Atari 2600 Games especially in some hard-explored games.",/pdf/5dc36909b1e6b83f4581be371aa49b2253276154.pdf,ICLR,2021,This paper introduces a general framework to obtain non-decreasing quantile estimate for Distributional Reinforcement Learning and proposes an efficient exploration method for quantile value based DRL algorithms +QxQkG-gIKJM,G7Emp6Kvuv,1601310000000.0,1614990000000.0,1369,Optimistic Exploration with Backward Bootstrapped Bonus for Deep Reinforcement Learning,"[""~Chenjia_Bai2"", ""~Lingxiao_Wang6"", ""~Peng_Liu5"", ""~Zhaoran_Wang1"", ""~Jianye_HAO1"", ""~Yingnan_Zhao1""]","[""Chenjia Bai"", ""Lingxiao Wang"", ""Peng Liu"", ""Zhaoran Wang"", ""Jianye HAO"", ""Yingnan Zhao""]","[""optimistic exploration"", ""backward bootstrapped bonus"", ""posterior sampling"", ""reinforcement learning""]","Optimism in the face of uncertainty is a principled approach for provably efficient exploration for reinforcement learning in tabular and linear settings. However, such an approach is challenging in developing practical exploration algorithms for Deep Reinforcement Learning (DRL). To address this problem, we propose an Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL by following these two principles. OEB3 is built on bootstrapped deep $Q$-learning, a non-parametric posterior sampling method for temporally-extended exploration. Based on such a temporally-extended exploration, we construct an UCB-bonus indicating the uncertainty of $Q$-functions. The UCB-bonus is further utilized to estimate an optimistic $Q$-value, which encourages the agent to explore the scarcely visited states and actions to reduce uncertainty. In the estimation of $Q$-function, we adopt an episodic backward update strategy to propagate the future uncertainty to the estimated $Q$-function consistently. Extensive evaluations show that OEB3 outperforms several state-of-the-art exploration approaches in Mnist maze and 49 Atari games.",/pdf/8cd7fc7e31f8036a122a16a0c67d16375883801f.pdf,ICLR,2021, +ryxC-kBYDS,SyxsnAi_wS,1569440000000.0,1577170000000.0,1563,Gaussian Conditional Random Fields for Classification,"[""aapetrovic@mas.bg.ac.rs"", ""nikolic@matf.bg.ac.rs"", ""milos.jovanovic@fon.bg.ac.rs"", ""boris.delibasic@fon.bg.ac.rs""]","[""Andrija Petrovic"", ""Mladen Nikolic"", ""Milos Jovanovic"", ""Boris Delibasic""]","[""Structured classification"", ""Gaussian conditional random fields"", ""Empirical Bayes"", ""Local variational approximation"", ""discriminative graph-based model""]","In this paper, a Gaussian conditional random field model for structured binary classification (GCRFBC) is proposed. The model is applicable to classification problems with undirected graphs, intractable for standard classification CRFs. The model representation of GCRFBC is extended by latent variables which yield some appealing properties. Thanks to the GCRF latent structure, the model becomes tractable, efficient, and open to improvements previously applied to GCRF regression. Two different forms of the algorithm are presented: GCRFBCb (GCRGBC - Bayesian) and GCRFBCnb (GCRFBC - non-Bayesian). The extended method of local variational approximation of sigmoid function is used for solving empirical Bayes in GCRFBCb variant, whereas MAP value of latent variables is the basis for learning and inference in the GCRFBCnb variant. The inference in GCRFBCb is solved by Newton-Cotes formulas for one-dimensional integration. Both models are evaluated on synthetic data and real-world data. It was shown that both models achieve better prediction performance than relevant baselines. Advantages and disadvantages of the proposed models are discussed.",/pdf/30b16c53343ef8803bcfe247aaa7ee631acb8d80.pdf,ICLR,2020, +H1M7soActX,rJgDwPu5YQ,1538090000000.0,1545360000000.0,608,The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects,"[""zhanxing.zhu@pku.edu.cn"", ""pkuwjf@pku.edu.cn"", ""byu@pku.edu.cn"", ""leiwu@pku.edu.cn"", ""jwma@math.pku.edu.cn""]","[""Zhanxing Zhu"", ""Jingfeng Wu"", ""Bing Yu"", ""Lei Wu"", ""Jinwen Ma""]","[""Stochastic gradient descent"", ""anisotropic noise"", ""regularization""]","Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we theoretically study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We verify our understanding through comparing +this anisotropic diffusion with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics) and other types of position-dependent noise.",/pdf/9bf0c38e3b6e894e55304589933458a2d48e7419.pdf,ICLR,2019,We provide theoretical and empirical analysis on the role of anisotropic noise introduced by stochastic gradient on escaping from minima. +BkMiWhR5K7,S1xPwZ2YYQ,1538090000000.0,1550710000000.0,1206,Prior Convictions: Black-box Adversarial Attacks with Bandits and Priors,"[""ailyas@mit.edu"", ""engstrom@mit.edu"", ""madry@mit.edu""]","[""Andrew Ilyas"", ""Logan Engstrom"", ""Aleksander Madry""]","[""adversarial examples"", ""gradient estimation"", ""black-box attacks"", ""model-based optimization"", ""bandit optimization""]","We study the problem of generating adversarial examples in a black-box setting in which only loss-oracle access to a model is available. We introduce a framework that conceptually unifies much of the existing work on black-box attacks, and demonstrate that the current state-of-the-art methods are optimal in a natural sense. Despite this optimality, we show how to improve black-box attacks by bringing a new element into the problem: gradient priors. We give a bandit optimization-based algorithm that allows us to seamlessly integrate any such priors, and we explicitly identify and incorporate two examples. The resulting methods use two to four times fewer queries and fail two to five times less than the current state-of-the-art. The code for reproducing our work is available at https://git.io/fAjOJ.",/pdf/dc922426472f0f1907b6247b8d425abdd084dc2b.pdf,ICLR,2019,"We present a unifying view on black-box adversarial attacks as a gradient estimation problem, and then present a framework (based on bandits optimization) to integrate priors into gradient estimation, leading to significantly increased performance." +H1gz_nNYDS,r1e6gF9gIH,1569440000000.0,1577170000000.0,34,AutoSlim: Towards One-Shot Architecture Search for Channel Numbers,"[""jyu79@illinois.edu"", ""t-huang1@illinois.edu""]","[""Jiahui Yu"", ""Thomas Huang""]","[""AutoSlim"", ""Neural Architecture Search"", ""Efficient Networks"", ""Network Pruning""]"," +We study how to set the number of channels in a neural network to achieve better accuracy under constrained resources (e.g., FLOPs, latency, memory footprint or model size). A simple and one-shot approach, named AutoSlim, is presented. Instead of training many network samples and searching with reinforcement learning, we train a single slimmable network to approximate the network accuracy of different channel configurations. We then iteratively evaluate the trained slimmable model and greedily slim the layer with minimal accuracy drop. By this single pass, we can obtain the optimized channel configurations under different resource constraints. We present experiments with MobileNet v1, MobileNet v2, ResNet-50 and RL-searched MNasNet on ImageNet classification. We show significant improvements over their default channel configurations. We also achieve better accuracy than recent channel pruning methods and neural architecture search methods with 100X lower search cost. + +Notably, by setting optimized channel numbers, our AutoSlim-MobileNet-v2 at 305M FLOPs achieves 74.2% top-1 accuracy, 2.4% better than default MobileNet-v2 (301M FLOPs), and even 0.2% better than RL-searched MNasNet (317M FLOPs). Our AutoSlim-ResNet-50 at 570M FLOPs, without depthwise convolutions, achieves 1.3% better accuracy than MobileNet-v1 (569M FLOPs). +",/pdf/efbd615da73f83d215fe415d8b172f87d0b69d10.pdf,ICLR,2020,"We present an automated approach to search the number of channels in a neural network to achieve better accuracy under constrained resources (e.g., FLOPs, latency, memory footprint or model size)." +SJxfxnA9K7,BklmigC9tm,1538090000000.0,1545360000000.0,1057,Structured Prediction using cGANs with Fusion Discriminator,"[""faisalm@jhu.edu"", ""wxu47@jhu.edu"", ""ndurr@jhu.edu"", ""jeremiah.johnson@unh.edu"", ""alan.l.yuille@gmail.com""]","[""Faisal Mahmood"", ""Wenhao Xu"", ""Nicholas J. Durr"", ""Jeremiah W. Johnson"", ""Alan Yuille""]","[""Generative Adversarial Networks"", ""GANs"", ""conditional GANs"", ""Discriminator"", ""Fusion""]","We propose a novel method for incorporating conditional information into a generative adversarial network (GAN) for structured prediction tasks. This method is based on fusing features from the generated and conditional information in feature space and allows the discriminator to better capture higher-order statistics from the data. This method also increases the strength of the signals passed through the network where the real or generated data and the conditional data agree. The proposed method is conceptually simpler than the joint convolutional neural network - conditional Markov random field (CNN-CRF) models and enforces higher-order consistency without being limited to a very specific class of high-order potentials. Experimental results demonstrate that this method leads to improvement on a variety of different structured prediction tasks including image synthesis, semantic segmentation, and depth estimation.",/pdf/acb4d17dbd0806deea009c6ff570731e88f83dab.pdf,ICLR,2019,We propose a novel way to incorporate conditional image information into the discriminator of GANs using feature fusion that can be used for structured prediction tasks. +2m0g1wEafh,0VoCnjhaEO3,1601310000000.0,1616060000000.0,3614,Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods,"[""~Taiji_Suzuki1"", ""shunta_akiyama@mist.i.u-tokyo.ac.jp""]","[""Taiji Suzuki"", ""Shunta Akiyama""]","[""Excess risk"", ""minimax optimal rate"", ""local Rademacher complexity"", ""fast learning rate"", ""kernel method"", ""linear estimator""]","Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, +and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, $k$-NN estimator and so on. We consider a teacher-student regression model, and eventually show that {\it any} linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than $O(1/\sqrt{n})$ that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.",/pdf/9698297c5705cc24f44b95ca9d67c8366730f110.pdf,ICLR,2021, +T1EMbxGNEJC,k8UjvJ2gN9w,1601310000000.0,1614990000000.0,715,RankingMatch: Delving into Semi-Supervised Learning with Consistency Regularization and Ranking Loss,"[""~Trung_Quang_Tran1"", ""~Mingu_Kang1"", ""~Daeyoung_Kim3""]","[""Trung Quang Tran"", ""Mingu Kang"", ""Daeyoung Kim""]","[""BatchMean Triplet Loss"", ""Semi-Supervised Learning"", ""Consistency Regularization"", ""Metric Learning""]","Semi-supervised learning (SSL) has played an important role in leveraging unlabeled data when labeled data is limited. One of the most successful SSL approaches is based on consistency regularization, which encourages the model to produce unchanged with perturbed input. However, there has been less attention spent on inputs that have the same label. Motivated by the observation that the inputs having the same label should have the similar model outputs, we propose a novel method, RankingMatch, that considers not only the perturbed inputs but also the similarity among the inputs having the same label. We especially introduce a new objective function, dubbed BatchMean Triplet loss, which has the advantage of computational efficiency while taking into account all input samples. Our RankingMatch achieves state-of-the-art performance across many standard SSL benchmarks with a variety of labeled data amounts, including 95.13% accuracy on CIFAR-10 with 250 labels, 77.65% accuracy on CIFAR-100 with 10000 labels, 97.76% accuracy on SVHN with 250 labels, and 97.77% accuracy on SVHN with 1000 labels. We also perform an ablation study to prove the efficacy of the proposed BatchMean Triplet loss against existing versions of Triplet loss.",/pdf/847fd268c9e92405e035bc63d9185678bc19cb3b.pdf,ICLR,2021,"We propose RankingMatch, a novel semi-supervised learning method that encourages the model to produce the same prediction for not only the different augmented versions of the same input but also the samples from the same class." +TtYSU29zgR,8yn7RRUzKjA,1601310000000.0,1615830000000.0,1564,Primal Wasserstein Imitation Learning,"[""~Robert_Dadashi2"", ""~Leonard_Hussenot1"", ""~Matthieu_Geist1"", ""~Olivier_Pietquin1""]","[""Robert Dadashi"", ""Leonard Hussenot"", ""Matthieu Geist"", ""Olivier Pietquin""]","[""Reinforcement Learning"", ""Inverse Reinforcement Learning"", ""Imitation Learning"", ""Optimal Transport"", ""Wasserstein distance""]","Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.",/pdf/a0cf4ec3c4a75e202ba613caaea4f173d8e53101.pdf,ICLR,2021,A new Imitation Learning method based on optimal transport. +SJxzFySKwH,ryl8bEC_PH,1569440000000.0,1600740000000.0,1832,On the Equivalence between Positional Node Embeddings and Structural Graph Representations,"[""bsriniv@purdue.edu"", ""ribeiro@cs.purdue.edu""]","[""Balasubramaniam Srinivasan"", ""Bruno Ribeiro""]","[""Graph Neural Networks"", ""Structural Graph Representations"", ""Node Embeddings"", ""Relational Learning"", ""Invariant Theory"", ""Theory"", ""Deep Learning"", ""Representational Power"", ""Graph Isomorphism""]","This work provides the first unifying theoretical framework for node (positional) embeddings and structural graph representations, bridging methods like matrix factorization and graph neural networks. Using invariant theory, we show that relationship between structural representations and node embeddings is analogous to that of a distribution and its samples. We prove that all tasks that can be performed by node embeddings can also be performed by structural representations and vice-versa. We also show that the concept of transductive and inductive learning is unrelated to node embeddings and graph representations, clearing another source of confusion in the literature. Finally, we introduce new practical guidelines to generating and using node embeddings, which further augments standard operating procedures used today.",/pdf/f09dce2089f54246bef56fb0cbd14717aa609772.pdf,ICLR,2020,We develop the foundations of a unifying theoretical framework connecting node embeddings and structural graph representations through invariant theory +wXgk_iCiYGo,LGFvZopahtr,1601310000000.0,1618920000000.0,3008,A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima,"[""~Zeke_Xie1"", ""~Issei_Sato1"", ""~Masashi_Sugiyama1""]","[""Zeke Xie"", ""Issei Sato"", ""Masashi Sugiyama""]","[""deep learning dynamics"", ""SGD"", ""diffusion"", ""flat minima"", ""stochastic optimization""]","Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.",/pdf/8d09cb383c404f3ef7a8782e7e20297845235b60.pdf,ICLR,2021,"We prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima." +UEtNMTl6yN,19rSixb40BQ,1601310000000.0,1614990000000.0,1388,Neural Pooling for Graph Neural Networks,"[""~Sai_Sree_Harsha1"", ""deepak.mishra@iist.ac.in""]","[""Sai Sree Harsha"", ""Deepak Mishra""]","[""graph neural networks"", ""graph pooling"", ""representation learning""]","Tasks such as graph classification, require graph pooling to learn graph-level representations from constituent node representations. In this work, we propose two novel methods using fully connected neural network layers for graph pooling, namely Neural Pooling Method 1 and 2. Our proposed methods have the ability to handle variable number of nodes in different graphs, and are also invariant to the isomorphic structures of graphs. In addition, compared to existing graph pooling methods, our proposed methods are able to capture information from all nodes, collect second-order statistics, and leverage the ability of neural networks to learn relationships among node representations, making them more powerful. We perform experiments on graph classification tasks in the bio-informatics and social network domains to determine the effectiveness of our proposed methods. Experimental results show that our methods lead to an absolute increase of upto 1.2% in classification accuracy over previous works and a general decrease in standard deviation across multiple runs indicating greater reliability. Experimental results also indicate that this improvement in performance is consistent across several datasets. +",/pdf/46a99ec81e50a5fc602bab0100abf770bd520d48.pdf,ICLR,2021,"A novel graph pooling method for graph neural networks, which can generate high quality graph representation." +rkl42iA5t7,H1g_KrhcKm,1538090000000.0,1545360000000.0,700,NETWORK COMPRESSION USING CORRELATION ANALYSIS OF LAYER RESPONSES,"[""xsuaucuadros@apple.com"", ""lzappella@apple.com"", ""napostoloff@apple.com""]","[""Xavier Suau"", ""Luca Zappella"", ""Nicholas Apostoloff""]","[""Artificial Intelligence"", ""Deep learning"", ""Machine learning"", ""Compression""]","Principal Filter Analysis (PFA) is an easy to implement, yet effective method for neural network compression. PFA exploits the intrinsic correlation between filter responses within network layers to recommend a smaller network footprint. We propose two compression algorithms: the first allows a user to specify the proportion of the original spectral energy that should be preserved in each layer after compression, while the second is a heuristic that leads to a parameter-free approach that automatically selects the compression used at each layer. Both algorithms are evaluated against several architectures and datasets, and we show considerable compression rates without compromising accuracy, e.g., for VGG-16 on CIFAR-10, CIFAR-100 and ImageNet, PFA achieves a compression rate of 8x, 3x, and 1.4x with an accuracy gain of 0.4%, 1.4% points, and 2.4% respectively. In our tests we also demonstrate that networks compressed with PFA achieve an accuracy that is very close to the empirical upper bound for a given compression ratio. Finally, we show how PFA is an effective tool for simultaneous compression and domain adaptation.",/pdf/86af237b32e7e38780eda929b8127ae2954eb0f0.pdf,ICLR,2019,"We propose an easy to implement, yet effective method for neural network compression. PFA exploits the intrinsic correlation between filter responses within network layers to recommend a smaller network footprints." +SJezGp4YPr,BJewe9KIvS,1569440000000.0,1583910000000.0,403,Geometric Insights into the Convergence of Nonlinear TD Learning,"[""david.brandfonbrener@nyu.edu"", ""bruna@cims.nyu.edu""]","[""David Brandfonbrener"", ""Joan Bruna""]","[""TD"", ""nonlinear"", ""convergence"", ""value estimation"", ""reinforcement learning""]","While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence. ",/pdf/c07a4589f711c340b9d0263d05b00df197d9816c.pdf,ICLR,2020, +rJJRDvcex,,1478290000000.0,1484790000000.0,401,Layer Recurrent Neural Networks,"[""weidi.xie@eng.ox.ac.uk"", ""alison.noble@eng.ox.ac.uk"", ""az@robots.ox.ac.uk""]","[""Weidi Xie"", ""Alison Noble"", ""Andrew Zisserman""]","[""Deep learning"", ""Computer vision""]","In this paper, we propose a Layer-RNN (L-RNN) module that is able to learn contextual information adaptively using within-layer recurrence. Our contributions are three-fold: +(i) we propose a hybrid neural network architecture that interleaves traditional convolutional layers with L-RNN module for learning long- range dependencies at multiple levels; +(ii) we show that a L-RNN module can be seamlessly inserted into any convolutional layer of a pre-trained CNN, and the entire network then fine-tuned, leading to a boost in performance; +(iii) we report experiments on the CIFAR-10 classification task, showing that a network with interleaved convolutional layers and L-RNN modules, achieves comparable results (5.39% top1 error) using only 15 layers and fewer parameters to ResNet-164 (5.46%); and on the PASCAL VOC2012 semantic segmentation task, we show that the performance of a pre-trained FCN network can be boosted by 5% (mean IOU) by simply inserting Layer-RNNs.",/pdf/ee86e273bd4e6d2047a43e5556a9052cab91564f.pdf,ICLR,2017,We propose a Layer-RNN (L-RNN) network that is able to learn contextual information adaptively using within-layer recurrence. We further propose to insert L-RNN to pre-trained CNNs seamlessly. +Sc8cY4Jpi3s,pEroVwNbrqD,1601310000000.0,1614990000000.0,1116,Towards Practical Second Order Optimization for Deep Learning,"[""~Rohan_Anil1"", ""~Vineet_Gupta1"", ""~Tomer_Koren1"", ""kevinregan@google.com"", ""~Yoram_Singer3""]","[""Rohan Anil"", ""Vineet Gupta"", ""Tomer Koren"", ""Kevin Regan"", ""Yoram Singer""]","[""large scale distributed deep learning"", ""second order optimization"", ""bert"", ""resnet"", ""criteo"", ""transformer"", ""machine translation""]","Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.",/pdf/ba4eced55322a403544f413af257d9329c648218.pdf,ICLR,2021,We outperform state-of-the-art first-order optimizers on a variety of tasks using a distributed second-order method. +BJl7mxBYvB,r1egbVgYvS,1569440000000.0,1577170000000.0,2205,Robust Reinforcement Learning via Adversarial Training with Langevin Dynamics,"[""yu.huang@epfl.ch"", ""kamalaruban.parameswaran@epfl.ch"", ""paul.rolland@epfl.ch"", ""ya-ping.hsieh@epfl.ch"", ""volkan.cevher@epfl.ch""]","[""Huang Yu-Ting"", ""Parameswaran Kamalaruban"", ""Paul Rolland"", ""Ya-Ping Hsieh"", ""Volkan Cevher""]","[""deep reinforcement learning"", ""robust reinforcement learning"", ""min-max problem""]","We re-think the Two-Player Reinforcement Learning (RL) as an instance of a distribution sampling problem in infinite dimensions. Using the powerful Stochastic Gradient Langevin Dynamics, we propose a new two-player RL algorithm, which is a sampling variant of the two-player policy gradient method. Our new algorithm consistently outperforms existing baselines, in terms of generalization across differing training and testing conditions, on several MuJoCo environments.",/pdf/f8808c89d6bb8a95d65fc0d1ae9286cb040922d6.pdf,ICLR,2020, +H1bhRHeA-,rylh0reAW,1509080000000.0,1518730000000.0,266,Unbiased scalable softmax optimization,"[""ff2316@columbia.edu"", ""garud@ieor.columbia.edu""]","[""Francois Fagan"", ""Garud Iyengar""]","[""softmax"", ""optimization"", ""implicit sgd""]","Recent neural network and language models have begun to rely on softmax distributions with an extremely large number of categories. In this context calculating the softmax normalizing constant is prohibitively expensive. This has spurred a growing literature of efficiently computable but biased estimates of the softmax. In this paper we present the first two unbiased algorithms for maximizing the softmax likelihood whose work per iteration is independent of the number of classes and datapoints (and does not require extra work at the end of each epoch). We compare our unbiased methods' empirical performance to the state-of-the-art on seven real world datasets, where they comprehensively outperform all competitors.",/pdf/3bdd09f84aa2cc174a240bd994591ac27eb8b4ec.pdf,ICLR,2018,Propose first methods for exactly optimizing the softmax distribution using stochastic gradient with runtime independent on the number of classes or datapoints. +SkFEGHx0Z,SyONfHeR-,1509080000000.0,1518730000000.0,248,Nearest Neighbour Radial Basis Function Solvers for Deep Neural Networks,"[""benjamin.meyer@monash.edu"", ""ben.harwood@monash.edu"", ""tom.drummond@monash.edu""]","[""Benjamin J. Meyer"", ""Ben Harwood"", ""Tom Drummond""]",[],"We present a radial basis function solver for convolutional neural networks that can be directly applied to both distance metric learning and classification problems. Our method treats all training features from a deep neural network as radial basis function centres and computes loss by summing the influence of a feature's nearby centres in the embedding space. Having a radial basis function centred on each training feature is made scalable by treating it as an approximate nearest neighbour search problem. End-to-end learning of the network and solver is carried out, mapping high dimensional features into clusters of the same class. This results in a well formed embedding space, where semantically related instances are likely to be located near one another, regardless of whether or not the network was trained on those classes. The same loss function is used for both the metric learning and classification problems. We show that our radial basis function solver outperforms state-of-the-art embedding approaches on the Stanford Cars196 and CUB-200-2011 datasets. Additionally, we show that when used as a classifier, our method outperforms a conventional softmax classifier on the CUB-200-2011, Stanford Cars196, Oxford 102 Flowers and Leafsnap fine-grained classification datasets.",/pdf/3e3becd68f1c4326b53ce2d8f8c6847e51f086bf.pdf,ICLR,2018, +Hy8X3aKee,,1478250000000.0,1478640000000.0,150,Deep Symbolic Representation Learning for Heterogeneous Time-series Classification,"[""zhangshengdongofgz@gmail.com"", ""Soheil.Bahrampour@us.bosch.com"", ""Naveen.Ramakrishnan@us.bosch.com"", ""mohak@mohakshah.com""]","[""Shengdong Zhang"", ""Soheil Bahrampour"", ""Naveen Ramakrishnan"", ""Mohak Shah""]",[],"In this paper, we consider the problem of event classification with multi-variate time series data consisting of heterogeneous (continuous and categorical) variables. The complex temporal dependencies between the variables combined with sparsity of the data makes the event classification problem particularly challenging. Most state-of-art approaches address this either by designing hand-engineered features or breaking up the problem over homogeneous variates. In this work, we propose and compare three representation learning algorithms over symbolized sequences which enables classification of heterogeneous time-series data using a deep architecture. The proposed representations are trained jointly along with the rest of the network architecture in an end-to-end fashion that makes the learned features discriminative for the given task. Experiments on three real-world datasets demonstrate the effectiveness of the proposed approaches.",/pdf/9c219f891cd8702b713d29638f7b6e0696347536.pdf,ICLR,2017, +p84tly8c4zf,9BYtS4wWkHy,1601310000000.0,1614990000000.0,1942,WeMix: How to Better Utilize Data Augmentation,"[""~Yi_Xu8"", ""asaf.noy@alibaba-inc.com"", ""~Ming_Lin4"", ""~Qi_Qian1"", ""~Li_Hao1"", ""~Rong_Jin1""]","[""Yi Xu"", ""Asaf Noy"", ""Ming Lin"", ""Qi Qian"", ""Li Hao"", ""Rong Jin""]","[""Data Augmentation"", ""Data Bias"", ""Non-convex Optimization"", ""Deep Learning Theory""]","Data augmentation is a widely used training trick in deep learning to improve the network generalization ability. Despite many encouraging results, several recent studies did point out limitations of the conventional data augmentation scheme in certain scenarios, calling for a better theoretical understanding of data augmentation. In this work, we develop a comprehensive analysis that reveals pros and cons of data augmentation. The main limitation of data augmentation arises from the data bias, i.e. the augmented data distribution can be quite different from the original one. This data bias leads to a suboptimal performance of existing data augmentation methods. To this end, we develop two novel algorithms, termed ""AugDrop"" and ""MixLoss"", to correct the data bias in the data augmentation. Our theoretical analysis shows that both algorithms are guaranteed to improve the effect of data augmentation through the bias correction, which is further validated by our empirical studies. Finally, we propose a generic algorithm ""WeMix"" by combining AugDrop and MixLoss, whose effectiveness is observed from extensive empirical evaluations.",/pdf/8de217fd0a05d9a273022ddaaafeda3d22a50ba2.pdf,ICLR,2021,This paper theoretically and empirically studies how to better utilize data augmentation by correcting data bias. +W0MKrbVOxtd,xrE_wyqoLWZ,1601310000000.0,1614990000000.0,2509,One Vertex Attack on Graph Neural Networks-based Spatiotemporal Forecasting,"[""~Fuqiang_Liu2"", ""luis.miranda-moreno@mcgill.ca"", ""lijun.sun@mcgill.ca""]","[""Fuqiang Liu"", ""Luis Miranda Moreno"", ""Lijun Sun""]","[""adversarial attack"", ""graph neural networks"", ""spatiotemporal forecasting""]","Spatiotemporal forecasting plays an essential role in intelligent transportation systems (ITS) and numerous applications, such as route planning, navigation, and automatic driving. Deep Spatiotemporal Graph Neural Networks, which capture both spatial and temporal patterns, have achieved great success in traffic forecasting applications. Though Deep Neural Networks (DNNs) have been proven to be vulnerable to carefully designed perturbations in multiple domains like objection classification and graph classification, these adversarial works cannot be directly applied to spatiotemporal GNNs because of their causality and spatiotemporal mechanism. There is still a lack of studies on the vulnerability and robustness of spatiotemporal GNNs. Particularly, if spatiotemporal GNNs are vulnerable in real-world traffic applications, a hacker can easily cause serious traffic congestion and even a city-scale breakdown. To fill this gap, we design One Vertex Attack to break deep spatiotemporal GNNs by attacking a single one vertex. To achieve this, we apply the genetic algorithm with a universal attack method as the evaluation function to locate the weakest vertex; then perturbations are generated by solving an optimization problem with the inverse estimation. Empirical studies prove that perturbations in one vertex can be diffused into most of the graph when spatiotemporal GNNs are under One Vertex Attack.",/pdf/58f1ae537784b0dc5fc6ec9f561a549d91772fff.pdf,ICLR,2021,This paper proposes an adversarial attack method that is able to break GNN-based spatiotemporal forecasting models by poisoning only one vertex. +H6ZWlQrPGS2,mHSLeOoxvM,1601310000000.0,1614990000000.0,518,Fast Binarized Neural Network Training with Partial Pre-training,"[""~Alex_Renda2"", ""~Joshua_Wolff_Fromm1""]","[""Alex Renda"", ""Joshua Wolff Fromm""]","[""binarized neural network"", ""binary"", ""quantized"", ""1-bit"", ""low precision""]","Binarized neural networks, networks with weights and activations constrained to lie in a 2-element set, allow for more time- and resource-efficient inference than standard floating-point networks. However, binarized neural networks typically take more training to plateau in accuracy than their floating-point counterparts, in terms of both iteration count and wall clock time. We demonstrate a technique, partial pre-training, that allows for faster from-scratch training of binarized neural networks by first training the network as a standard floating-point network for a short amount of time, then converting the network to a binarized neural network and continuing to train from there. Without tuning any hyperparameters across four networks on three different datasets, partial pre-training is able to train binarized neural networks between $1.26\times$ and $1.61\times$ faster than when training a binarized network from scratch using standard low-precision training. +",/pdf/2fd73f4b3bcc5d360f0967c0647400b9ef196138.pdf,ICLR,2021,"We demonstrate a technique, partial pre-training, that allows for faster from-scratch training of binarized neural networks." +j0yLJ-MsgJ,9Q-ud9bydh-,1601310000000.0,1614990000000.0,2067,Class Imbalance in Few-Shot Learning,"[""~Mateusz_Ochal1"", ""~Massimiliano_Patacchiola1"", ""jose.vazquez@seebyte.com"", ""~Amos_Storkey1"", ""~Sen_Wang7""]","[""Mateusz Ochal"", ""Massimiliano Patacchiola"", ""Jose Vazquez"", ""Amos Storkey"", ""Sen Wang""]","[""few-shot learning"", ""class imbalance""]","Few-shot learning aims to train models on a limited number of labeled samples from a support set in order to generalize to unseen samples from a query set. In the standard setup, the support set contains an equal amount of data points for each class. This assumption overlooks many practical considerations arising from the dynamic nature of the real world, such as class imbalance. In this paper, we present a detailed study of few-shot class imbalance along three axes: dataset vs. support set imbalance, effect of different imbalance distributions (linear, step, random), and effect of rebalancing techniques. We extensively compare over 10 state-of-the-art few-shot learning methods using backbones of different depths on multiple datasets. Our analysis reveals that 1) compared to the balanced task, the performances of their class-imbalance counterparts always drop, by up to $18.0\%$ for optimization-based methods, although feature-transfer and metric-based methods generally suffer less, 2) strategies used to mitigate imbalance in supervised learning can be adapted to the few-shot case resulting in better performances, 3) the effects of imbalance at the dataset level are less significant than the effects at the support set level. The code to reproduce the experiments is released under an open-source license.",/pdf/a32aec740948dabc3d41a77a828dc2c1b58d5cd6.pdf,ICLR,2021,We extensively compare over 10 few-shot learning methods in the class imbalance problem. +YCXrx6rRCXO,ro9P1Tm39xw,1601310000000.0,1615340000000.0,2225,Faster Binary Embeddings for Preserving Euclidean Distances,"[""~Jinjie_Zhang1"", ""~Rayan_Saab1""]","[""Jinjie Zhang"", ""Rayan Saab""]","[""Binary Embeddings"", ""Johnson-Lindenstrauss Transforms"", ""Sigma Delta Quantization""]","We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset $\mathcal{T}\subseteq\mathbb{R}^n$ into binary sequences in the cube $\{\pm 1\}^m$. When $\mathcal{T}$ consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to $A x$ where $A\in\mathbb{R}^{m\times n}$ is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use $x\mapsto \mathrm{sign}(Ax)$ for the embedding. Moreover, we show that Euclidean distances among the elements of $\mathcal{T}$ are approximated by the $\ell_1$ norm on the images of $\{\pm 1\}^m$ under a fast linear transformation. This again contrasts with standard methods, where the Hamming distance is used instead. Our method is both fast and memory efficient, with time complexity $O(m)$ and space complexity $O(m)$ on well-spread data. When the data is not well-spread, we show that the approach still works provided that data is transformed via a Walsh-Hadamard matrix, but now the cost is $O(n\log n)$ per data point. Further, we prove that the method is accurate and its associated error is comparable to that of a continuous valued Johnson-Lindenstrauss embedding plus a quantization error that admits a polynomial decay as the embedding dimension $m$ increases. + Thus the length of the binary codes required to achieve a desired accuracy is quite small, and we show it can even be compressed further without compromising the accuracy. To illustrate our results, we test the proposed method on natural images and show that it achieves strong performance.",/pdf/1eba3bf99a991505d994341a4156be4959947011.pdf,ICLR,2021,We propose a fast binary embedding algorithm to preserve Euclidean distances among well-spread vectors and it achieves optimal bit complexity. +S1xzyhR9Y7,BklnKGFtKX,1538090000000.0,1545360000000.0,967,Improving Sentence Representations with Multi-view Frameworks,"[""shuaitang93@ucsd.edu"", ""desa@ucsd.edu""]","[""Shuai Tang"", ""Virginia R. de Sa""]","[""multi-view"", ""learning"", ""sentence"", ""representation""]","Multi-view learning can provide self-supervision when different views are available of the same data. Distributional hypothesis provides another form of useful self-supervision from adjacent sentences which are plentiful in large unlabelled corpora. Motivated by the asymmetry in the two hemispheres of the human brain as well as the observation that different learning architectures tend to emphasise different aspects of sentence meaning, we present two multi-view frameworks for learning sentence representations in an unsupervised fashion. One framework uses a generative objective and the other a discriminative one. In both frameworks, the final representation is an ensemble of two views, in which, one view encodes the input sentence with a Recurrent Neural Network (RNN), and the other view encodes it with a simple linear model. We show that, after learning, the vectors produced by our multi-view frameworks provide improved representations over their single-view learnt counterparts, and the combination of different views gives representational improvement over each view and demonstrates solid transferability on standard downstream tasks.",/pdf/690632205a87a8ad49ef5774f5bb7ebb0d79db9a.pdf,ICLR,2019,Multi-view learning improves unsupervised sentence representation learning +rkgdYhVtvH,H1efoJPhIH,1569440000000.0,1577170000000.0,83,Unifying Graph Convolutional Neural Networks and Label Propagation,"[""wanghongwei55@gmail.com"", ""jure@cs.stanford.edu""]","[""Hongwei Wang"", ""Jure Leskovec""]","[""graph convolutional neural networks"", ""label propagation"", ""node classification""]","Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, theoretical relation between LPA and GCN has not yet been investigated. Here we study the relationship between LPA and GCN in terms of two aspects: (1) feature/label smoothing where we analyze how the feature/label of one node are spread over its neighbors; And, (2) feature/label influence of how much the initial feature/label of one node influences the final feature/label of another node. Based on our theoretical analysis, we propose an end-to-end model that unifies GCN and LPA for node classification. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning attention weights based on node labels, which is more task-oriented than existing feature-based attention models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art GCN-based methods in terms of node classification accuracy. +",/pdf/0094b00cbf899a537bf66ed4fb1b2c8dcf16890c.pdf,ICLR,2020,"This paper studies theoretical relationships between Graph Convolutional Neural Networks (GCN) and Label Propagation Algorithm (LPA), then proposes an end-to-end model that unifies GCN and LPA for node classification." +BklUAoAcY7,r1gsNT6ct7,1538090000000.0,1545360000000.0,897,Unsupervised Learning of Sentence Representations Using Sequence Consistency,"[""sidbrahma@gmail.com""]","[""Siddhartha Brahma""]","[""sentence representation"", ""unsupervised learning"", ""LSTM""]","Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose ConsSent, a simple yet surprisingly powerful unsupervised method to learn such representations by enforcing consistency constraints on sequences of tokens. We consider two classes of such constraints – sequences that form a sentence and between two sequences that form a sentence when merged. We learn sentence encoders by training them to distinguish between consistent and inconsistent examples, the latter being generated by randomly perturbing consistent examples in six different ways. Extensive evaluation on several transfer learning and linguistic probing tasks shows improved performance over strong unsupervised and supervised baselines, substantially surpassing them in several cases. Our best results are achieved by training sentence encoders in a multitask setting and by an ensemble of encoders trained on the individual tasks.",/pdf/8aeefa8b06ed3af4abf2497b8e4e965d50a8e285.pdf,ICLR,2019,Good sentence encoders can be learned by training them to distinguish between consistent and inconsistent (pairs of) sequences that are generated in an unsupervised manner. +SklE_CNFPr,BkxIOhD_Dr,1569440000000.0,1577170000000.0,1207,Zeroth Order Optimization by a Mixture of Evolution Strategies,"[""jimwang@gatech.edu"", ""xl374@scarletmail.rutgers.edu"", ""liping11@baidu.com""]","[""Jun-Kun Wang"", ""Xiaoyun Li"", ""Ping Li""]",[],"Evolution strategies or zeroth-order optimization algorithms have become popular in some areas of optimization and machine learning where only the oracle of function value evaluations is available. The central idea in the design of the algorithms is by querying function values of some perturbed points in the neighborhood of the current update and constructing a pseudo-gradient using the function values. In recent years, there is a growing interest in developing new ways of perturbation. Though the new perturbation methods are well motivating, most of them are criticized for lack of convergence guarantees even when the underlying function is convex. Perhaps the only methods that enjoy convergence guarantees are the ones that sample the perturbed points uniformly from a unit sphere or from a multivariate Gaussian distribution with an isotropic covariance. In this work, we tackle the non-convergence issue and propose sampling perturbed points from a mixture of distributions. Experiments show that our proposed method can identify the best perturbation scheme for the convergence and might also help to leverage the complementariness of different perturbation schemes. +",/pdf/f333e090161c8ee9f3feb4124758a76dc30ef134.pdf,ICLR,2020, +udaowxM8rz,#NAME?,1601310000000.0,1614990000000.0,1101,Increasing the Coverage and Balance of Robustness Benchmarks by Using Non-Overlapping Corruptions,"[""~Alfred_LAUGROS1"", ""~Alice_Caplier2"", ""~Matthieu_Ospici1""]","[""Alfred LAUGROS"", ""Alice Caplier"", ""Matthieu Ospici""]","[""Computer Vision"", ""Robustness"", ""Common Corruptions"", ""Benchmark""]","Neural Networks are sensitive to various corruptions that usually occur in real-world applications such as low-lighting conditions, blurs, noises, etc. To estimate the robustness of neural networks to these common corruptions, we generally use a group of modeled corruptions gathered into a benchmark. We argue that corruption benchmarks often have a poor coverage: being robust to them only implies being robust to a narrow range of corruptions. They are also often unbalanced: they give too much importance to some corruptions compared to others. In this paper, we propose to build corruption benchmarks with only non-overlapping corruptions, to improve their coverage and their balance. Two corruptions overlap when the robustnesses of neural networks to these corruptions are correlated. We propose the first metric to measure the overlapping between two corruptions. We provide an algorithm that uses this metric to build benchmarks of Non-Overlapping Corruptions. Using this algorithm, we build from ImageNet a new corruption benchmark called ImageNet-NOC. We show that ImageNet-NOC is balanced and covers several kinds of corruptions that are not covered by ImageNet-C.",/pdf/6f6ca8bb6ca3227650ee2da8a47a2ecf777a234a.pdf,ICLR,2021, +HJldzhA5tQ,HkxxGZ39Km,1538090000000.0,1545360000000.0,1278,Learning powerful policies and better dynamics models by encouraging consistency,"[""sshagunsodhani@gmail.com"", ""anirudhgoyal9119@gmail.com"", ""tristan.deleu@gmail.com"", ""yoshua.bengio@mila.quebec"", ""tangjianpku@gmail.com""]","[""Shagun Sodhani"", ""Anirudh Goyal"", ""Tristan Deleu"", ""Yoshua Bengio"", ""Jian Tang""]","[""model-based reinforcement learning"", ""deep learning"", ""generative agents"", ""policy gradient"", ""imitation learning""]","Model-based reinforcement learning approaches have the promise of being sample efficient. Much of the progress in learning dynamics models in RL has been made by learning models via supervised learning. There is enough evidence that humans build a model of the environment, not only by observing the environment but also by interacting with the environment. Interaction with the environment allows humans to carry out ""experiments"": taking actions that help uncover true causal relationships which can be used for building better dynamics models. Analogously, we would expect such interaction to be helpful for a learning agent while learning to model the environment dynamics. In this paper, we build upon this intuition, by using an auxiliary cost function to ensure consistency between what the agent observes (by acting in the real world) and what it imagines (by acting in the ``learned'' world). Our empirical analysis shows that the proposed approach helps to train powerful policies as well as better dynamics models.",/pdf/0e02b02c2af7f6c2c18e7e0af134730ef8f1d108.pdf,ICLR,2019,"In this paper, we formulate a way to ensure consistency between the predictions of dynamics model and the real observations from the environment. Thus allowing the agent to learn powerful policies, as well as better dynamics models." +HkgTTh4FDH,rkednnbEPr,1569440000000.0,1587800000000.0,244,Implicit Bias of Gradient Descent based Adversarial Training on Separable Data,"[""yli939@gatech.edu"", ""xxf13@psu.edu"", ""huan.xu@isye.gatech.edu"", ""tourzhao@gatech.edu""]","[""Yan Li"", ""Ethan X.Fang"", ""Huan Xu"", ""Tuo Zhao""]","[""implicit bias"", ""adversarial training"", ""robustness"", ""gradient descent""]","Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its implicit bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that for any fixed iteration $T$, when the adversarial perturbation during training has proper bounded L2 norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum L2 norm margin classifier at the rate of $O(1/\sqrt{T})$, significantly faster than the rate $O(1/\log T}$ of training with clean data. In addition, when the adversarial perturbation during training has bounded Lq norm, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum L2 norm margin classifier under worst-case bounded Lq norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation.",/pdf/4c97f1e2f10751f45110d91b3db5805e5766cae0.pdf,ICLR,2020,"The solution of gradient descent based adversarial training converges in direction to a robust max margin solution that is adapted to adversary geometry, using L2 perturbation also shows significant speed-up in convergence compared to clean training." +BJx4rerFwB,B1ehHdgtPH,1569440000000.0,1577170000000.0,2282,wMAN: WEAKLY-SUPERVISED MOMENT ALIGNMENT NETWORK FOR TEXT-BASED VIDEO SEGMENT RETRIEVAL,"[""rxtan@bu.edu"", ""huijuan@berkeley.edu"", ""saenko@bu.edu"", ""bplumme2@illinois.edu""]","[""Reuben Tan"", ""Huijuan Xu"", ""Kate Saenko"", ""Bryan A. Plummer""]","[""vision"", ""language"", ""video moment retrieval""]","Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate the video segment which is described by the sentence without having access to temporal annotations during training. Instead, a model must learn how to identify the correct segment (i.e. moment) when only being provided with video-sentence pairs. Thus, an inherent challenge is automatically inferring the latent correspondence between visual and language representations. To facilitate this alignment, we propose our Weakly-supervised Moment Alignment Network (wMAN) which exploits a multi-level co-attention mechanism to learn richer multimodal representations. The aforementioned mechanism is comprised of a Frame-By-Word interaction module as well as a novel Word-Conditioned Visual Graph (WCVG). Our approach also incorporates a novel application of positional encodings, commonly used in Transformers, to learn visual-semantic representations that contain contextual information of their relative positions in the temporal sequence through iterative message-passing. Comprehensive experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our learned representations: our combined wMAN model not only outperforms the state-of-the-art weakly-supervised method by a significant margin but also does better than strongly-supervised state-of-the-art methods on some metrics.",/pdf/20f0ed815a46db98827ac164ea423e25b0825e56.pdf,ICLR,2020,Weakly-Supervised Text-Based Video Moment Retrieval +vkxGQB9f2Vg,HeTHXb3wrU,1601310000000.0,1614990000000.0,413,Fourier Stochastic Backpropagation,"[""~Amine_Echraibi1"", ""joachim.floconcholet@orange.com"", ""stephane.gosselin@orange.com"", ""sandrine.vaton@imt-atlantique.fr""]","[""Amine Echraibi"", ""Joachim Flocon Cholet"", ""St\u00e9phane Gosselin"", ""Sandrine Vaton""]","[""Stochastic Backpropagation"", ""Variational Inference"", ""Probabilistic Graphical Models"", ""Deep Learning""]","Backpropagating gradients through random variables is at the heart of numerous machine learning applications. In this paper, we present a general framework for deriving stochastic backpropagation rules for any distribution, discrete or continuous. Our approach exploits the link between the characteristic function and the Fourier transform, to transport the derivatives from the parameters of the distribution to the random variable. Our method generalizes previously known estimators, and results in new estimators for the gamma, beta, Dirichlet and Laplace distributions. Furthermore, we show that the classical deterministic backproapagation rule and the discrete random variable case, can also be interpreted through stochastic backpropagation.",/pdf/d28d2e78d0f0eb13f2df02db0ba06d43b0995666.pdf,ICLR,2021,Communicating new theoretical results concerning stochastic backpropagation. +B1xwcyHFDr,SJxRPcCOvr,1569440000000.0,1583910000000.0,1880,Learning Robust Representations via Multi-View Information Bottleneck,"[""m.federici@uva.nl"", ""duttanjan@gmail.com"", ""patrickforre@gmail.com"", ""nate@kushman.org"", ""zeynepakata@gmail.com""]","[""Marco Federici"", ""Anjan Dutta"", ""Patrick Forr\u00e9"", ""Nate Kushman"", ""Zeynep Akata""]","[""Information Bottleneck"", ""Multi-View Learning"", ""Representation Learning"", ""Information Theory""]","The information bottleneck principle provides an information-theoretic method for representation learning, by training an encoder to retain all information which is relevant for predicting the label while minimizing the amount of other, excess information in the representation. The original formulation, however, requires labeled data to identify the superfluous information. In this work, we extend this ability to the multi-view unsupervised setting, where two views of the same underlying entity are provided but the label is unknown. This enables us to identify superfluous information as that not shared by both views. A theoretical analysis leads to the definition of a new multi-view model that produces state-of-the-art results on the Sketchy dataset and label-limited versions of the MIR-Flickr dataset. We also extend our theory to the single-view setting by taking advantage of standard data augmentation techniques, empirically showing better generalization capabilities when compared to common unsupervised approaches for representation learning.",/pdf/bd46688c90403447019b2861c8a8356430a7726d.pdf,ICLR,2020,We extend the information bottleneck method to the unsupervised multiview setting and show state of the art results on standard datasets +rJzIBfZAb,S1zLrM-C-,1509140000000.0,1519410000000.0,835,Towards Deep Learning Models Resistant to Adversarial Attacks,"[""madry@mit.edu"", ""amakelov@mit.edu"", ""ludwigs@mit.edu"", ""tsipras@mit.edu"", ""avladu@mit.edu""]","[""Aleksander Madry"", ""Aleksandar Makelov"", ""Ludwig Schmidt"", ""Dimitris Tsipras"", ""Adrian Vladu""]","[""adversarial examples"", ""robust optimization"", ""ML security""]","Recent work has demonstrated that neural networks are vulnerable to adversarial examples, i.e., inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against a well-defined class of adversaries. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest robustness against a first-order adversary as a natural security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models.",/pdf/82e3f58baa3f400762f0f44e5649ce6dfe28c0b9.pdf,ICLR,2018,"We provide a principled, optimization-based re-look at the notion of adversarial examples, and develop methods that produce models that are adversarially robust against a wide range of adversaries." +r1zOg309tX,B1gDbVM5KX,1538090000000.0,1545360000000.0,1091,Understanding the Effectiveness of Lipschitz-Continuity in Generative Adversarial Nets,"[""heyohai@apex.sjtu.edu.cn"", ""songyuxuan@apex.sjtu.edu.cn"", ""yulantao@apex.sjtu.edu.cn"", ""wanghongwei55@gmail.com"", ""wnzhang@sjtu.edu.cn"", ""zhzhang@math.pku.edu.cn"", ""yyu@apex.sjtu.edu.cn""]","[""Zhiming Zhou"", ""Yuxuan Song"", ""Lantao Yu"", ""Hongwei Wang"", ""Weinan Zhang"", ""Zhihua Zhang"", ""Yong Yu""]","[""GANs"", ""Lipschitz-continuity"", ""convergence""]","In this paper, we investigate the underlying factor that leads to the failure and success in training of GANs. Specifically, we study the property of the optimal discriminative function $f^*(x)$ and show that $f^*(x)$ in most GANs can only reflect the local densities at $x$, which means the value of $f^*(x)$ for points in the fake distribution ($P_g$) does not contain any information useful about the location of other points in the real distribution ($P_r$). Given that the supports of the real and fake distributions are usually disjoint, we argue that such a $f^*(x)$ and its gradient tell nothing about ""how to pull $P_g$ to $P_r$"", which turns out to be the fundamental cause of failure in training of GANs. We further demonstrate that a well-defined distance metric (including the dual form of Wasserstein distance with a compacted constraint) does not necessarily ensure the convergence of GANs. Finally, we propose Lipschitz-continuity condition as a general solution and show that in a large family of GAN objectives, Lipschitz condition is capable of connecting $P_g$ and $P_r$ through $f^*(x)$ such that the gradient $\nabla_{\!x}f^*(x)$ at each sample $x \sim P_g$ points towards some real sample $y \sim P_r$.",/pdf/d4e4c5ceda932c48e5bca0d996e5ca598b339031.pdf,ICLR,2019,"We disclose the fundamental cause of failure in training of GANs, and demonstrate that Lipschitz-continuity is a general solution to this issue." +SJlJSaEFwS,H1x9tcEDvS,1569440000000.0,1577170000000.0,507,Robust Cross-lingual Embeddings from Parallel Sentences ,"[""asabet@uwaterloo.ca"", ""prakhar.gupta@epfl.ch"", ""jean-baptiste.cordonnier@epfl.ch"", ""robert.west@epfl.ch"", ""martin.jaggi@epfl.ch""]","[""Ali Sabet"", ""Prakhar Gupta"", ""Jean-Baptiste Cordonnier"", ""Robert West"", ""Martin Jaggi""]","[""Cross-lingual embeddings"", ""sent2vec"", ""word2vec"", ""bilingual"", ""word translation"", ""sentence retrieval"", ""text"", ""NLP"", ""word vectors"", ""sentence vectors""]","Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation. However, these approaches assume word embedding spaces are isomorphic between different languages, which has been shown not to hold in practice (Søgaard et al., 2018), and fundamentally limits their performance. This motivates investigating joint learning methods which can overcome this impediment, by simultaneously learning embeddings across languages via a cross-lingual term in the training objective. Given the abundance of parallel data available (Tiedemann, 2012), we propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations. Our approach significantly improves cross-lingual sentence retrieval performance over all other approaches, as well as convincingly outscores mapping methods while maintaining parity with jointly trained methods on word-translation. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task, requiring far fewer computational resources for training and inference. As an additional advantage, our bilingual method also improves the quality of monolingual word vectors despite training on much smaller datasets. We make our code and models publicly available. +",/pdf/993ee5a9d143de4da2bb1b92c02bf8601fde72f4.pdf,ICLR,2020,Joint method for learning cross-lingual embeddings with state-of-art performance for cross-lingual tasks and mono-lingual quality +HJgkj0NFwr,S1gWhkFOPB,1569440000000.0,1577170000000.0,1304,Differentiable Architecture Compression,"[""sss1@andrew.cmu.edu"", ""khetan2@illinois.edu"", ""zkarnin@gmail.com""]","[""Shashank Singh"", ""Ashish Khetan"", ""Zohar Karnin""]",[],"In many learning situations, resources at inference time are significantly more constrained than resources at training time. This paper studies a general paradigm, called Differentiable ARchitecture Compression (DARC), that combines model compression and architecture search to learn models that are resource-efficient at inference time. Given a resource-intensive base architecture, DARC utilizes the training data to learn which sub-components can be replaced by cheaper alternatives. The high-level technique can be applied to any neural architecture, and we report experiments on state-of-the-art convolutional neural networks for image classification. For a WideResNet with 97.2% accuracy on CIFAR-10, we improve single-sample inference speed by 2.28X and memory footprint by 5.64X, with no accuracy loss. For a ResNet with 79.15% Top-1 accuracy on ImageNet, we improve batch inference speed by 1.29X and memory footprint by 3.57X with 1% accuracy loss. We also give theoretical Rademacher complexity bounds in simplified cases, showing how DARC avoids over-fitting despite over-parameterization.",/pdf/e977776887e6d696fb022a76dc104cac2610c018.pdf,ICLR,2020, +HyeuP2EtDB,rJghCqTWBS,1569440000000.0,1577170000000.0,10,Scoring-Aggregating-Planning: Learning task-agnostic priors from interactions and sparse rewards for zero-shot generalization,"[""huazhe_xu@eecs.berkeley.edu"", ""boyuanchen@berkeley.edu"", ""yg@eecs.berkeley.edu"", ""trevordarrell@eecs.berkeley.edu""]","[""Huazhe Xu"", ""Boyuan Chen"", ""Yang Gao"", ""Trevor Darrell""]","[""learning priors from exploration data"", ""policy zero-shot generalization"", ""reward shaping"", ""model-based""]","Humans can learn task-agnostic priors from interactive experience and utilize the priors for novel tasks without any finetuning. In this paper, we propose Scoring-Aggregating-Planning (SAP), a framework that can learn task-agnostic semantics and dynamics priors from arbitrary quality interactions as well as the corresponding sparse rewards and then plan on unseen tasks in zero-shot condition. The framework finds a neural score function for local regional state and action pairs that can be aggregated to approximate the quality of a full trajectory; moreover, a dynamics model that is learned with self-supervision can be incorporated for planning. Many of previous works that leverage interactive data for policy learning either need massive on-policy environmental interactions or assume access to expert data while we can achieve a similar goal with pure off-policy imperfect data. Instantiating our framework results in a generalizable policy to unseen tasks. Experiments demonstrate that the proposed method can outperform baseline methods on a wide range of applications including gridworld, robotics tasks and video games.",/pdf/a86483704a793bacecf55a3183a94d366ff62e2f.pdf,ICLR,2020,We learn dense scores and dynamics model as priors from exploration data and use them to induce a good policy in new tasks in zero-shot condition. +ryl1r1BYDS,H1giik6_wB,1569440000000.0,1577170000000.0,1675,Multiagent Reinforcement Learning in Games with an Iterated Dominance Solution,"[""yorambac@gmail.com"", ""lattimore@google.com"", ""garnelo@google.com"", ""perolat@google.com"", ""dbalduzzi@google.com"", ""twa@google.com"", ""baveja@google.com"", ""thore@google.com""]","[""Yoram Bachrach"", ""Tor Lattimore"", ""Marta Garnelo"", ""Julien Perolat"", ""David Balduzzi"", ""Thomas Anthony"", ""Satinder Singh"", ""Thore Graepel""]","[""multiagent"", ""reinforcement learning"", ""iterated dominance"", ""mechanism design"", ""Nash equilibrium""]","Multiagent reinforcement learning (MARL) attempts to optimize policies of intelligent agents interacting in the same environment. However, it may fail to converge to a Nash equilibrium in some games. We study independent MARL under the more demanding solution concept of iterated elimination of strictly dominated strategies. In dominance solvable games, if players iteratively eliminate strictly dominated strategies until no further strategies can be eliminated, we obtain a single strategy profile. We show that convergence to the iterated dominance solution is guaranteed for several reinforcement learning algorithms (for multiple independent learners). We illustrate an application of our results by studying mechanism design for principal-agent problems, where a principal wishes to incentivize agents to exert costly effort in a joint project when it can only observe whether the project succeeded, but not whether agents actually exerted effort. We show that MARL converges to the desired outcome if the rewards are designed so that exerting effort is the iterated dominance solution, but fails if it is merely a Nash equilibrium.",/pdf/6c04be67f4f39fb0b8872f9930b70e07c7ab83e7.pdf,ICLR,2020,"For games that are solvable by iterated elimination of dominated strategies, we prove that simple standard reinforcement learning algorithms converge to the iterated dominance solution." +SkeQniAqK7,SJeKXzacK7,1538090000000.0,1545360000000.0,696,Combining Learned Representations for Combinatorial Optimization,"[""saavan@berkeley.edu"", ""sayeef@berkeley.edu""]","[""Saavan Patel"", ""Sayeef Salahuddin""]","[""Generative Models"", ""Restricted Boltzmann Machines"", ""Transfer Learning"", ""Compositional Learning""]","We propose a new approach to combine Restricted Boltzmann Machines (RBMs) that can be used to solve combinatorial optimization problems. This allows synthesis of larger models from smaller RBMs that have been pretrained, thus effectively bypassing the problem of learning in large RBMs, and creating a system able to model a large, complex multi-modal space. We validate this approach by using learned representations to create ``invertible boolean logic'', where we can use Markov chain Monte Carlo (MCMC) approaches to find the solution to large scale boolean satisfiability problems and show viability towards other combinatorial optimization problems. Using this method, we are able to solve 64 bit addition based problems, as well as factorize 16 bit numbers. We find that these combined representations can provide a more accurate result for the same sample size as compared to a fully trained model. ",/pdf/d38e063981e2be150ef7fe583cf8c9f569086a0f.pdf,ICLR,2019,We use combinations of RBMs to solve number factorization and combinatorial optimization problems. +kuqBCnJuD4Z,PaFJD1KqRZH,1601310000000.0,1614990000000.0,1494,FedMes: Speeding Up Federated Learning with Multiple Edge Servers,"[""~Dong-Jun_Han1"", ""ejaqmf@jejunu.ac.kr"", ""savertm@kaist.ac.kr"", ""~Jaekyun_Moon2""]","[""Dong-Jun Han"", ""Minseok Choi"", ""Jungwuk Park"", ""Jaekyun Moon""]","[""Federated Learning"", ""Edge Servers"", ""Wireless Edge Networks""]","We consider federated learning with multiple wireless edge servers having their own local coverages. We focus on speeding up training in this increasingly practical setup. Our key idea is to utilize the devices located in the overlapping areas between the coverage of edge servers; in the model-downloading stage, the devices in the overlapping areas receive multiple models from different edge servers, take the average of the received models, and then update the model with their local data. These devices send their updated model to multiple edge servers by broadcasting, which acts as bridges for sharing the trained models between servers. Even when some edge servers are given biased datasets within their coverages, their training processes can be assisted by coverages of adjacent servers, through the devices in the overlapping regions. As a result, the proposed scheme does not require costly communications with the central cloud server (located at the higher tier of edge servers) for model synchronization, significantly reducing the overall training time compared to the conventional cloud-based federated learning systems. Extensive experimental results show remarkable performance gains of our scheme compared to existing methods.",/pdf/73e26adfd659c09744ebfa7b2ad18ad178f6cc86.pdf,ICLR,2021,We propose a scheme to speed up federated learning in an increasingly practical setup with multiple edge servers having their own local coverages. +ZlIfK1wCubc,DDrMSPHFc_W,1601310000000.0,1614990000000.0,452,Contrasting distinct structured views to learn sentence embeddings,"[""~Antoine_Simoulin1"", ""benoit.crabbe@gmail.com""]","[""Antoine Simoulin"", ""Benoit Crabb\u00e9""]","[""Sentence"", ""Embeddings"", ""Structure"", ""Contrastive"", ""Multi-views""]"," We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures of a sentence. We assume structure is crucial to build consistent representations as we expect sentence meaning to be a function from both syntax and semantic aspects. In this perspective, we hypothesize that some linguistic representations might be better adapted given the considered task or sentence. We, therefore, propose to jointly learn individual representation functions for different syntactic frameworks. Again, by hypothesis, all such functions should encode similar semantic information differently and consequently, be complementary for building better sentential semantic embeddings. To assess such hypothesis, we propose an original contrastive multi-view framework that induces an explicit interaction between models during the training phase. We make experiments combining various structures such as dependency, constituency, or sequential schemes. We evaluate our method on standard sentence embedding benchmarks. Our results outperform comparable methods on several tasks. + + +",/pdf/b2813cbb69f650d7c6e65b7066472d7100f2a5e4.pdf,ICLR,2021,We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures. +Qe_de8HpWK,_YtvRiqf01J,1601310000000.0,1614990000000.0,2433,GenQu: A Hybrid System for Learning Classical Data in Quantum States,"[""sstein17@fordham.edu"", ""rtischio@fordham.edu"", ""bbaheri@kent.edu"", ""ychen638@fordham.edu"", ""~Ying_Mao1"", ""qguan@kent.edu"", ""ang.li@pnnl.gov"", ""bo.fang@pnnl.gov""]","[""Samuel A. Stein"", ""Ray Marie Tischio"", ""Betis Baheri"", ""Yiwen Chen"", ""Ying Mao"", ""Qiang Guan"", ""Ang Li"", ""Bo Fang""]","[""Quantum Machine Learning"", ""Qubits"", ""Kernel Methods"", ""Deep Neural Network""]","Deep neural network-powered artificial intelligence has rapidly changed our daily life with various applications. However, as one of the essential steps of deep neural networks, training a heavily-weighted network requires a tremendous amount of computing resources. Especially in the post Moore's Law era, the limit of semiconductor fabrication technology has restricted the development of learning algorithms to cope with the increasing high-intensity training data. Meanwhile, quantum computing has exhibited its significant potential in terms of speeding up the traditionally compute-intensive workloads. For example, Google illustrates quantum supremacy by completing a sampling calculation task in 200 seconds, which is otherwise impracticable on the world's largest supercomputers. To this end, quantum-based learning becomes an area of interest, with the promising of a quantum speedup. In this paper, we propose GenQu, a hybrid and general-purpose quantum framework for learning classical data through quantum states. We evaluate GenQu with real datasets and conduct experiments on both simulations and real quantum computer IBM-Q. Our evaluation demonstrates that, comparing with classical solutions, the proposed models running on GenQu framework achieve similar accuracy with a much smaller number of qubits, while significantly reducing the parameter size by up to 95.8\% and converging speedup by 66.67% faster. ",/pdf/a512c04a58785954c8f8f3b15188ac67d4fdadbb.pdf,ICLR,2021,"We propose GenQu, a hybrid and general-purpose quantum system for learning classical data in quantum states" +BkewX2C9tX,SylcW6T5FX,1538090000000.0,1545360000000.0,1366,Analyzing Federated Learning through an Adversarial Lens,"[""abhagoji@princeton.edu"", ""supriyo@us.ibm.com"", ""scalo@us.ibm.com"", ""pmittal@princeton.edu""]","[""Arjun Nitin Bhagoji"", ""Supriyo Chakraborty"", ""Seraphin Calo"", ""Prateek Mittal""]","[""federated learning"", ""model poisoning""]","Federated learning distributes model training among a multitude of agents, who, guided by privacy concerns, perform training using their local data but share only model parameter updates, for iterative aggregation at the server. In this work, we explore the threat of model poisoning attacks on federated learning initiated by a single, non-colluding malicious agent where the adversarial objective is to cause the model to misclassify a set of chosen inputs with high confidence. We explore a number of strategies to carry out this attack, starting with simple boosting of the malicious agent's update to overcome the effects of other agents' updates. To increase attack stealth, we propose an alternating minimization strategy, which alternately optimizes for the training loss and the adversarial objective. We follow up by using parameter estimation for the benign agents' updates to improve on attack success. Finally, we use a suite of interpretability techniques to generate visual explanations of model decisions for both benign and malicious models and show that the explanations are nearly visually indistinguishable. Our results indicate that even a highly constrained adversary can carry out model poisoning attacks while simultaneously maintaining stealth, thus highlighting the vulnerability of the federated learning setting and the need to develop effective defense strategies.",/pdf/b0b36674a8d2b3fef2cdcabd5c6053c12be24eeb.pdf,ICLR,2019,Effective model poisoning attacks on federated learning able to cause high-confidence targeted misclassification of desired inputs +Ske5UANYDB,B1xEWXvdwH,1569440000000.0,1577170000000.0,1149,Benefit of Interpolation in Nearest Neighbor Algorithms,"[""xing49@purdue.edu"", ""qfsong@purdue.edu"", ""chengg@purdue.edu""]","[""Yue Xing"", ""Qifan Song"", ""Guang Cheng""]","[""Data Interpolation"", ""Multiplicative Constant"", ""W-Shaped Double Descent"", ""Nearest Neighbor Algorithm""]","The over-parameterized models attract much attention in the era of data science and deep learning. It is empirically observed that although these models, e.g. deep neural networks, over-fit the training data, they can still achieve small testing error, and sometimes even outperform traditional algorithms which are designed to avoid over-fitting. The major goal of this work is to sharply quantify the benefit of data interpolation in the context of nearest neighbors (NN) algorithm. Specifically, we consider a class of interpolated weighting schemes and then carefully characterize their asymptotic performances. Our analysis reveals a U-shaped performance curve with respect to the level of data interpolation, and proves that a mild degree of data interpolation strictly improves the prediction accuracy and statistical stability over those of the (un-interpolated) optimal $k$NN algorithm. This theoretically justifies (predicts) the existence of the second U-shaped curve in the recently discovered double descent phenomenon. Note that our goal in this study is not to promote the use of interpolated-NN method, but to obtain theoretical insights on data interpolation inspired by the aforementioned phenomenon.",/pdf/c3afedd4587bde1e38b7383e9cc8783e4795cf5b.pdf,ICLR,2020, +NfZ6g2OmXEk,OJ65O8RzK6M,1601310000000.0,1614990000000.0,615,Prioritized Level Replay,"[""~Minqi_Jiang1"", ""~Edward_Grefenstette1"", ""~Tim_Rockt\u00e4schel1""]","[""Minqi Jiang"", ""Edward Grefenstette"", ""Tim Rockt\u00e4schel""]","[""Reinforcement Learning"", ""Procedurally Generated Environments"", ""Curriculum Learning"", ""Procgen Benchmark""]","Simulated environments with procedurally generated content have become popular benchmarks for testing systematic generalization of reinforcement learning agents. Every level in such an environment is algorithmically created, thereby exhibiting a unique configuration of underlying factors of variation, such as layout, positions of entities, asset appearances, or even the rules governing environment transitions. Fixed sets of training levels can be determined to aid comparison and reproducibility, and test levels can be held out to evaluate the generalization and robustness of agents. While prior work samples training levels in a direct way (e.g.~uniformly) for the agent to learn from, we investigate the hypothesis that different levels provide different learning progress for an agent at specific times during training. We introduce Prioritized Level Replay, a general framework for estimating the future learning potential of a level given the current state of the agent's policy. We find that temporal-difference (TD) errors, while previously used to selectively sample past transitions, also prove effective for scoring a level's future learning potential when the agent replays (that is, revisits) that level to generate entirely new episodes of experiences from it. We report significantly improved sample-efficiency and generalization on the majority of Procgen Benchmark environments as well as two challenging MiniGrid environments. Lastly, we present a qualitative analysis showing that Prioritized Level Replay induces an implicit curriculum, taking the agent gradually from easier to harder levels.",/pdf/97566ca56e2490adb67bfbdbd18acdb617462a8d.pdf,ICLR,2021,"TD error can be exploited to score procedurally generated levels for future learning potential, thereby inducing a curriculum from easier to harder levels and providing significant gains in OpenAI Procgen Benchmark and MiniGrid." +C0qJUx5dxFb,vmB959VkKSC,1601310000000.0,1615890000000.0,3621,Neural networks with late-phase weights,"[""~Johannes_Von_Oswald1"", ""seijink@ethz.ch"", ""~Joao_Sacramento1"", ""~Alexander_Meulemans1"", ""~Christian_Henning1"", ""~Benjamin_F_Grewe1""]","[""Johannes Von Oswald"", ""Seijin Kobayashi"", ""Joao Sacramento"", ""Alexander Meulemans"", ""Christian Henning"", ""Benjamin F Grewe""]",[],"The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning.",/pdf/b8fee003035442b69f0891687d5afc8ab42545d4.pdf,ICLR,2021, +ryeX-nC9YQ,SyeljTpcKX,1538090000000.0,1545360000000.0,1155,Dimension-Free Bounds for Low-Precision Training,"[""lzlz19971997@gmail.com"", ""cdesa@cs.cornell.edu""]","[""Zheng Li"", ""Christopher De Sa""]","[""low precision"", ""stochastic gradient descent""]","Low-precision training is a promising way of decreasing the time and energy cost of training machine learning models. +Previous work has analyzed low-precision training algorithms, such as low-precision stochastic gradient descent, and derived theoretical bounds on their convergence rates. +These bounds tend to depend on the dimension of the model $d$ in that the number of bits needed to achieve a particular error bound increases as $d$ increases. +This is undesirable because a motivating application for low-precision training is large-scale models, such as deep learning, where $d$ can be huge. +In this paper, we prove dimension-independent bounds for low-precision training algorithms that use fixed-point arithmetic, which lets us better understand what affects the convergence of these algorithms as parameters scale. +Our methods also generalize naturally to let us prove new convergence bounds on low-precision training with other quantization schemes, such as low-precision floating-point computation and logarithmic quantization.",/pdf/e2fa34b3b6519990013c7618cb3a71a04531f868.pdf,ICLR,2019,we proved dimension-independent bounds for low-precision training algorithms +SkhQHMW0W,HJ5XHM-0b,1509140000000.0,1519440000000.0,833,Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,"[""yujunlin@stanford.edu"", ""songhan@stanford.edu"", ""huizi@stanford.edu"", ""yu-wang@mail.tsinghua.edu.cn"", ""dally@stanford.edu""]","[""Yujun Lin"", ""Song Han"", ""Huizi Mao"", ""Yu Wang"", ""Bill Dally""]","[""distributed training""]","Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.",/pdf/41772454cc4bd99cc9865acd9eb52dadf67ccb50.pdf,ICLR,2018,we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy. +BJepcaEtwB,SJx0B4APPr,1569440000000.0,1577170000000.0,723,Meta-Graph: Few shot Link Prediction via Meta Learning,"[""joey.bose@mail.mcgill.ca"", ""ankit.jain@uber.com"", ""piero.molino@uber.com"", ""wlh@cs.mcgill.ca""]","[""Avishek Joey Bose"", ""Ankit Jain"", ""Piero Molino"", ""William L. Hamilton""]","[""Meta Learning"", ""Link Prediction"", ""Graph Representation Learning"", ""Graph Neural Networks""]","We consider the task of few shot link prediction, where the goal is to predict missing edges across multiple graphs using only a small sample of known edges. We show that current link prediction methods are generally ill-equipped to handle this task---as they cannot effectively transfer knowledge between graphs in a multi-graph setting and are unable to effectively learn from very sparse data. To address this challenge, we introduce a new gradient-based meta learning framework, Meta-Graph, that leverages higher-order gradients along with a learned graph signature function that conditionally generates a graph neural network initialization. Using a novel set of few shot link prediction benchmarks, we show that Meta-Graph enables not only fast adaptation but also better final convergence and can effectively learn using only a small sample of true edges.",/pdf/67c358571a31c4b956e1f71ca633990bf1a9b792.pdf,ICLR,2020,We apply gradient based meta-learning to the graph domain and introduce a new graph specific transfer function to further bootstrap the process. +dx4b7lm8jMM,N0-idM_vnHF,1601310000000.0,1616850000000.0,3363,Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections,"[""~Csaba_Toth2"", ""~Patric_Bonnier1"", ""~Harald_Oberhauser1""]","[""Csaba Toth"", ""Patric Bonnier"", ""Harald Oberhauser""]","[""time series"", ""sequential data"", ""representation learning"", ""low-rank tensors"", ""classification"", ""generative modelling""]","Sequential data such as time series, video, or text can be challenging to analyse as the ordered structure gives rise to complex dependencies. At the heart of this is non-commutativity, in the sense that reordering the elements of a sequence can completely change its meaning. We use a classical mathematical object -- the free algebra -- to capture this non-commutativity. To address the innate computational complexity of this algebra, we use compositions of low-rank tensor projections. This yields modular and scalable building blocks that give state-of-the-art performance on standard benchmarks such as multivariate time series classification, mortality prediction and generative models for video.",/pdf/bc313164adf3017b7e94a07aecbd830b43e5c49a.pdf,ICLR,2021,An Efficient Representation of Sequences by Low-Rank Tensor Projections +BkljIlHtvS,rJgh_clKPr,1569440000000.0,1577170000000.0,2335,Decoupling Adaptation from Modeling with Meta-Optimizers for Meta Learning,"[""arnolds@usc.edu"", ""shariqiqbal2810@gmail.com"", ""fsha@google.com""]","[""S\u00e9bastien M.R. Arnold"", ""Shariq Iqbal"", ""Fei Sha""]","[""meta-learning"", ""MAML"", ""analysis"", ""depth"", ""meta-optimizers""]","Meta-learning methods, most notably Model-Agnostic Meta-Learning (Finn et al, 2017) or MAML, have achieved great success in adapting to new tasks quickly, after having been trained on similar tasks. +The mechanism behind their success, however, is poorly understood. +We begin this work with an experimental analysis of MAML, finding that deep models are crucial for its success, even given sets of simple tasks where a linear model would suffice on any individual task. +Furthermore, on image-recognition tasks, we find that the early layers of MAML-trained models learn task-invariant features, while later layers are used for adaptation, providing further evidence that these models require greater capacity than is strictly necessary for their individual tasks. +Following our findings, we propose a method which enables better use of model capacity at inference time by separating the adaptation aspect of meta-learning into parameters that are only used for adaptation but are not part of the forward model. +We find that our approach enables more effective meta-learning in smaller models, which are suitably sized for the individual tasks. +",/pdf/626d9ba8d193fe70a4b6e387259d50cb98aa97a4.pdf,ICLR,2020,We find that deep models are crucial for MAML to work and propose a method which enables effective meta-learning in smaller models. +b-7nwWHFtw,uVO9pd6SWHE,1601310000000.0,1614990000000.0,2810,Privacy-preserving Learning via Deep Net Pruning,"[""~YANGSIBO_HUANG1"", ""~Xiaoxiao_Li1"", ""yushans@princeton.edu"", ""~Sachin_Ravi1"", ""~Zhao_Song3"", ""~Sanjeev_Arora1"", ""~Kai_Li8""]","[""YANGSIBO HUANG"", ""Xiaoxiao Li"", ""Yushan Su"", ""Sachin Ravi"", ""Zhao Song"", ""Sanjeev Arora"", ""Kai Li""]","[""Neural Network Pruning"", ""Differential Privacy""]","Neural network pruning has demonstrated its success in significantly improving the computational efficiency of deep models while only introducing a small reduction on final accuracy. In this paper, we explore an extra bonus of neural network pruning in terms of enhancing privacy. Specifically, we show a novel connection between magnitude-based pruning and adding differentially private noise to intermediate layers under the over-parameterized regime. To the best of our knowledge, this is the first work that bridges pruning with the theory of differential privacy. The paper also presents experimental results by running the model inversion attack on two benchmark datasets, which supports the theoretical finding. ",/pdf/c9483b7f9192d3c71bd14c1fff6d2c7ac3caaa0c.pdf,ICLR,2021,"This paper shows a novel connection between magnitude-based pruning and adding differentially private noise to intermediate layers, under the over-parameterized regime" +DGttsPh502x,GUoi_kIrDn8,1601310000000.0,1614990000000.0,3438,Unsupervised Discovery of Interpretable Latent Manipulations in Language VAEs,"[""~Max_Ryabinin1"", ""~Artem_Babenko1"", ""~Elena_Voita1""]","[""Max Ryabinin"", ""Artem Babenko"", ""Elena Voita""]","[""interpretability"", ""unsupervised interpretable directions"", ""controllable text generation""]","Language generation models are attracting more and more attention due to their constantly increasing quality and remarkable generation results. State-of-the-art NLG models like BART/T5/GPT-3 do not have latent spaces, therefore there is no natural way to perform controlled generation. In contrast, less popular models with explicit latent spaces have the innate ability to manipulate text attributes by moving along latent directions. For images, properties of latent spaces are well-studied: there exist interpretable directions (e.g. zooming, aging, background removal) and they can even be found without supervision. This success is expected: latent space image models, especially GANs, achieve state-of-the-art generation results and hence have been the focus of the research community. For language, this is not the case: text GANs are hard to train because of non-differentiable discrete data generation, and language VAEs suffer from posterior collapse and fill the latent space poorly. This makes finding interpetable text controls challenging. In this work, we make the first step towards unsupervised discovery of interpretable directions in language latent spaces. For this, we turn to methods shown to work in the image domain. Surprisingly, we find that running PCA on VAE representations of training data consistently outperforms shifts along the coordinate and random directions. This approach is simple, data-adaptive, does not require training and discovers meaningful directions, e.g. sentence length, subject age, and verb tense. Our work lays foundations for two important areas: first, it allows to compare models in terms of latent space interpretability, and second, it provides a baseline for unsupervised latent controls discovery.",/pdf/46aa1ab23832fe111723276d57c5a058caf25fa9.pdf,ICLR,2021,"We propose the first method for unsupervised discovery of interpretable attribute manipulations in text variational autoencoders; this method is very simple, fast, and outperforms the baselines by a large margin." +d9Emve8gG5E,ZyGdE55ZQE,1601310000000.0,1614990000000.0,758,OFFER PERSONALIZATION USING TEMPORAL CONVOLUTION NETWORK AND OPTIMIZATION,"[""~Ankur_Verma1""]","[""Ankur Verma""]","[""Machine Learning"", ""Deep Learning"", ""Optimization"", ""Time-Series"", ""Offer Personalization""]","Lately, personalized marketing has become important for retail/e-retail firms due to significant rise in online shopping and market competition. Increase in online shopping and high market competition has led to an increase in promotional expenditure for online retailers, and hence, rolling out optimal offers has become imperative to maintain balance between number of transactions and profit. In this paper, we propose our approach to solve the offer optimization problem at the intersection of consumer, item and time in retail setting. To optimize offer, we first build a generalized non-linear model using Temporal Convolutional Network to predict the item purchase probability at consumer level for the given time period. Secondly, we establish the functional relationship between historical offer values and purchase probabilities obtained from the model, which is then used to estimate offer-elasticity of purchase probability at consumer item granularity. Finally, using estimated elasticities, we optimize offer values using constraint based optimization technique. This paper describes our detailed methodology and presents the results of modelling and optimization across categories.",/pdf/a1a219347cf0f1e62dad7b230e33a4bd2a208320.pdf,ICLR,2021,"Deep Learning and Optimization based approach to solve the offer optimization problem at the intersection of consumer, item and time in retail setting." +rJeZS3RcYm,SkgPDdact7,1538090000000.0,1545360000000.0,1514,Simple Black-box Adversarial Attacks,"[""cg563@cornell.edu"", ""jrg365@cornell.edu"", ""yy785@cornell.edu"", ""andrew@cornell.edu"", ""kqw4@cornell.edu""]","[""Chuan Guo"", ""Jacob R. Gardner"", ""Yurong You"", ""Andrew G. Wilson"", ""Kilian Q. Weinberger""]",[],"The construction of adversarial images is a search problem in high dimensions within a small region around a target image. The goal is to find an imperceptibly modified image that is misclassified by a target model. In the black-box setting, only sporadic feedback is provided through occasional model evaluations. In this paper we provide a new algorithm whose search strategy is based on an intriguingly simple iterative principle: We randomly pick a low frequency component of the discrete cosine transform (DCT) and either add or subtract it to the target image. Model evaluations are only required to identify whether an operation decreases the adversarial loss. Despite its simplicity, the proposed method can be used for targeted and untargeted attacks --- resulting in previously unprecedented query efficiency in both settings. We require a median of 600 black-box model queries (ResNet-50) to produce an adversarial ImageNet image, and we successfully attack Google Cloud Vision with 2500 median queries, averaging to a cost of only $3 per image. We argue that our proposed algorithm should serve as a strong baseline for future adversarial black-box attacks, in particular because it is extremely fast and can be implemented in less than 20 lines of PyTorch code. ",/pdf/dfedb03996426d491479278b797615cb2b0de79f.pdf,ICLR,2019, +M_eaMB2DOxw,A-tMqVz10vB,1601310000000.0,1614990000000.0,1952,On Representing (Anti)Symmetric Functions,"[""~Marcus_Hutter1""]","[""Marcus Hutter""]","[""Neural network"", ""approximation"", ""universality"", ""Slater determinant"", ""Vandermonde matrix"", ""equivariance"", ""symmetry"", ""anti-symmetry"", ""symmetric polynomials"", ""polarized basis"", ""multilayer perceptron"", ""continuity"", ""smoothness""]","Permutation-invariant, -equivariant, and -covariant functions and anti-symmetric functions are important in quantum physics, computer vision, and other disciplines. Applications often require most or all of the following properties: (a) a large class of such functions can be approximated, e.g. all continuous function (b) only the (anti)symmetric functions can be represented (c) a fast algorithm for computing the approximation (d) the representation itself is continuous or differentiable (e) the architecture is suitable for learning the function from data (Anti)symmetric neural networks have recently been developed and applied with great success. A few theoretical approximation results have been proven, but many questions are still open, especially for particles in more than one dimension and the anti-symmetric case, which this work focuses on. More concretely, we derive natural polynomial approximations in the symmetric case, and approximations based on a single generalized Slater determinant in the anti-symmetric case. Unlike some previous super-exponential and discontinuous approximations, these seem a more promising basis for future tighter bounds.",/pdf/ceb82fadd82d17e8d0b7c92bdc97cacfc2ace6d8.pdf,ICLR,2021,"We prove universality of the symmetric/equivariant 2-hidden-layer Perceptron and of the FermiNet with a single generalized Slater determinant, both based on polynomials and for particles of arbitrary dimension." +HJe5_6VKwS,r1l82XhPDB,1569440000000.0,1577170000000.0,642,Model-based Saliency for the Detection of Adversarial Examples,"[""lisaschut94@gmail.com"", ""yarin.gal@cs.ox.ac.uk""]","[""Lisa Schut"", ""Yarin Gal""]","[""Adversarial Examples"", ""Defense"", ""Model-based Saliency""]","Adversarial perturbations cause a shift in the salient features of an image, which may result in a misclassification. We demonstrate that gradient-based saliency approaches are unable to capture this shift, and develop a new defense which detects adversarial examples based on learnt saliency models instead. We study two approaches: a CNN trained to distinguish between natural and adversarial images using the saliency masks produced by our learnt saliency model, and a CNN trained on the salient pixels themselves as its input. On MNIST, CIFAR-10 and ASSIRA, our defenses are able to detect various adversarial attacks, including strong attacks such as C&W and DeepFool, contrary to gradient-based saliency and detectors which rely on the input image. The latter are unable to detect adversarial images when the L_2- and L_infinity- norms of the perturbations are too small. Lastly, we find that the salient pixel based detector improves on saliency map based detectors as it is more robust to white-box attacks.",/pdf/37a8a79df61747fa43219cde649fe785fe2c7aee.pdf,ICLR,2020,We show that gradients are unable to capture shifts in saliency due to adversarial perturbations and present an alternative adversarial defense using learnt saliency models that is effective against both black-box and white-box attacks. +rkxwShA9Ym,ryxWcmhqKQ,1538090000000.0,1547750000000.0,1552,Label super-resolution networks,"[""kolya_malkin@hotmail.com"", ""dcrobins@gatech.edu"", ""le.hou@stonybrook.edu"", ""rsoobitsky@chesapeakeconservancy.org"", ""jczawlytko@chesapeakeconservancy.org"", ""samaras@cs.stonybrook.edu"", ""joel.saltz@stonybrookmedicine.edu"", ""lujoppa@microsoft.com"", ""jojic@microsoft.com""]","[""Kolya Malkin"", ""Caleb Robinson"", ""Le Hou"", ""Rachel Soobitsky"", ""Jacob Czawlytko"", ""Dimitris Samaras"", ""Joel Saltz"", ""Lucas Joppa"", ""Nebojsa Jojic""]","[""weakly supervised segmentation"", ""land cover mapping"", ""medical imaging""]","We present a deep learning-based method for super-resolving coarse (low-resolution) labels assigned to groups of image pixels into pixel-level (high-resolution) labels, given the joint distribution between those low- and high-resolution labels. This method involves a novel loss function that minimizes the distance between a distribution determined by a set of model outputs and the corresponding distribution given by low-resolution labels over the same set of outputs. This setup does not require that the high-resolution classes match the low-resolution classes and can be used in high-resolution semantic segmentation tasks where high-resolution labeled data is not available. Furthermore, our proposed method is able to utilize both data with low-resolution labels and any available high-resolution labels, which we show improves performance compared to a network trained only with the same amount of high-resolution data. +We test our proposed algorithm in a challenging land cover mapping task to super-resolve labels at a 30m resolution to a separate set of labels at a 1m resolution. We compare our algorithm with models that are trained on high-resolution data and show that 1) we can achieve similar performance using only low-resolution data; and 2) we can achieve better performance when we incorporate a small amount of high-resolution data in our training. We also test our approach on a medical imaging problem, resolving low-resolution probability maps into high-resolution segmentation of lymphocytes with accuracy equal to that of fully supervised models.",/pdf/c67df427800bf5178ceb18ba82580d8613d042dd.pdf,ICLR,2019,"Super-resolving coarse labels into pixel-level labels, applied to aerial imagery and medical scans." +Bkab5dqxe,,1478290000000.0,1488650000000.0,462,A Compositional Object-Based Approach to Learning Physical Dynamics,"[""mbchang@mit.edu"", ""tomeru@mit.edu"", ""torralba@mit.edu"", ""jbt@mit.edu""]","[""Michael Chang"", ""Tomer Ullman"", ""Antonio Torralba"", ""Joshua Tenenbaum""]","[""Deep learning"", ""Unsupervised Learning""]","We present the Neural Physics Engine (NPE), a framework for learning simulators of intuitive physics that naturally generalize across variable object count and different scene configurations. We propose a factorization of a physical scene into composable object-based representations and a neural network architecture whose compositional structure factorizes object dynamics into pairwise interactions. Like a symbolic physics engine, the NPE is endowed with generic notions of objects and their interactions; realized as a neural network, it can be trained via stochastic gradient descent to adapt to specific object properties and dynamics of different worlds. We evaluate the efficacy of our approach on simple rigid body dynamics in two-dimensional worlds. By comparing to less structured architectures, we show that the NPE's compositional representation of the structure in physical interactions improves its ability to predict movement, generalize across variable object count and different scene configurations, and infer latent properties of objects such as mass.",/pdf/2c1326b0f5fc2791b2d903ea43219c27ae3002bc.pdf,ICLR,2017,We propose a factorization of a physical scene into composable object-based representations and also a model architecture whose compositional structure factorizes object dynamics into pairwise interactions. +Bkg8jjC9KQ,rklZBDq9FX,1538090000000.0,1560260000000.0,626,Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile,"[""panayotis.mertikopoulos@imag.fr"", ""bruno_lecouat@i2r.a-star.edu.sg"", ""houssam_zenati@i2r.a-star.edu.sg"", ""foocs@i2r.a-star.edu.sg"", ""vijay@i2r.a-star.edu.sg"", ""georgios@sutd.edu.sg""]","[""Panayotis Mertikopoulos"", ""Bruno Lecouat"", ""Houssam Zenati"", ""Chuan-Sheng Foo"", ""Vijay Chandrasekhar"", ""Georgios Piliouras""]","[""Mirror descent"", ""extra-gradient"", ""generative adversarial networks"", ""saddle-point problems""]","Owing to their connection with generative adversarial networks (GANs), saddle-point problems have recently attracted considerable interest in machine learning and beyond. By necessity, most theoretical guarantees revolve around convex-concave (or even linear) problems; however, making theoretical inroads towards efficient GAN training depends crucially on moving beyond this classic framework. To make piecemeal progress along these lines, we analyze the behavior of mirror descent (MD) in a class of non-monotone problems whose solutions coincide with those of a naturally associated variational inequality – a property which we call coherence. We first show that ordinary, “vanilla” MD converges under a strict version of this condition, but not otherwise; in particular, it may fail to converge even in bilinear models with a unique solution. We then show that this deficiency is mitigated by optimism: by taking an “extra-gradient” step, optimistic mirror descent (OMD) converges in all coherent problems. Our analysis generalizes and extends the results of Daskalakis et al. [2018] for optimistic gradient descent (OGD) in bilinear problems, and makes concrete headway for provable convergence beyond convex-concave games. We also provide stochastic analogues of these results, and we validate our analysis by numerical experiments in a wide array of GAN models (including Gaussian mixture models, and the CelebA and CIFAR-10 datasets).",/pdf/e68620ae4c38287559cc1c2f04ff57bf59833d07.pdf,ICLR,2019,We show how the inclusion of an extra-gradient step in first-order GAN training methods can improve stability and lead to improved convergence results. +rkxtNaNKwr,r1lnSDmvvS,1569440000000.0,1577170000000.0,495,Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination,"[""shauharda.khadka@intel.com"", ""somdeb.majumdar@intel.com"", ""santiago.miret@intel.com"", ""smcaleer@uci.edu"", ""kagan.tumer@oregonstate.edu""]","[""Shauharda Khadka"", ""Somdeb Majumdar"", ""Santiago Miret"", ""Stephen McAleer"", ""Kagan Tumer""]","[""reinforcement learning"", ""multiagent"", ""neuroevolution""]","Many cooperative multiagent reinforcement learning environments provide agents with a sparse team-based reward as well as a dense agent-specific reward that incentivizes learning basic skills. Training policies solely on the team-based reward is often difficult due to its sparsity. Also, relying solely on the agent-specific reward is sub-optimal because it usually does not capture the team coordination objective. A common approach is to use reward shaping to construct a proxy reward by combining the individual rewards. However, this requires manual tuning for each environment. We introduce Multiagent Evolutionary Reinforcement Learning (MERL), a split-level training platform that handles the two objectives separately through two optimization processes. An evolutionary algorithm maximizes the sparse team-based objective through neuroevolution on a population of teams. Concurrently, a gradient-based optimizer trains policies to only maximize the dense agent-specific rewards. The gradient-based policies are periodically added to the evolutionary population as a way of information transfer between the two optimization processes. This enables the evolutionary algorithm to use skills learned via the agent-specific rewards toward optimizing the global objective. Results demonstrate that MERL significantly outperforms state-of-the-art methods such as MADDPG on a number of difficult coordination benchmarks. ",/pdf/87542d001b51422c2553a915781b66822029f666.pdf,ICLR,2020,Reinforcement learning for problems that involve multiple agents coordinating to achieve a sparse team objective +HkxwmRVtwH,SklXe9H_Pr,1569440000000.0,1577170000000.0,1041,Gaussian Process Meta-Representations Of Neural Networks,"[""theofanis.karaletsos@gmail.com"", ""thang.buivn@gmail.com""]","[""Theofanis Karaletsos"", ""Thang Bui""]","[""Bayesian Neural Networks"", ""Representation Learning"", ""Gaussian Processes"", ""Variational Inference""]","Bayesian inference offers a theoretically grounded and general way to train neural networks and can potentially give calibrated uncertainty. It is, however, challenging to specify a meaningful and tractable prior over the network parameters. More crucially, many existing inference methods assume mean-field approximate posteriors, ignoring interactions between parameters in high-dimensional weight space. To this end, this paper introduces two innovations: (i) a Gaussian process-based hierarchical model for the network parameters based on recently introduced unit embeddings that can flexibly encode weight structures, and (ii) input-dependent contextual variables for the weight prior that can provide convenient ways to regularize the function space being modeled by the NN through the use of kernels. +Furthermore, we develop an efficient structured variational inference scheme that alleviates the need to perform inference in the weight space whilst retaining and learning non-trivial correlations between network parameters. +We show these models provide desirable test-time uncertainty estimates, demonstrate cases of modeling inductive biases for neural networks with kernels and demonstrate competitive predictive performance of the proposed model and algorithm over alternative approaches on a range of classification and active learning tasks.",/pdf/66333f2b50cd54da9e0dcef0917547c25d8c566c.pdf,ICLR,2020,We derive a Gaussian Process prior for Bayesian Neural Networks based on representations of units and use compositional kernels to model inductive biases for deep learning. +HJxYwiC5tm,BJlPFW_ctX,1538090000000.0,1545360000000.0,281,Why do deep convolutional networks generalize so poorly to small image transformations?,"[""aharon.azulay@mail.huji.ac.il"", ""yweiss@cs.huji.ac.il""]","[""Aharon Azulay"", ""Yair Weiss""]","[""Convolutional neural networks"", ""The sampling theorem"", ""Sensitivity to small image transformations"", ""Dataset bias"", ""Shiftability""]","Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore, we see these failures to generalize more frequently in more modern networks. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.",/pdf/67577627cc2bcea8d277f19807e77b576684c043.pdf,ICLR,2019,"Modern deep CNNs are not invariant to translations, scalings and other realistic image transformations, and this lack of invariance is related to the subsampling operation and the biases contained in image datasets." +SJxJtiRqt7,Byg7mzt9tQ,1538090000000.0,1545360000000.0,405,Generating Images from Sounds Using Multimodal Features and GANs,"[""app@live.jp"", ""tshino@nict.go.jp"", ""kaoruamano@nict.go.jp""]","[""Jeonghyun Lyu"", ""Takashi Shinozaki"", ""Kaoru Amano""]","[""deep learning"", ""machine learning"", ""multimodal"", ""generative adversarial networks""]","Although generative adversarial networks (GANs) have enabled us to convert images from one domain to another similar one, converting between different sensory modalities, such as images and sounds, has been difficult. This study aims to propose a network that reconstructs images from sounds. First, video data with both images and sounds are labeled with pre-trained classifiers. Second, image and sound features are extracted from the data using pre-trained classifiers. Third, multimodal layers are introduced to extract features that are common to both the images and sounds. These layers are trained to extract similar features regardless of the input modality, such as images only, sounds only, and both images and sounds. Once the multimodal layers have been trained, features are extracted from input sounds and converted into image features using a feature-to-feature GAN. Finally, the generated image features are used to reconstruct images. Experimental results show that this method can successfully convert from the sound domain into the image domain. When we applied a pre-trained classifier to both the generated and original images, 31.9% of the examples had at least one of their top 10 labels in common, suggesting reasonably good image generation. Our results suggest that common representations can be learned for different modalities, and that proposed method can be applied not only to sound-to-image conversion but also to other conversions, such as from images to sounds.",/pdf/4ceae2e37f600b40c20a1109ee3721196cc32e73.pdf,ICLR,2019,We propose a method of converting from the sound domain into the image domain based on multimodal features and stacked GANs. +BygXFkSYDH,BJx0dNAdvr,1569440000000.0,1583910000000.0,1836,Target-Embedding Autoencoders for Supervised Representation Learning,"[""daniel.jarrett@eng.ox.ac.uk"", ""mv472@damtp.cam.ac.uk""]","[""Daniel Jarrett"", ""Mihaela van der Schaar""]","[""autoencoders"", ""supervised learning"", ""representation learning"", ""target-embedding"", ""label-embedding""]","Autoencoder-based learning has emerged as a staple for disciplining representations in unsupervised and semi-supervised settings. This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets---encoding the prior that variations in targets are driven by a compact set of underlying factors. As our theoretical contribution, we provide a guarantee of generalization for linear TEAs by demonstrating uniform stability, interpreting the benefit of the auxiliary reconstruction task as a form of regularization. As our empirical contribution, we extend validation of this approach beyond existing static classification applications to multivariate sequence forecasting, verifying their advantage on both linear and nonlinear recurrent architectures---thereby underscoring the further generality of this framework beyond feedforward instantiations.",/pdf/faa226acb48cee96a11ff11e7382d4999f80dc00.pdf,ICLR,2020, +zcOJOUjUcyF,NNurYmFMfUz,1601310000000.0,1614990000000.0,601,Better Optimization can Reduce Sample Complexity: Active Semi-Supervised Learning via Convergence Rate Control,"[""~Seo_Taek_Kong1"", ""soomin.jeon@vuno.co"", ""~Jaewon_Lee2"", ""hongseok@vuno.co"", ""~Kyu-Hwan_Jung1""]","[""Seo Taek Kong"", ""Soomin Jeon"", ""Jaewon Lee"", ""Hong-Seok Lee"", ""Kyu-Hwan Jung""]","[""Active Learning"", ""Semi-Supervised Learning"", ""Neural Tangent Kernel"", ""Deep Learning""]","Reducing the sample complexity associated with deep learning (DL) remains one of the most important problems in both theory and practice since its advent. Semi-supervised learning (SSL) tackles this task by leveraging unlabeled instances which are usually more accessible than their labeled counterparts. Active learning (AL) directly seeks to reduce the sample complexity by training a classification network and querying unlabeled instances to be annotated by a human-in-the-loop. Under relatively strict settings, it has been shown that both SSL and AL can theoretically achieve the same performance of fully-supervised learning (SL) using far less labeled samples. While empirical works have shown that SSL can attain this benefit in practice, DL-based AL algorithms have yet to show their success to the extent achieved by SSL. Given the accessible pool of unlabeled instances in pool-based AL, we argue that the annotation efficiency brought by AL algorithms that seek diversity on labeled samples can be improved upon when using SSL as the training scheme. Equipped with a few theoretical insights, we designed an AL algorithm that rather focuses on controlling the convergence rate of a classification network by actively querying instances to improve the rate of convergence upon inclusion to the labeled set. We name this AL scheme convergence rate control (CRC), and our experiments show that a deep neural network trained using a combination of CRC and a recently proposed SSL algorithm can quickly achieve high performance using far less labeled samples than SL. In contrast to a few works combining independently developed AL and SSL (ASSL) algorithms, our method is a natural fit to ASSL, and we hope our work can catalyze research combining AL and SSL as opposed to an exclusion of either.",/pdf/aa841b885f44dff285b9447c3dbd6df19dfdc733.pdf,ICLR,2021,We propose a new active learning algorithm inspired by neural tangent kernels and demonstrate its effectiveness when combined with semi-supervised learning. +EQtwFlmq7mx,mH1mujXmpJ0,1601310000000.0,1614990000000.0,3407,"Stochastic Proximal Point Algorithm for Large-scale Nonconvex Optimization: Convergence, Implementation, and Application to Neural Networks","[""aysegul.bumin@ufl.edu"", ""~Kejun_Huang1""]","[""Aysegul Bumin"", ""Kejun Huang""]",[],"We revisit the stochastic proximal point algorithm (SPPA) for large-scale nonconvex optimization problems. SPPA has been shown to converge faster and more stable than the celebrated stochastic gradient descent (SGD) algorithm, and its many variations, for convex problems. However, the per-iteration update of SPPA is defined abstractly and has long been considered expensive. In this paper, we show that efficient implementation of SPPA can be achieved. If the problem is a nonlinear least squares, each iteration of SPPA can be efficiently implemented by Gauss-Newton; with some linear algebra trick the resulting complexity is in the same order of SGD. For more generic problems, SPPA can still be implemented with L-BFGS or accelerated gradient with high efficiency. Another contribution of this work is the convergence of SPPA to a stationary point in expectation for nonconvex problems. The result is encouraging that it admits more flexible choices of the step sizes under similar assumptions. The proposed algorithm is elaborated for both regression and classification problems using different neural network structures. Real data experiments showcase its effectiveness in terms of convergence and accuracy compared to SGD and its variants.",/pdf/4d3b299ce69bb905d9b0e8114eb34601b869d132.pdf,ICLR,2021, +S1eYHoC5FX,ryey_J9YFX,1538090000000.0,1550860000000.0,106,DARTS: Differentiable Architecture Search,"[""hanxiaol@cs.cmu.edu"", ""simonyan@google.com"", ""yiming@cs.cmu.edu""]","[""Hanxiao Liu"", ""Karen Simonyan"", ""Yiming Yang""]","[""deep learning"", ""autoML"", ""neural architecture search"", ""image classification"", ""language modeling""]","This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques.",/pdf/3b1d8d3eac4b3da2ffdd9566bf835ada35e54fac.pdf,ICLR,2019,"We propose a differentiable architecture search algorithm for both convolutional and recurrent networks, achieving competitive performance with the state of the art using orders of magnitude less computation resources." +BJ8fyHceg,,1478280000000.0,1484190000000.0,229,Tuning Recurrent Neural Networks with Reinforcement Learning,"[""jaquesn@mit.edu"", ""sg717@cam.ac.uk"", ""ret26@cam.ack.uk"", ""deck@google.com""]","[""Natasha Jaques"", ""Shixiang Gu"", ""Richard E. Turner"", ""Douglas Eck""]","[""Deep learning"", ""Reinforcement Learning"", ""Structured prediction"", ""Supervised Learning"", ""Applications""]","The approach of training sequence models using supervised learning and next-step prediction suffers from known failure modes. For example, it is notoriously difficult to ensure multi-step generated sequences have coherent global structure. We propose a novel sequence-learning approach in which we use a pre-trained Recurrent Neural Network (RNN) to supply part of the reward value in a Reinforcement Learning (RL) model. Thus, we can refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from KL control. We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note RNN is then refined using our method and rules of music theory. We show that by combining maximum likelihood (ML) and RL in this way, we can not only produce more pleasing melodies, but significantly reduce unwanted behaviors and failure modes of the RNN, while maintaining information learned from data.",/pdf/c4f118ae6bf24cd110be68a6280bac447bdfd1df.pdf,ICLR,2017,"RL Tuner is a method for refining an LSTM trained on data by using RL to impose desired behaviors, while maintaining good predictive properties learned from data." +qcKh_Msv1GP,W83Iuk5svvh,1601310000000.0,1614990000000.0,2787,Motif-Driven Contrastive Learning of Graph Representations,"[""~Shichang_Zhang2"", ""~Ziniu_Hu1"", ""arjunsub@ucla.edu"", ""~Yizhou_Sun1""]","[""Shichang Zhang"", ""Ziniu Hu"", ""Arjun Subramonian"", ""Yizhou Sun""]","[""graph neural network"", ""self-supervised learning"", ""contrastive learning"", ""graph motif learning""]","Graph motifs are significant subgraph patterns occurring frequently in graphs, and they play important roles in representing the whole graph characteristics. For example, in the chemical domain, functional groups are motifs that can determine molecule properties. Mining and utilizing motifs, however, is a non-trivial task for large graph datasets. Traditional motif discovery approaches mostly rely on exact counting or statistical estimation, which are hard to scale for a large number of graphs with continuous and high-dimension features. In light of the significance and challenges of motif mining, we propose : MICRO-Graph: a framework for \underline{M}ot\underline{I}f-driven \underline{C}ontrastive lea\underline{R}ning \underline{O}f \underline{G}raph representations to: 1) pre-train Graph Neural Networks (GNNs) in a self-supervised manner to automatically extract graph motifs from large graph datasets; 2) leverage learned motifs to guide the contrastive learning of graph representations, which further benefit various graph downstream tasks. Specifically, given a graph dataset, a motif learner cluster similar and significant subgraphs into corresponding motif slots. Based on the learned motifs, a motif-guided subgraph segmenter is trained to generate more informative subgraphs, which are used to conduct graph-to-subgraph contrastive learning of GNNs. Our discovering strategy is to simutaneously do clustering and contrastive learning on dynamically sampled subgraphs. The clustering part pull together similar subgraphs across different whole graphs, as the contrastive part push away dissimilar ones. Meanwhile, our learnable sampler will generate subgraph samples better aligned with the discoverying procedure. By pre-training on ogbn-molhiv molecule dataset with our proposed MICRO-Graph, the pre-trained GNN model can enhance various chemical property prediction downstream tasks with scarce label by 2.0%, and significantly higher than other state-of-the-art self-supervised learning baselines. +",/pdf/938edae56f95e180d491633988f2c03bb7077dae.pdf,ICLR,2021,Learn graph motifs and use motifs to benefit contrastive learning of whole graph representations +lQdXeXDoWtI,XQgJ5DH4O7V,1601310000000.0,1615980000000.0,1580,In Search of Lost Domain Generalization,"[""~Ishaan_Gulrajani1"", ""~David_Lopez-Paz2""]","[""Ishaan Gulrajani"", ""David Lopez-Paz""]","[""domain generalization"", ""reproducible research""]","The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. +While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions---datasets, network architectures, and model selection criteria---render fair comparisons difficult. +The goal of this paper is to understand how useful domain generalization algorithms are in realistic settings. +As a first step, we realize that model selection is non-trivial for domain generalization tasks, and we argue that algorithms without a model selection criterion remain incomplete. +Next we implement DomainBed, a testbed for domain generalization including seven benchmarks, fourteen algorithms, and three model selection criteria. +When conducting extensive experiments using DomainBed we find that when carefully implemented and tuned, ERM outperforms the state-of-the-art in terms of average performance. +Furthermore, no algorithm included in DomainBed outperforms ERM by more than one point when evaluated under the same experimental conditions. +We hope that the release of DomainBed, alongside contributions from fellow researchers, will streamline reproducible and rigorous advances in domain generalization.",/pdf/83e32f2664eddb15fdc2ac88f7bff07dd2682647.pdf,ICLR,2021,Our ERM baseline achieves state-of-the-art performance across many domain generalization benchmarks +Hke4l2AcKQ,HyxEZna9Y7,1538090000000.0,1546750000000.0,1064,MAE: Mutual Posterior-Divergence Regularization for Variational AutoEncoders,"[""xuezhem@cs.cmu.edu"", ""ctzhou@cs.cmu.edu"", ""ehovy@cs.cmu.edu""]","[""Xuezhe Ma"", ""Chunting Zhou"", ""Eduard Hovy""]","[""VAE"", ""regularization"", ""auto-regressive""]","Variational Autoencoder (VAE), a simple and effective deep generative model, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. However, recent studies demonstrate that, when equipped with expressive generative distributions (aka. decoders), VAE suffers from learning uninformative latent representations with the observation called KL Varnishing, in which case VAE collapses into an unconditional generative model. In this work, we introduce mutual posterior-divergence regularization, a novel regularization that is able to control the geometry of the latent space to accomplish meaningful representation learning, while achieving comparable or superior capability of density estimation.Experiments on three image benchmark datasets demonstrate that, when equipped with powerful decoders, our model performs well both on density estimation and representation learning.",/pdf/babbf506605364d38ab996f0e6495bc76e74d526.pdf,ICLR,2019, +H1ggKyrYwB,SJg_jmA_wS,1569440000000.0,1577170000000.0,1828,On Incorporating Semantic Prior Knowlegde in Deep Learning Through Embedding-Space Constraints,"[""damien.teney@adelaide.edu.au"", ""ehsan.abbasnejad@adelaide.edu.au"", ""anton.vandenhengel@adelaide.edu.au""]","[""Damien Teney"", ""Ehsan Abbasnejad"", ""Anton van den Hengel""]","[""regularizers"", ""vision"", ""language"", ""vqa"", ""visual question answering""]","The knowledge that humans hold about a problem often extends far beyond a set of training data and output labels. While the success of deep learning mostly relies on supervised training, important properties cannot be inferred efficiently from end-to-end annotations alone, for example causal relations or domain-specific invariances. We present a general technique to supplement supervised training with prior knowledge expressed as relations between training instances. We illustrate the method on the task of visual question answering to exploit various auxiliary annotations, including relations of equivalence and of logical entailment between questions. Existing methods to use these annotations, including auxiliary losses and data augmentation, cannot guarantee the strict inclusion of these relations into the model since they require a careful balancing against the end-to-end objective. Our method uses these relations to shape the embedding space of the model, and treats them as strict constraints on its learned representations. %The resulting model encodes relations that better generalize across instances. In the context of VQA, this approach brings significant improvements in accuracy and robustness, in particular over the common practice of incorporating the constraints as a soft regularizer. We also show that incorporating this type of prior knowledge with our method brings consistent improvements, independently from the amount of supervised data used. It demonstrates the value of an additional training signal that is otherwise difficult to extract from end-to-end annotations alone.",/pdf/0869aa5fa37d5844ab3de30a838970607b226cd3.pdf,ICLR,2020,Training method to enforce strict constraints on learned embeddings during supervised training. Applied to visual question answering. +B1xU4nAqK7,HyxAtgRctQ,1538090000000.0,1545360000000.0,1451,Unsupervised Exploration with Deep Model-Based Reinforcement Learning,"[""kchua@berkeley.edu"", ""rmcallister@berkeley.edu"", ""roberto.calandra@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Kurtland Chua"", ""Rowan McAllister"", ""Roberto Calandra"", ""Sergey Levine""]","[""exploration"", ""model based reinforcement learning""]","Reinforcement learning (RL) often requires large numbers of trials to solve a single specific task. This is in sharp contrast to human and animal learning: humans and animals can use past experience to acquire an understanding about the world, which they can then use to perform new tasks with minimal additional learning. In this work, we study how an unsupervised exploration phase can be used to build up such prior knowledge, which can then be utilized in a second phase to perform new tasks, either directly without any additional exploration, or through minimal fine-tuning. A critical question with this approach is: what kind of knowledge should be transferred from the unsupervised phase to the goal-directed phase? We argue that model-based RL offers an appealing solution. By transferring models, which are task-agnostic, we can perform new tasks without any additional learning at all. However, this relies on having a suitable exploration method during unsupervised training, and a model-based RL method that can effectively utilize modern high-capacity parametric function classes, such as deep neural networks. We show that both challenges can be addressed by representing model-uncertainty, which can both guide exploration in the unsupervised phase and ensure that the errors in the model are not exploited by the planner in the goal-directed phase. We illustrate, on simple simulated benchmark tasks, that our method can perform various goal-directed skills on the first attempt, and can improve further with fine-tuning, exceeding the performance of alternative exploration methods.",/pdf/175cb1413aa109f9ab1332986c8926a2441b1541.pdf,ICLR,2019, +HkmaTz-0W,r1zj6Gb0Z,1509140000000.0,1518730000000.0,1032,Visualizing the Loss Landscape of Neural Nets,"[""haoli@cs.umd.edu"", ""xuzh@cs.umd.edu"", ""taylor@usna.edu"", ""tomg@cs.umd.edu""]","[""Hao Li"", ""Zheng Xu"", ""Gavin Taylor"", ""Tom Goldstein""]","[""visualization"", ""loss surface"", ""flatness"", ""sharpness""]","Neural network training relies on our ability to find ````````""good"" minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. + +In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple ``""filter normalization"" method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture effects the loss landscape, and how training parameters affect the shape of minimizers.",/pdf/8a3069d820795b1eca9e21c8f7432b08ed642e07.pdf,ICLR,2018,"We explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods." +9l9WD4ahJgs,YMn9EQqvXIl,1601310000000.0,1614990000000.0,2299,Automatic Data Augmentation for Generalization in Reinforcement Learning,"[""~Roberta_Raileanu2"", ""~Maxwell_Goldstein1"", ""~Denis_Yarats1"", ""~Ilya_Kostrikov1"", ""~Rob_Fergus1""]","[""Roberta Raileanu"", ""Maxwell Goldstein"", ""Denis Yarats"", ""Ilya Kostrikov"", ""Rob Fergus""]","[""reinforcement learning"", ""generalization"", ""data augmentation""]","Deep reinforcement learning (RL) agents often fail to generalize beyond their training environments. To alleviate this problem, recent work has proposed the use of data augmentation. However, different tasks tend to benefit from different types of augmentations and selecting the right one typically requires expert knowledge. In this paper, we introduce three approaches for automatically finding an effective augmentation for any RL task. These are combined with two novel regularization terms for the policy and value function, required to make the use of data augmentation theoretically sound for actor-critic algorithms. We evaluate our method on the Procgen benchmark which consists of 16 procedurally generated environments and show that it improves test performance by 40% relative to standard RL algorithms. Our approach also outperforms methods specifically designed to improve generalization in RL, thus setting a new state-of-the-art on Procgen. In addition, our agent learns policies and representations which are more robust to changes in the environment that are irrelevant for solving the task, such as the background. ",/pdf/ae2f7ad5c2753439062781784eaa3039a4ef7b2a.pdf,ICLR,2021,"We propose an approach for automatically finding an augmentation, which is used to regularize the policy and value function in order to improve generalization in reinforcement learning." +Syxgbh05tQ,rkg9z-AqYm,1538090000000.0,1545360000000.0,1138,Lyapunov-based Safe Policy Optimization,"[""yinlamchow@google.com"", ""ofirnachum@google.com"", ""mohammad.ghavamzadeh@inria.fr"", ""duenez@google.com""]","[""Yinlam Chow"", ""Ofir Nachum"", ""Mohammad Ghavamzadeh"", ""Edgar Guzman-Duenez""]","[""Reinforcement Learning"", ""Safe Learning"", ""Lyapunov Functions"", ""Constrained Markov Decision Problems""]","In many reinforcement learning applications, it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to certain undesirable situations. These problems are often formulated as a constrained Markov decision process (CMDP) in which the agent's goal is to optimize its main objective while not violating a number of safety constraints. In this paper, we propose safe policy optimization algorithms that are based on the Lyapunov approach to CMDPs, an approach that has well-established theoretical guarantees in control engineering. We first show how to generate a set of state-dependent Lyapunov constraints from the original CMDP safety constraints. We then propose safe policy gradient algorithms that train a neural network policy using DDPG or PPO, while guaranteeing near-constraint satisfaction at every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the linearized Lyapunov constraints. Unlike the existing (safe) constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Furthermore, the action-projection version of our algorithms often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with CPO and the Lagrangian method on several high-dimensional continuous state and action simulated robot locomotion tasks, in which the agent must satisfy certain safety constraints while minimizing its expected cumulative cost. ",/pdf/fee03d2d032ea8e86de00ee567448d3f474aa254.pdf,ICLR,2019,Safe Reinforcement Learning Algorithms for Continuous Control +HJeOMhA5K7,BkeWVd3qK7,1538090000000.0,1545360000000.0,1277,Human-Guided Column Networks: Augmenting Deep Learning with Advice,"[""mayukh.das1@utdallas.edu"", ""yangyu@hlt.utdallas.edu"", ""devendra.dhami@utdallas.edu"", ""gautam.kunapuli@utdallas.edu"", ""sriraam.natarajan@utdallas.edu""]","[""Mayukh Das"", ""Yang Yu"", ""Devendra Singh Dhami"", ""Gautam Kunapuli"", ""Sriraam Natarajan""]","[""Knowledge-guided learning"", ""Human advice"", ""Column Networks"", ""Knowledge-based relational deep model"", ""Collective classification""]","While extremely successful in several applications, especially with low-level representations; sparse, noisy samples and structured domains (with multiple objects and interactions) are some of the open challenges in most deep models. Column Networks, a deep architecture, can succinctly capture such domain structure and interactions, but may still be prone to sub-optimal learning from sparse and noisy samples. Inspired by the success of human-advice guided learning in AI, especially in data-scarce domains, we propose Knowledge-augmented Column Networks that leverage human advice/knowledge for better learning with noisy/sparse samples. Our experiments demonstrate how our approach leads to either superior overall performance or faster convergence.",/pdf/53505ad692482326a1037a294abfd45a8ef112ba.pdf,ICLR,2019,Guiding relation-aware deep models towards better learning with human knowledge. +rkeSiiA5Fm,HyeNTdj5tQ,1538090000000.0,1552520000000.0,617,Deep Learning 3D Shapes Using Alt-az Anisotropic 2-Sphere Convolution,"[""liu66@purdue.edu"", ""yao153@purdue.edu"", ""chihochoi@purdue.edu"", ""asinha@magicleap.com"", ""ramani@purdue.edu""]","[""Min Liu"", ""Fupin Yao"", ""Chiho Choi"", ""Ayan Sinha"", ""Karthik Ramani""]","[""Spherical Convolution"", ""Geometric deep learning"", ""3D shape analysis""]","The ground-breaking performance obtained by deep convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to extend it for 3D geometric tasks. One of the main challenge in applying CNNs to 3D shape analysis is how to define a natural convolution operator on non-euclidean surfaces. In this paper, we present a method for applying deep learning to 3D surfaces using their spherical descriptors and alt-az anisotropic convolution on 2-sphere. A cascade set of geodesic disk filters rotate on the 2-sphere and collect spherical patterns and so to extract geometric features for various 3D shape analysis tasks. We demonstrate theoretically and experimentally that our proposed method has the possibility to bridge the gap between 2D images and 3D shapes with the desired rotation equivariance/invariance, and its effectiveness is evaluated in applications of non-rigid/ rigid shape classification and shape retrieval.",/pdf/2e28fb2677d505daa8a0263c694b35e6a63ba67a.pdf,ICLR,2019,A method for applying deep learning to 3D surfaces using their spherical descriptors and alt-az anisotropic convolution on 2-sphere. +HJlA0C4tPS,SJgm-i5dDB,1569440000000.0,1583910000000.0,1451,A Probabilistic Formulation of Unsupervised Text Style Transfer,"[""junxianh@cs.cmu.edu"", ""xinyiw1@cs.cmu.edu"", ""gneubig@cs.cmu.edu"", ""tberg@eng.ucsd.edu""]","[""Junxian He"", ""Xinyi Wang"", ""Graham Neubig"", ""Taylor Berg-Kirkpatrick""]","[""unsupervised text style transfer"", ""deep latent sequence model""]","We present a deep generative model for unsupervised text style transfer that unifies previously proposed non-generative techniques. Our probabilistic approach models non-parallel data from two domains as a partially observed parallel corpus. By hypothesizing a parallel latent sequence that generates each observed sequence, our model learns to transform sequences from one domain to another in a completely unsupervised fashion. In contrast with traditional generative sequence models (e.g. the HMM), our model makes few assumptions about the data it generates: it uses a recurrent language model as a prior and an encoder-decoder as a transduction distribution. While computation of marginal data likelihood is intractable in this model class, we show that amortized variational inference admits a practical surrogate. Further, by drawing connections between our variational objective and other recent unsupervised style transfer and machine translation techniques, we show how our probabilistic view can unify some known non-generative objectives such as backtranslation and adversarial loss. Finally, we demonstrate the effectiveness of our method on a wide range of unsupervised style transfer tasks, including sentiment transfer, formality transfer, word decipherment, author imitation, and related language translation. Across all style transfer tasks, our approach yields substantial gains over state-of-the-art non-generative baselines, including the state-of-the-art unsupervised machine translation techniques that our approach generalizes. Further, we conduct experiments on a standard unsupervised machine translation task and find that our unified approach matches the current state-of-the-art.",/pdf/b040383264bfff0ce2ed87f26c4859dcf3ab59be.pdf,ICLR,2020,"We formulate a probabilistic latent sequence model to tackle unsupervised text style transfer, and show its effectiveness across a suite of unsupervised text style transfer tasks. " +BymIbLKgl,,1478220000000.0,1487280000000.0,97,Learning Invariant Representations Of Planar Curves ,"[""paigautam@cs.technion.ac.il"", ""twerd@cs.technion.ac.il"", ""ron@cs.technion.ac.il""]","[""Gautam Pai"", ""Aaron Wetzler"", ""Ron Kimmel""]","[""Computer vision"", ""Deep learning"", ""Supervised Learning"", ""Applications""]","We propose a metric learning framework for the construction of invariant geometric +functions of planar curves for the Euclidean and Similarity group of transformations. +We leverage on the representational power of convolutional neural +networks to compute these geometric quantities. In comparison with axiomatic +constructions, we show that the invariants approximated by the learning architectures +have better numerical qualities such as robustness to noise, resiliency to +sampling, as well as the ability to adapt to occlusion and partiality. Finally, we develop +a novel multi-scale representation in a similarity metric learning paradigm.",/pdf/908e82ec04bafe607a8b0d01ebe6e12436ac4929.pdf,ICLR,2017, +ku4sJKvnbwV,m02Pkgcnarz,1601310000000.0,1614990000000.0,1782,Model-Based Reinforcement Learning via Latent-Space Collocation,"[""~Oleh_Rybkin1"", ""zchuning@seas.upenn.edu"", ""~Anusha_Nagabandi1"", ""~Kostas_Daniilidis1"", ""~Igor_Mordatch4"", ""~Sergey_Levine1""]","[""Oleh Rybkin"", ""Chuning Zhu"", ""Anusha Nagabandi"", ""Kostas Daniilidis"", ""Igor Mordatch"", ""Sergey Levine""]","[""visual model-based reinforcement learning"", ""visual planning"", ""long-horizon planning"", ""collocation""]","The ability to construct and execute long-term plans enables intelligent agents to solve complex multi-step tasks and prevents myopic behavior only seeking the short-term reward. Recent work has achieved significant progress on building agents that can predict and plan from raw visual observations. However, existing visual planning methods still require a densely shaped reward that provides the algorithm with a short-term signal that is always easy to optimize. These algorithms fail when the shaped reward is not available as they use simplistic planning methods such as sampling-based random shooting and are unable to plan for a distant goal. Instead, to achieve long-horizon visual control, we propose to use collocation-based planning, a powerful optimal control technique that plans forward a sequence of states while constraining the transitions to be physical. We propose a planning algorithm that adapts collocation to visual planning by leveraging probabilistic latent variable models. A model-based reinforcement learning agent equipped with our planning algorithm significantly outperforms prior model-based agents on challenging visual control tasks with sparse rewards and long-term goals. ",/pdf/b49e6f52bfd75ff017717eb3bac45a84c28eb0c4.pdf,ICLR,2021,We propose a visual model-based reinforcement agent that uses collocation in the latent space to plan and outperforms prior shooting-based planning methods. +S1EwLkW0W,rJ-wIy-AW,1509120000000.0,1518730000000.0,482,"Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients","[""lukas.balles@tuebingen.mpg.de"", ""ph@tue.mpg.de""]","[""Lukas Balles"", ""Philipp Hennig""]","[""Stochastic Optimization"", ""Deep Learning""]","The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn’t. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of the stochastic gradient, whereas the update magnitude is solely determined by an estimate of its relative variance. We disentangle these two aspects and analyze them in isolation, shedding light on ADAM ’s inner workings. Transferring the ""variance adaptation” to momentum- SGD gives rise to a novel method, completing the practitioner’s toolbox for problems where ADAM fails.",/pdf/d608f3fd3ce1628b92d6ad15a19268ae25f5440f.pdf,ICLR,2018,Analyzing the popular Adam optimizer +r1xpF0VYDS,BJekJPd_wS,1569440000000.0,1577170000000.0,1263,Quantum algorithm for finding the negative curvature direction,"[""kzha3670@uni.sydney.edu.au"", ""min-hsiu.hsieh@uts.edu.au"", ""liu.liu1@sydney.edu.au"", ""dacheng.tao@sydney.edu.au""]","[""Kaining Zhang"", ""Min-Hsiu Hsieh"", ""Liu Liu"", ""Dacheng Tao""]","[""quantum algorithm"", ""negative curvature""]","We present an efficient quantum algorithm aiming to find the negative curvature direction for escaping the saddle point, which is a critical subroutine for many second-order non-convex optimization algorithms. We prove that our algorithm could produce the target state corresponding to the negative curvature direction with query complexity O(polylog(d)ε^(-1)), where d is the dimension of the optimization function. The quantum negative curvature finding algorithm is exponentially faster than any known classical method which takes time at least O(dε^(−1/2)). Moreover, we propose an efficient algorithm to achieve the classical read-out of the target state. Our classical read-out algorithm runs exponentially faster on the degree of d than existing counterparts.",/pdf/16d1b232c8e39c4690f0a24c456383db093e4caf.pdf,ICLR,2020,We present an efficient quantum algorithm aiming to find the negative curvature direction. +Byx5R0NKPr,BJxI6tqdDr,1569440000000.0,1577170000000.0,1441,Learning Calibratable Policies using Programmatic Style-Consistency,"[""ezhan@caltech.edu"", ""atseng@caltech.edu"", ""yyue@caltech.edu"", ""adswamin@microsoft.com"", ""mahauskn@microsoft.com""]","[""Eric Zhan"", ""Albert Tseng"", ""Yisong Yue"", ""Adith Swaminathan"", ""Matthew Hausknecht""]","[""imitation learning"", ""conditional generation"", ""data programming""]","We study the important and challenging problem of controllable generation of long-term sequential behaviors. Solutions to this problem would impact many applications, such as calibrating behaviors of AI agents in games or predicting player trajectories in sports. In contrast to the well-studied areas of controllable generation of images, text, and speech, there are significant challenges that are unique to or exacerbated by generating long-term behaviors: how should we specify the factors of variation to control, and how can we ensure that the generated temporal behavior faithfully demonstrates diverse styles? In this paper, we leverage large amounts of raw behavioral data to learn policies that can be calibrated to generate a diverse range of behavior styles (e.g., aggressive versus passive play in sports). Inspired by recent work on leveraging programmatic labeling functions, we present a novel framework that combines imitation learning with data programming to learn style-calibratable policies. Our primary technical contribution is a formal notion of style-consistency as a learning objective, and its integration with conventional imitation learning approaches. We evaluate our framework using demonstrations from professional basketball players and agents in the MuJoCo physics environment, and show that our learned policies can be accurately calibrated to generate interesting behavior styles in both domains.",/pdf/cb6a6c65b18cde25d44a1ae4eb95e8e32a38d861.pdf,ICLR,2020,We introduce a framework for style-consistent imitation of diverse behaviors. +g6OrH2oT5so,WuVN_iURik9,1601310000000.0,1614990000000.0,1071,Bridging the Imitation Gap by Adaptive Insubordination,"[""~Luca_Weihs1"", ""~Unnat_Jain1"", ""~Jordi_Salvador3"", ""~Svetlana_Lazebnik1"", ""~Aniruddha_Kembhavi1"", ""~Alex_Schwing1""]","[""Luca Weihs"", ""Unnat Jain"", ""Jordi Salvador"", ""Svetlana Lazebnik"", ""Aniruddha Kembhavi"", ""Alex Schwing""]","[""Privileged Experts"", ""Imitation Learning"", ""Reinforcement Learning"", ""Actor-Critic"", ""Behavior Cloning"", ""MiniGrid"", ""Knowledge Distillation""]","When expert supervision is available, practitioners often use imitation learning with varying degrees of success. We show that when an expert has access to privileged information that is unavailable to the student, this information is marginalized in the student policy during imitation learning resulting in an ''imitation gap'' and, potentially, poor results. Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization skills. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR), which dynamically weights imitation and reward-based reinforcement learning losses during training, enabling switching between imitation and exploration. On a suite of challenging didactic and MiniGrid tasks, we show that ADVISOR outperforms pure imitation, pure reinforcement learning, as well as their sequential and parallel combinations.",/pdf/e804ae83daa38ca4e0fa739f5ea8e5f9849b92e2.pdf,ICLR,2021,"Imitation learning can fail when the expert uses privileged information, we address this by combining imitation and reward-based reinforcement learning losses using dynamic weights." +ryxB2lBtvH,B1xnHxZYPB,1569440000000.0,1583910000000.0,2535,Learning to Coordinate Manipulation Skills via Skill Behavior Diversification,"[""lee504@usc.edu"", ""jingyuny@usc.edu"", ""limjj@usc.edu""]","[""Youngwoon Lee"", ""Jingyun Yang"", ""Joseph J. Lim""]","[""reinforcement learning"", ""hierarchical reinforcement learning"", ""modular framework"", ""skill coordination"", ""bimanual manipulation""]","When mastering a complex manipulation task, humans often decompose the task into sub-skills of their body parts, practice the sub-skills independently, and then execute the sub-skills together. Similarly, a robot with multiple end-effectors can perform complex tasks by coordinating sub-skills of each end-effector. To realize temporal and behavioral coordination of skills, we propose a modular framework that first individually trains sub-skills of each end-effector with skill behavior diversification, and then learns to coordinate end-effectors using diverse behaviors of the skills. We demonstrate that our proposed framework is able to efficiently coordinate skills to solve challenging collaborative control tasks such as picking up a long bar, placing a block inside a container while pushing the container with two robot arms, and pushing a box with two ant agents. Videos and code are available at https://clvrai.com/coordination",/pdf/b91b8720df5d00e5d36d1b9877ba04c392abb48a.pdf,ICLR,2020,We propose to tackle complex tasks of multiple agents by learning composable primitive skills and coordination of the skills. +L7Irrt5sMQa,WFutbm3_i1w,1601310000000.0,1614990000000.0,1808,The Surprising Power of Graph Neural Networks with Random Node Initialization,"[""ralph.abboud@cs.ox.ac.uk"", ""~Ismail_Ilkan_Ceylan2"", ""~Martin_Grohe1"", ""~Thomas_Lukasiewicz2""]","[""Ralph Abboud"", ""Ismail Ilkan Ceylan"", ""Martin Grohe"", ""Thomas Lukasiewicz""]","[""graph representation learning"", ""graph neural networks"", ""expressiveness"", ""universality"", ""random node initialization"", ""Weisfeiler-Lehman heuristic"", ""higher-order graph neural networks""]","Graph neural networks (GNNs) are effective models for representation learning on graph-structured data. However, standard GNNs are limited in their expressive power, as they cannot distinguish graphs beyond the capability of the Weisfeiler-Leman (1-WL) graph isomorphism heuristic. This limitation motivated a large body of work, including higher-order GNNs, which are provably more powerful models. To date, higher-order invariant and equivariant networks are the only models with known universality results, but these results are practically hindered by prohibitive computational complexity. Thus, despite their limitations, standard GNNs are commonly used, due to their strong practical performance. In practice, GNNs have shown a promising performance when enhanced with random node initialization (RNI), where the idea is to train and run the models with randomized initial node features. In this paper, we analyze the expressive power of GNNs with RNI, and pose the following question: are GNNs with RNI more expressive than GNNs? We prove that this is indeed the case, by showing that GNNs with RNI are universal, a first such result for GNNs not relying on computationally demanding higher-order properties. We then empirically analyze the effect of RNI on GNNs, based on carefully constructed datasets. Our empirical findings support the superior performance of GNNs with RNI over standard GNNs. In fact, we demonstrate that the performance of GNNs with RNI is often comparable with or better than that of higher-order GNNs, while keeping the much lower memory requirements of standard GNNs. However, this improvement typically comes at the cost of slower model convergence. Somewhat surprisingly, we found that the convergence rate and the accuracy of the models can be improved by using only a partial random initialization regime.",/pdf/67a08c8b7e475c6134b3b5e8211691940829f4c9.pdf,ICLR,2021,"We prove that graph neural networks with random node initialization are universal, and verify this theoretical expressivity with a detailed empirical evaluation using carefully constructed datasets." +Syoiqwcxx,,1478290000000.0,1484240000000.0,418,Local minima in training of deep networks,"[""swirszcz@google.com"", ""lejlot@google.com"", ""razp@google.com""]","[""Grzegorz Swirszcz"", ""Wojciech Marian Czarnecki"", ""Razvan Pascanu""]","[""Theory"", ""Deep learning"", ""Supervised Learning"", ""Optimization""]","There has been a lot of recent interest in trying to characterize the error surface of deep models. This stems from a long standing question. Given that deep networks are highly nonlinear systems optimized by local gradient methods, why do they not seem to be affected by bad local minima? It is widely believed that training of deep models using gradient methods works so well because the error surface either has no local minima, or if they exist they need to be close in value to the global minimum. It is known that such results hold under strong assumptions which are not satisfied by real models. In this paper we present examples showing that for such theorem to be true additional assumptions on the data, initialization schemes and/or the model classes have to be made. We look at the particular case of finite size datasets. We demonstrate that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.",/pdf/4fb369cc2525eceac892afa2dae6377f8ddfc7c0.pdf,ICLR,2017,"As a contribution to the discussion about error surface and the question why ""deep and cheap"" learning works so well we present concrete examples of local minima and obstacles arising in the training of deep models." +BJxvEh0cFQ,ByxHgT35KQ,1538090000000.0,1550890000000.0,1458,K for the Price of 1: Parameter-efficient Multi-task and Transfer Learning,"[""pramodkm@uchicago.edu"", ""mark.sandler@gmail.com"", ""azhmogin@google.com"", ""howarda@google.com""]","[""Pramod Kaushik Mudrakarta"", ""Mark Sandler"", ""Andrey Zhmoginov"", ""Andrew Howard""]","[""deep learning"", ""mobile"", ""transfer learning"", ""multi-task learning"", ""computer vision"", ""small models"", ""imagenet"", ""inception"", ""batch normalization""]","We introduce a novel method that enables parameter-efficient transfer and multi-task learning with deep neural networks. The basic approach is to learn a model patch - a small set of parameters - that will specialize to each task, instead of fine-tuning the last layer or the entire network. For instance, we show that learning a set of scales and biases is sufficient to convert a pretrained network to perform well on qualitatively different problems (e.g. converting a Single Shot MultiBox Detection (SSD) model into a 1000-class image classification model while reusing 98% of parameters of the SSD feature extractor). Similarly, we show that re-learning existing low-parameter layers (such as depth-wise convolutions) while keeping the rest of the network frozen also improves transfer-learning accuracy significantly. Our approach allows both simultaneous (multi-task) as well as sequential transfer learning. In several multi-task learning problems, despite using much fewer parameters than traditional logits-only fine-tuning, we match single-task performance. +",/pdf/9eaa0161656b9baebf26e14f030b153f69d764e0.pdf,ICLR,2019,"A novel and practically effective method to adapt pretrained neural networks to new tasks by retraining a minimal (e.g., less than 2%) number of parameters" +BJepq2VtDB,B1ghboWgDS,1569440000000.0,1577170000000.0,131,Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments,"[""boris.ginsburg@gmail.com"", ""pcastonguay@nvidia.com"", ""grinchuk.alexey@gmail.com"", ""kuchaev@gmail.com"", ""vlavrukhin@yahoo.com"", ""rleary@nvidia.com"", ""jasoli@nvidia.com"", ""huyenntkvn@gmail.com"", ""yangzhang@nvidia.com"", ""jocohen@nvidia.com""]","[""Boris Ginsburg"", ""Patrice Castonguay"", ""Oleksii Hrinchuk"", ""Oleksii Kuchaiev"", ""Vitaly Lavrukhin"", ""Ryan Leary"", ""Jason Li"", ""Huyen Nguyen"", ""Yang Zhang"", ""Jonathan M. Cohen""]","[""deep learning"", ""optimization"", ""SGD"", ""Adam"", ""NovoGrad"", ""large batch training""]","We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam/AdamW. +Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.",/pdf/158c3188dbf0472fd77d354715021ad4e271cb55.pdf,ICLR,2020,NovoGrad - an adaptive SGD method with layer-wise gradient normalization and decoupled weight decay. +rklr9kHFDB,Byxa19CuvH,1569440000000.0,1583910000000.0,1876,Rotation-invariant clustering of neuronal responses in primary visual cortex,"[""ivan.ustyuzhaninov@bethgelab.org"", ""santiago.cadena@bethgelab.org"", ""froudara@bcm.edu"", ""paul.fahey@bcm.edu"", ""eywalker@bcm.edu"", ""ecobos@bcm.edu"", ""reimer@bcm.edu"", ""fabian.sinz@bcm.edu"", ""astolias@bcm.edu"", ""matthias@bethgelab.org"", ""alexander.ecker@uni-tuebingen.de""]","[""Ivan Ustyuzhaninov"", ""Santiago A. Cadena"", ""Emmanouil Froudarakis"", ""Paul G. Fahey"", ""Edgar Y. Walker"", ""Erick Cobos"", ""Jacob Reimer"", ""Fabian H. Sinz"", ""Andreas S. Tolias"", ""Matthias Bethge"", ""Alexander S. Ecker""]","[""computational neuroscience"", ""neural system identification"", ""functional cell types"", ""deep learning"", ""rotational equivariance""]","Similar to a convolutional neural network (CNN), the mammalian retina encodes visual information into several dozen nonlinear feature maps, each formed by one ganglion cell type that tiles the visual space in an approximately shift-equivariant manner. Whether such organization into distinct cell types is maintained at the level of cortical image processing is an open question. Predictive models building upon convolutional features have been shown to provide state-of-the-art performance, and have recently been extended to include rotation equivariance in order to account for the orientation selectivity of V1 neurons. However, generally no direct correspondence between CNN feature maps and groups of individual neurons emerges in these models, thus rendering it an open question whether V1 neurons form distinct functional clusters. Here we build upon the rotation-equivariant representation of a CNN-based V1 model and propose a methodology for clustering the representations of neurons in this model to find functional cell types independent of preferred orientations of the neurons. We apply this method to a dataset of 6000 neurons and visualize the preferred stimuli of the resulting clusters. Our results highlight the range of non-linear computations in mouse V1.",/pdf/8047d6ef874c09100bca5cd5dcaf81258fbd5bcd.pdf,ICLR,2020,We classify mouse V1 neurons into putative functional cell types based on their representations in a CNN predicting neural responses +rJo9n9Feg,,1478240000000.0,1481610000000.0,117,Chess Game Concepts Emerge under Weak Supervision: A Case Study of Tic-tac-toe,"[""zhao-h13@mails.tsinghua.edu.cn"", ""lu-m13@mails.tsinghua.edu.cn"", ""anbang.yao@intel.com"", ""yurong.chen@intel.com"", ""chinazhangli@mail.tsinghua.edu.cn""]","[""Hao Zhao"", ""Ming Lu"", ""Anbang Yao"", ""Yurong Chen"", ""Li Zhang""]","[""Semi-Supervised Learning""]","This paper explores the possibility of learning chess game concepts under weak supervision with convolutional neural networks, which is a topic that has not been visited to the best of our knowledge. We put this task in three different backgrounds: (1) deep reinforcement learning has shown an amazing capability to learn a mapping from visual inputs to most rewarding actions, without knowing the concepts of a video game. But how could we confirm that the network understands these concepts or it just does not? (2) cross-modal supervision for visual representation learning draws much attention recently. Is this methodology still applicable when it comes to the domain of game concepts and actions? (3) class activation mapping is widely recognized as a visualization technique to help us understand what a network has learnt. Is it possible for it to activate at non-salient regions? With the simplest chess game tic-tac-toe, we report interesting results as answers to those three questions mentioned above. All codes, pre-processed datasets and pre-trained models will be released.",/pdf/479d3f447c4e6159ef11734a26a7bdfd1d82d7b5.pdf,ICLR,2017,investigating whether a CNN understands concepts from a new perspective +#NAME?,25pN8C0sCh-,1601310000000.0,1611610000000.0,1592,Categorical Normalizing Flows via Continuous Transformations,"[""~Phillip_Lippe1"", ""~Efstratios_Gavves1""]","[""Phillip Lippe"", ""Efstratios Gavves""]","[""Normalizing Flows"", ""Density Estimation"", ""Graph Generation""]","Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous space as a variational inference problem, we jointly optimize the continuous representation and the model likelihood. Using a factorized decoder, we introduce an inductive bias to model any interactions in the normalizing flow. As a consequence, we do not only simplify the optimization compared to having a joint decoder, but also make it possible to scale up to a large number of categories that is currently impossible with discrete normalizing flows. Based on Categorical Normalizing Flows, we propose GraphCNF a permutation-invariant generative model on graphs. GraphCNF implements a three step approach modeling the nodes, edges, and adjacency matrix stepwise to increase efficiency. On molecule generation, GraphCNF outperforms both one-shot and autoregressive flow-based state-of-the-art. +",/pdf/ce39823e0ee0cf47b03da3ce20b4927562d5b1b5.pdf,ICLR,2021,"We explore the application of normalizing flows on categorical data and propose a permutation-invariant generative model on graphs, called GraphCNF." +ZpS34ymonwE,i6QEt6iTXHJ,1601310000000.0,1614990000000.0,424,Meta Adversarial Training,"[""~Jan_Hendrik_Metzen1"", ""~Nicole_Finnie2"", ""~Robin_Hutmacher1""]","[""Jan Hendrik Metzen"", ""Nicole Finnie"", ""Robin Hutmacher""]","[""robustness"", ""adversarial examples"", ""adversarial training"", ""physical-world adversarial attacks"", ""adversarial patch"", ""universal perturbation""]","Recently demonstrated physical-world adversarial attacks have exposed vulnerabilities in perception systems that pose severe risks for safety-critical applications such as autonomous driving. These attacks place adversarial artifacts in the physical world that indirectly cause the addition of universal perturbations to inputs of a model that can fool it in a variety of contexts. Adversarial training is the most effective defense against image-dependent adversarial attacks. However, tailoring adversarial training to universal perturbations is computationally expensive since the optimal universal perturbations depend on the model weights which change during training. We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning, which overcomes this challenge by meta-learning universal perturbations along with model training. MAT requires little extra computation while continuously adapting a large set of perturbations to the current model. We present results for universal patch and universal perturbation attacks on image classification and traffic-light detection. MAT considerably increases robustness against universal patch attacks compared to prior work. ",/pdf/009b18427e01e4136476d01cd1865979b9fad7e9.pdf,ICLR,2021,"We propose Meta Adversarial Training (MAT), which allows efficiently training models with largely increased robustness against universal patch and universal perturbation attacks." +SJlpM3RqKQ,Byg0eunKFQ,1538090000000.0,1545360000000.0,1311,Expanding the Reach of Federated Learning by Reducing Client Resource Requirements,"[""scaldas@cmu.edu"", ""konkey@google.com"", ""mcmahan@google.com"", ""talwalkar@cmu.edu""]","[""Sebastian Caldas"", ""Jakub Kone\u010dn\u00fd"", ""Brendan McMahan"", ""Ameet Talwalkar""]",[],"Communication on heterogeneous edge networks is a fundamental bottleneck in Federated Learning (FL), restricting both model capacity and user participation. To address this issue, we introduce two novel strategies to reduce communication costs: (1) the use of lossy compression on the global model sent server-to-client; and (2) Federated Dropout, which allows users to efficiently train locally on smaller subsets of the global model and also provides a reduction in both client-to-server communication and local computation. We empirically show that these strategies, combined with existing compression approaches for client-to-server communication, collectively provide up to a 9.6x reduction in server-to-client communication, a 1.5x reduction in local computation, and a 24x reduction in upload communication, all without degrading the quality of the final model. We thus comprehensively reduce FL's impact on client device resources, allowing higher capacity models to be trained, and a more diverse set of users to be reached.",/pdf/d5a0a28825a2bd3bd1bdc6d690341f029806d152.pdf,ICLR,2019, +r1Usiwcex,,1478290000000.0,1484950000000.0,424,Counterpoint by Convolution,"[""chengzhiannahuang@gmail.com"", ""tim.cooijmans@umontreal.ca"", ""adarob@google.com"", ""aaron.courville@umontreal.ca"", ""deck@google.com""]","[""Cheng-Zhi Anna Huang"", ""Tim Cooijmans"", ""Adam Roberts"", ""Aaron Courville"", ""Douglas Eck""]","[""Deep learning"", ""Applications"", ""Unsupervised Learning""]","Machine learning models of music typically break down the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. We explore the use of blocked Gibbs sampling as an analogue to the human approach, and introduce Coconet, a convolutional neural network in the NADE family of generative models. Despite ostensibly sampling from the same distribution as the NADE ancestral sampling procedure, we find that a blocked Gibbs approach significantly improves sample quality. We provide evidence that this is due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling. We demonstrate the versatility of our method on unconditioned polyphonic music generation.",/pdf/fbfa9f4044a6f033361d0e1c10d31e61e4f15a36.pdf,ICLR,2017,"NADE generative model of music, with new insights on sampling" +5l9zj5G7vDY,LCGBC15KNA,1601310000000.0,1615890000000.0,3098,Spatially Structured Recurrent Modules,"[""~Nasim_Rahaman1"", ""~Anirudh_Goyal1"", ""~Muhammad_Waleed_Gondal1"", ""~Manuel_Wuthrich1"", ""~Stefan_Bauer1"", ""~Yash_Sharma1"", ""~Yoshua_Bengio1"", ""~Bernhard_Sch\u00f6lkopf1""]","[""Nasim Rahaman"", ""Anirudh Goyal"", ""Muhammad Waleed Gondal"", ""Manuel Wuthrich"", ""Stefan Bauer"", ""Yash Sharma"", ""Yoshua Bengio"", ""Bernhard Sch\u00f6lkopf""]","[""spatio-temporal modelling"", ""modular architectures"", ""recurrent neural networks"", ""partially observed environments""]","Capturing the structure of a data-generating process by means of appropriate inductive biases can help in learning models that generalise well and are robust to changes in the input distribution. While methods that harness spatial and temporal structures find broad application, recent work has demonstrated the potential of models that leverage sparse and modular structure using an ensemble of sparingly interacting modules. In this work, we take a step towards dynamic models that are capable of simultaneously exploiting both modular and spatiotemporal structures. To this end, we model the dynamical system as a collection of autonomous but sparsely interacting sub-systems that interact according to a learned topology which is informed by the spatial structure of the underlying system. This gives rise to a class of models that are well suited for capturing the dynamics of systems that only offer local views into their state, along with corresponding spatial locations of those views. On the tasks of video prediction from cropped frames and multi-agent world modelling from partial observations in the challenging Starcraft2 domain, we find our models to be more robust to the number of available views and better capable of generalisation to novel tasks without additional training than strong baselines that perform equally well or better on the training distribution. ",/pdf/3590e3dd48376daa86d4fee6c6cb3c8b051d03b9.pdf,ICLR,2021,We model a dynamical system as a collection of recurrent modules that interact according to a spatially informed but learned topology. +S1e4Q6EtDH,H1lMaa0UwB,1569440000000.0,1577170000000.0,445,Tensorized Embedding Layers for Efficient Model Compression,"[""oleksii.hrinchuk@skoltech.ru"", ""khrulkov.v@gmail.com"", ""leyla.mirvakhabova@skoltech.ru"", ""i.oseledets@skoltech.ru""]","[""Oleksii Hrinchuk"", ""Valentin Khrulkov"", ""Leyla Mirvakhabova"", ""Ivan Oseledets""]","[""Embedding layers compression"", ""tensor networks"", ""low-rank factorization""]","The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance. We evaluate our method on a wide range of benchmarks in natural language processing and analyze the trade-off between performance and compression ratios for a wide range of architectures, from MLPs to LSTMs and Transformers.",/pdf/806b36b718d81cdd04610220370d676a000dda7f.pdf,ICLR,2020,Embedding layers are factorized with Tensor Train decomposition to reduce their memory footprint. +6k7VdojAIK,E-ZSAMn3L3_y,1601310000000.0,1616070000000.0,767,Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design,"[""~Xiufeng_Yang1"", ""~Tanuj_Aasawat1"", ""~Kazuki_Yoshizoe2""]","[""Xiufeng Yang"", ""Tanuj Aasawat"", ""Kazuki Yoshizoe""]","[""parallel Monte Carlo Tree Search (MCTS)"", ""Upper Confidence bound applied to Trees (UCT)"", ""molecular design""]","It is common practice to use large computational resources to train neural networks, known from many examples, such as reinforcement learning applications. However, while massively parallel computing is often used for training models, it is rarely used to search solutions for combinatorial optimization problems. This paper proposes a novel massively parallel Monte-Carlo Tree Search (MP-MCTS) algorithm that works efficiently for a 1,000 worker scale on a distributed memory environment using multiple compute nodes and applies it to molecular design. This paper is the first work that applies distributed MCTS to a real-world and non-game problem. Existing works on large-scale parallel MCTS show efficient scalability in terms of the number of rollouts up to 100 workers. Still, they suffer from the degradation in the quality of the solutions. MP-MCTS maintains the search quality at a larger scale. By running MP-MCTS on 256 CPU cores for only 10 minutes, we obtained candidate molecules with similar scores to non-parallel MCTS running for 42 hours. Moreover, our results based on parallel MCTS (combined with a simple RNN model) significantly outperform existing state-of-the-art work. Our method is generic and is expected to speed up other applications of MCTS.",/pdf/87e21d56e100df25c3170406746c2f73d33dfc66.pdf,ICLR,2021,Novel massively parallel MCTS achieves state-of-the-art score in molecular design benchmark. +HJWLfGWRb,BygUfzbAW,1509130000000.0,1520440000000.0,789,Matrix capsules with EM routing,"[""geoffhinton@google.com"", ""sasabour@google.com"", ""frosst@google.com""]","[""Geoffrey E Hinton"", ""Sara Sabour"", ""Nicholas Frosst""]","[""Computer Vision"", ""Deep Learning"", ""Dynamic routing""]","A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose). A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45\% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attacks than our baseline convolutional neural network.",/pdf/8f973934873678bd6d0ed09097bcf11760c465f6.pdf,ICLR,2018,"Capsule networks with learned pose matrices and EM routing improves state of the art classification on smallNORB, improves generalizability to new view points, and white box adversarial robustness. " +ESVGfJM9a7,XUZxOLa9AB,1601310000000.0,1614990000000.0,980,Neural Point Process for Forecasting Spatiotemporal Events,"[""~Zihao_Zhou1"", ""~Xingyi_Yang1"", ""~Xinyi_He1"", ""~Ryan_Rossi1"", ""~Handong_Zhao3"", ""~Rose_Yu1""]","[""Zihao Zhou"", ""Xingyi Yang"", ""Xinyi He"", ""Ryan Rossi"", ""Handong Zhao"", ""Rose Yu""]","[""spatiotemporal point process"", ""deep sequence models"", ""time series""]","Forecasting events occurring in space and time is a fundamental problem. Existing neural point process models are only temporal and are limited in spatial inference. We propose a family of deep sequence models that integrate spatiotemporal point processes with deep neural networks. Our novel Neural Spatiotemporal Point Process model is flexible, efficient, and can accurately predict irregularly sampled events. The key construction of our approach is based on space-time separation of temporal intensity function and time-conditioned spatial density function, which is approximated by kernel density estimation. We validate our model on the synthetic spatiotemporal Hawkes process and self-correcting process. On many benchmark spatiotemporal event forecasting datasets, our model demonstrates superior performances. To the best of our knowledge, this is the first neural point process model that can jointly predict the continuous space and time of events. ",/pdf/df4bd9e4df4daa538c8801f93e30ef1c598ffa66.pdf,ICLR,2021,"A novel Neural Spatiotemporal Point Process model for irregularly sampled spatiotemporal event forecasting, which integrates deep neural networks with spatiotemporal point processes." +r1gRTCVFvB,S1l3uVqdDS,1569440000000.0,1583910000000.0,1412,Decoupling Representation and Classifier for Long-Tailed Recognition,"[""kang@u.nus.edu"", ""xiesaining@gmail.com"", ""maroffm@gmail.com"", ""zhicheng.yan@live.com"", ""albert.gordo.s@gmail.com"", ""elefjia@nus.edu.sg"", ""ykalant@image.ntua.gr""]","[""Bingyi Kang"", ""Saining Xie"", ""Marcus Rohrbach"", ""Zhicheng Yan"", ""Albert Gordo"", ""Jiashi Feng"", ""Yannis Kalantidis""]","[""long-tailed recognition"", ""classification""]","The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem. Existing solutions usually involve class-balancing strategies, e.g., by loss re-weighting, data re-sampling, or transfer learning from head- to tail-classes, but most of them adhere to the scheme of jointly learning representations and classifiers. In this work, we decouple the learning procedure into representation learning and classification, and systematically explore how different balancing strategies affect them for long-tailed recognition. The findings are surprising: (1) data imbalance might not be an issue in learning high-quality representations; (2) with representations learned with the simplest instance-balanced (natural) sampling, it is also possible to achieve strong long-tailed recognition ability by adjusting only the classifier. We conduct extensive experiments and set new state-of-the-art performance on common long-tailed benchmarks like ImageNet-LT, Places-LT and iNaturalist, showing that it is possible to outperform carefully designed losses, sampling strategies, even complex modules with memory, by using a straightforward approach that decouples representation and classification. Our code is available at https://github.com/facebookresearch/classifier-balancing.",/pdf/2be5582a3de2661961b882fb82da2c1937c67162.pdf,ICLR,2020, +S1xnXRVFwH,SJxwFRB_DB,1569440000000.0,1583910000000.0,1053,Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP,"[""haonanu@gmail.com"", ""edunov@fb.com"", ""yuandong@fb.com"", ""arimorcos@gmail.com""]","[""Haonan Yu"", ""Sergey Edunov"", ""Yuandong Tian"", ""Ari S. Morcos""]","[""lottery tickets"", ""nlp"", ""transformer"", ""rl"", ""reinforcement learning""]","The lottery ticket hypothesis proposes that over-parameterization of deep neural networks (DNNs) aids training by increasing the probability of a “lucky” sub-network initialization being present rather than by helping the optimization process (Frankle& Carbin, 2019). Intriguingly, this phenomenon suggests that initialization strategies for DNNs can be improved substantially, but the lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether “winning ticket” initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL).For NLP, we examined both recurrent LSTM models and large-scale Transformer models (Vaswani et al., 2017). For RL, we analyzed a number of discrete-action space tasks, including both classic control and pixel control. Consistent with workin supervised image classification, we confirm that winning ticket initializations generally outperform parameter-matched random initializations, even at extreme pruning rates for both NLP and RL. Notably, we are able to find winning ticket initializations for Transformers which enable models one-third the size to achieve nearly equivalent performance. Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.",/pdf/f018a57a0d0b930f39fb95f78e7928d9e1616dd9.pdf,ICLR,2020,"We find that the lottery ticket phenomenon is present in both NLP and RL, and find that it can be used to train compressed Transformers to high performance" +ryeFY0EFwS,rkxGdBOuvS,1569440000000.0,1583910000000.0,1256,Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization,"[""satrajit@gmail.com""]","[""Satrajit Chatterjee""]","[""generalization"", ""deep learning""]","An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data. We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other. Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists. We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting.",/pdf/2484d08967b6883dd5824cccfb67216ee65d38b6.pdf,ICLR,2020,We propose a hypothesis for why gradient descent generalizes based on how per-example gradients interact with each other. +IgIk8RRT-Z,fy-RxiuI-K7,1601310000000.0,1616030000000.0,2447,CompOFA – Compound Once-For-All Networks for Faster Multi-Platform Deployment,"[""~Manas_Sahni1"", ""shreyavarshini@gatech.edu"", ""~Alind_Khare1"", ""atumanov@gatech.edu""]","[""Manas Sahni"", ""Shreya Varshini"", ""Alind Khare"", ""Alexey Tumanov""]","[""Efficient Deep Learning"", ""Latency-aware Neural Architecture Search"", ""AutoML""]","The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware and latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -- and hence the training budget -- by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms. Our source code is available at https://github.com/gatech-sysml/CompOFA",/pdf/cd9ed036121abc86a3630081eb6c6264788c8194.pdf,ICLR,2021,CNN design-space and system insights for faster latency-guided training and searching of models for diverse deployment targets. +Byg5KyHYwr,HkgFXDC_DS,1569440000000.0,1577170000000.0,1851,Self-Imitation Learning via Trajectory-Conditioned Policy for Hard-Exploration Tasks,"[""guoyijie@umich.edu"", ""jwook@umich.edu"", ""moczulski@google.com"", ""bengio@google.com"", ""mnorouzi@google.com"", ""honglak@google.com""]","[""Yijie Guo"", ""Jongwook Choi"", ""Marcin Moczulski"", ""Samy Bengio"", ""Mohammad Norouzi"", ""Honglak Lee""]","[""imitation learning"", ""hard-exploration tasks"", ""exploration and exploitation""]","Imitation learning from human-expert demonstrations has been shown to be greatly helpful for challenging reinforcement learning problems with sparse environment rewards. However, it is very difficult to achieve similar success without relying on expert demonstrations. Recent works on self-imitation learning showed that imitating the agent's own past good experience could indirectly drive exploration in some environments, but these methods often lead to sub-optimal and myopic behavior. To address this issue, we argue that exploration in diverse directions by imitating diverse trajectories, instead of focusing on limited good trajectories, is more desirable for the hard-exploration tasks. We propose a new method of learning a trajectory-conditioned policy to imitate diverse trajectories from the agent's own past experiences and show that such self-imitation helps avoid myopic behavior and increases the chance of finding a globally optimal solution for hard-exploration tasks, especially when there are misleading rewards. Our method significantly outperforms existing self-imitation learning and count-based exploration methods on various hard-exploration tasks with local optima. In particular, we report a state-of-the-art score of more than 20,000 points on Montezumas Revenge without using expert demonstrations or resetting to arbitrary states.",/pdf/574d0a4dfa347dfe9cefed9f9ce8d3fa23ec61e7.pdf,ICLR,2020,Self-imitation learning of diverse trajectories with trajectory-conditioned policy +u4WfreuXxnk,RNfRQqH1JI5,1601310000000.0,1614990000000.0,421,Single-Node Attack for Fooling Graph Neural Networks,"[""benfin@campus.technion.ac.il"", ""~Chaim_Baskin1"", ""~Evgenii_Zheltonozhskii1"", ""~Uri_Alon1""]","[""Ben Finkelshtein"", ""Chaim Baskin"", ""Evgenii Zheltonozhskii"", ""Uri Alon""]","[""graphs"", ""GNN"", ""adversarial"", ""attack""]","Graph neural networks (GNNs) have shown broad applicability in a variety of domains. +Some of these domains, such as social networks and product recommendations, are fertile ground for malicious users and behavior. +In this paper, we show that GNNs are vulnerable to the extremely limited scenario of a single-node adversarial example, where the node cannot be picked by the attacker. +That is, an attacker can force the GNN to classify any target node to a chosen label by only slightly perturbing another single arbitrary node in the graph, even when not being able to pick that specific attacker node. When the adversary is allowed to pick a specific attacker node, the attack is even more effective. +We show that this attack is effective across various GNN types (e.g., GraphSAGE, GCN, GAT, and GIN), across a variety of real-world datasets, and as a targeted and non-targeted attack. +Our code is available anonymously at https://github.com/gnnattack/SINGLE .",/pdf/6696010c6581e0ec94c2d7c97182ced7f103d2da.pdf,ICLR,2021,GNNs are vulnerable to adversarial attacks from a single attacker node. +ZDnzZrTqU9N,rGcpqFHBdPO,1601310000000.0,1617180000000.0,3696,Modeling the Second Player in Distributionally Robust Optimization,"[""~Paul_Michel1"", ""~Tatsunori_Hashimoto1"", ""~Graham_Neubig1""]","[""Paul Michel"", ""Tatsunori Hashimoto"", ""Graham Neubig""]","[""distributionally robust optimization"", ""deep learning"", ""robustness"", ""adversarial learning""]","Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the ""uncertainty set""). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple alternatives that keep the min-max optimization problem exactly tractable, such as $f$-divergence balls. In this paper, we argue instead for the use of neural generative models to characterize the worst-case distribution, allowing for more flexible and problem-specific selection of the uncertainty set. However, while simple conceptually, this approach poses a number of implementation and optimization challenges. To circumvent these issues, we propose a relaxation of the KL-constrained inner maximization objective that makes the DRO problem more amenable to gradient-based optimization of large scale generative models, and develop model selection heuristics to guide hyper-parameter search. On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines.",/pdf/e31aab007b357e40ee683c0d59d6d939a70c3b2a.pdf,ICLR,2021,"We use generative neural models to define the uncertainty set in distributionally robust optimization, and show that this helps train more robust classifiers" +ByxdUySKvS,BkxJDPT_wH,1569440000000.0,1583910000000.0,1734,Adversarial AutoAugment,"[""zhangxinyu10@huawei.com"", ""wangqiang168@huawei.com"", ""zhangjian157@huawei.com"", ""zorro.zhongzhao@huawei.com""]","[""Xinyu Zhang"", ""Qiang Wang"", ""Jian Zhang"", ""Zhao Zhong""]","[""Automatic Data Augmentation"", ""Adversarial Learning"", ""Reinforcement Learning""]","Data augmentation (DA) has been widely utilized to improve generalization in training deep neural networks. Recently, human-designed data augmentation has been gradually replaced by automatically learned augmentation policy. Through finding the best policy in well-designed search space of data augmentation, AutoAugment (Cubuk et al., 2019) can significantly improve validation accuracy on image classification tasks. However, this approach is not computationally practical for large-scale problems. In this paper, we develop an adversarial method to arrive at a computationally-affordable solution called Adversarial AutoAugment, which can simultaneously optimize target related object and augmentation policy search loss. The augmentation policy network attempts to increase the training loss of a target network through generating adversarial augmentation policies, while the target network can learn more robust features from harder examples to improve the generalization. In contrast to prior work, we reuse the computation in target network training for policy evaluation, and dispense with the retraining of the target network. Compared to AutoAugment, this leads to about 12x reduction in computing cost and 11x shortening in time overhead on ImageNet. We show experimental results of our approach on CIFAR-10/CIFAR-100, ImageNet, and demonstrate significant performance improvements over state-of-the-art. On CIFAR-10, we achieve a top-1 test error of 1.36%, which is the currently best performing single model. On ImageNet, we achieve a leading performance of top-1 accuracy 79.40% on ResNet-50 and 80.00% on ResNet-50-D without extra data.",/pdf/890e57507ed7da50ce0627561b3e9699ece5a2cb.pdf,ICLR,2020,We introduce the idea of adversarial learning into automatic data augmentation to improve the generalization of a targe network. +r1gKNs0qYX,SJlPe9o_F7,1538090000000.0,1545360000000.0,21,Filter Training and Maximum Response: Classification via Discerning,"[""gul2@uci.edu""]","[""Lei Gu""]","[""filter training"", ""maximum response"", ""multiple check"", ""ensemble learning""]","This report introduces a training and recognition scheme, in which classification is realized via class-wise discerning. Trained with datasets whose labels are randomly shuffled except for one class of interest, a neural network learns class-wise parameter values, and remolds itself from a feature sorter into feature filters, each of which discerns objects belonging to one of the classes only. Classification of an input can be inferred from the maximum response of the filters. A multiple check with multiple versions of filters can diminish fluctuation and yields better performance. This scheme of discerning, maximum response and multiple check is a method of general viability to improve performance of feedforward networks, and the filter training itself is a promising feature abstraction procedure. In contrast to the direct sorting, the scheme mimics the classification process mediated by a series of one component picking.",/pdf/9a0066cc0eefc78484561fbc69c77a123fd49a18.pdf,ICLR,2019,The proposed scheme mimics the classification process mediated by a series of one component picking. +H1eqQeHFDS,rkxsTVxYDS,1569440000000.0,1585950000000.0,2221,AdvectiveNet: An Eulerian-Lagrangian Fluidic Reservoir for Point Cloud Processing ,"[""xingzhe.he95@gmail.com"", ""helen.l.cao.22@dartmouth.edu"", ""bo.zhu@dartmouth.edu""]","[""Xingzhe He"", ""Helen Lu Cao"", ""Bo Zhu""]","[""Point Cloud Processing"", ""Physical Reservoir Learning"", ""Eulerian-Lagrangian Method"", ""PIC/FLIP""]","This paper presents a novel physics-inspired deep learning approach for point cloud processing motivated by the natural flow phenomena in fluid mechanics. Our learning architecture jointly defines data in an Eulerian world space, using a static background grid, and a Lagrangian material space, using moving particles. By introducing this Eulerian-Lagrangian representation, we are able to naturally evolve and accumulate particle features using flow velocities generated from a generalized, high-dimensional force field. We demonstrate the efficacy of this system by solving various point cloud classification and segmentation problems with state-of-the-art performance. The entire geometric reservoir and data flow mimic the pipeline of the classic PIC/FLIP scheme in modeling natural flow, bridging the disciplines of geometric machine learning and physical simulation.",/pdf/ed964d2fea5030138e3f09a86b61ad5272f3d315.pdf,ICLR,2020,We present a new grid-particle learning method to process point clouds motivated by computational fluid dynamics. +SJgIPJBFvH,r1ltVhTdvH,1569440000000.0,1583910000000.0,1767,Fantastic Generalization Measures and Where to Find Them,"[""ydjiang@google.com"", ""neyshabur@google.com"", ""dilipkay@google.com"", ""hmobahi@google.com"", ""bengio@google.com""]","[""Yiding Jiang*"", ""Behnam Neyshabur*"", ""Hossein Mobahi"", ""Dilip Krishnan"", ""Samy Bengio""]","[""Generalization"", ""correlation"", ""experiments""]","Generalization of deep networks has been intensely researched in recent years, resulting in a number of theoretical bounds and empirically motivated measures. However, most papers proposing such measures only study a small set of models, leaving open the question of whether these measures are truly useful in practice. We present the first large scale study of generalization bounds and measures in deep networks. We train over two thousand CIFAR-10 networks with systematic changes in important hyper-parameters. We attempt to uncover potential causal relationships between each measure and generalization, by using rank correlation coefficient and its modified forms. We analyze the results and show that some of the studied measures are very promising for further research.",/pdf/c915c7896213fa5b3abc239ba4c4a5a800b404ef.pdf,ICLR,2020,"We empirically study generalization measures over more than 2000 models, identify common pitfall in existing practice of studying generalization measures and provide some new bounds based on measures in our study." +NZj7TnMr01,wOjHnkGjLTw,1601310000000.0,1614990000000.0,763,Improving Neural Network Accuracy and Calibration Under Distributional Shift with Prior Augmented Data ,"[""~Jeffrey_Ryan_Willette1"", ""~Juho_Lee2"", ""~Sung_Ju_Hwang1""]","[""Jeffrey Ryan Willette"", ""Juho Lee"", ""Sung Ju Hwang""]","[""Bayesian"", ""Calibration""]","Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators. However, neural networks are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions. The problem of overconfidence becomes especially apparent in cases where the test-time data distribution differs from that which was seen during training. We propose a solution to this problem by seeking out regions in arbitrary feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the Bayesian prior on the distribution of the labels. Our method results in a better calibrated network and is agnostic to the underlying model structure, so it can be applied to any neural network which produces a probability density as an output. We demonstrate the effectiveness of our method and validate its performance on both classification and regression problems by applying it to the training of recent state-of-the-art neural network models.",/pdf/4ef69fdbb876bc28086eea1083c0c740b8931544.pdf,ICLR,2021,We propose a method of training existing neural network models which results in better calibrated probabilistic outputs. +HklvmlrKPB,Hye84ExFPS,1569440000000.0,1577170000000.0,2214,Improving Sequential Latent Variable Models with Autoregressive Flows,"[""jmarino@caltech.edu"", ""lei_chen_4@sfu.ca"", ""jiawei_he_2@sfu.ca"", ""stephan.mandt@gmail.com""]","[""Joseph Marino"", ""Lei Chen"", ""Jiawei He"", ""Stephan Mandt""]","[""Autoregressive Flows"", ""Sequence Modeling"", ""Latent Variable Models"", ""Video Modeling"", ""Variational Inference""]","We propose an approach for sequence modeling based on autoregressive normalizing flows. Each autoregressive transform, acting across time, serves as a moving reference frame for modeling higher-level dynamics. This technique provides a simple, general-purpose method for improving sequence modeling, with connections to existing and classical techniques. We demonstrate the proposed approach both with standalone models, as well as a part of larger sequential latent variable models. Results are presented on three benchmark video datasets, where flow-based dynamics improve log-likelihood performance over baseline models.",/pdf/e62dd47e2346912a2e38bbd17d28bc8740db819c.pdf,ICLR,2020,We show how autoregressive flows can be used to improve sequential latent variable models. +bQf4aGhfmFx,5RKYeUPtsZ9,1601310000000.0,1614990000000.0,2551,Effective Regularization Through Loss-Function Metalearning,"[""~Santiago_Gonzalez1"", ""~Risto_Miikkulainen1""]","[""Santiago Gonzalez"", ""Risto Miikkulainen""]","[""regularization"", ""loss"", ""loss function"", ""metalearning"", ""meta-learning"", ""optimization"", ""theory"", ""robustness"", ""adversarial attacks""]","Loss-function metalearning can be used to discover novel, customized loss functions for deep neural networks, resulting in improved performance, faster training, and improved data utilization. A likely explanation is that such functions discourage overfitting, leading to effective regularization. This paper theoretically demonstrates that this is indeed the case: decomposition of learning rules makes it possible to characterize the training dynamics and show that loss functions evolved through TaylorGLO regularize both in the beginning and end of learning, and maintain an invariant in between. The invariant can be utilized to make the metalearning process more efficient in practice, and the regularization can train networks that are robust against adversarial attacks. Loss-function optimization can thus be seen as a well-founded new aspect of metalearning in neural networks.",/pdf/1998ab969838a3ea95026c0de0c91a664ee75d3d.pdf,ICLR,2021,This paper provides a theoretical foundation to explain how and why metalearned loss functions are able to regularize. +rJe-Pr9le,,1478280000000.0,1481850000000.0,253,Multi-task learning with deep model based reinforcement learning,"[""asierm@student.ethz.ch""]","[""Asier Mujika""]","[""Reinforcement Learning"", ""Deep learning"", ""Games"", ""Transfer Learning""]","In recent years, model-free methods that use deep learning have achieved great success in many different reinforcement learning environments. Most successful approaches focus on solving a single task, while multi-task reinforcement learning remains an open problem. In this paper, we present a model based approach to deep reinforcement learning which we use to solve different tasks simultaneously. We show that our approach not only does not degrade but actually benefits from learning multiple tasks. For our model, we also present a new kind of recurrent neural network inspired by residual networks that decouples memory from computation allowing to model complex environments that do not require lots of memory. The code will be released before ICLR 2017.",/pdf/63b92f7ce8708e9df35dc299a7b3472b6cf9731e.pdf,ICLR,2017,"We build a world model, based on CNN's and RNN's, to play multiple ATARI games simultaneously, achieving super-human performance." +H13F3Pqll,,1478290000000.0,1478380000000.0,433,Inverse Problems in Computer Vision using Adversarial Imagination Priors,"[""htung@cs.cmu.edu"", ""katef@cs.cmu.edu""]","[""Hsiao-Yu Fish Tung"", ""Katerina Fragkiadaki""]","[""Unsupervised Learning"", ""Deep learning""]","Given an image, humans effortlessly run the image formation process backwards in their minds: they can tell albedo from shading, foreground from background, and imagine the occluded parts of the scene behind foreground objects. In this work, we propose a weakly supervised inversion machine trained to generate similar imaginations that when rendered using differentiable, graphics-like decoders, produce the original visual input. We constrain the imagination spaces by providing exemplar memory repositories in the form of foreground segmented objects, albedo, shading, background scenes and imposing adversarial losses on the imagination spaces. Our model learns to perform such inversion with weak supervision, without ever having seen paired annotated data, that is, without having seen the image paired with the corresponding ground-truth imaginations. We demonstrate our method by applying it to three Computer Vision tasks: image in-painting, intrinsic decomposition and object segmentation, each task having its own differentiable renderer. Data driven adversarial imagination priors effectively guide inversion, minimize the need for hand designed priors of smoothness or good continuation, or the need for paired annotated data.",/pdf/3c3449d94f4a81cddaaabe2b21ee20f81d2276d1.pdf,ICLR,2017,"We present a model that given a visual image learns to generate imaginations of complete scenes, albedo, shading etc, by using adversarial data driven priors on the imaginations spaces." +wS0UFjsNYjn,b98qq6KutuK,1601310000000.0,1616050000000.0,1492,Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning,"[""~Dong_Bok_Lee1"", ""~Dongchan_Min1"", ""~Seanie_Lee1"", ""~Sung_Ju_Hwang1""]","[""Dong Bok Lee"", ""Dongchan Min"", ""Seanie Lee"", ""Sung Ju Hwang""]","[""Unsupervised Learning"", ""Meta-Learning"", ""Unsupervised Meta-learning"", ""Variational Autoencoders""]","Unsupervised learning aims to learn meaningful representations from unlabeled data which can captures its intrinsic structure, that can be transferred to downstream tasks. Meta-learning, whose objective is to learn to generalize across tasks such that the learned model can rapidly adapt to a novel task, shares the spirit of unsupervised learning in that the both seek to learn more effective and efficient learning procedure than learning from scratch. The fundamental difference of the two is that the most meta-learning approaches are supervised, assuming full access to the labels. However, acquiring labeled dataset for meta-training not only is costly as it requires human efforts in labeling but also limits its applications to pre-defined task distributions. In this paper, we propose a principled unsupervised meta-learning model, namely Meta-GMVAE, based on Variational Autoencoder (VAE) and set-level variational inference. Moreover, we introduce a mixture of Gaussian (GMM) prior, assuming that each modality represents each class-concept in a randomly sampled episode, which we optimize with Expectation-Maximization (EM). Then, the learned model can be used for downstream few-shot classification tasks, where we obtain task-specific parameters by performing semi-supervised EM on the latent representations of the support and query set, and predict labels of the query set by computing aggregated posteriors. We validate our model on Omniglot and Mini-ImageNet datasets by evaluating its performance on downstream few-shot classification tasks. The results show that our model obtain impressive performance gains over existing unsupervised meta-learning baselines, even outperforming supervised MAML on a certain setting.",/pdf/7b58adedb02a73d26b32a949a08c9238409022a5.pdf,ICLR,2021,"we propose a principled unsupervised meta-learning model which meta-learns a set-level variational posterior, by matching it with multi-modal prior distribution obtained by EM." +Qr0aRliE_Hb,UvmA8zzNaIl,1601310000000.0,1616740000000.0,3080,Simple Augmentation Goes a Long Way: ADRL for DNN Quantization,"[""~Lin_Ning1"", ""~Guoyang_Chen1"", ""weifeng.z@alibaba-inc.com"", ""~Xipeng_Shen1""]","[""Lin Ning"", ""Guoyang Chen"", ""Weifeng Zhang"", ""Xipeng Shen""]","[""Reinforcement Learning"", ""Quantization"", ""mixed precision"", ""augmented deep reinforcement learning"", ""DNN""]","Mixed precision quantization improves DNN performance by assigning different layers with different bit-width values. Searching for the optimal bit-width for each layer, however, remains a challenge. Deep Reinforcement Learning (DRL) shows some recent promise. It however suffers instability due to function approximation errors, causing large variances in the early training stages, slow convergence, and suboptimal policies in the mixed-precision quantization problem. This paper proposes augmented DRL (ADRL) as a way to alleviate these issues. This new strategy augments the neural networks in DRL with a complementary scheme to boost the performance of learning. The paper examines the effectiveness of ADRL both analytically and empirically, showing that it can produce more accurate quantized models than the state of the art DRL-based quantization while improving the learning speed by 4.5-64 times. ",/pdf/4f1af14f420632aa60f163e48701a935fae3a547.pdf,ICLR,2021,Augments the neural networks in Deep Reinforcement Learning(DRL) with a complementary scheme to boost the performance of learning and solve the common low convergence problem in the early stage of DRL +BkgXT24tDS,SJlWnz8XPH,1569440000000.0,1583910000000.0,220,Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks,"[""loafyuhang@gmail.com"", ""xindong@g.harvard.edu"", ""wangwei@comp.nus.edu.sg""]","[""Yuhang Li"", ""Xin Dong"", ""Wei Wang""]","[""Quantization"", ""Efficient Inference"", ""Neural Networks""]","We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high computational efficiency and a good match with the distribution of weights. A simple reparameterization of the clipping function is applied to generate a better-defined gradient for learning the clipping threshold. Moreover, weight normalization is presented to refine the distribution of weights to make the training more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models, demonstrating the effectiveness of our proposed APoT quantization. For example, our 4-bit quantized ResNet-50 on ImageNet achieves 76.6% top-1 accuracy without bells and whistles; meanwhile, our model reduces 22% computational cost compared with the uniformly quantized counterpart.",/pdf/65bbe05758ea9345981db8c060d6b183847f4ced.pdf,ICLR,2020, +rkg-mA4FDr,Syl6mSH_DH,1569440000000.0,1583910000000.0,1027,Pre-training Tasks for Embedding-based Large-scale Retrieval,"[""wchang2@cs.cmu.edu"", ""felixyu@google.com"", ""yinwen@google.com"", ""yiming@cs.cmu.edu"", ""sanjivk@google.com""]","[""Wei-Cheng Chang"", ""Felix X. Yu"", ""Yin-Wen Chang"", ""Yiming Yang"", ""Sanjiv Kumar""]","[""natural language processing"", ""large-scale retrieval"", ""unsupervised representation learning"", ""paragraph-level pre-training"", ""two-tower Transformer models""]","We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.",/pdf/b90af495b4bef6333a12055e38a1d02a48383f28.pdf,ICLR,2020,We consider large-scale retrieval problems such as question answering retrieval and present a comprehensive study of how different sentence level pre-training improving the BERT-style token-level pre-training for two-tower Transformer models. +ryGkSo0qYm,rJl6V6TUKQ,1538090000000.0,1548050000000.0,51,Large Scale Graph Learning From Smooth Signals,"[""v.kalofolias@gmail.com"", ""nathanael.perraudin@sdsc.ethz.ch""]","[""Vassilis Kalofolias"", ""Nathana\u00ebl Perraudin""]","[""Graph learning"", ""Graph signal processing"", ""Network inference""]","Graphs are a prevalent tool in data science, as they model the inherent structure of the data. Typically they are constructed either by connecting nearest samples, or by learning them from data, solving an optimization problem. While graph learning does achieve a better quality, it also comes with a higher computational cost. In particular, the current state-of-the-art model cost is O(n^2) for n samples. +In this paper, we show how to scale it, obtaining an approximation with leading cost of O(n log(n)), with quality that approaches the exact graph learning model. Our algorithm uses known approximate nearest neighbor techniques to reduce the number of variables, and automatically selects the correct parameters of the model, requiring a single intuitive input: the desired edge density.",/pdf/def56f715a464b2fbec9bec2780092eaf31a8697.pdf,ICLR,2019, +rylDzTEKwr,SkxGmbjLwr,1569440000000.0,1577170000000.0,416,Variational Hashing-based Collaborative Filtering with Self-Masking,"[""c.hansen@di.ku.dk"", ""chrh@di.ku.dk"", ""simonsen@di.ku.dk"", ""s.alstrup@di.ku.dk"", ""c.lioma@di.ku.dk""]","[""Casper Hansen"", ""Christian Hansen"", ""Jakob Grue Simonsen"", ""Stephen Alstrup"", ""Christina Lioma""]","[""hashing"", ""collaborative filtering"", ""information retrieval"", ""supervised learning""]","Hashing-based collaborative filtering learns binary vector representations (hash codes) of users and items, such that recommendations can be computed very efficiently using the Hamming distance, which is simply the sum of differing bits between two hash codes. A problem with hashing-based collaborative filtering using the Hamming distance, is that each bit is equally weighted in the distance computation, but in practice some bits might encode more important properties than other bits, where the importance depends on the user. +To this end, we propose an end-to-end trainable variational hashing-based collaborative filtering approach that uses the novel concept of self-masking: the user hash code acts as a mask on the items (using the Boolean AND operation), such that it learns to encode which bits are important to the user, rather than the user's preference towards the underlying item property that the bits represent. This allows a binary user-level importance weighting of each item without the need to store additional weights for each user. We experimentally evaluate our approach against state-of-the-art baselines on 4 datasets, and obtain significant gains of up to 12% in NDCG. We also make available an efficient implementation of self-masking, which experimentally yields <4% runtime overhead compared to the standard Hamming distance.",/pdf/f512a2ad40b5d496038e7c14a49ac8a0523dd445.pdf,ICLR,2020,"We propose a new variational hashing-based collaborative filtering approach optimized for a novel self-mask variant of the Hamming distance, which outperforms state-of-the-art by up to 12% on NDCG." +HJlzxgBtwH,Syerq3yYPr,1569440000000.0,1577170000000.0,2090,Minimally distorted Adversarial Examples with a Fast Adaptive Boundary Attack,"[""francesco91.croce@gmail.com"", ""matthias.hein@uni-tuebingen.de""]","[""Francesco Croce"", ""Matthias Hein""]","[""adversarial attacks"", ""adversarial robustness""]","The evaluation of robustness against adversarial manipulations of neural networks-based classifiers is mainly tested with empirical attacks as the methods for the exact computation, even when available, do not scale to large networks. We propose in this paper a new white-box adversarial attack wrt the $l_p$-norms for $p \in \{1,2,\infty\}$ aiming at finding the minimal perturbation necessary to change the class of a given input. It has an intuitive geometric meaning, yields quickly high quality results, minimizes the size of the perturbation (so that it returns the robust accuracy at every threshold with a single run). It performs better or similarly to state-of-the-art attacks which are partially specialized to one $l_p$-norm.",/pdf/66364311aad6bf929fc69eec50408d4c5c138a0a.pdf,ICLR,2020,"We introduce a white-box adversarial attack wrt the $l_1$-, $l_2$- and $l_\infty$-norm achieving state-of-the-art performances, minimizing the norm of the perturbations and being computationally cheap." +H1gX8C4YPr,Syl0pA8OvS,1569440000000.0,1583910000000.0,1132,DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,"[""etw@gatech.edu"", ""akadian@fb.com"", ""arimorcos@gmail.com"", ""leestef@oregonstate.edu"", ""irfan@gatech.edu"", ""parikh@gatech.edu"", ""msavva@sfu.ca"", ""dbatra@gatech.edu""]","[""Erik Wijmans"", ""Abhishek Kadian"", ""Ari Morcos"", ""Stefan Lee"", ""Irfan Essa"", ""Devi Parikh"", ""Manolis Savva"", ""Dhruv Batra""]","[""autonomous navigation"", ""habitat"", ""embodied AI"", ""pointgoal navigation"", ""reinforcement learning""]","We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever ""stale""), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. + +This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially ""solves"" the task -- near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ""ImageNet pre-training + task-specific fine-tuning"" for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available). ",/pdf/7cd7e4ba1dad47a6766dcc250d99d0a2300f2f85.pdf,ICLR,2020, +HyDMX0l0Z,B1LzXCxCW,1509120000000.0,1518730000000.0,448,Towards Effective GANs for Data Distributions with Diverse Modes,"[""sanchit@cse.iitm.ac.in"", ""garry@cse.iitm.ac.in"", ""miteshk@cse.iitm.ac.in""]","[""Sanchit Agrawal"", ""Gurneet Singh"", ""Mitesh Khapra""]","[""generative adversarial networks"", ""GANs"", ""deep learning"", ""unsupervised learning"", ""generative models"", ""adversarial learning""]","Generative Adversarial Networks (GANs), when trained on large datasets with diverse modes, are known to produce conflated images which do not distinctly belong to any of the modes. We hypothesize that this problem occurs due to the interaction between two facts: (1) For datasets with large variety, it is likely that the modes lie on separate manifolds. (2) The generator (G) is formulated as a continuous function, and the input noise is derived from a connected set, due to which G's output is a connected set. If G covers all modes, then there must be some portion of G's output which connects them. This corresponds to undesirable, conflated images. We develop theoretical arguments to support these intuitions. We propose a novel method to break the second assumption via learnable discontinuities in the latent noise space. Equivalently, it can be viewed as training several generators, thus creating discontinuities in the G function. We also augment the GAN formulation with a classifier C that predicts which noise partition/generator produced the output images, encouraging diversity between each partition/generator. We experiment on MNIST, celebA, STL-10, and a difficult dataset with clearly distinct modes, and show that the noise partitions correspond to different modes of the data distribution, and produce images of superior quality.",/pdf/825f7f81714b0581addb74645baed8fa4b3abce7.pdf,ICLR,2018,We introduce theory to explain the failure of GANs on complex datasets and propose a solution to fix it. +r1gIwgSYwr,HJlPFixKvS,1569440000000.0,1577170000000.0,2362,Localized Meta-Learning: A PAC-Bayes Analysis for Meta-Leanring Beyond Global Prior,"[""chliu@smu.edu.sg"", ""lutaott@zju.edu.cn"", ""doyensahoo@gmail.com"", ""yfang@smu.edu.sg"", ""shoi@salesforce.com""]","[""Chenghao Liu"", ""Tao Lu"", ""Doyen Sahoo"", ""Yuan Fang"", ""Steven C.H. Hoi.""]","[""localized meta-learning"", ""PAC-Bayes"", ""meta-learning""]","Meta-learning methods learn the meta-knowledge among various training tasks and aim to promote the learning of new tasks under the task similarity assumption. However, such meta-knowledge is often represented as a fixed distribution, which is too restrictive to capture various specific task information. In this work, we present a localized meta-learning framework based on PAC-Bayes theory. In particular, we propose a LCC-based prior predictor that allows the meta learner adaptively generate local meta-knowledge for specific task. We further develop a pratical algorithm with deep neural network based on the bound. Empirical results on real-world datasets demonstrate the efficacy of the proposed method. ",/pdf/f02bf2d890c2bbc402cae7ac376f9f17dd9b711b.pdf,ICLR,2020, +nlWgE3A-iS,gESAWF-0KJG,1601310000000.0,1614990000000.0,2485,ReaPER: Improving Sample Efficiency in Model-Based Latent Imagination,"[""~Martin_A_Bertran1"", ""~Guillermo_Sapiro1"", ""~mariano_phielipp1""]","[""Martin A Bertran"", ""Guillermo Sapiro"", ""mariano phielipp""]","[""model-based reinforcement learning"", ""visual control"", ""sample efficiency""]","Deep Reinforcement Learning (DRL) can distill behavioural policies from sensory input that solve complex tasks, however, the policies tend to be task-specific and sample inefficient, requiring a large number of interactions with the environment that may be costly or impractical for many real world applications. Model-based DRL (MBRL) can allow learned behaviours and dynamics from one task to be translated to a new task in a related environment, but still suffer from low sample efficiency. In this work we introduce ReaPER, an algorithm that addresses the sample efficiency challenge in model-based DRL, we illustrate the power of the proposed solution on the DeepMind Control benchmark. Our improvements are driven by sparse , self-supervised, contrastive model representations and efficient use of past experience. We empirically analyze each novel component of ReaPER and analyze how they contribute to sample efficiency. We also illustrate how other standard alternatives fail to improve upon previous methods. Code will be made available.",/pdf/22b122aa441549b121a193de3fd185edbf840a68.pdf,ICLR,2021,"We introduce ReaPER, an algorithm that addresses the sample efficiency challenge in model-based DRL, we illustrate the power of the proposed solution on the DeepMind Control benchmark." +rkeZ9a4Fwr,B1eYtOaDPr,1569440000000.0,1577170000000.0,695,Disentangling Improves VAEs' Robustness to Adversarial Attacks,"[""mwilletts@turing.ac.uk"", ""acamuto@turing.ac.uk"", ""sroberts@turing.ac.uk"", ""cholmes@turing.ac.uk""]","[""Matthew Willetts"", ""Alexander Camuto"", ""Stephen Roberts"", ""Chris Holmes""]",[],"This paper is concerned with the robustness of VAEs to adversarial attacks. We highlight that conventional VAEs are brittle under attack but that methods recently introduced for disentanglement such as β-TCVAE (Chen et al., 2018) improve robustness, as demonstrated through a variety of previously proposed adversarial attacks (Tabacof et al. (2016); Gondim-Ribeiro et al. (2018); Kos et al.(2018)). This motivated us to develop Seatbelt-VAE, a new hierarchical disentangled VAE that is designed to be significantly more robust to adversarial attacks than existing approaches, while retaining high quality reconstructions.",/pdf/f04841bcdca53a40a72609d4f255e236160817e9.pdf,ICLR,2020,"We show that disentangled VAEs are more robust than vanilla VAEs to adversarial attacks that aim to trick them into decoding the adversarial input to a chosen target. We then develop an even more robust hierarchical disentangled VAE, Seatbelt-VAE." +SkhU2fcll,,1478270000000.0,1487280000000.0,184,Deep Multi-task Representation Learning: A Tensor Factorisation Approach,"[""yongxin.yang@qmul.ac.uk"", ""t.hospedales@qmul.ac.uk""]","[""Yongxin Yang"", ""Timothy M. Hospedales""]",[],"Most contemporary multi-task learning methods assume linear models. This setting is considered shallow in the era of deep learning. In this paper, we present a new deep multi-task representation learning framework that learns cross-task sharing structure at every layer in a deep network. Our approach is based on generalising the matrix factorisation techniques explicitly or implicitly used by many conventional MTL algorithms to tensor factorisation, to realise automatic learning of end-to-end knowledge sharing in deep networks. This is in contrast to existing deep learning approaches that need a user-defined multi-task sharing strategy. Our approach applies to both homogeneous and heterogeneous MTL. Experiments demonstrate the efficacy of our deep multi-task representation learning in terms of both higher accuracy and fewer design choices.",/pdf/ec8f6654ac93c61aee2c445af25d3b49c6db086c.pdf,ICLR,2017,A multi-task representation learning framework that learns cross-task sharing structure at every layer in a deep network. +RgDq8-AwvtN,mVmjvs3bz3a,1601310000000.0,1614990000000.0,359,"Model-Based Robust Deep Learning: Generalizing to Natural, Out-of-Distribution Data","[""~Alexander_Robey1"", ""~Hamed_Hassani2"", ""~George_J._Pappas1""]","[""Alexander Robey"", ""Hamed Hassani"", ""George J. Pappas""]","[""robustness"", ""out-of-distribution generalization"", ""natural variation"", ""deep learning""]","While deep learning (DL) has resulted in major breakthroughs in many applications, the frameworks commonly used in DL remain fragile to seemingly innocuous changes in the data. In response, adversarial training has emerged as a principled approach for improving the robustness of DL against norm-bounded perturbations. Despite this progress, DL is also known to be fragile to unbounded shifts in the data distribution due to many forms of natural variation, including changes in weather or lighting in images. However, there are remarkably few techniques that can address robustness to natural, out-of-distribution shifts in the data distribution in a general context. To address this gap, we propose a paradigm shift from perturbation-based adversarial robustness to model-based robust deep learning. Critical to our paradigm is to obtain models of natural variation, which vary data over a range of natural conditions. Then by exploiting these models, we develop three novel model-based robust training algorithms that improve the robustness of DL with respect to natural variation. Our extensive experiments show that across a variety of natural conditions in twelve distinct datasets, classifiers trained with our algorithms significantly outperform classifiers trained via ERM, adversarial training, and domain adaptation techniques. Specifically, when training on ImageNet and testing on various subsets of ImageNet-c, our algorithms improve over baseline methods by up to 30 percentage points in top-1 accuracy. Further, we show that our methods provide robustness (1) against natural, out-of-distribution data, (2) against multiple simultaneous distributional shifts, and (3) to domains entirely unseen during training.",/pdf/750ae608d0a114890e34f3984d78ac00e473dece.pdf,ICLR,2021,"We provide algorithms that can be used to significantly improve robustness against natural, out-of-distribution shifts in the data distribution." +S1lAOhEKPS,HJedS1OF8B,1569440000000.0,1577170000000.0,60,X-Forest: Approximate Random Projection Trees for Similarity Measurement,"[""zyk@pku.edu.cn"", ""chenpeiqing@pku.edu.cn"", ""benkerd@pku.edu.cn"", ""yangtongemail@gmail.com"", ""jie.jiang@pku.edu.cn"", ""bin.cui@pku.edu.cn"", ""nicholas.zhang@huawei.com"", ""steve.uhlig@qmul.ac.uk""]","[""Yikai Zhao"", ""Peiqing Chen"", ""Zidong Zhao"", ""Tong Yang"", ""Jie Jiang"", ""Bin Cui"", ""Gong Zhang"", ""Steve Uhlig""]",[],"Similarity measurement plays a central role in various data mining and machine learning tasks. Generally, a similarity measurement solution should, in an ideal state, possess the following three properties: accuracy, efficiency and independence from prior knowledge. Yet unfortunately, vital as similarity measurements are, no previous works have addressed all of them. In this paper, we propose X-Forest, consisting of a group of approximate Random Projection Trees, such that all three targets mentioned above are tackled simultaneously. Our key techniques are as follows. First, we introduced RP Trees into the tasks of similarity measurement such that accuracy is improved. In addition, we enforce certain layers in each tree to share identical projection vectors, such that exalted efficiency is achieved. Last but not least, we introduce randomness into partition to eliminate its reliance on prior knowledge. We conduct experiments on three real-world datasets, whose results demonstrate that our model, X-Forest, reaches an efficiency of up to 3.5 times higher than RP Trees with negligible compromising on its accuracy, while also being able to outperform traditional Euclidean distance-based similarity metrics by as much as 20% with respect to clustering tasks. We have released codes in github anonymously so as to meet the demand of reproducibility.",/pdf/2907d4f441292d83e15ff606e48318aef12db63a.pdf,ICLR,2020, +HyEl3o05Fm,BJeeReacFX,1538090000000.0,1545360000000.0,682,Stochastic Adversarial Video Prediction,"[""alexlee_gk@cs.berkeley.edu"", ""rich.zhang@eecs.berkeley.edu"", ""febert@berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""cbfinn@eecs.berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Alex X. Lee"", ""Richard Zhang"", ""Frederik Ebert"", ""Pieter Abbeel"", ""Chelsea Finn"", ""Sergey Levine""]","[""video prediction"", ""GANs"", ""variational autoencoder""]","Being able to predict what may happen in the future requires an in-depth understanding of the physical and causal rules that govern the world. A model that is able to do so has a number of appealing applications, from robotic planning to representation learning. However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging—the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction. Recently, this has been addressed by two distinct approaches: (a) latent variational variable models that explicitly model underlying stochasticity and (b) adversarially-trained models that aim to produce naturalistic images. However, a standard latent variable model can struggle to produce realistic results, and a standard adversarially-trained model underutilizes latent variables and fails to produce diverse predictions. We show that these distinct methods are in fact complementary. Combining the two produces predictions that look more realistic to human raters and better cover the range of possible futures. Our method outperforms prior works in these aspects.",/pdf/756cdfcc085e000045b66a1cc06c33dd6be3761c.pdf,ICLR,2019, +rkMhusC5Y7,S1xQTBpDt7,1538090000000.0,1545360000000.0,391,Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation,"[""rodrigonogueira@nyu.edu"", ""jbulian@google.com"", ""massi@google.com""]","[""Rodrigo Nogueira"", ""Jannis Bulian"", ""Massimiliano Ciaramita""]","[""Reinforcement Learning"", ""Multi-agent"", ""Information Retrieval"", ""Question-Answering"", ""Query Reformulation"", ""Query Expansion""]","We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering. In the proposed framework an agent consists of multiple specialized sub-agents and a meta-agent that learns to aggregate the answers from sub-agents to produce a final answer. Sub-agents are trained on disjoint partitions of the training data, while the meta-agent is trained on the full training set. Our method makes learning faster, because it is highly parallelizable, and has better generalization performance than strong baselines, such as an ensemble of agents trained on the full data. We show that the improved performance is due to the increased diversity of reformulation strategies. ",/pdf/b986c75df9b58699521f2f2e53c49d2668b56c09.pdf,ICLR,2019,Multiple diverse query reformulation agents trained with reinforcement learning to improve search engines. +RuUdMAU-XbI,l3Lv8tDM68c,1601310000000.0,1614990000000.0,11,Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks,"[""~Kun_Yuan1"", ""~Quanquan_Li1"", ""~Dapeng_Chen4"", ""~Aojun_Zhou2"", ""~Junjie_Yan1""]","[""Kun Yuan"", ""Quanquan Li"", ""Dapeng Chen"", ""Aojun Zhou"", ""Junjie Yan""]","[""dynamic network"", ""data-dependent"", ""complete graph""]","One practice of employing deep neural networks is to apply the same architecture to all the input instances. However, a fixed architecture may not be representative enough for data with high diversity. To promote the model capacity, existing approaches usually employ larger convolutional kernels or deeper network structure, which may increase the computational cost. In this paper, we address this issue by raising the Dynamic Graph Network (DG-Net). The network learns the instance-aware connectivity, which creates different forward paths for different instances. Specifically, the network is initialized as a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent the connection paths. We generate edge weights by a learnable module \textit{router} and select the edges whose weights are larger than a threshold, to adjust the connectivity of the neural network structure. Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability. To facilitate the training, we represent the network connectivity of each sample in an adjacency matrix. The matrix is updated to aggregate features in the forward pass, cached in the memory, and used for gradient computing in the backward pass. We verify the effectiveness of our method with several static architectures, including MobileNetV2, ResNet, ResNeXt, and RegNet. Extensive experiments are performed on ImageNet classification and COCO object detection, which shows the effectiveness and generalization ability of our approach.",/pdf/51cf6bf6a8fb30f24c6838c226f2e3853d9c3076.pdf,ICLR,2021,Dynamic Graph Networks promote the model capacity by performing instance-aware connectivity for neural networks. +u9ax42K7ND,3Czaj-ZOMxa,1601310000000.0,1614990000000.0,101,Hierarchical Meta Reinforcement Learning for Multi-Task Environments,"[""~Dongyang_Zhao1"", ""~Yue_Huang5"", ""~Changnan_Xiao1"", ""~Yue_Li5"", ""~Shihong_Deng1""]","[""Dongyang Zhao"", ""Yue Huang"", ""Changnan Xiao"", ""Yue Li"", ""Shihong Deng""]","[""Reinforcement Learning"", ""Multi-task"", ""Hierarchical"", ""Meta Learning""]","Deep reinforcement learning algorithms aim to achieve human-level intelligence by solving practical decisions-making problems, which are often composed of multiple sub-tasks. Complex and subtle relationships between sub-tasks make traditional methods hard to give a promising solution. We implement a first-person shooting environment with random spatial structures to illustrate a typical representative of this kind. A desirable agent should be capable of balancing between different sub-tasks: navigation to find enemies and shooting to kill them. To address the problem brought by the environment, we propose a Meta Soft Hierarchical reinforcement learning framework (MeSH), in which each low-level sub-policy focuses on a specific sub-task respectively and high-level policy automatically learns to utilize low-level sub-policies through meta-gradients. The proposed framework is able to disentangle multiple sub-tasks and discover proper low-level policies under different situations. The effectiveness and efficiency of the framework are shown by a series of comparison experiments. Both environment and algorithm code will be provided for open source to encourage further research.",/pdf/883282234864314238ab80918076a9a9a2d52ca0.pdf,ICLR,2021,A new multi-task environment & a hierarchical meta reinforcement learning framework +rJggX0EKwS,B1xp1BBOwH,1569440000000.0,1577170000000.0,1025,The Benefits of Over-parameterization at Initialization in Deep ReLU Networks,"[""devansharpit@gmail.com"", ""yoshua.bengio@mila.quebec""]","[""Devansh Arpit"", ""Yoshua Bengio""]","[""deep relu networks"", ""he initialization"", ""norm preserving"", ""gradient preserving""]","It has been noted in existing literature that over-parameterization in ReLU networks generally improves performance. While there could be several factors involved behind this, we prove some desirable theoretical properties at initialization which may be enjoyed by ReLU networks. Specifically, it is known that He initialization in deep ReLU networks asymptotically preserves variance of activations in the forward pass and variance of gradients in the backward pass for infinitely wide networks, thus preserving the flow of information in both directions. Our paper goes beyond these results and shows novel properties that hold under He initialization: i) the norm of hidden activation of each layer is equal to the norm of the input, and, ii) the norm of weight gradient of each layer is equal to the product of norm of the input vector and the error at output layer. These results are derived using the PAC analysis framework, and hold true for finitely sized datasets such that the width of the ReLU network only needs to be larger than a certain finite lower bound. As we show, this lower bound depends on the depth of the network and the number of samples, and by the virtue of being a lower bound, over-parameterized ReLU networks are endowed with these desirable properties. For the aforementioned hidden activation norm property under He initialization, we further extend our theory and show that this property holds for a finite width network even when the number of data samples is infinite. Thus we overcome several limitations of existing papers, and show new properties of deep ReLU networks at initialization.",/pdf/fca83ed84912e22fc21ae710ebc604be30e38b45.pdf,ICLR,2020,We show that the norm of hidden activations and the norm of weight gradients are a function of the norm of input data and error at output. We relax the assumption made by previous papers that study weight initialization in deep ReLU networks. +tL89RnzIiCd,TbWbwAsplhu,1601310000000.0,1615990000000.0,3489,Hopfield Networks is All You Need,"[""~Hubert_Ramsauer2"", ""~Bernhard_Sch\u00e4fl1"", ""~Johannes_Lehner1"", ""~Philipp_Seidl1"", ""~Michael_Widrich2"", ""~Lukas_Gruber2"", ""~Markus_Holzleitner1"", ""~Thomas_Adler1"", ""openreview20@kreil.org"", ""~Michael_K_Kopp1"", ""~G\u00fcnter_Klambauer1"", ""~Johannes_Brandstetter1"", ""~Sepp_Hochreiter1""]","[""Hubert Ramsauer"", ""Bernhard Sch\u00e4fl"", ""Johannes Lehner"", ""Philipp Seidl"", ""Michael Widrich"", ""Lukas Gruber"", ""Markus Holzleitner"", ""Thomas Adler"", ""David Kreil"", ""Michael K Kopp"", ""G\u00fcnter Klambauer"", ""Johannes Brandstetter"", ""Sepp Hochreiter""]","[""Modern Hopfield Network"", ""Energy"", ""Attention"", ""Convergence"", ""Storage Capacity"", ""Hopfield layer"", ""Associative Memory""]","We introduce a modern Hopfield network with continuous states and a corresponding update rule. The new Hopfield network can store exponentially (with the dimension of the associative space) many patterns, retrieves the pattern with one update, and has exponentially small retrieval errors. It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models. These heads perform in the first layers preferably global averaging and in higher layers partial averaging via metastable states. The new modern Hopfield network can be integrated into deep learning architectures as layers to allow the storage of and access to raw input data, intermediate results, or learned prototypes. +These Hopfield layers enable new ways of deep learning, beyond fully-connected, convolutional, or recurrent networks, and provide pooling, memory, association, and attention mechanisms. We demonstrate the broad applicability of the Hopfield layers +across various domains. Hopfield layers improved state-of-the-art on three out of four considered multiple instance learning problems as well as on immune repertoire classification with several hundreds of thousands of instances. On the UCI benchmark collections of small classification tasks, where deep learning methods typically struggle, Hopfield layers yielded a new state-of-the-art when compared to different machine learning methods. Finally, Hopfield layers achieved state-of-the-art on two drug design datasets. The implementation is available at: \url{https://github.com/ml-jku/hopfield-layers}",/pdf/4dfbed3a6ececb7282dfef90fd6c03812ae0da7b.pdf,ICLR,2021,A novel continuous Hopfield network is proposed whose update rule is the attention mechanism of the transformer model and which can be integrated into deep learning architectures. +BJxAHgSYDB,BygnNFltPB,1569440000000.0,1577170000000.0,2306,Learning to Rank Learning Curves,"[""martin.wistuba@ibm.com"", ""tejaswinip@us.ibm.com""]","[""Martin Wistuba"", ""Tejaswini Pedapati""]",[],"Many automated machine learning methods, such as those for hyperparameter and neural architecture optimization, are computationally expensive because they involve training many different model configurations. In this work, we present a new method that saves computational budget by terminating poor configurations early on in the training. In contrast to existing methods, we consider this task as a ranking and transfer learning problem. We qualitatively show that by optimizing a pairwise ranking loss and leveraging learning curves from other data sets, our model is able to effectively rank learning curves without having to observe many or very long learning curves. We further demonstrate that our method can be used to accelerate a neural architecture search by a factor of up to 100 without a significant performance degradation of the discovered architecture. In further experiments we analyze the quality of ranking, the influence of different model components as well as the predictive behavior of the model.",/pdf/33da53866a9f53de2f7681ee7b84874c35f534c2.pdf,ICLR,2020,Learn to rank learning curves in order to stop unpromising training jobs early. Novelty: use of pairwise ranking loss to directly model the probability of improving and transfer learning across data sets to reduce required training data. +HkpbnH9lx,,1478280000000.0,1487090000000.0,269,Density estimation using Real NVP,"[""dinh.laurent@gmail.com"", ""jaschasd@google.com"", ""bengio@google.com""]","[""Laurent Dinh"", ""Jascha Sohl-Dickstein"", ""Samy Bengio""]","[""Deep learning"", ""Unsupervised Learning""]","Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.",/pdf/10807df6f626b504c0d7abe90bb81aef3fb9fb13.pdf,ICLR,2017,Efficient invertible neural networks for density estimation and generation +HJgCF0VFwr,rJg3Pw__DH,1569440000000.0,1583910000000.0,1267,Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks,"[""xin_xing@fas.harvard.edu"", ""longsha@brandeis.edu"", ""hongpeng@brandeis.edu"", ""zuofeng.shang@njit.edu"", ""jliu@stat.harvard.edu""]","[""Xin Xing"", ""Long Sha"", ""Pengyu Hong"", ""Zuofeng Shang"", ""Jun S. Liu""]",[],"Deep neural networks (DNNs) can be huge in size, requiring a considerable a mount of energy and computational resources to operate, which limits their applications in numerous scenarios. It is thus of interest to compress DNNs while maintaining their performance levels. We here propose a probabilistic importance inference approach for pruning DNNs. Specifically, we test the significance of the relevance of a connection in a DNN to the DNN’s outputs using a nonparemtric scoring testand keep only those significant ones. Experimental results show that the proposed approach achieves better lossless compression rates than existing techniques",/pdf/0b4d4acde853a70c2593060c34e9c3d57e83700f.pdf,ICLR,2020, +H1BO9M-0Z,S1ODqM-Ab,1509140000000.0,1518730000000.0,905,Lifelong Word Embedding via Meta-Learning,"[""hxu48@uic.edu"", ""liub@uic.edu"", ""lshu3@uic.edu"", ""psyu@uic.edu""]","[""Hu Xu"", ""Bing Liu"", ""Lei Shu"", ""Philip S. Yu""]","[""Lifelong learning"", ""meta learning"", ""word embedding""]","Learning high-quality word embeddings is of significant importance in achieving better performance in many down-stream learning tasks. On one hand, traditional word embeddings are trained on a large scale corpus for general-purpose tasks, which are often sub-optimal for many domain-specific tasks. On the other hand, many domain-specific tasks do not have a large enough domain corpus to obtain high-quality embeddings. We observe that domains are not isolated and a small domain corpus can leverage the learned knowledge from many past domains to augment that corpus in order to generate high-quality embeddings. In this paper, we formulate the learning of word embeddings as a lifelong learning process. Given knowledge learned from many previous domains and a small new domain corpus, the proposed method can effectively generate new domain embeddings by leveraging a simple but effective algorithm and a meta-learner, where the meta-learner is able to provide word context similarity information at the domain-level. Experimental results demonstrate that the proposed method can effectively learn new domain embeddings from a small corpus and past domain knowledges\footnote{We will release the code after final revisions.}. We also demonstrate that general-purpose embeddings trained from a large scale corpus are sub-optimal in domain-specific tasks.",/pdf/2d129dcd1101fe5b2fbb27768913ebad82c5727e.pdf,ICLR,2018,learning better domain embeddings via lifelong learning and meta-learning +rkxt8oC9FQ,rkeInEZuK7,1538090000000.0,1545360000000.0,192,Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks,"[""patrick.schwab@hest.ethz.ch"", ""llorenz@student.ethz.ch"", ""walter.karlen@hest.ethz.ch""]","[""Patrick Schwab"", ""Lorenz Linhardt"", ""Walter Karlen""]",[],"Learning representations for counterfactual inference from observational data is of high practical relevance for many domains, such as healthcare, public policy and economics. Counterfactual inference enables one to answer ""What if...?"" questions, such as ""What would be the outcome if we gave this patient treatment $t_1$?"". However, current methods for training neural networks for counterfactual inference on observational data are either overly complex, limited to settings with only two available treatment options, or both. Here, we present Perfect Match (PM), a method for training neural networks for counterfactual inference that is easy to implement, compatible with any architecture, does not add computational complexity or hyperparameters, and extends to any number of treatments. PM is based on the idea of augmenting samples within a minibatch with their propensity-matched nearest neighbours. Our experiments demonstrate that PM outperforms a number of more complex state-of-the-art methods in inferring counterfactual outcomes across several real-world and semi-synthetic datasets.",/pdf/395dff4f82864bacdd447ad49564dc36d74cc3b7.pdf,ICLR,2019, +SkxaueHFPB,r1gNPaetwS,1569440000000.0,1577170000000.0,2416,Implicit competitive regularization in GANs,"[""florian.schaefer@caltech.edu"", ""devzhk@sjtu.edu.cn"", ""anima@caltech.edu""]","[""Florian Schaefer"", ""Hongkai Zheng"", ""Anima Anandkumar""]","[""GAN"", ""competitive optimization"", ""game theory""]","Generative adversarial networks (GANs) are capable of producing high quality samples, but they suffer from numerous issues such as instability and mode collapse during training. To combat this, we propose to model the generator and discriminator as agents acting under local information, uncertainty, and awareness of their opponent. By doing so we achieve stable convergence, even when the underlying game has no Nash equilibria. We call this mechanism \emph{implicit competitive regularization} (ICR) and show that it is present in the recently proposed \emph{competitive gradient descent} (CGD). +When comparing CGD to Adam using a variety of loss functions and regularizers on CIFAR10, CGD shows a much more consistent performance, which we attribute to ICR. +In our experiments, we achieve the highest inception score when using the WGAN loss (without gradient penalty or weight clipping) together with CGD. This can be interpreted as minimizing a form of integral probability metric based on ICR.",/pdf/f3bd24900cf17a2cd24913ecd24768a7feacc6e2.pdf,ICLR,2020,Training GANs by modeling networks as agents acting with limited information and in awareness of their opponent imposes an implicit regularization that leads to stable convergence and prevents mode collapse. +BJxQxeBYwH,SJg05hyFPB,1569440000000.0,1577170000000.0,2092,Are Powerful Graph Neural Nets Necessary? A Dissection on Graph Classification,"[""iamtingchen@gmail.com"", ""biansonghz@gmail.com"", ""yzsun@cs.ucla.edu""]","[""Ting Chen"", ""Song Bian"", ""Yizhou Sun""]","[""graph neural nets"", ""graph classification"", ""set function""]","Graph Neural Nets (GNNs) have received increasing attentions, partially due to their superior performance in many node and graph classification tasks. However, there is a lack of understanding on what they are learning and how sophisticated the learned graph functions are. In this work, we propose a dissection of GNNs on graph classification into two parts: 1) the graph filtering, where graph-based neighbor aggregations are performed, and 2) the set function, where a set of hidden node features are composed for prediction. To study the importance of both parts, we propose to linearize them separately. We first linearize the graph filtering function, resulting Graph Feature Network (GFN), which is a simple lightweight neural net defined on a \textit{set} of graph augmented features. Further linearization of GFN's set function results in Graph Linear Network (GLN), which is a linear function. Empirically we perform evaluations on common graph classification benchmarks. To our surprise, we find that, despite the simplification, GFN could match or exceed the best accuracies produced by recently proposed GNNs (with a fraction of computation cost), while GLN underperforms significantly. Our results demonstrate the importance of non-linear set function, and suggest that linear graph filtering with non-linear set function is an efficient and powerful scheme for modeling existing graph classification benchmarks.",/pdf/60f60e3259ea10fe4f9eb2c67202aa7fa1f4f980.pdf,ICLR,2020,"We propose a dissection of GNNs through linearization of the parts, and find that linear graph filtering with non-linear set function is powerful enough for common graph classification benchmarks." +Sks3zF9eg,,1478300000000.0,1484310000000.0,473,Taming the waves: sine as activation function in deep neural networks,"[""giambattista.parascandolo@tut.fi"", ""heikki.huttunen@tut.fi"", ""tuomas.virtanen@tut.fi""]","[""Giambattista Parascandolo"", ""Heikki Huttunen"", ""Tuomas Virtanen""]","[""Theory"", ""Deep learning"", ""Optimization"", ""Supervised Learning""]","Most deep neural networks use non-periodic and monotonic—or at least +quasiconvex— activation functions. While sinusoidal activation functions have +been successfully used for specific applications, they remain largely ignored and +regarded as difficult to train. In this paper we formally characterize why these +networks can indeed often be difficult to train even in very simple scenarios, and +describe how the presence of infinitely many and shallow local minima emerges +from the architecture. We also provide an explanation to the good performance +achieved on a typical classification task, by showing that for several network architectures +the presence of the periodic cycles is largely ignored when the learning +is successful. Finally, we show that there are non-trivial tasks—such as learning +algorithms—where networks using sinusoidal activations can learn faster than +more established monotonic functions.",/pdf/5b46760a646ca9fecbe8ce41647b6a3cca4f6948.pdf,ICLR,2017,"Why nets with sine as activation function are difficult to train in theory. Also, they often don't use the periodic part if not needed, but when it's beneficial they might learn faster" +HyoST_9xl,,1478290000000.0,1488220000000.0,465,DSD: Dense-Sparse-Dense Training for Deep Neural Networks,"[""songhan@stanford.edu"", ""jpool@nvidia.com"", ""sharan@baidu.com"", ""huizi@stanford.edu"", ""enhaog@stanford.edu"", ""sjtang@stanford.edu"", ""eriche@google.com"", ""vajdap@fb.com"", ""mano@fb.com"", ""johntran@nvidia.com"", ""bcatanzaro@nvidia.com"", ""dally@stanford.edu""]","[""Song Han"", ""Jeff Pool"", ""Sharan Narang"", ""Huizi Mao"", ""Enhao Gong"", ""Shijian Tang"", ""Erich Elsen"", ""Peter Vajda"", ""Manohar Paluri"", ""John Tran"", ""Bryan Catanzaro"", ""William J. Dally""]","[""Deep learning""]","Modern deep neural networks have a large number of parameters, making them very hard to train. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network. Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classification, caption generation and speech recognition. On ImageNet, DSD improved the Top1 accuracy of GoogLeNet by 1.1%, VGG-16 by 4.3%, ResNet-18 by 1.2% and ResNet-50 by 1.1%, respectively. On the WSJ’93 dataset, DSD improved DeepSpeech and DeepSpeech2 WER by 2.0% and 1.1%. On the Flickr-8K dataset, DSD improved the NeuralTalk BLEU score by over 1.7. DSD is easy to use in practice: at training time, DSD incurs only one extra hyper-parameter: the sparsity ratio in the S step. At testing time, DSD doesn’t change the network architecture or incur any inference overhead. The consistent and significant performance gain of DSD experiments shows the inadequacy of the current training methods for finding the best local optimum, while DSD effectively achieves superior optimization performance for finding a better solution. DSD models are available to download at https://songhan.github.io/DSD.",/pdf/317ff05385cb43f0aac187a87a96396c15ca1c1c.pdf,ICLR,2017,DSD effectively achieves superior optimization performance on a wide range of deep neural networks. +HJlEUoR9Km,S1xYBjZqtQ,1538090000000.0,1545360000000.0,165,Improved resistance of neural networks to adversarial images through generative pre-training,"[""joachim.wabnig@nokia-bell-labs.com""]","[""Joachim Wabnig""]","[""adversarial images"", ""Boltzmann machine"", ""mean field approximation""]",We train a feed forward neural network with increased robustness against adversarial attacks compared to conventional training approaches. This is achieved using a novel pre-trained building block based on a mean field description of a Boltzmann machine. On the MNIST dataset the method achieves strong adversarial resistance without data augmentation or adversarial training. We show that the increased adversarial resistance is correlated with the generative performance of the underlying Boltzmann machine.,/pdf/c258f19792cc3f4be4d568218abbb1a19f48724c.pdf,ICLR,2019,Generative pre-training with mean field Boltzmann machines increases robustness against adversarial images in neural networks. +BkxDthVtvS,BklQKcSnIr,1569440000000.0,1577170000000.0,82,Equivariant neural networks and equivarification,"[""baoerkao@gmail.com"", ""linqi.song@cityu.edu.hk""]","[""Erkao Bao"", ""Linqi Song""]","[""equivariant"", ""invariant"", ""neural network"", ""equivarification""]","A key difference from existing works is that our equivarification method can be applied without knowledge of the detailed functions of a layer in a neural network, and hence, can be generalized to any feedforward neural networks. Although the network size scales up, the constructed equivariant neural network does not increase the complexity of the network compared with the original one, in terms of the number of parameters. As an illustration, we build an equivariant neural network for image classification by equivarifying a convolutional neural network. Results show that our proposed method significantly reduces the design and training complexity, yet preserving the learning performance in terms of accuracy.",/pdf/9dfd9da1e9291567a85e8be6233f74fa0369ff49.pdf,ICLR,2020, +0zvfm-nZqQs,eIxBtKLk54u,1601310000000.0,1615590000000.0,741,Undistillable: Making A Nasty Teacher That CANNOT teach students,"[""~Haoyu_Ma1"", ""~Tianlong_Chen1"", ""~Ting-Kuei_Hu1"", ""~Chenyu_You1"", ""~Xiaohui_Xie2"", ""~Zhangyang_Wang1""]","[""Haoyu Ma"", ""Tianlong Chen"", ""Ting-Kuei Hu"", ""Chenyu You"", ""Xiaohui Xie"", ""Zhangyang Wang""]","[""knowledge distillation"", ""avoid knowledge leaking""]","Knowledge Distillation (KD) is a widely used technique to transfer knowledge from pre-trained teacher models to (usually more lightweight) student models. However, in certain situations, this technique is more of a curse than a blessing. For instance, KD poses a potential risk of exposing intellectual properties (IPs): even if a trained machine learning model is released in ``black boxes'' (e.g., as executable software or APIs without open-sourcing code), it can still be replicated by KD through imitating input-output behaviors. To prevent this unwanted effect of KD, this paper introduces and investigates a concept called $\textit{Nasty Teacher}$: a specially trained teacher network that yields nearly the same performance as a normal one, but would significantly degrade the performance of student models learned by imitating it. We propose a simple yet effective algorithm to build the nasty teacher, called $\textit{self-undermining knowledge distillation}$. Specifically, we aim to maximize the difference between the output of the nasty teacher and a normal pre-trained network. Extensive experiments on several datasets demonstrate that our method is effective on both standard KD and data-free KD, providing the desirable KD-immunity to model owners for the first time. We hope our preliminary study can draw more awareness and interest in this new practical problem of both social and legal importance. Our codes and pre-trained models can be found at: $\url{https://github.com/VITA-Group/Nasty-Teacher}$.",/pdf/42f6ff4cc0e85c1f3a226c56205d2f78953cdc7c.pdf,ICLR,2021,"We propose the Nasty Teacher, a defensive approach to prevent unauthorized cloning from a teacher model through knowledge distillation. " +vXj_ucZQ4hA,HJSVoIdp5fW,1601310000000.0,1615920000000.0,2155,Robust Pruning at Initialization,"[""~Soufiane_Hayou1"", ""~Jean-Francois_Ton1"", ""~Arnaud_Doucet2"", ""~Yee_Whye_Teh1""]","[""Soufiane Hayou"", ""Jean-Francois Ton"", ""Arnaud Doucet"", ""Yee Whye Teh""]","[""Pruning"", ""Initialization"", ""Compression""]","Overparameterized Neural Networks (NN) display state-of-the-art performance. However, there is a growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained NN (LeCun et al.,1990; Hassibi et al., 1993), recent work by Lee et al. (2018) has shown promising results when pruning at initialization. However, for Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned. In this paper, we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.",/pdf/2e8257a71504620d9109ae3b12c4c0c228f26db0.pdf,ICLR,2021,Making pruning at initialization robust to Gradient vanishing/exploding +agHLCOBM5jP,9I-QItuJqqR,1601310000000.0,1614560000000.0,3284,Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders,"[""~Mangal_Prakash1"", ""~Alexander_Krull3"", ""~Florian_Jug1""]","[""Mangal Prakash"", ""Alexander Krull"", ""Florian Jug""]","[""Diversity denoising"", ""Unsupervised denoising"", ""Variational Autoencoders"", ""Noise model""]","Deep Learning based methods have emerged as the indisputable leaders for virtually all image restoration tasks. Especially in the domain of microscopy images, various content-aware image restoration (CARE) approaches are now used to improve the interpretability of acquired data. Naturally, there are limitations to what can be restored in corrupted images, and like for all inverse problems, many potential solutions exist, and one of them must be chosen. Here, we propose DivNoising, a denoising approach based on fully convolutional variational autoencoders (VAEs), overcoming the problem of having to choose a single solution by predicting a whole distribution of denoised images. First we introduce a principled way of formulating the unsupervised denoising problem within the VAE framework by explicitly incorporating imaging noise models into the decoder. Our approach is fully unsupervised, only requiring noisy images and a suitable description of the imaging noise distribution. We show that such a noise model can either be measured, bootstrapped from noisy data, or co-learned during training. If desired, consensus predictions can be inferred from a set of DivNoising predictions, leading to competitive results with other unsupervised methods and, on occasion, even with the supervised state-of-the-art. DivNoising samples from the posterior enable a plethora of useful applications. We are (i) showing denoising results for 13 datasets, (ii) discussing how optical character recognition (OCR) applications can benefit from diverse predictions, and are (iii) demonstrating how instance cell segmentation improves when using diverse DivNoising predictions.",/pdf/2afe972808ebb66f3926468902039c366b274c59.pdf,ICLR,2021,DivNoising performs fully unsupervised diversity denoising using fully convolutional variational autoencoders and achieves SOTA results for a number of well known datasets while also enabling VAE-like sampling +rkgIW1HKPB,BJl3i5sdDB,1569440000000.0,1577170000000.0,1545,Unsupervised Representation Learning by Predicting Random Distances,"[""hu.wang@adelaide.edu.au"", ""pangguansong@gmail.com"", ""chunhua.shen@adelaide.edu.au"", ""201520121828@mail.scut.edu.cn""]","[""Hu Wang"", ""Guansong Pang"", ""Chunhua Shen"", ""Congbo Ma""]","[""representation learning"", ""unsupervised learning"", ""anomaly detection"", ""clustering""]","Deep neural networks have gained tremendous success in a broad range of machine learning tasks due to its remarkable capability to learn semantic-rich features from high-dimensional data. However, they often require large-scale labelled data to successfully learn such features, which significantly hinders their adaption into unsupervised learning tasks, such as anomaly detection and clustering, and limits their applications into critical domains where obtaining massive labelled data is prohibitively expensive. To enable downstream unsupervised learning on those domains, in this work we propose to learn features without using any labelled data by training neural networks to predict data distances in a randomly projected space. Random mapping is a highly efficient yet theoretical proven approach to obtain approximately preserved distances. To well predict these random distances, the representation learner is optimised to learn class structures that are implicitly embedded in the randomly projected space. Experimental results on 19 real-world datasets show our learned representations substantially outperform state-of-the-art competing methods in both anomaly detection and clustering tasks.",/pdf/4f6a753f5cef654e9ea6f5afbde831b6872943de.pdf,ICLR,2020,"This paper introduces a novel Random Distance Prediction model to learn expressive feature representations in a fully unsupervised fashion by predicting random distances, enabling substantially improved anomaly detection and clustering performance." +jQUf0TmN-oT,RSCkB4N5n6E,1601310000000.0,1615540000000.0,2602,SACoD: Sensor Algorithm Co-Design Towards Efficient CNN-powered Intelligent PhlatCam,"[""~Yonggan_Fu1"", ""~Yang_Zhang3"", ""~Yue_Wang3"", ""~Zhihan_Lu1"", ""~Vivek_Boominathan1"", ""~Ashok_Veeraraghavan1"", ""~Yingyan_Lin1""]","[""Yonggan Fu"", ""Yang Zhang"", ""Yue Wang"", ""Zhihan Lu"", ""Vivek Boominathan"", ""Ashok Veeraraghavan"", ""Yingyan Lin""]","[""Sensor Network Co-design"", ""neural architecture search""]"," There has been a booming demand for integrating Convolutional Neural Networks (CNNs) powered functionalities into Internet-of-Thing (IoT) devices to enable ubiquitous intelligent ""IoT cameras”. However, more extensive applications of such IoT systems are still limited by two challenges. First, some applications, especially medicine- and wearable-related ones, impose stringent requirements on the camera form factor. Second, powerful CNNs often require considerable storage and energy cost, whereas IoT devices often suffer from limited resources. PhlatCam, with its form factor potentially reduced by orders of magnitude, has emerged as a promising solution to the first aforementioned challenge, while the second one remains a bottleneck. Existing compression techniques, which can potentially tackle the second challenge, are far from realizing the full potential in storage and energy reduction, because they mostly focus on the CNN algorithm itself. To this end, this work proposes SACoD, a Sensor Algorithm Co-Design framework to develop more efficient CNN-powered PhlatCam. In particular, the mask coded in the PhlatCam sensor and the backend CNN model are jointly optimized in terms of both model parameters and architectures via differential neural architecture search. Extensive experiments including both simulation and physical measurement on manufactured masks show that the proposed SACoD framework achieves aggressive model compression and energy savings while maintaining or even boosting the task accuracy, when benchmarking over two state-of-the-art (SOTA) designs with six datasets on four different tasks. We also perform visualization for better understanding the superiority of SACoD generated designs. All the codes will be released publicly upon acceptance. ",/pdf/e389e16aa3bb7ce8b9b7209b5092d64697871c19.pdf,ICLR,2021,"We propose SACoD, a Sensor Algorithm Co-Design framework to develop more efficient CNN-powered PhlatCam." +n1wPkibo2R,F4hcBKLB0pI,1601310000000.0,1614990000000.0,2840,An Efficient Protocol for Distributed Column Subset Selection in the Entrywise $\ell_p$ Norm,"[""~Shuli_Jiang1"", ""dongyul@cs.cmu.edu"", ""mengzeli@andrew.cmu.edu"", ""~Arvind_V._Mahankali1"", ""~David_Woodruff2""]","[""Shuli Jiang"", ""Dongyu Li"", ""Irene Mengze Li"", ""Arvind V. Mahankali"", ""David Woodruff""]","[""Column Subset Selection"", ""Distributed Learning""]","We give a distributed protocol with nearly-optimal communication and number of rounds for Column Subset Selection with respect to the entrywise {$\ell_1$} norm ($k$-CSS$_1$), and more generally, for the $\ell_p$-norm with $1 \leq p < 2$. We study matrix factorization in $\ell_1$-norm loss, rather than the more standard Frobenius norm loss, because the $\ell_1$ norm is more robust to noise, which is observed to lead to improved performance in a wide range of computer vision and robotics problems. +In the distributed setting, we consider $s$ servers in the standard coordinator model of communication, where the columns of the input matrix $A \in \mathbb{R}^{d \times n}$ ($n \gg d$) are distributed across the $s$ servers. We give a protocol in this model with $\widetilde{O}(sdk)$ communication, $1$ round, and polynomial running time, and which achieves a multiplicative $k^{\frac{1}{p} - \frac{1}{2}}\poly(\log nd)$-approximation to the best possible column subset. A key ingredient in our proof is the reduction to the $\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to use strong coreset constructions for Euclidean norms, which previously had not been used in this context. This naturally also allows us to implement our algorithm in the popular streaming model of computation. We further propose a greedy algorithm for selecting columns, which can be used by the coordinator, and show the first provable guarantees for a greedy algorithm for the $\ell_{1,2}$ norm. Finally, we implement our protocol and give significant practical advantages on real-world data analysis tasks.",/pdf/e29c8abd902cd3abe1add9e136bf01b5fbfbf9f6.pdf,ICLR,2021, +F1vEjWK-lH_,EMduyDb5wVM,1601310000000.0,1616040000000.0,2546,Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models,"[""~Zirui_Wang1"", ""~Yulia_Tsvetkov1"", ""~Orhan_Firat1"", ""~Yuan_Cao2""]","[""Zirui Wang"", ""Yulia Tsvetkov"", ""Orhan Firat"", ""Yuan Cao""]","[""Multi-task Learning"", ""Multilingual Modeling""]","Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.",/pdf/4958372042631716242a4b3f1a10231548614522.pdf,ICLR,2021, +yvzMA5im3h,mt9575LHKqe,1601310000000.0,1614990000000.0,810,Graph Joint Attention Networks,"[""~Tiantian_He1"", ""bailu@ntu.edu.sg"", ""~Yew-Soon_Ong1""]","[""Tiantian He"", ""Lu Bai"", ""Yew-Soon Ong""]","[""Graph attention networks"", ""Joint attention mechanism"", ""Graph transductive learning""]","Graph attention networks (GATs) have been recognized as powerful tools for learning in graph structured data. However, how to enable the attention mechanisms in GATs to smoothly consider both structural and feature information is still very challenging. In this paper, we propose Graph Joint Attention Networks (JATs) to address the aforementioned challenge. Different from previous attention-based graph neural networks (GNNs), JATs adopt novel joint attention mechanisms which can automatically determine the relative significance between node features and structural coefficients learned from graph subspace, when computing the attention scores. Therefore, representations concerning more structural properties can be inferred by JATs. +Besides, we theoretically analyze the expressive power of JATs and further propose an improved strategy for the joint attention mechanisms that enables JATs to reach the upper bound of expressive power which every message-passing GNN can ultimately achieve, i.e., 1-WL test. JATs can thereby be seen as most powerful message-passing GNNs. The proposed neural architecture has been extensively tested on widely used benchmarking datasets, including Cora, Cite, and Pubmed and has been compared with state-of-the-art GNNs for node classification tasks. Experimental results show that JATs achieve state-of-the-art performance on all the testing datasets.",/pdf/0957a57f9f812b828dd59a585ea9d1ac40a5db41.pdf,ICLR,2021, +HJgKYlSKvr,Hklr7AxYDB,1569440000000.0,1577170000000.0,2443,Unsupervised Generative 3D Shape Learning from Natural Images,"[""attila.szabo@inf.unibe.ch"", ""givi.meishvili@inf.unibe.ch"", ""paolo.favaro@inf.unibe.ch""]","[""Attila Szabo"", ""Givi Meishvili"", ""Paolo Favaro""]","[""unsupervised"", ""3D"", ""differentiable"", ""rendering"", ""disentangling"", ""interpretable""]","In this paper we present, to the best of our knowledge, the first method to learn a generative model of 3D shapes from natural images in a fully unsupervised way. For example, we do not use any ground truth 3D or 2D annotations, stereo video, and ego-motion during the training. Our approach follows the general strategy of Generative Adversarial Networks, where an image generator network learns to create image samples that are realistic enough to fool a discriminator network into believing that they are natural images. In contrast, in our approach the image gen- eration is split into 2 stages. In the first stage a generator network outputs 3D ob- jects. In the second, a differentiable renderer produces an image of the 3D object from a random viewpoint. The key observation is that a realistic 3D object should yield a realistic rendering from any plausible viewpoint. Thus, by randomizing the choice of the viewpoint our proposed training forces the generator network to learn an interpretable 3D representation disentangled from the viewpoint. In this work, a 3D representation consists of a triangle mesh and a texture map that is used to color the triangle surface by using the UV-mapping technique. We provide analysis of our learning approach, expose its ambiguities and show how to over- come them. Experimentally, we demonstrate that our method can learn realistic 3D shapes of faces by using only the natural images of the FFHQ dataset.",/pdf/20883019eeab7c402e03cfe7ec5b5d42e5c7fbe4.pdf,ICLR,2020,We train a generative 3D model of shapes from natural images in an fully unsupervised way. +BJGVX3CqYm,SJleNcaqYX,1538090000000.0,1545360000000.0,1349,Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search,"[""bichen@berkeley.edu"", ""yanghan@instagram.com"", ""stzpz@fb.com"", ""yuandong@fb.com"", ""vajdap@fb.com""]","[""Bichen Wu"", ""Yanghan Wang"", ""Peizhao Zhang"", ""Yuandong Tian"", ""Peter Vajda"", ""Kurt Keutzer""]","[""Neural Net Quantization"", ""Neural Architecture Search""]","Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources. However, existing quantization methods often represent all weights and activations with the same precision (bit-width). In this paper, we explore a new dimension of the design space: quantizing different layers with different bit-widths. We formulate this problem as a neural architecture search problem and propose a novel differentiable neural architecture search (DNAS) framework to efficiently explore its exponential search space with gradient-based optimization. Experiments show we surpass the state-of-the-art compression of ResNet on CIFAR-10 and ImageNet. Our quantized models with 21.1x smaller model size or 103.9x lower computational cost can still outperform baseline quantized or even full precision models.",/pdf/c28f925c389b2212e02c426c60439e409aef7401.pdf,ICLR,2019,A novel differentiable neural architecture search framework for mixed quantization of ConvNets. +BJlPLlrFvH,S1lS4qltvB,1569440000000.0,1577170000000.0,2327,Variable Complexity in the Univariate and Multivariate Structural Causal Model,"[""tomerga2@post.tau.ac.il"", ""ofirnabati@mail.tau.ac.il"", ""wolf@fb.com""]","[""Tomer Galanti"", ""Ofir Nabati"", ""Lior Wolf""]",[],"We show that by comparing the individual complexities of univariante cause and effect in the Structural Causal Model, one can identify the cause and the effect, without considering their interaction at all. The entropy of each variable is ineffective in measuring the complexity, and we propose to capture it by an autoencoder that operates on the list of sorted samples. Comparing the reconstruction errors of the two autoencoders, one for each variable, is shown to perform well on the accepted benchmarks of the field. + +In the multivariate case, where one can ensure that the complexities of the cause and effect are balanced, we propose a new method that mimics the disentangled structure of the causal model. We extend the results of~\cite{Zhang:2009:IPC:1795114.1795190} to the multidimensional case, showing that such modeling is only likely in the direction of causality. Furthermore, the learned model is shown theoretically to perform the separation to the causal component and to the residual (noise) component. Our multidimensional method obtains a significantly higher accuracy than the literature methods.",/pdf/62346780de1a89a548f1a246fbba8cdc764b697f.pdf,ICLR,2020, +HJlHzJBFwB,BJg5Rb2OwB,1569440000000.0,1577170000000.0,1578,Accelerating Monte Carlo Bayesian Inference via Approximating Predictive Uncertainty over the Simplex,"[""yufeicui92@gmail.com"", ""satie.yao@my.cityu.edu.hk"", ""qiaoli045@gmail.com"", ""abchan@cityu.edu.hk"", ""jasonxue@cityu.edu.hk""]","[""Yufei Cui"", ""Wuguannan Yao"", ""Qiao Li"", ""Antoni Chan"", ""Chun Jason Xue""]",[],"Estimating the predictive uncertainty of a Bayesian learning model is critical in various decision-making problems, e.g., reinforcement learning, detecting adversarial attack, self-driving car. As the model posterior is almost always intractable, most efforts were made on finding an accurate approximation the true posterior. Even though a decent estimation of the model posterior is obtained, another approximation is required to compute the predictive distribution over the desired output. A common accurate solution is to use Monte Carlo (MC) integration. However, it needs to maintain a large number of samples, evaluate the model repeatedly and average multiple model outputs. In many real-world cases, this is computationally prohibitive. In this work, assuming that the exact posterior or a decent approximation is obtained, we propose a generic framework to approximate the output probability distribution induced by model posterior with a parameterized model and in an amortized fashion. The aim is to approximate the true uncertainty of a specific Bayesian model, meanwhile alleviating the heavy workload of MC integration at testing time. The proposed method is universally applicable to Bayesian classification models that allow for posterior sampling. Theoretically, we show that the idea of amortization incurs no additional costs on approximation performance. Empirical results validate the strong practical performance of our approach.",/pdf/1fa1a674f5cf6de871a6ff8631c689b76a526518.pdf,ICLR,2020, +Hyghb2Rct7,BJlc-Z9qtQ,1538090000000.0,1545360000000.0,1208,SIMILE: Introducing Sequential Information towards More Effective Imitation Learning,"[""ytongbai@gmail.com"", ""198808xc@gmail.com""]","[""Yutong Bai"", ""Lingxi Xie""]","[""Reinforcement Learning"", ""Imitation Learning"", ""Sequential Information""]","Reinforcement learning (RL) is a metaheuristic aiming at teaching an agent to interact with an environment and maximizing the reward in a complex task. RL algorithms often encounter the difficulty in defining a reward function in a sparse solution space. Imitation learning (IL) deals with this issue by providing a few expert demonstrations, and then either mimicking the expert's behavior (behavioral cloning, BC) or recovering the reward function by assuming the optimality of the expert (inverse reinforcement learning, IRL). Conventional IL approaches formulate the agent policy by mapping one single state to a distribution over actions, which did not consider sequential information. This strategy can be less accurate especially in IL, a weakly supervised learning environment, especially when the number of expert demonstrations is limited. + +This paper presents an effective approach named Sequential IMItation LEarning (SIMILE). The core idea is to introduce sequential information, so that an agent can refer to both the current state and past state-action pairs to make a decision. We formulate our approach into a recurrent model, and instantiate it using LSTM so as to fuse both long-term and short-term information. SIMILE is a generalized IL framework which is easily applied to BL and IRL, two major types of IL algorithms. Experiments are performed on several robot controlling tasks in OpenAI Gym. SIMILE not only achieves performance gain over the baseline approaches, but also enjoys the benefit of faster convergence and better stability of testing performance. These advantages verify a higher learning efficiency of SIMILE, and implies its potential applications in real-world scenarios, i.e., when the agent-environment interaction is more difficult and/or expensive.",/pdf/b095a3ad08413e428a99de63b23602810e353f6c.pdf,ICLR,2019,This paper introduces sequential information to improve inverse reinforcement learning algorithms +BJxeHyrKPB,ryeVde6OwS,1569440000000.0,1577170000000.0,1678,RATE-DISTORTION OPTIMIZATION GUIDED AUTOENCODER FOR GENERATIVE APPROACH,"[""kato.keizo@jp.fujitsu.com"", ""zhoujing@cn.fujitsu.com"", ""anaka@jp.fujitsu.com""]","[""Keizo Kato"", ""Jing Zhou"", ""Akira Nakagawa""]","[""Autoencoder"", ""Rate-distortion optimization"", ""Generative model"", ""Unsupervised learning"", ""Jacobian""]","In the generative model approach of machine learning, it is essential to acquire an accurate probabilistic model and compress the dimension of data for easy treatment. However, in the conventional deep-autoencoder based generative model such as VAE, the probability of the real space cannot be obtained correctly from that of in the latent space, because the scaling between both spaces is not controlled. This has also been an obstacle to quantifying the impact of the variation of latent variables on data. In this paper, we propose a method to learn parametric probability distribution and autoencoder simultaneously based on Rate-Distortion Optimization to support scaling control. It is proved theoretically and experimentally that (i) the probability distribution of the latent space obtained by this model is proportional to the probability distribution of the real space because Jacobian between two spaces is constant: (ii) our model behaves as non-linear PCA, which enables to evaluate the influence of latent variables on data. Furthermore, to verify the usefulness on the practical application, we evaluate its performance in unsupervised anomaly detection and outperform current state-of-the-art methods.",/pdf/bbd1aa96a619d0bf499e0f2aebb4235e6bac37d5.pdf,ICLR,2020,"We propose an autoencoder based on Rate-Distortion Optimization. With our model, log-likelihood maximization is possible without ELBO." +6BRLOfrMhW,8ObDbp2iLHJ,1601310000000.0,1615800000000.0,696,Partitioned Learned Bloom Filters,"[""~Kapil_Vaidya1"", ""eric_knorr@g.harvard.edu"", ""~Michael_Mitzenmacher1"", ""~Tim_Kraska1""]","[""Kapil Vaidya"", ""Eric Knorr"", ""Michael Mitzenmacher"", ""Tim Kraska""]","[""optimization"", ""data structures"", ""algorithms"", ""theory"", ""learned algorithms""]","Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set, and may return false positives. Recently, variations referred to as learned Bloom filters were developed that can provide improved performance in terms of the rate of false positives, by using a learned model for the represented set. However, previous methods for learned Bloom filters do not take full advantage of the learned model. Here we show how to frame the problem of optimal model utilization as an optimization problem, and using our framework derive algorithms that can achieve near-optimal performance in many cases.",/pdf/9d3c15624a3da1883d53dc7d7e286e835c51a105.pdf,ICLR,2021, +HyeGBj09Fm,BkxQbsw0uX,1538090000000.0,1550670000000.0,70,Generating Liquid Simulations with Deformation-aware Neural Networks,"[""lukas.prantl@tum.de"", ""boris.bonev@tum.de"", ""nils.thuerey@tum.de""]","[""Lukas Prantl"", ""Boris Bonev"", ""Nils Thuerey""]","[""deformation learning"", ""spatial transformer networks"", ""fluid simulation""]","We propose a novel approach for deformation-aware neural networks that learn the weighting and synthesis of dense volumetric deformation fields. Our method specifically targets the space-time representation of physical surfaces from liquid simulations. Liquids exhibit highly complex, non-linear behavior under changing simulation conditions such as different initial conditions. Our algorithm captures these complex phenomena in two stages: a first neural network computes a weighting function for a set of pre-computed deformations, while a second network directly generates a deformation field for refining the surface. Key for successful training runs in this setting is a suitable loss function that encodes the effect of the deformations, and a robust calculation of the corresponding gradients. To demonstrate the effectiveness of our approach, we showcase our method with several complex examples of flowing liquids with topology changes. Our representation makes it possible to rapidly generate the desired implicit surfaces. We have implemented a mobile application to demonstrate that real-time interactions with complex liquid effects are possible with our approach.",/pdf/3c157edb9037ee72d313c9d30fa730a4f9647d54.pdf,ICLR,2019,Learning weighting and deformations of space-time data sets for highly efficient approximations of liquid behavior. +F8xpAPm_ZKS,q0XSb1TWI9kE,1601310000000.0,1614990000000.0,3792,Model-Free Counterfactual Credit Assignment,"[""mesnard@google.com"", ""~Theophane_Weber1"", ""~Fabio_Viola2"", ""thakoor@google.com"", ""alaas@google.com"", ""~Anna_Harutyunyan1"", ""~Will_Dabney1"", ""~Tom_Stepleton1"", ""~Nicolas_Heess1"", ""~Marcus_Hutter1"", ""~Lars_Holger_Buesing1"", ""~Remi_Munos1""]","[""Thomas Mesnard"", ""Theophane Weber"", ""Fabio Viola"", ""Shantanu Thakoor"", ""Alaa Saade"", ""Anna Harutyunyan"", ""Will Dabney"", ""Tom Stepleton"", ""Nicolas Heess"", ""Marcus Hutter"", ""Lars Holger Buesing"", ""Remi Munos""]","[""credit assignment"", ""model-free RL"", ""causality"", ""hindsight""]","Credit assignment in reinforcement learning is the problem of measuring an action’s influence on future rewards. +In particular, this requires separating \emph{skill} from \emph{luck}, ie.\ disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. +The key idea is to condition value functions on \emph{future} events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent’s actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.",/pdf/89070bd0f4794115918b5ff91eb9ca9bd3c5f7db.pdf,ICLR,2021,"Under an appropriate action-independence constraint, future-conditional baselines are valid to use in policy gradients and lead to drastically reduced variance and faster learning in certain environments with difficult credit assignment." +H38f_9b90BO,lc6M716i2ib,1601310000000.0,1614990000000.0,317,Towards Robust Graph Neural Networks against Label Noise,"[""~Jun_Xia1"", ""~Haitao_Lin2"", ""~Yongjie_Xu1"", ""~Lirong_Wu1"", ""~Zhangyang_Gao1"", ""~Siyuan_Li6"", ""~Stan_Z._Li2""]","[""Jun Xia"", ""Haitao Lin"", ""Yongjie Xu"", ""Lirong Wu"", ""Zhangyang Gao"", ""Siyuan Li"", ""Stan Z. Li""]","[""Graph Neural Networks"", ""Graph Node Classification"", ""Label Noise""]","Massive labeled data have been used in training deep neural networks, thus label noise has become an important issue therein. Although learning with noisy labels has made great progress on image datasets in recent years, it has not yet been studied in connection with utilizing GNNs to classify graph nodes. In this paper, we proposed a method, named LPM, to address the problem using Label Propagation (LP) and Meta learning. Different from previous methods designed for image datasets, our method is based on a special attribute (label smoothness) of graph-structured data, i.e., neighboring nodes in a graph tend to have the same label. A pseudo label is computed from the neighboring labels for each node in the training set using LP; meta learning is utilized to learn a proper aggregation of the original and pseudo label as the final label. Experimental results demonstrate that LPM outperforms state-of-the-art methods in graph node classification task with both synthetic and real-world label noise. Source code to reproduce all results will be released.",/pdf/0f5774c96a06ab9bc0c9353ff3f2a59b2ef885ae.pdf,ICLR,2021,"To the best of our knowledge, we are the first to focus on the label noise existing in utilizing GNNs to classify graph nodes." +C_p3TDhOXW_,jY3wMYEuzTE,1601310000000.0,1614990000000.0,2655,Prior Preference Learning From Experts: Designing A Reward with Active Inference,"[""~Jin_Young_Shin1"", ""~Cheolhyeong_Kim1"", ""~Hyung_Ju_Hwang1""]","[""Jin Young Shin"", ""Cheolhyeong Kim"", ""Hyung Ju Hwang""]","[""Active Inference"", ""Free Energy Principle"", ""Reinforcement Learning"", ""Reward Design""]","Active inference may be defined as Bayesian modeling of a brain with a biologically plausible model of the agent. Its primary idea relies on the free energy principle and the prior preference of the agent. An agent will choose an action that leads to its prior preference for a future observation. In this paper, we claim that active inference can be interpreted using reinforcement learning (RL) algorithms and find a theoretical connection between them. We extend the concept of expected free energy (EFE), which is a core quantity in active inference, and claim that EFE can be treated as a negative value function. Motivated by the concept of prior preference and a theoretical connection, we propose a simple but novel method for learning a prior preference from experts. This illustrates that the problem with RL can be approached with a new perspective of active inference. Experimental results of prior preference learning show the possibility of active inference with EFE-based rewards and its application to an inverse RL problem.",/pdf/46763e1c73516be5c1027f1c73c975146e60a19c.pdf,ICLR,2021,We propose a new method to design a reward from experts' simulations based on a concept of active inference. +rJg7BA4YDr,r1xENcLdDB,1569440000000.0,1577170000000.0,1104,NEURAL EXECUTION ENGINES,"[""yujunyan@umich.edu"", ""kswersky@google.com"", ""dkoutra@umich.edu"", ""parthas@google.com"", ""miladh@google.com""]","[""Yujun Yan"", ""Kevin Swersky"", ""Danai Koutra"", ""Parthasarathy Ranganathan"", ""Milad Hashemi""]","[""neural computation"", ""strong generalization"", ""numerical reasoning""]","Turing complete computation and reasoning are often regarded as necessary pre- cursors to general intelligence. There has been a significant body of work studying neural networks that mimic general computation, but these networks fail to generalize to data distributions that are outside of their training set. We study this problem through the lens of fundamental computer science problems: sorting and graph processing. We modify the masking mechanism of a transformer in order to allow them to implement rudimentary functions with strong generalization. We call this model the Neural Execution Engine, and show that it learns, through supervision, to numerically compute the basic subroutines comprising these algorithms with near perfect accuracy. Moreover, it retains this level of accuracy while generalizing to unseen data and long sequences outside of the training distribution.",/pdf/c9a2e09395b51b0be064f2e31ea38016298b1124.pdf,ICLR,2020,"We propose neural execution engines (NEEs), which leverage a learned mask and supervised execution traces to mimic the functionality of subroutines and demonstrate strong generalization." +rye4g3AqFm,SylX2xAcFQ,1538090000000.0,1550880000000.0,1066,Deep learning generalizes because the parameter-function map is biased towards simple functions,"[""guillermo.valle@dtc.ox.ac.uk"", ""chico.camargo@gmail.com"", ""ard.louis@physics.ox.ac.uk""]","[""Guillermo Valle-Perez"", ""Chico Q. Camargo"", ""Ard A. Louis""]","[""generalization"", ""deep learning theory"", ""PAC-Bayes"", ""Gaussian processes"", ""parameter-function map"", ""simplicity bias""]","Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parametrized regime where classical learning theory would instead predict that they would severely overfit. While many proposals for some kind of implicit regularization have been made to rationalise this success, there is no consensus for the fundamental reason why DNNs do not strongly overfit. In this paper, we provide a new explanation. By applying a very general probability-complexity bound recently derived from algorithmic information theory (AIT), we argue that the parameter-function map of many DNNs should be exponentially biased towards simple functions. We then provide clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks trained on CIFAR10 and MNIST. +As the target functions in many real problems are expected to be highly structured, this intrinsic simplicity bias helps explain why deep networks generalize well on real world problems. +This picture also facilitates a novel PAC-Bayes approach where the prior is taken over the DNN input-output function space, rather than the more conventional prior over parameter space. If we assume that the training algorithm samples parameters close to uniformly within the zero-error region then the PAC-Bayes theorem can be used to guarantee good expected generalization for target functions producing high-likelihood training sets. By exploiting recently discovered connections between DNNs and Gaussian processes to estimate the marginal likelihood, we produce relatively tight generalization PAC-Bayes error bounds which correlate well with the true error on realistic datasets such as MNIST and CIFAR10 and for architectures including convolutional and fully connected networks.",/pdf/047045cff29435c46508e8cd66055d206867c3e4.pdf,ICLR,2019,The parameter-function map of deep networks is hugely biased; this can explain why they generalize. We use PAC-Bayes and Gaussian processes to obtain nonvacuous bounds. +BJedHRVtPB,HylSqnLOvr,1569440000000.0,1584640000000.0,1117,Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving,"[""yy785@cornell.edu"", ""yw763@cornell.edu"", ""weilunchao760414@gmail.com"", ""dg595@cornell.edu"", ""gp346@cornell.edu"", ""bharathh@cs.cornell.edu"", ""mc288@cornell.edu"", ""kqw4@cornell.edu""]","[""Yurong You"", ""Yan Wang"", ""Wei-Lun Chao"", ""Divyansh Garg"", ""Geoff Pleiss"", ""Bharath Hariharan"", ""Mark Campbell"", ""Kilian Q. Weinberger""]","[""pseudo-LiDAR"", ""3D-object detection"", ""stereo depth estimation"", ""autonomous driving""]","Detecting objects such as cars and pedestrians in 3D plays an indispensable role in autonomous driving. Existing approaches largely rely on expensive LiDAR sensors for accurate depth information. While recently pseudo-LiDAR has been introduced as a promising alternative, at a much lower cost based solely on stereo images, there is still a notable performance gap. +In this paper we provide substantial advances to the pseudo-LiDAR framework through improvements in stereo depth estimation. Concretely, we adapt the stereo network architecture and loss function to be more aligned with accurate depth estimation of faraway objects --- currently the primary weakness of pseudo-LiDAR. Further, we explore the idea to leverage cheaper but extremely sparse LiDAR sensors, which alone provide insufficient information for 3D detection, to de-bias our depth estimation. We propose a depth-propagation algorithm, guided by the initial depth estimates, to diffuse these few exact measurements across the entire depth map. We show on the KITTI object detection benchmark that our combined approach yields substantial improvements in depth estimation and stereo-based 3D object detection --- outperforming the previous state-of-the-art detection accuracy for faraway objects by 40%. Our code is available at https://github.com/mileyan/Pseudo_Lidar_V2.",/pdf/f9532a87f751b9201aea15e1df1db90e51c9411f.pdf,ICLR,2020, +a9nIWs-Orh,NsHv-le8W28,1601310000000.0,1614990000000.0,1723,Deepening Hidden Representations from Pre-trained Language Models,"[""~Junjie_Yang3"", ""~hai_zhao1""]","[""Junjie Yang"", ""hai zhao""]","[""Natural Language Processing"", ""Representation Learning""]","Transformer-based pre-trained language models have proven to be effective for learning contextualized language representation. However, current approaches only take advantage of the output of the encoder's final layer when fine-tuning the downstream tasks. We argue that only taking single layer's output restricts the power of pre-trained representation. Thus we deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer. Utilizing RoBERTa as the backbone encoder, our proposed improvement over the pre-trained models is shown effective on multiple natural language understanding tasks and help our model rival with the state-of-the-art models on the GLUE benchmark.",/pdf/75f70119590c29caa6c9ac3bb6411fe5d2a6a6cf.pdf,ICLR,2021, +rkx8l3Cctm,rygv5JeFKX,1538090000000.0,1545360000000.0,1076,Safe Policy Learning from Observations,"[""elad.sarafian@gmail.com"", ""avivt@berkeley.edu"", ""sarit@cs.biu.ac.il""]","[""Elad Sarafian"", ""Aviv Tamar"", ""Sarit Kraus""]","[""learning from observations"", ""safe reinforcement learning"", ""deep reinforcement learning""]","In this paper, we consider the problem of learning a policy by observing numerous non-expert agents. Our goal is to extract a policy that, with high-confidence, acts better than the agents' average performance. Such a setting is important for real-world problems where expert data is scarce but non-expert data can easily be obtained, e.g. by crowdsourcing. Our approach is to pose this problem as safe policy improvement in reinforcement learning. First, we evaluate an average behavior policy and approximate its value function. Then, we develop a stochastic policy improvement algorithm that safely improves the average behavior. The primary advantages of our approach, termed Rerouted Behavior Improvement (RBI), over other safe learning methods are its stability in the presence of value estimation errors and the elimination of a policy search process. We demonstrate these advantages in the Taxi grid-world domain and in four games from the Atari learning environment.",/pdf/86b6100f6aeeda9c534845d3464f029e707aeb33.pdf,ICLR,2019,"An algorithm for learning to improve upon the behavior demonstrated by multiple unknown policies, by combining imitation learning and a novel safe policy improvement step that is resilient to value estimation errors." +Bkx5XyrtPS,HygHg_nOvr,1569440000000.0,1577170000000.0,1627,Depth creates no more spurious local minima in linear networks,"[""liqzhang@google.com""]","[""Li Zhang""]","[""local minimum"", ""deep linear network""]","We show that for any convex differentiable loss, a deep linear network has no spurious local minima as long as it is true for the two layer case. This reduction greatly simplifies the study on the existence of spurious local minima in deep linear networks. When applied to the quadratic loss, our result immediately implies the powerful result by Kawaguchi (2016). Further, with the recent work by Zhou& Liang (2018), we can remove all the assumptions in (Kawaguchi, 2016). This property holds for more general “multi-tower” linear networks too. Our proof builds on the work in (Laurent & von Brecht, 2018) and develops a new perturbation argument to show that any spurious local minimum must have full rank, a structural property which can be useful more generally",/pdf/a19b8ec3346af2360040a0059c740b3fa900c6b8.pdf,ICLR,2020,We show that a deep linear network has no spurious local minima as long as it is true for the two layer case. +H1lTRJBtwB,HklpedJtDH,1569440000000.0,1577170000000.0,2041,Compositional Transfer in Hierarchical Reinforcement Learning,"[""mwulfmeier@google.com"", ""aabdolmaleki@google.com"", ""rhafner@google.com"", ""springenberg@google.com"", ""neunertm@google.com"", ""thertweck@google.com"", ""thomaslampe@google.com"", ""heess@google.com"", ""riedmiller@google.com""]","[""Markus Wulfmeier"", ""Abbas Abdolmaleki"", ""Roland Hafner"", ""Jost Tobias Springenberg"", ""Michael Neunert"", ""Tim Hertweck"", ""Thomas Lampe"", ""Noah Siegel"", ""Nicolas Heess"", ""Martin Riedmiller""]","[""Multitask"", ""Transfer Learning"", ""Reinforcement Learning"", ""Hierarchical Reinforcement Learning"", ""Compositional"", ""Off-Policy""]","The successful application of flexible, general learning algorithms to real-world robotics applications is often limited by their poor data-efficiency. To address the challenge, domains with more than one dominant task of interest encourage the sharing of information across tasks to limit required experiment time. To this end, we investigate compositional inductive biases in the form of hierarchical policies as a mechanism for knowledge transfer across tasks in reinforcement learning (RL). We demonstrate that this type of hierarchy enables positive transfer while mitigating negative interference. Furthermore, we demonstrate the benefits of additional incentives to efficiently decompose task solutions. Our experiments show that these incentives are naturally given in multitask learning and can be easily introduced for single objectives. We design an RL algorithm that enables stable and fast learning of structured policies and the effective reuse of both behavior components and transition data across tasks in an off-policy setting. Finally, we evaluate our algorithm in simulated environments as well as physical robot experiments and demonstrate substantial improvements in data data-efficiency over competitive baselines.",/pdf/b3a5e162ed7008cf4955f9d4990fd7799385a0e2.pdf,ICLR,2020,"We develop a hierarchical, actor-critic algorithm for compositional transfer by sharing policy components and demonstrate component specialization and related direct benefits in multitask domains as well as its adaptation for single tasks." +H1e0Wp4KvH,r1x4-fKLPS,1569440000000.0,1583910000000.0,395,Automated curriculum generation through setter-solver interactions,"[""lampinen@stanford.edu"", ""sracaniere@google.com"", ""adamsantoro@google.com"", ""reichert@google.com"", ""vladfi@google.com"", ""countzero@google.com""]","[""Sebastien Racaniere"", ""Andrew Lampinen"", ""Adam Santoro"", ""David Reichert"", ""Vlad Firoiu"", ""Timothy Lillicrap""]","[""Deep Reinforcement Learning"", ""Automatic Curriculum""]","Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula –the breakdown of tasks into simpler, static challenges with dense rewards– to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich,dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes.",/pdf/30d564052b18a3822b6a924daa6730e9aec54d01.pdf,ICLR,2020,We investigate automatic curriculum generation and identify a number of losses useful to learn to generate a curriculum of tasks. +rJl63fZRb,r1VuhMWAb,1509140000000.0,1521790000000.0,972,Parametrized Hierarchical Procedures for Neural Programming,"[""roy.d.fox@gmail.com"", ""shin.richard@gmail.com"", ""sanjay@eecs.berkeley.edu"", ""goldberg@berkeley.edu"", ""dawnsong.travel@gmail.com"", ""istoica@cs.berkeley.edu""]","[""Roy Fox"", ""Richard Shin"", ""Sanjay Krishnan"", ""Ken Goldberg"", ""Dawn Song"", ""Ion Stoica""]","[""Neural programming"", ""Hierarchical Control""]","Neural programs are highly accurate and structured policies that perform algorithmic tasks by controlling the behavior of a computation mechanism. Despite the potential to increase the interpretability and the compositionality of the behavior of artificial agents, it remains difficult to learn from demonstrations neural networks that represent computer programs. The main challenges that set algorithmic domains apart from other imitation learning domains are the need for high accuracy, the involvement of specific structures of data, and the extremely limited observability. To address these challenges, we propose to model programs as Parametrized Hierarchical Procedures (PHPs). A PHP is a sequence of conditional operations, using a program counter along with the observation to select between taking an elementary action, invoking another PHP as a sub-procedure, and returning to the caller. We develop an algorithm for training PHPs from a set of supervisor demonstrations, only some of which are annotated with the internal call structure, and apply it to efficient level-wise training of multi-level PHPs. We show in two benchmarks, NanoCraft and long-hand addition, that PHPs can learn neural programs more accurately from smaller amounts of both annotated and unannotated demonstrations.",/pdf/536c16681ee50070e91b9227d45e74e815a79982.pdf,ICLR,2018,"We introduce the PHP model for hierarchical representation of neural programs, and an algorithm for learning PHPs from a mixture of strong and weak supervision." +HJeTo2VFwH,HygZ6vGbPB,1569440000000.0,1583910000000.0,168,A Signal Propagation Perspective for Pruning Neural Networks at Initialization,"[""namhoon@robots.ox.ac.uk"", ""thalaiyasingam.ajanthan@anu.edu.au"", ""stephen.gould@anu.edu.au"", ""phst@robots.ox.ac.uk""]","[""Namhoon Lee"", ""Thalaiyasingam Ajanthan"", ""Stephen Gould"", ""Philip H. S. Torr""]","[""neural network pruning"", ""signal propagation perspective"", ""sparse neural networks""]","Network pruning is a promising avenue for compressing deep neural networks. A typical approach to pruning starts by training a model and then removing redundant parameters while minimizing the impact on what is learned. Alternatively, a recent approach shows that pruning can be done at initialization prior to training, based on a saliency criterion called connection sensitivity. However, it remains unclear exactly why pruning an untrained, randomly initialized neural network is effective. In this work, by noting connection sensitivity as a form of gradient, we formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results. Moreover, we analyze the signal propagation properties of the resulting pruned networks and introduce a simple, data-free method to improve their trainability. Our modifications to the existing pruning at initialization method lead to improved results on all tested network models for image classification tasks. Furthermore, we empirically study the effect of supervision for pruning and demonstrate that our signal propagation perspective, combined with unsupervised pruning, can be useful in various scenarios where pruning is applied to non-standard arbitrarily-designed architectures.",/pdf/0337afac0712aa60d2998f4a056e13ebdf41232d.pdf,ICLR,2020,We formally characterize the initialization conditions for effective pruning at initialization and analyze the signal propagation properties of the resulting pruned networks which leads to a method to enhance their trainability and pruning results. +ryfcCo0ctQ,HklXNp29FQ,1538090000000.0,1545360000000.0,922,Convergent Reinforcement Learning with Function Approximation: A Bilevel Optimization Perspective,"[""zy6@princeton.edu"", ""zuyuefu2022@u.northwestern.edu"", ""kzhang66@illinois.edu"", ""zhaoranwang@gmail.com""]","[""Zhuoran Yang"", ""Zuyue Fu"", ""Kaiqing Zhang"", ""Zhaoran Wang""]","[""reinforcement learning"", ""Deep Q-networks"", ""actor-critic algorithm"", ""ODE approximation""]"," We study reinforcement learning algorithms with nonlinear function approximation in the online setting. By formulating both the problems of value function estimation and policy learning as bilevel optimization problems, we propose online Q-learning and actor-critic algorithms for these two problems respectively. Our algorithms are gradient-based methods and thus are computationally efficient. Moreover, by approximating the iterates using differential equations, we establish convergence guarantees for the proposed algorithms. Thorough numerical experiments are conducted to back up our theory.",/pdf/b5090a93bdd9dbc5fb39e9816dc1ed3b3652c393.pdf,ICLR,2019, +YTyHkF4P03w,RYRRsbvEzjJ,1601310000000.0,1614990000000.0,3234,What to Prune and What Not to Prune at Initialization,"[""~Maham_Haroon1""]","[""Maham Haroon""]","[""Network Sparsity"", ""Machine Learning"", ""Initialization Pruning""]","Post-training dropout based approaches achieve high sparsity and are well established means of deciphering problems relating to computational cost and overfitting in Neural Network architectures. Contrastingly, pruning at initialization is still far behind. Initialization pruning is more efficacious when it comes to scaling computation cost of the network. Furthermore, it handles overfitting just as well as post training dropout. It is also averse to retraining losses. + +In approbation of the above reasons, the paper presents two approaches to prune at initialization. The goal is to achieve higher sparsity while preserving performance. 1) K-starts, begins with k random p-sparse matrices at initialization. In the first couple of epochs the network then determines the ""fittest"" of these p-sparse matrices in an attempt to find the ""lottery ticket"" p-sparse network. The approach is adopted from how evolutionary algorithms find the best individual. Depending on the Neural Network architecture, fitness criteria can be based on magnitude of network weights, magnitude of gradient accumulation over an epoch or a combination of both. 2) Dissipating gradients approach, aims at eliminating weights that remain within a fraction of their initial value during the first couple of epochs. Removing weights in this manner despite their magnitude best preserves performance of the network. Contrarily, the approach also takes the most epochs to achieve higher sparsity. 3) Combination of dissipating gradients and kstarts outperforms either methods and random dropout consistently. + +The benefits of using the provided pertaining approaches are: 1) They do not require specific knowledge of the classification task, fixing of dropout threshold or regularization parameters 2) Retraining of the model is neither necessary nor affects the performance of the p-sparse network. + +We evaluate the efficacy of the said methods on Autoencoders and Fully Connected Multilayered Perceptrons. The datasets used are MNIST and Fashion MNIST.",/pdf/032a6594153b919b8bd489958cc98ea55d34510a.pdf,ICLR,2021,Provides methods to prune weights at initialization +SJg9z6VFDr,S1esP6sUDr,1569440000000.0,1577170000000.0,422,Ordinary differential equations on graph networks,"[""j.zhuang@yale.edu"", ""nicha.dvornek@yale.edu"", ""xiaoxiao.li@yale.edu"", ""james.duncan@yale.edu""]","[""Juntang Zhuang"", ""Nicha Dvornek"", ""Xiaoxiao Li"", ""James S. Duncan""]","[""Graph Networks"", ""Ordinary differential equation""]","Recently various neural networks have been proposed for irregularly structured data such as graphs and manifolds. To our knowledge, all existing graph networks have discrete depth. Inspired by neural ordinary differential equation (NODE) for data in the Euclidean domain, we extend the idea of continuous-depth models to graph data, and propose graph ordinary differential equation (GODE). The derivative of hidden node states are parameterized with a graph neural network, and the output states are the solution to this ordinary differential equation. We demonstrate two end-to-end methods for efficient training of GODE: (1) indirect back-propagation with the adjoint method; (2) direct back-propagation through the ODE solver, which accurately computes the gradient. We demonstrate that direct backprop outperforms the adjoint method in experiments. We then introduce a family of bijective blocks, which enables $\mathcal{O}(1)$ memory consumption. We demonstrate that GODE can be easily adapted to different existing graph neural networks and improve accuracy. We validate the performance of GODE in both semi-supervised node classification tasks and graph classification tasks. Our GODE model achieves a continuous model in time, memory efficiency, accurate gradient estimation, and generalizability with different graph networks.",/pdf/5c8a94186856d657aa0a2d8b80c095bf6ddcf282.pdf,ICLR,2020,Apply ordinary differential equation model on graph structured data +SZ3wtsXfzQR,obWgxZtCBn3N,1601310000000.0,1616810000000.0,3384,Theoretical bounds on estimation error for meta-learning,"[""~James_Lucas1"", ""~Mengye_Ren1"", ""~Irene_Raissa_KAMENI_KAMENI1"", ""~Toniann_Pitassi1"", ""~Richard_Zemel1""]","[""James Lucas"", ""Mengye Ren"", ""Irene Raissa KAMENI KAMENI"", ""Toniann Pitassi"", ""Richard Zemel""]","[""meta learning"", ""few-shot"", ""minimax risk"", ""lower bounds"", ""learning theory""]","Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel information-theoretic lower-bounds on minimax rates of convergence for algorithms that are trained on data from multiple sources and tested on novel data. Our bounds depend intuitively on the information shared between sources of data, and characterize the difficulty of learning in this setting for arbitrary algorithms. We demonstrate these bounds on a hierarchical Bayesian model of meta-learning, computing both upper and lower bounds on parameter estimation via maximum-a-posteriori inference.",/pdf/f6e0a3923ea91b65f312ccf597276c427be18097.pdf,ICLR,2021,We prove novel minimax risk lower bounds and upper bounds for meta learners +nXSDybDWV3,ATp1HuqHgz0,1601310000000.0,1614990000000.0,1105,Einstein VI: General and Integrated Stein Variational Inference in NumPyro,"[""~Ahmad_Salim_Al-Sibahi1"", ""ola@di.ku.dk"", ""christophe.ley@ugent.be"", ""thamelry@binf.ku.dk""]","[""Ahmad Salim Al-Sibahi"", ""Ola R\u00f8nning"", ""Christophe Ley"", ""Thomas Wim Hamelryck""]","[""Stein variational inference"", ""variational inference"", ""probabilistic programming"", ""Pyro"", ""deep probabilistic programming"", ""deep learning""]","Stein Variational Inference is a technique for approximate Bayesian inferencethat is recently gaining popularity since it combines the scalability of traditionalVariational Inference (VI) with the flexibility of non-parametric particle basedinference methods. While there has been considerable progress in developmentof algorithms, integration in existing probabilistic programming languages (PPLs)with an easy-to-use interface is currently lacking. EinStein VI is a lightweightcomposable library that integrates the latest Stein Variational Inference methodswith the NumPyro PPL. Inference with EinStein VI relies on ELBO-within-Stein tosupport use of custom inference programs (guides), non-linear scaling of repulsionforce, second-order gradient updates using matrix-valued kernels and parametertransforms. We demonstrate the achieved synergy of the different Stein techniquesand the versatility of EinStein VI library by applying it on examples. Comparedto traditional Stochastic VI, EinStein VI is better at capturing uncertainty andrepresenting richer posteriors. We use several applications to show how one canuse Neural Transforms (NeuTra) and second-order optimization to provide betterinference using EinStein VI. We show how EinStein VI can be used to infer theparameters of a Latent Dirichlet Allocation model with a neural guide. The resultsindicate that Einstein VI can be combined with NumPyro’s support for automaticmarginalization to do inference over models with discrete latent variables. Finally,we introduce an example with a novel extension to Deep Markov Models, calledthe Stein Mixture Deep Markov Model (SM-DMM), which shows that EinStein VIcan be scaled to reasonably large models with over 500.000 parameters",/pdf/362f99d2de5cf06e043089f7a18cbf3345166104.pdf,ICLR,2021,"We present EinStein Variational Inference, a technique for inference that integrates all the latest developments within Stein VI into NumPyro, adds optimizable parameter transforms and supports ELBO optimization." +SygpC6Ntvr,Byg81GfdPr,1569440000000.0,1586720000000.0,873,Minimizing FLOPs to Learn Efficient Sparse Representations,"[""bparia@cs.cmu.edu"", ""cjyeh@cs.cmu.edu"", ""a061105@gmail.com"", ""ningxu01@gmail.com"", ""pradeepr@cs.cmu.edu"", ""bapoczos@cs.cmu.edu""]","[""Biswajit Paria"", ""Chih-Kuan Yeh"", ""Ian E.H. Yen"", ""Ning Xu"", ""Pradeep Ravikumar"", ""Barnab\u00e1s P\u00f3czos""]","[""sparse embeddings"", ""deep representations"", ""metric learning"", ""regularization""]","Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.",/pdf/a4828a25256072c99f7232b0756d6310ac4c4d02.pdf,ICLR,2020,"We propose an approach to learn sparse high dimensional representations that are fast to search, by incorporating a surrogate of the number of operations directly into the loss function." +HygsfnR9Ym,rJxDGe59K7,1538090000000.0,1550610000000.0,1299,Recall Traces: Backtracking Models for Efficient Reinforcement Learning,"[""anirudhgoyal9119@gmail.com"", ""philemon@google.com"", ""liam.fedus@gmail.com"", ""singhalsoumye@gmail.com"", ""countzero@google.com"", ""svlevine@eecs.berkeley.edu"", ""hugolarochelle@google.com"", ""yoshua.bengio@mila.quebec""]","[""Anirudh Goyal"", ""Philemon Brakel"", ""William Fedus"", ""Soumye Singhal"", ""Timothy Lillicrap"", ""Sergey Levine"", ""Hugo Larochelle"", ""Yoshua Bengio""]","[""Model free RL"", ""Variational Inference""]","In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. +To this end, we advocate for the use of a \textit{backtracking model} that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and samples which (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks. ",/pdf/3c2f3116d77fa2ce0af93b56a95feb16e9b9d9a3.pdf,ICLR,2019,"A backward model of previous (state, action) given the next state, i.e. P(s_t, a_t | s_{t+1}), can be used to simulate additional trajectories terminating at states of interest! Improves RL learning efficiency." +rJehllrtDS,SJeJ7AyYvr,1569440000000.0,1577170000000.0,2114,Rethinking deep active learning: Using unlabeled data at model training,"[""oriane.simeoni@inria.fr"", ""mateusz.budnik@inria.fr"", ""yannis@avrithis.net"", ""guig@irisa.fr""]","[""Oriane Sim\u00e9oni"", ""Mateusz Budnik"", ""Yannis Avrithis"", ""Guillaume Gravier""]","[""active learning"", ""deep learning"", ""semi-supervised learning"", ""unsupervised feature learning""]","Active learning typically focuses on training a model on few labeled examples alone, while unlabeled ones are only used for acquisition. In this work we depart from this setting by using both labeled and unlabeled data during model training across active learning cycles. We do so by using unsupervised feature learning at the beginning of the active learning pipeline and semi-supervised learning at every active learning cycle, on all available data. The former has not been investigated before in active learning, while the study of latter in the context of deep learning is scarce and recent findings are not conclusive with respect to its benefit. Our idea is orthogonal to acquisition strategies by using more data, much like ensemble methods use more models. By systematically evaluating on a number of popular acquisition strategies and datasets, we find that the use of unlabeled data during model training brings a spectacular accuracy improvement in image classification, compared to the differences between acquisition strategies. We thus explore smaller label budgets, even one label per class. ",/pdf/c9d2f97fdeae9be9a94f12a6b4dfaad4a6dc8038.pdf,ICLR,2020,"We revisit deep active learning be making use of the unlabeled data through unsupervised and semi-supervised learning, allowing us to improve drastically the results using the same annotation effort." +IDFQI9OY6K,q-kEimvzeAG,1601310000000.0,1611610000000.0,193,Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling,"[""~Benedikt_Boecking1"", ""~Willie_Neiswanger2"", ""~Eric_Xing1"", ""~Artur_Dubrawski2""]","[""Benedikt Boecking"", ""Willie Neiswanger"", ""Eric Xing"", ""Artur Dubrawski""]","[""weak supervision"", ""data programming"", ""data labeling"", ""active learning""]","Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their creation requires creativity, foresight, and domain expertise from those who hand-craft them, a process which can be tedious and subjective. We develop the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic. Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles.",/pdf/7b8b0c300266ed08c27bfea5eaaa513189d0caab.pdf,ICLR,2021,We introduce a framework and method for training classifiers on datasets without ground truth annotation by interacting with domain experts to discover good weak supervision sources. +SkeJPertPS,BkldR9eYDB,1569440000000.0,1577170000000.0,2344,Collaborative Training of Balanced Random Forests for Open Set Domain Adaptation,"[""jongbin.ryu@gmail.com"", ""maybe@hanyang.ac.kr"", ""jlim@hanyang.ac.kr""]","[""Jongbin Ryu"", ""Jiun Bae"", ""Jongwoo Lim""]",[],"In this paper, we introduce a collaborative training algorithm of balanced random forests for domain adaptation tasks which can avoid the overfitting problem. In real scenarios, most domain adaptation algorithms face the challenges from noisy, insufficient training data. Moreover in open set categorization, unknown or misaligned source and target categories adds difficulty. In such cases, conventional methods suffer from overfitting and fail to successfully transfer the knowledge of the source to the target domain. To address these issues, the following two techniques are proposed. First, we introduce the optimized decision tree construction method, in which the data at each node are split into equal sizes while maximizing the information gain. Compared to the conventional random forests, it generates larger and more balanced decision trees due to the even-split constraint, which contributes to enhanced discrimination power and reduced overfitting. Second, to tackle the domain misalignment problem, we propose the domain alignment loss which penalizes uneven splits of the source and target domain data. By collaboratively optimizing the information gain of the labeled source data as well as the entropy of unlabeled target data distributions, the proposed CoBRF algorithm achieves significantly better performance than the state-of-the-art methods. The proposed algorithm is extensively evaluated in various experimental setups in challenging domain adaptation tasks with noisy and small training data as well as open set domain adaptation problems, for two backbone networks of AlexNet and ResNet-50.",/pdf/0b9cfb123ef467704ea8381ee6539b391ba14028.pdf,ICLR,2020, +3-a23gHXQmr,hbAVNuG5UQv,1601310000000.0,1614990000000.0,2182,Parametric Density Estimation with Uncertainty using Deep Ensembles,"[""~Abel_Peirson1"", ""thowell@stanford.edu"", ""mtirlea@stanford.edu""]","[""Abel Peirson"", ""Taylor Howell"", ""Marius Aurel Tirlea""]","[""Deep ensembles"", ""deep learning"", ""computer vision"", ""density estimation"", ""uncertainty""]","In parametric density estimation, the parameters of a known probability density are typically recovered from measurements by maximizing the log-likelihood. Prior knowledge of measurement uncertainties is not included in this method -- potentially producing degraded or even biased parameter estimates. +We propose an efficient two-step, general-purpose approach for parametric density estimation using deep ensembles. +Feature predictions and their uncertainties are returned by a deep ensemble and then combined in an importance weighted maximum likelihood estimation to recover parameters representing a known density along with their respective errors. To compare the bias-variance tradeoff of different approaches, we define an appropriate figure of merit. +We illustrate a number of use cases for our method in the physical sciences and demonstrate state-of-the-art results for X-ray polarimetry that outperform current classical and deep learning methods.",/pdf/ba13f3ea34ba2d31e265ce6ac2a382175ac01b7b.pdf,ICLR,2021,Deep ensemble predictive uncertainties can inform parametric density estimation in a number of applications. +B1vRTeqxg,,1478260000000.0,1484850000000.0,165,Learning Continuous Semantic Representations of Symbolic Expressions,"[""m.allamanis@ed.ac.uk"", ""pankajan.chanthirasegaran@ed.ac.uk"", ""pkohli@microsoft.com"", ""csutton@ed.ac.uk""]","[""Miltiadis Allamanis"", ""Pankajan Chanthirasegaran"", ""Pushmeet Kohli"", ""Charles Sutton""]","[""Deep learning""]","The question of how procedural knowledge is represented and inferred is a fundamental problem in machine learning and artificial intelligence. Recent work on program induction has proposed neural architectures, based on abstractions like stacks, Turing machines, and interpreters, that operate on abstract computational machines or on execution traces. But the recursive abstraction that is central to procedural knowledge is perhaps most naturally represented by symbolic representations that have syntactic structure, such as logical expressions and source code. Combining abstract, symbolic reasoning with continuous neural reasoning is a grand challenge of representation learning. As a step in this direction, we propose a new architecture, called neural equivalence networks, for the problem of learning continuous semantic representations of mathematical and logical expressions. These networks are trained to represent semantic equivalence, even of expressions that are syntactically very different. The challenge is that semantic representations must be computed in a syntax-directed manner, because semantics is compositional, but at the same time, small changes in syntax can lead to very large changes in semantics, which can be difficult for continuous neural architectures. We perform an exhaustive evaluation on the task of checking equivalence on a highly diverse class of symbolic algebraic and boolean expression types, showing that our model significantly outperforms existing architectures. +",/pdf/8cfedc863eec47c1c047f231b88712e77bd698ab.pdf,ICLR,2017,"Assign continuous vectors to logical and algebraic symbolic expressions in such a way that semantically equivalent, but syntactically diverse expressions are assigned to identical (or highly similar) continuous vectors." +ryxgJTEYDr,ryx2v1QSwH,1569440000000.0,1583910000000.0,288,Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives,"[""anirudhgoyal9119@gmail.com"", ""sshagunsodhani@gmail.com"", ""jbinas@gmail.com"", ""xbpeng@berkeley.edu"", ""svlevine@eecs.berkeley.edu"", ""yoshua.bengio@mila.quebec""]","[""Anirudh Goyal"", ""Shagun Sodhani"", ""Jonathan Binas"", ""Xue Bin Peng"", ""Sergey Levine"", ""Yoshua Bengio""]","[""Reinforcement Learning"", ""Variational Information Bottleneck"", ""Learning primitives""]","Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. +In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. +We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization. ",/pdf/8ce5bb18a2df4be9d75fa4294af41e2cf8a4ab37.pdf,ICLR,2020,"Learning an implicit master policy, as a master policy in HRL can fail to generalize." +rJEyrjRqYX,HJxQXBo8tm,1538090000000.0,1545360000000.0,53,Reduced-Gate Convolutional LSTM Design Using Predictive Coding for Next-Frame Video Prediction,"[""nelly.elsayed5@gmail.com"", ""maida@louisiana.edu"", ""mab0778@louisiana.edu""]","[""Nelly Elsayed"", ""Anthony S. Maida"", ""Magdy Bayoumi""]","[""rgcLSTM"", ""convolutional LSTM"", ""unsupervised learning"", ""predictive coding"", ""video prediction"", ""moving MNIST"", ""KITTI datasets"", ""deep learning""]","Spatiotemporal sequence prediction is an important problem in deep learning. We +study next-frame video prediction using a deep-learning-based predictive coding +framework that uses convolutional, long short-term memory (convLSTM) modules. +We introduce a novel reduced-gate convolutional LSTM architecture. Our +reduced-gate model achieves better next-frame prediction accuracy than the original +convolutional LSTM while using a smaller parameter budget, thereby reducing +training time. We tested our reduced gate modules within a predictive coding architecture +on the moving MNIST and KITTI datasets. We found that our reduced-gate +model has a significant reduction of approximately 40 percent of the total +number of training parameters and training time in comparison with the standard +LSTM model which makes it attractive for hardware implementation especially +on small devices.",/pdf/b5c5837202d2f45bc52d02ef302a9dbf4ac5f77c.pdf,ICLR,2019,A novel reduced-gate convolutional LSTM design using predictive coding for next-frame video prediction +ByJ7obb0b,Hy0fsZb0W,1509130000000.0,1518730000000.0,710,Understanding and Exploiting the Low-Rank Structure of Deep Networks,"[""craig.bakker@pnnl.gov"", ""michael.j.henry@pnnl.gov"", ""nathan.hodas@pnnl.gov""]","[""Craig Bakker"", ""Michael J. Henry"", ""Nathan O. Hodas""]","[""Deep Learning"", ""Derivative Calculations"", ""Optimization Algorithms""]","Training methods for deep networks are primarily variants on stochastic gradient descent. Techniques that use (approximate) second-order information are rarely used because of the computational cost and noise associated with those approaches in deep learning contexts. However, in this paper, we show how feedforward deep networks exhibit a low-rank derivative structure. This low-rank structure makes it possible to use second-order information without needing approximations and without incurring a significantly greater computational cost than gradient descent. To demonstrate this capability, we implement Cubic Regularization (CR) on a feedforward deep network with stochastic gradient descent and two of its variants. There, we use CR to calculate learning rates on a per-iteration basis while training on the MNIST and CIFAR-10 datasets. CR proved particularly successful in escaping plateau regions of the objective function. We also found that this approach requires less problem-specific information (e.g. an optimal initial learning rate) than other first-order methods in order to perform well.",/pdf/fcf45f2e6a6e59e9053aca6e2bd5bef824327ea9.pdf,ICLR,2018,"We show that deep learning network derivatives have a low-rank structure, and this structure allows us to use second-order derivative information to calculate learning rates adaptively and in a computationally feasible manner." +Twf5rUVeU-I,RhWoe19nZ_,1601310000000.0,1614990000000.0,3090,Convergence Analysis of Homotopy-SGD for Non-Convex Optimization,"[""~Matilde_Gargiani1"", ""~Andrea_Zanelli1"", ""~Moritz_Diehl1"", ""~Quoc_Tran-Dinh2"", ""~Frank_Hutter1""]","[""Matilde Gargiani"", ""Andrea Zanelli"", ""Moritz Diehl"", ""Quoc Tran-Dinh"", ""Frank Hutter""]","[""deep learning"", ""numerical optimization"", ""transfer learning""]","First-order stochastic methods for solving large-scale non-convex optimization problems are widely used in many big-data applications, e.g. training deep neural networks as well as other complex and potentially non-convex machine learning +models. Their inexpensive iterations generally come together with slow global convergence rate (mostly sublinear), leading to the necessity of carrying out a very high number of iterations before the iterates reach a neighborhood of a minimizer. In this work, we present a first-order stochastic algorithm based on a combination of homotopy methods and SGD, called Homotopy-Stochastic Gradient Descent (H-SGD), which finds interesting connections with some proposed heuristics in the literature, e.g. optimization by Gaussian continuation, training by diffusion, mollifying networks. Under some mild and realistic assumptions on the problem structure, we conduct a theoretical analysis of the proposed algorithm. Our analysis shows that, with a specifically designed scheme for the homotopy parameter, H-SGD enjoys a global linear rate of convergence to a neighborhood of a minimizer while maintaining fast and inexpensive iterations. Experimental evaluations confirm the theoretical results and show that H-SGD can outperform standard SGD.",/pdf/504bc7be5d81dca691f4a3d468d3ac696c4b60e0.pdf,ICLR,2021,"In this work, we present and study both theoretically and empirically a novel first-order stochastic algorithm based on a combination of homotopy methods and SGD, called Homotopy-Stochastic Gradient Descent (H-SGD)." +S1x63TEYvr,BkeTrIguwH,1569440000000.0,1577170000000.0,798,Latent Question Reformulation and Information Accumulation for Multi-Hop Machine Reading,"[""quentin.grail@naverlabs.com"", ""julien.perez@naverlabs.com"", ""eric.gaussier@imag.fr""]","[""Quentin Grail"", ""Julien Perez"", ""Eric Gaussier""]","[""question-answering"", ""machine comprehension"", ""deep learning""]","Multi-hop text-based question-answering is a current challenge in machine comprehension. +This task requires to sequentially integrate facts from multiple passages to answer complex natural language questions. +In this paper, we propose a novel architecture, called the Latent Question Reformulation Network (LQR-net), a multi-hop and parallel attentive network designed for question-answering tasks that require reasoning capabilities. +LQR-net is composed of an association of \textbf{reading modules} and \textbf{reformulation modules}. +The purpose of the reading module is to produce a question-aware representation of the document. +From this document representation, the reformulation module extracts essential elements to calculate an updated representation of the question. +This updated question is then passed to the following hop. +We evaluate our architecture on the \hotpotqa question-answering dataset designed to assess multi-hop reasoning capabilities. +Our model achieves competitive results on the public leaderboard and outperforms the best current \textit{published} models in terms of Exact Match (EM) and $F_1$ score. +Finally, we show that an analysis of the sequential reformulations can provide interpretable reasoning paths.",/pdf/cc1c1dbccaa76a14f138d704a3f59b8261d386ca.pdf,ICLR,2020,"In this paper, we propose the Latent Question Reformulation Network (LQR-net), a multi-hop and parallel attentive network designed for question-answering tasks that require reasoning capabilities." +HkeKVh05Fm,BJlHnIc5KX,1538090000000.0,1545360000000.0,1468,Multi-Grained Entity Proposal Network for Named Entity Recognition,"[""cxia8@uic.edu"", ""czhang99@uic.edu"", ""tytaoyang@tencent.com"", ""yaliangli@tencent.com"", ""ndu@tencent.com"", ""kevinxwu@tencent.com"", ""davidwfan@tencent.com"", ""fenglong@buffalo.edu"", ""psyu@uic.edu""]","[""Congying Xia"", ""Chenwei Zhang"", ""Tao Yang"", ""Yaliang Li"", ""Nan Du"", ""Xian Wu"", ""Wei Fan"", ""Fenglong Ma"", ""Philip S. Yu""]",[],"In this paper, we focus on a new Named Entity Recognition (NER) task, i.e., the Multi-grained NER task. This task aims to simultaneously detect both fine-grained and coarse-grained entities in sentences. Correspondingly, we develop a novel Multi-grained Entity Proposal Network (MGEPN). Different from traditional NER models which regard NER as a sequential labeling task, MGEPN provides a new method that proposes entity candidates in the Proposal Network and classifies entities into different categories in the Classification Network. All possible entity candidates including fine-grained ones and coarse-grained ones are proposed in the Proposal Network, which enables the MGEPN model to identify multi-grained entities. In order to better identify named entities and determine their categories, context information is utilized and transferred from the Proposal Network to the Classification Network during the learning process. A novel Entity-Context attention mechanism is also introduced to help the model focus on entity-related context information. Experiments show that our model can obtain state-of-the-art performance on two real-world datasets for both the Multi-grained NER task and the traditional NER task.",/pdf/145b2323d510f7d10651be630c364ef01f25eb0e.pdf,ICLR,2019, +HJG0ojCcFm,SJeehO29KX,1538090000000.0,1545360000000.0,668,Negotiating Team Formation Using Deep Reinforcement Learning,"[""yorambac@google.com"", ""reverett@google.com"", ""edwardhughes@google.com"", ""jzl@google.com"", ""angeliki@google.com"", ""lanctot@google.com"", ""mjohanson@google.com"", ""lejlot@google.com"", ""thore@google.com""]","[""Yoram Bachrach"", ""Richard Everett"", ""Edward Hughes"", ""Angeliki Lazaridou"", ""Joel Leibo"", ""Marc Lanctot"", ""Mike Johanson"", ""Wojtek Czarnecki"", ""Thore Graepel""]","[""Reinforcement Learning"", ""Negotiation"", ""Team Formation"", ""Cooperative Game Theory"", ""Shapley Value""]","When autonomous agents interact in the same environment, they must often cooperate to achieve their goals. One way for agents to cooperate effectively is to form a team, make a binding agreement on a joint plan, and execute it. However, when agents are self-interested, the gains from team formation must be allocated appropriately to incentivize agreement. Various approaches for multi-agent negotiation have been proposed, but typically only work for particular negotiation protocols. More general methods usually require human input or domain-specific data, and so do not scale. To address this, we propose a framework for training agents to negotiate and form teams using deep reinforcement learning. Importantly, our method makes no assumptions about the specific negotiation protocol, and is instead completely experience driven. We evaluate our approach on both non-spatial and spatially extended team-formation negotiation environments, demonstrating that our agents beat hand-crafted bots and reach negotiation outcomes consistent with fair solutions predicted by cooperative game theory. Additionally, we investigate how the physical location of agents influences negotiation outcomes.",/pdf/065174ca86712eb29f1ee4bc7a7c315a9a8f8cfe.pdf,ICLR,2019,Reinforcement learning can be used to train agents to negotiate team formation across many negotiation protocols +OcTUl1kc_00,0HCvHiiil3L,1601310000000.0,1614990000000.0,848,Are Graph Convolutional Networks Fully Exploiting the Graph Structure?,"[""~Davide_Buffelli1"", ""~Fabio_Vandin2""]","[""Davide Buffelli"", ""Fabio Vandin""]","[""Graph Representation Learning"", ""Graph Neural Networks"", ""Random Walks""]","Graph Convolutional Networks (GCNs) represent the state-of-the-art for many graph related tasks. At every layer, GCNs rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours. A known limitation of GCNs is their inability to infer long-range dependencies. In fact, as the number of layers increases, information gets smoothed and node embeddings become indistinguishable, negatively affecting performance. In this paper we formalize four levels of injection of graph structural information, and use them to analyze the importance of long-range dependencies. We then propose a novel regularization technique based on random walks with restart, called RWRReg, which encourages the network to encode long-range information into node embeddings. RWRReg does not require additional operations at inference time, is model-agnostic, and is further supported by our theoretical analysis connecting it to the Weisfeiler-Leman algorithm. Our experimental analysis, on both transductive and inductive tasks, shows that the lack of long-range structural information greatly affects the performance of state-of-the-art models, and that the long-range information exploited by RWRReg leads to an average accuracy improvement of more than $5\%$ on all considered tasks.",/pdf/fb0703957a11ffdd1e24447e54f88bef43e3c126.pdf,ICLR,2021, +wC99I7uIFe,vKPCtf7jak6,1601310000000.0,1614990000000.0,1392,D2p-fed:Differentially Private Federated Learning with Efficient Communication,"[""~Lun_Wang1"", ""ruoxijia@vt.edu"", ""~Dawn_Song1""]","[""Lun Wang"", ""Ruoxi Jia"", ""Dawn Song""]","[""Differential Privacy"", ""Federated Learning"", ""Communication Efficiency""]","In this paper, we propose the discrete Gaussian based differentially private federated learning (D2p-fed), a unified scheme to achieve both differential privacy (DP) and communication efficiency in federated learning (FL). In particular, compared with the only prior work taking care of both aspects, D2p-fed provides stronger privacy guarantee, better composability and smaller communication cost. The key idea is to apply the discrete Gaussian noise to the private data transmission. We provide complete analysis of the privacy guarantee, communication cost and convergence rate of D2p-fed. We evaluated D2p-fed on INFIMNIST and CIFAR10. The results show that D2p-fed outperforms the-state-of-the-art by 4.7% to 13.0% in terms of model accuracy while saving one third of the communication cost. The code for evaluation is available in the supplementary material.",/pdf/5a376a722796864e18ea349aa92e2c8cb378aa70.pdf,ICLR,2021,"We propose D2p-fed, a differentially private federated learning protocol with efficient communication." +_i3ASPp12WS,1BW2HZC926,1601310000000.0,1615710000000.0,2704,Online Adversarial Purification based on Self-supervised Learning,"[""~Changhao_Shi1"", ""~Chester_Holtz1"", ""~Gal_Mishne1""]","[""Changhao Shi"", ""Chester Holtz"", ""Gal Mishne""]","[""Adversarial Robustness"", ""Self-Supervised Learning""]","Deep neural networks are known to be vulnerable to adversarial examples, where a perturbation in the input space leads to an amplified shift in the latent network representation. In this paper, we combine canonical supervised learning with self-supervised representation learning, and present Self-supervised Online Adversarial Purification (SOAP), a novel defense strategy that uses a self-supervised loss to purify adversarial examples at test-time. Our approach leverages the label-independent nature of self-supervised signals and counters the adversarial perturbation with respect to the self-supervised tasks. SOAP yields competitive robust accuracy against state-of-the-art adversarial training and purification methods, with considerably less training complexity. In addition, our approach is robust even when adversaries are given the knowledge of the purification defense strategy. To the best of our knowledge, our paper is the first that generalizes the idea of using self-supervised signals to perform online test-time purification.",/pdf/c72b2431912d9433eb862f7ff1c59d589191b939.pdf,ICLR,2021, +ByToKu9ll,,1478290000000.0,1483770000000.0,460,Evaluation of Defensive Methods for DNNs against Multiple Adversarial Evasion Models,"[""jungyhuk@gmail.com"", ""bbbli@umich.edu"", ""yevgeniy.vorobeychik@vanderbilt.edu""]","[""Xinyun Chen"", ""Bo Li"", ""Yevgeniy Vorobeychik""]","[""Deep learning""]","Due to deep cascades of nonlinear units, deep neural networks (DNNs) can automatically learn non-local generalization priors from data and have achieved high performance in various applications. +However, such properties have also opened a door for adversaries to generate the so-called adversarial examples to fool DNNs. Specifically, adversaries can inject small perturbations to the input data and therefore decrease the performance of deep neural networks significantly. +Even worse, these adversarial examples have the transferability to attack a black-box model based on finite queries without knowledge of the target model. +Therefore, we aim to empirically compare different defensive strategies against various adversary models and analyze the cross-model efficiency for these robust learners. We conclude that the adversarial retraining framework also has the transferability, which can defend adversarial examples without requiring prior knowledge of the adversary models. +We compare the general adversarial retraining framework with the state-of-the-art robust deep neural networks, such as distillation, autoencoder stacked with classifier (AEC), and our improved version, IAEC, to evaluate their robustness as well as the vulnerability in terms of the distortion required to mislead the learner. +Our experimental results show that the adversarial retraining framework can defend most of the adversarial examples notably and consistently without adding additional +vulnerabilities or performance penalty to the original model.",/pdf/5ffc265e88c70d978a04bdfa570cb41fc25e89e8.pdf,ICLR,2017,robust adversarial retraining +l-PrrQrK0QR,6vpyNoA5b6V9,1601310000000.0,1615840000000.0,2753,Dataset Meta-Learning from Kernel Ridge-Regression,"[""~Timothy_Nguyen1"", ""~Zhourong_Chen3"", ""~Jaehoon_Lee2""]","[""Timothy Nguyen"", ""Zhourong Chen"", ""Jaehoon Lee""]","[""dataset distillation"", ""dataset compression"", ""meta-learning"", ""kernel-ridge regression"", ""neural kernels"", ""infinite-width networks"", ""dataset corruption""]","One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. +We introduce the novel concept of $\epsilon$-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar performance. We introduce a meta-learning algorithm Kernel Inducing Points (KIP) for obtaining such remarkable datasets, drawing inspiration from recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime. Consequently, we obtain state of the art results for neural network dataset distillation with potential applications to privacy-preservation.",/pdf/e5bd67ca9948951b21c82c12b69280270a7bfe71.pdf,ICLR,2021,"We introduce a meta-learning approach to distilling datasets, achieving state of the art performance for kernel-ridge regression and neural networks." +r1R5Z19le,,1478250000000.0,1491120000000.0,158,Semi-supervised deep learning by metric embedding,"[""ehoffer@tx.technion.ac.il"", ""nailon@cs.technion.ac.il""]","[""Elad Hoffer"", ""Nir Ailon""]","[""Deep learning"", ""Semi-Supervised Learning""]","Deep networks are successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are usually much less suited for semi-supervised problems because of their tendency to overfit easily when trained on small amounts of data. In this work we will explore a new training objective that is targeting a semi-supervised regime with only a small subset of labeled data. This criterion is based on a deep metric embedding over distance relations within the set of labeled samples, together with constraints over the embeddings of the unlabeled set. The final learned representations are discriminative in euclidean space, and hence can be used with subsequent nearest-neighbor classification using the labeled samples.",/pdf/7eac89ab912b7990b401913de9106ef5e6b257ee.pdf,ICLR,2017, +5rc0K0ezhqI,kryG5oOhrVA,1601310000000.0,1614990000000.0,3552,Unpacking Information Bottlenecks: Surrogate Objectives for Deep Learning,"[""~Andreas_Kirsch1"", ""~Clare_Lyle1"", ""~Yarin_Gal1""]","[""Andreas Kirsch"", ""Clare Lyle"", ""Yarin Gal""]","[""deep learning"", ""information bottleneck"", ""information theory""]","The Information Bottleneck principle offers both a mechanism to explain how deep neural networks train and generalize, as well as a regularized objective with which to train models. However, multiple competing objectives are proposed in the literature, and the information-theoretic quantities used in these objectives are difficult to compute for large deep neural networks, which in turn limits their use as a training objective. In this work, we review these quantities, compare and unify previously proposed objectives, which allows us to develop surrogate objectives more friendly to optimization without relying on cumbersome tools such as density estimation. We find that these surrogate objectives allow us to apply the information bottleneck to modern neural network architectures. We demonstrate our insights on MNIST, CIFAR-10 and ImageNette with modern DNN architectures (ResNets).",/pdf/3f22bd3a33699e4ef49ff0ebaee825f3c9d07913.pdf,ICLR,2021, +H92-E4kFwbR,Oha9LwCMqTB,1601310000000.0,1614990000000.0,3766,Composite Adversarial Training for Multiple Adversarial Perturbations and Beyond,"[""~Xinyang_Zhang5"", ""zxz147@psu.edu"", ""~Ting_Wang1""]","[""Xinyang Zhang"", ""Zheng Zhang"", ""Ting Wang""]","[""adversarial examples"", ""deep learning"", ""robustness""]","One intriguing property of deep neural networks (DNNs) is their vulnerability to adversarial perturbations. Despite the plethora of work on defending against individual perturbation models, improving DNN robustness against the combinations of multiple perturbations is still fairly under-studied. In this paper, we propose \underline{c}omposite \underline{a}dversarial \underline{t}raining (CAT), a novel training method that flexibly integrates and optimizes multiple adversarial losses, leading to significant robustness improvement with respect to individual perturbations as well as their ``compositions''. Through empirical evaluation on benchmark datasets and models, we show that CAT outperforms existing adversarial training methods by large margins in defending against the compositions of pixel perturbations and spatial transformations, two major classes of adversarial perturbation models, while incurring limited impact on clean inputs.",/pdf/79c82a8ddd33e753d73b0e0d0bd874a7db0fa8dc.pdf,ICLR,2021,"A new adversarial training framework for multiple adversarial perturbations and a new ""composite"" adversary." +Hkex2a4FPr,ByxrWnJdwS,1569440000000.0,1577170000000.0,768,On Variational Learning of Controllable Representations for Text without Supervision,"[""pxu4@ualberta.ca"", ""yanshuaicao@gmail.com"", ""jcheung@cs.mcgill.ca""]","[""Peng Xu"", ""Yanshuai Cao"", ""Jackie Chi Kit Cheung""]","[""sequence variational autoencoders"", ""unsupervised learning"", ""controllable text generation"", ""text style transfer""]","The variational autoencoder (VAE) has found success in modelling the manifold of natural images on certain datasets, allowing meaningful images to be generated while interpolating or extrapolating in the latent code space, but it is unclear whether similar capabilities are feasible for text considering its discrete nature. In this work, we investigate the reason why unsupervised learning of controllable representations fails for text. We find that traditional sequence VAEs can learn disentangled representations through their latent codes to some extent, but they often fail to properly decode when the latent factor is being manipulated, because the manipulated codes often land in holes or vacant regions in the aggregated posterior latent space, which the decoding network is not trained to process. Both as a validation of the explanation and as a fix to the problem, we propose to constrain the posterior mean to a learned probability simplex, and performs manipulation within this simplex. Our proposed method mitigates the latent vacancy problem and achieves the first success in unsupervised learning of controllable representations for text. Empirically, our method significantly outperforms unsupervised baselines and is competitive with strong supervised approaches on text style transfer. Furthermore, when switching the latent factor (e.g., topic) during a long sentence generation, our proposed framework can often complete the sentence in a seemingly natural way -- a capability that has never been attempted by previous methods. ",/pdf/2a25fbea3d6cc82f80c7438feb5452c1513c21f4.pdf,ICLR,2020,"why previous VAEs on text cannot learn controllable latent representation as on images, as well as a fix to enable the first success towards controlled text generation without supervision" +r1VGvBcxl,,1478280000000.0,1488480000000.0,255,Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU,"[""mb2@uiuc.edu"", ""ifrosio@nvidia.com"", ""styree@nvidia.com"", ""jclemons@nvidia.com"", ""jkautz@nvidia.com""]","[""Mohammad Babaeizadeh"", ""Iuri Frosio"", ""Stephen Tyree"", ""Jason Clemons"", ""Jan Kautz""]","[""Reinforcement Learning""]","We introduce a hybrid CPU/GPU version of the Asynchronous Advantage Actor-Critic (A3C) algorithm, currently the state-of-the-art method in reinforcement learning for various gaming tasks. We analyze its computational traits and concentrate on aspects critical to leveraging the GPU's computational power. We introduce a system of queues and a dynamic scheduling strategy, potentially helpful for other asynchronous algorithms as well. Our hybrid CPU/GPU version of A3C, based on TensorFlow, achieves a significant speed up compared to a CPU implementation; we make it publicly available to other researchers at https://github.com/NVlabs/GA3C.",/pdf/0dbb876b6cb8698603dbc236fb8bdac201f323f4.pdf,ICLR,2017,Implementation and analysis of the computational aspect of a GPU version of the Asynchronous Advantage Actor-Critic (A3C) algorithm +6c6KZUdm1Nq,Tf1GFXEzz8L,1601310000000.0,1614990000000.0,3505,Regression from Upper One-side Labeled Data,"[""~Takayuki_Katsuki2""]","[""Takayuki Katsuki""]","[""regression"", ""weakly-supervised learning"", ""healthcare""]","We address a regression problem from weakly labeled data that are correctly labeled only above a regression line, i.e., upper one-side labeled data. +The label values of the data are the results of sensing the magnitude of some phenomenon. +In this case, the labels often contain missing or incomplete observations whose values are lower than those of correct observations and are also usually lower than the regression line. It follows that data labeled with lower values than the estimations of a regression function (lower-side data) are mixed with data that should originally be labeled above the regression line (upper-side data). +When such missing label observations are observed in a non-negligible amount, we thus should assume our lower-side data to be unlabeled data that are a mix of original upper- and lower-side data. +We formulate a regression problem from these upper-side labeled and lower-side unlabeled data. We then derive a learning algorithm in an unbiased and consistent manner to ordinary regression that is learned from data labeled correctly in both upper- and lower-side cases. Our key idea is that we can derive a gradient that requires only upper-side data and unlabeled data as the equivalent expression of that for ordinary regression. We additionally found that a specific class of losses enables us to learn unbiased solutions practically. In numerical experiments on synthetic and real-world datasets, we demonstrate the advantages of our algorithm.",/pdf/28e8bccb94f4fe91a59d906c56db4c89749e420a.pdf,ICLR,2021, +HylwpREtDr,S1eEmGq_wH,1569440000000.0,1577170000000.0,1398,Active Learning Graph Neural Networks via Node Feature Propagation,"[""yuexinw@andrew.cmu.edu"", ""yichongx@cs.cmu.edu"", ""aarti@cs.cmu.edu"", ""awd@cs.cmu.edu"", ""yiming@cs.cmu.edu""]","[""Yuexin Wu"", ""Yichong Xu"", ""Aarti Singh"", ""Artur Dubrawski"", ""Yiming Yang""]","[""Graph Learning"", ""Active Learning""]","Graph Neural Networks (GNNs) for prediction tasks like node classification or edge prediction have received increasing attention in recent machine learning from graphically structured data. However, a large quantity of labeled graphs is difficult to obtain, which significantly limit the true success of GNNs. Although active learning has been widely studied for addressing label-sparse issues with other data types like text, images, etc., how to make it effective over graphs is an open question for research. In this paper, we present the investigation on active learning with GNNs for node classification tasks. Specifically, we propose a new method, which uses node feature propagation followed by K-Medoids clustering of the nodes for instance selection in active learning. With a theoretical bound analysis we justify the design choice of our approach. In our experiments on four benchmark dataset, the proposed method outperforms other representative baseline methods consistently and significantly.",/pdf/1a056637dd35c7944cfe240dba93469ee8cd15ec.pdf,ICLR,2020,This paper introduces a clustering-based active learning algorithm on graphs. +SJNDWNOlg,,1478140000000.0,1478140000000.0,55,What Is the Best Practice for CNNs Applied to Visual Instance Retrieval?,"[""jiedong.hao@cripac.ia.ac.cn"", ""jdong@nlpr.ia.ac.cn"", ""wwang@nlpr.ia.ac.cn"", ""tnt@nlpr.ia.ac.cn""]","[""Jiedong Hao"", ""Jing Dong"", ""Wei Wang"", ""Tieniu Tan""]","[""Computer vision"", ""Deep learning""]","Previous work has shown that feature maps of deep convolutional neural networks (CNNs) +can be interpreted as feature representation of a particular image region. Features aggregated from +these feature maps have been exploited for image retrieval tasks and achieved state-of-the-art performances in +recent years. The key to the success of such methods is the feature representation. However, the different +factors that impact the effectiveness of features are still not explored thoroughly. There are much less +discussion about the best combination of them. + +The main contribution of our paper is the thorough evaluations of the various factors that affect the +discriminative ability of the features extracted from CNNs. Based on the evaluation results, we also identify +the best choices for different factors and propose a new multi-scale image feature representation method to +encode the image effectively. Finally, we show that the proposed method generalises well and outperforms +the state-of-the-art methods on four typical datasets used for visual instance retrieval.",/pdf/a7ed76c871eabe3940c81677382ff3c612068f5b.pdf,ICLR,2017, +B1lxV6NFPH,ryl7XT-vwB,1569440000000.0,1577170000000.0,474,BANANAS: Bayesian Optimization with Neural Networks for Neural Architecture Search,"[""crwhite@cs.cmu.edu"", ""willie@cs.cmu.edu"", ""yash@realityengines.ai""]","[""Colin White"", ""Willie Neiswanger"", ""Yash Savani""]","[""neural architecture search"", ""Bayesian optimization""]","Neural Architecture Search (NAS) has seen an explosion of research in the past few years. A variety of methods have been proposed to perform NAS, including reinforcement learning, Bayesian optimization with a Gaussian process model, evolutionary search, and gradient descent. In this work, we design a NAS algorithm that performs Bayesian optimization using a neural network model. + +We develop a path-based encoding scheme to featurize the neural architectures that are used to train the neural network model. This strategy is particularly effective for encoding architectures in cell-based search spaces. After training on just 200 random neural architectures, we are able to predict the validation accuracy of a new architecture to within one percent of its true accuracy on average. This may be of independent interest beyond Bayesian neural architecture search. + +We test our algorithm on the NASBench dataset (Ying et al. 2019), and show that our algorithm significantly outperforms other NAS methods including evolutionary search, reinforcement learning, and AlphaX (Wang et al. 2019). Our algorithm is over 100x more efficient than random search, and 3.8x more efficient than the next-best algorithm. We also test our algorithm on the search space used in DARTS (Liu et al. 2018), and show that our algorithm is competitive with state-of-the-art NAS algorithms on this search space.",/pdf/487a6256ad818b3bd5ded80fa530e92488af5a69.pdf,ICLR,2020,"We design a NAS algorithm that performs Bayesian optimization using a neural network model, which takes advantage of a novel way to featurize neural architectures, and it performs very well on multiple search spaces." +0z1HScLBEpb,LxJonWAJ6g,1601310000000.0,1614990000000.0,430,UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning,"[""~Tarun_Gupta3"", ""~Anuj_Mahajan1"", ""~Bei_Peng2"", ""~Wendelin_Boehmer1"", ""~Shimon_Whiteson1""]","[""Tarun Gupta"", ""Anuj Mahajan"", ""Bei Peng"", ""Wendelin Boehmer"", ""Shimon Whiteson""]","[""multi-agent reinforcement learning"", ""deep Q-learning"", ""universal value functions"", ""successor features"", ""relative overgeneralization""]","This paper focuses on cooperative value-based multi-agent reinforcement learning (MARL) in the paradigm of centralized training with decentralized execution (CTDE). Current state-of-the-art value-based MARL methods leverage CTDE to learn a centralized joint-action value function as a monotonic mixing of each agent's utility function, which enables easy decentralization. However, this monotonic restriction leads to inefficient exploration in tasks with nonmonotonic returns due to suboptimal approximations of the values of joint actions. To address this, we present a novel MARL approach called Universal Value Exploration (UneVEn), which uses universal successor features (USFs) to learn policies of tasks related to the target task, but with simpler reward functions in a sample efficient manner. UneVEn uses novel action-selection schemes between randomly sampled related tasks during exploration, which enables the monotonic joint-action value function of the target task to place more importance on useful joint actions. Empirical results on a challenging cooperative predator-prey task requiring significant coordination amongst agents show that UneVEn significantly outperforms state-of-the-art baselines.",/pdf/49bea0f98bdfadd0db694855c5dde3a58062c566.pdf,ICLR,2021,We propose universal value exploration (UneVEn) for multi-agent reinforcement learning (MARL) to address the suboptimal approximations of employed monotonic joint-action value function in current SOTA value-based MARL methods on non-monotonic tasks. +_77KiX2VIEg,-8YL6LCIg8n,1601310000000.0,1614990000000.0,3500,On the Effectiveness of Deep Ensembles for Small Data Tasks,"[""~Lorenzo_Brigato1"", ""~Luca_Iocchi1""]","[""Lorenzo Brigato"", ""Luca Iocchi""]","[""small data"", ""deep learning"", ""ensembles"", ""classification""]","Deep neural networks represent the gold standard for image classification. +However, they usually need large amounts of data to reach superior performance. +In this work, we focus on image classification problems with a few labeled examples per class and improve sample efficiency in the low data regime by using an ensemble of relatively small deep networks. +For the first time, our work broadly studies the existing concept of neural ensembling in small data domains, through an extensive validation using popular datasets and architectures. +We show that deep ensembling is a simple yet effective technique that outperforms current state-of-the-art approaches for learning from small datasets. +We compare different ensemble configurations to their deeper and wider competitors given a total fixed computational budget and provide empirical evidence of their advantage. +Furthermore, we investigate the effectiveness of different losses and show that their choice should be made considering different factors.",/pdf/9b6abfc0e32c7d3dc8a675aa71f791e2df11704a.pdf,ICLR,2021,Deep ensembles of relatively small deep networks improve the state of the art of image classification problems in small data regimes +S1lg0jAcYm,H1xy9x29F7,1538090000000.0,1548980000000.0,862,ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks,"[""mzyin@utexas.edu"", ""mingyuan.zhou@mccombs.utexas.edu""]","[""Mingzhang Yin"", ""Mingyuan Zhou""]","[""Antithetic sampling"", ""variable augmentation"", ""deep discrete latent variable models"", ""variance reduction"", ""variational auto-encoder""]","To backpropagate the gradients through stochastic binary layers, we propose the augment-REINFORCE-merge (ARM) estimator that is unbiased, exhibits low variance, and has low computational complexity. Exploiting variable augmentation, REINFORCE, and reparameterization, the ARM estimator achieves adaptive variance reduction for Monte Carlo integration by merging two expectations via common random numbers. The variance-reduction mechanism of the ARM estimator can also be attributed to either antithetic sampling in an augmented space, or the use of an optimal anti-symmetric ""self-control"" baseline function together with the REINFORCE estimator in that augmented space. Experimental results show the ARM estimator provides state-of-the-art performance in auto-encoding variational inference and maximum likelihood estimation, for discrete latent variable models with one or multiple stochastic binary layers. Python code for reproducible research is publicly available.",/pdf/5e88b16c21ed525f101915dc79067b994bb8f958.pdf,ICLR,2019,An unbiased and low-variance gradient estimator for discrete latent variable models +HkJ1rgbCb,HJ00ExZAZ,1509130000000.0,1518730000000.0,578,Using Deep Reinforcement Learning to Generate Rationales for Molecules,"[""bensonc@mit.edu"", ""ccoley@mit.edu"", ""regina@csail.mit.edu"", ""tommi@csail.mit.edu""]","[""Benson Chen"", ""Connor Coley"", ""Regina Barzilay"", ""Tommi Jaakkola""]","[""Reinforcement Learning"", ""Chemistry"", ""Interpretable Models""]","Deep learning algorithms are increasingly used in modeling chemical processes. However, black box predictions without rationales have limited used in practical applications, such as drug design. To this end, we learn to identify molecular substructures -- rationales -- that are associated with the target chemical property (e.g., toxicity). The rationales are learned in an unsupervised fashion, requiring no additional information beyond the end-to-end task. We formulate this problem as a reinforcement learning problem over the molecular graph, parametrized by two convolution networks corresponding to the rationale selection and prediction based on it, where the latter induces the reward function. We evaluate the approach on two benchmark toxicity datasets. We demonstrate that our model sustains high performance under the additional constraint that predictions strictly follow the rationales. Additionally, we validate the extracted rationales through comparison against those described in chemical literature and through synthetic experiments. ",/pdf/16b1214dab93b299062351e1d98257d46f9263a8.pdf,ICLR,2018,We use a reinforcement learning over molecular graphs to generate rationales for interpretable molecular property prediction. +#NAME?,wNCFIPeiAcU,1601310000000.0,1614990000000.0,2573,Interpretable Meta-Reinforcement Learning with Actor-Critic Method,"[""~Xingyuan_Liang1"", ""liuxy@seu.edu.cn""]","[""Xingyuan Liang"", ""Xu-Ying Liu""]","[""meta-reinforcement learning"", ""actor-critic"", ""deep learning"", ""interpretable""]","Meta-reinforcement learning (meta-RL) algorithms have successfully trained agent systems to perform well on different tasks within only few updates. However, in gradient-based meta-RL algorithms, the Q-function at adaptation step is mainly estimated by the return of few trajectories, which can lead to high variance in Q-value and biased meta-gradient estimation, and the adaptation uses a large number of batched trajectories. To address these challenges, we propose a new meta-RL algorithm that can reduce the variance and bias of the meta-gradient estimation and perform few-shot task data sampling, which makes the meta-policy more interpretable. We reformulate the meta-RL objective, and introduce contextual Q-function as a meta-policy critic during task adaptation step and learn the Q-function under a soft actor-critic (SAC) framework. The experimental results on 2D navigation task and meta-RL benchmarks show that our approach can learn an more interpretable meta-policy to explore unknown environment and the performance are comparable to previous gradient-based algorithms.",/pdf/9f12231f7b8c5a1c8e9bdc0db1c6b71e760f3521.pdf,ICLR,2021,a new meta-RL algorithm that can reduce the variance and bias of the meta-gradient estimation and perform few-shot task data sampling +C3qvk5IQIJY,tTQrn2MsSMt,1601310000000.0,1616060000000.0,2213,Understanding Over-parameterization in Generative Adversarial Networks,"[""~Yogesh_Balaji1"", ""sajedi@usc.edu"", ""~Neha_Mukund_Kalibhat1"", ""~Mucong_Ding1"", ""~Dominik_St\u00f6ger1"", ""~Mahdi_Soltanolkotabi1"", ""~Soheil_Feizi2""]","[""Yogesh Balaji"", ""Mohammadmahdi Sajedi"", ""Neha Mukund Kalibhat"", ""Mucong Ding"", ""Dominik St\u00f6ger"", ""Mahdi Soltanolkotabi"", ""Soheil Feizi""]","[""GAN"", ""Over-parameterization"", ""min-max optimization""]","A broad class of unsupervised deep learning methods such as Generative Adversarial Networks (GANs) involve training of overparameterized models where the number of parameters of the model exceeds a certain threshold. Indeed, most successful GANs used in practice are trained using overparameterized generator and discriminator networks, both in terms of depth and width. A large body of work in supervised learning have shown the importance of model overparameterization in the convergence of the gradient descent (GD) to globally optimal solutions. In contrast, the unsupervised setting and GANs in particular involve non-convex concave mini-max optimization problems that are often trained using Gradient Descent/Ascent (GDA). +The role and benefits of model overparameterization in the convergence of GDA to a global saddle point in non-convex concave problems is far less understood. In this work, we present a comprehensive analysis of the importance of model overparameterization in GANs both theoretically and empirically. We theoretically show that in an overparameterized GAN model with a $1$-layer neural network generator and a linear discriminator, GDA converges to a global saddle point of the underlying non-convex concave min-max problem. To the best of our knowledge, this is the first result for global convergence of GDA in such settings. Our theory is based on a more general result that holds for a broader class of nonlinear generators and discriminators that obey certain assumptions (including deeper generators and random feature discriminators). Our theory utilizes and builds upon a novel connection with the convergence analysis of linear time-varying dynamical systems which may have broader implications for understanding the convergence behavior of GDA for non-convex concave problems involving overparameterized models. We also empirically study the role of model overparameterization in GANs using several large-scale experiments on CIFAR-10 and Celeb-A datasets. Our experiments show that overparameterization improves the quality of generated samples across various model architectures and datasets. Remarkably, we observe that overparameterization leads to faster and more stable convergence behavior of GDA across the board.",/pdf/c7a1dc846cfadfda171147b173c3155d0de4ef8a.pdf,ICLR,2021,We present an analysis of over-parameterization in GANs both theoretically and empirically. +yeeS_HULL7Z,_t8dSXx7B1F,1601310000000.0,1614990000000.0,820,Attention-Based Clustering: Learning a Kernel from Context,"[""~Samuel_Coward1"", ""erik.visse-martindale@uk.zuken.com"", ""~Chithrupa_Ramesh1""]","[""Samuel Coward"", ""Erik Visse-Martindale"", ""Chithrupa Ramesh""]","[""Similarity learning"", ""kernel methods"", ""constrained clustering"", ""transformer analysis"", ""spectral clustering"", ""supervised learning"", ""deep learning""]","In machine learning, no data point stands alone. We believe that context is an underappreciated concept in many machine learning methods. We propose Attention-Based Clustering (ABC), a neural architecture based on the attention mechanism, which is designed to learn latent representations that adapt to context within an input set, and which is inherently agnostic to input sizes and number of clusters. By learning a similarity kernel, our method directly combines with any out-of-the-box kernel-based clustering approach. We present competitive results for clustering Omniglot characters and include analytical evidence of the effectiveness of an attention-based approach for clustering. ",/pdf/53f330d0f4cb5e6abfa5b444d281ec22ccfa6b65.pdf,ICLR,2021,"We propose an attention-based architecture that utilizes contextual information to learn a kernel, and combine it with an off-the-shelf clustering method to obtain state-of-the-art results on the Omniglot dataset." +BJewlyStDr,r1xkGSidPB,1569440000000.0,1583910000000.0,1510,On Bonus Based Exploration Methods In The Arcade Learning Environment,"[""adrien.alitaiga@gmail.com"", ""liamfedus@google.com"", ""marlosm@google.com"", ""aaron.courville@gmail.com"", ""bellemare@google.com""]","[""Adrien Ali Taiga"", ""William Fedus"", ""Marlos C. Machado"", ""Aaron Courville"", ""Marc G. Bellemare""]","[""exploration"", ""arcade learning environment"", ""bonus-based methods""]","Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler epsilon-greedy scheme. In fact, we find that methods that perform best on that game often underperform epsilon-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.",/pdf/43cfcb6b201731f2f54251fcc64d67a6bb883c96.pdf,ICLR,2020,We find that existing bonus-based exploration methods have not been able to address the exploration-exploitation trade-off in the Arcade Learning Environment. +SyW4Gjg0W,SklVGsxCZ,1509110000000.0,1518730000000.0,367,Kernel Graph Convolutional Neural Nets,"[""giannisnik@hotmail.com"", ""pmeladianos@aueb.gr"", ""antoine.tixier-1@colorado.edu"", ""kskianis@lix.polytechnique.fr"", ""mvazirg@lix.polytechnique.fr""]","[""Giannis Nikolentzos"", ""Polykarpos Meladianos"", ""Antoine J-P Tixier"", ""Konstantinos Skianis"", ""Michalis Vazirgiannis""]",[],"Graph kernels have been successfully applied to many graph classification problems. Typically, a kernel is first designed, and then an SVM classifier is trained based on the features defined implicitly by this kernel. This two-stage approach decouples data representation from learning, which is suboptimal. On the other hand, Convolutional Neural Networks (CNNs) have the capability to learn their own features directly from the raw data during training. Unfortunately, they cannot handle irregular data such as graphs. We address this challenge by using graph kernels to embed meaningful local neighborhoods of the graphs in a continuous vector space. A set of filters is then convolved with these patches, pooled, and the output is then passed to a feedforward network. With limited parameter tuning, our approach outperforms strong baselines on 7 out of 10 benchmark datasets, and reaches comparable performance elsewhere. Code and data are publicly available.",/pdf/834e53db11e516983962363252577d9c330b4197.pdf,ICLR,2018, +_ojjh-QFiFr,8JVva4S5NxR3,1601310000000.0,1614990000000.0,2086,"Language-Mediated, Object-Centric Representation Learning","[""~Ruocheng_Wang2"", ""~Jiayuan_Mao1"", ""~Samuel_Gershman1"", ""~Jiajun_Wu1""]","[""Ruocheng Wang"", ""Jiayuan Mao"", ""Samuel Gershman"", ""Jiajun Wu""]","[""Object-Centric Representation Learning"", ""Concept Learning""]","We present Language-mediated, Object-centric Representation Learning (LORL), learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object segmentation, notably MONet and Slot Attention. Just like these algorithms, LORL also learns an object-centric representation by reconstructing the input image. But LORL further learns to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised segmentation algorithms that are language-agnostic. Experiments show that LORL consistently improves the performance of MONet and Slot Attention on two datasets via the help of language. We also show that concepts learned by LORL aid downstream tasks such as referential expression interpretation.",/pdf/25ae18c0c04aa8f948dd240ecfbb54b34e035bcb.pdf,ICLR,2021,"We present a framework for learning disentangled, object-centric scene representations from vision and language." +Bygre3R9Fm,r1ltVLT9t7,1538090000000.0,1545360000000.0,1072,DEFactor: Differentiable Edge Factorization-based Probabilistic Graph Generation,"[""rim.assouel@hotmail.fr"", ""mohamed.ahmed@benevolent.ai"", ""marwin.segler@benevolent.ai"", ""amir.saffari@benevolent.ai"", ""yoshua.bengio@mila.quebec""]","[""Rim Assouel"", ""Mohamed Ahmed"", ""Marwin Segler"", ""Amir Saffari"", ""Yoshua Bengio""]","[""molecular graphs"", ""conditional autoencoder"", ""graph autoencoder""]","Generating novel molecules with optimal properties is a crucial step in many industries such as drug discovery. +Recently, deep generative models have shown a promising way of performing de-novo molecular design. +Although graph generative models are currently available they either have a graph size dependency in their number of parameters, limiting their use to only very small graphs or are formulated as a sequence of discrete actions needed to construct a graph, making the output graph non-differentiable w.r.t the model parameters, therefore preventing them to be used in scenarios such as conditional graph generation. In this work we propose a model for conditional graph generation that is computationally efficient and enables direct optimisation of the graph. We demonstrate favourable performance of our model on prototype-based molecular graph conditional generation tasks.",/pdf/919b588768ad744dcb63250c18f1b7bbe79b2fe4.pdf,ICLR,2019,New scalable graph decoding scheme that allows to perform direct molecular graph conditional generation. +fgpXAu8puGj,4X1_WdNd6n5,1601310000000.0,1614990000000.0,672,NAHAS: Neural Architecture and Hardware Accelerator Search,"[""~Yanqi_Zhou1"", ""~Xuanyi_Dong1"", ""~Daiyi_Peng1"", ""ethanzhu@google.com"", ""~Amir_Yazdanbakhsh1"", ""~Berkin_Akin1"", ""~Mingxing_Tan3"", ""~James_Laudon1""]","[""Yanqi Zhou"", ""Xuanyi Dong"", ""Daiyi Peng"", ""Ethan Zhu"", ""Amir Yazdanbakhsh"", ""Berkin Akin"", ""Mingxing Tan"", ""James Laudon""]","[""neural architecture search"", ""systems"", ""hardware""]","Neural architectures and hardware accelerators have been two driving forces for the rapid progress in deep learning. +Although previous works have optimized either neural architectures given fixed hardware, or hardware given fixed neural architectures, none has considered optimizing them jointly. In this paper, we study the importance of co-designing neural architectures and hardware accelerators. To this end, we propose NAHAS, an automated hardware design paradigm that jointly searches for the best configuration for both neural architecture and accelerator. In NAHAS, accelerator hardware design is conditioned on the dynamically explored neural networks for the targeted application, instead of fixed architectures, thus providing better performance opportunities. Our experiments with an industry-standard edge accelerator show that NAHAS consistently outperforms previous platform-aware neural architecture search and state-of-the-art EfficientNet on all latency targets by 0.5% - 1% ImageNet top-1 accuracy, while reducing latency by about 20%. Joint optimization reduces the search samples by 2x and reduces the latency constraint violations from 3 violations to 1 violation per 4 searches, compared to independently optimizing the two sub spaces.",/pdf/4bec214a6a17e41b35876512de1c4b3976c6176b.pdf,ICLR,2021,"We propose NAHAS, a latency-driven software/hardware co-optimizer that jointly optimize the design of neural architectures and a mobile edge processor." +SklibJBFDB,Bye4gTouPH,1569440000000.0,1577170000000.0,1555,Evaluating Semantic Representations of Source Code,"[""yaza.wainakh@gmail.com"", ""moiz.rauf@iste.uni-stuttgart.de"", ""michael@binaervarianz.de""]","[""Yaza Wainakh"", ""Moiz Rauf"", ""Michael Pradel""]","[""embeddings"", ""representation"", ""source code"", ""identifiers""]","Learned representations of source code enable various software developer tools, e.g., to detect bugs or to predict program properties. At the core of code representations often are word embeddings of identifier names in source code, because identifiers account for the majority of source code vocabulary and convey important semantic information. Unfortunately, there currently is no generally accepted way of evaluating the quality of word embeddings of identifiers, and current evaluations are biased toward specific downstream tasks. This paper presents IdBench, the first benchmark for evaluating to what extent word embeddings of identifiers represent semantic relatedness and similarity. The benchmark is based on thousands of ratings gathered by surveying 500 software developers. We use IdBench to evaluate state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions, as these are often used in current developer tools. Our results show that the effectiveness of embeddings varies significantly across different embedding techniques and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing embedding provides a satisfactory representation of semantic similarities, e.g., because embeddings consider identifiers with opposing meanings as similar, which may lead to fatal mistakes in downstream developer tools. IdBench provides a gold standard to guide the development of novel embeddings that address the current limitations. +",/pdf/c62c376c1ea19bc9ccceecf4e38814d9f3baebe7.pdf,ICLR,2020,A benchmark to evaluate neural embeddings of identifiers in source code. +rkMt1bWAZ,SkWKJbZCW,1509130000000.0,1518730000000.0,641,Bias-Variance Decomposition for Boltzmann Machines,"[""mahito@nii.ac.jp"", ""tsuda@k.u-tokyo.ac.jp"", ""hiro@brain.riken.jp""]","[""Mahito Sugiyama"", ""Koji Tsuda"", ""Hiroyuki Nakahara""]","[""Boltzmann machine"", ""bias-variance decomposition"", ""information geometry""]","We achieve bias-variance decomposition for Boltzmann machines using an information geometric formulation. Our decomposition leads to an interesting phenomenon that the variance does not necessarily increase when more parameters are included in Boltzmann machines, while the bias always decreases. Our result gives a theoretical evidence of the generalization ability of deep learning architectures because it provides the possibility of increasing the representation power with avoiding the variance inflation.",/pdf/c4695b8f6c9bea5c3c1368e703d7e64aa7d50ccd.pdf,ICLR,2018,We achieve bias-variance decomposition for Boltzmann machines using an information geometric formulation. +ryk77mbRZ,B1Paz7ZCZ,1509140000000.0,1518730000000.0,1151,Noise-Based Regularizers for Recurrent Neural Networks,"[""abd2141@columbia.edu"", ""altosaar@princeton.edu"", ""rajeshr@cs.princeton.edu"", ""david.blei@columbia.edu""]","[""Adji B. Dieng"", ""Jaan Altosaar"", ""Rajesh Ranganath"", ""David M. Blei""]",[],"Recurrent neural networks (RNNs) are powerful models for sequential data. They can approximate arbitrary computations, and have been used successfully in domains such as text and speech. However, the flexibility of RNNs makes them susceptible to overfitting and regularization is important. We develop a noise-based regularization method for RNNs. The idea is simple and easy to implement: we inject noise in the hidden units of the RNN and then maximize the original RNN's likelihood averaged over the injected noise. On a language modeling benchmark, our method achieves better performance than the deterministic RNN and the variational dropout.",/pdf/f5434c16d9149ba2ecf5dff8e5b5a34dce8e600b.pdf,ICLR,2018, +Tio_oO2ga3u,wqpYK7yW7_,1601310000000.0,1614990000000.0,2379,Deep Ensemble Kernel Learning,"[""dagrawa2@vols.utk.edu"", ""~Jacob_D_Hinkle1""]","[""Devanshu Agrawal"", ""Jacob D Hinkle""]","[""kernel-learning"", ""gaussian-process"", ""Bayesian"", ""ensemble""]","Gaussian processes (GPs) are nonparametric Bayesian models that are both flexible and robust to overfitting. One of the main challenges of GP methods is selecting the kernel. In the deep kernel learning (DKL) paradigm, a deep neural network or ``feature network'' is used to map inputs into a latent feature space, where a GP with a ``base kernel'' acts; the resulting model is then trained in an end-to-end fashion. In this work, we introduce the ``deep ensemble kernel learning'' (DEKL) model, which is a special case of DKL. In DEKL, a linear base kernel is used, enabling exact optimization of the base kernel hyperparameters and a scalable inference method that does not require approximation by inducing points. We also represent the feature network as a concatenation of an ensemble of learner networks with a common architecture, allowing for easy model parallelism. We show that DEKL is able to approximate any kernel if the number of learners in the ensemble is arbitrarily large. Comparing the DEKL model to DKL and deep ensemble (DE) baselines on both synthetic and real-world regression tasks, we find that DEKL often outperforms both baselines in terms of predictive performance and that the DEKL learners tend to be more diverse (i.e., less correlated with one another) compared to the DE learners.",/pdf/11cd5068e221323b508fb39db7b5fdd89deb9b23.pdf,ICLR,2021,"We present a joint training method for neural network ensembles using deep kernel learning with a linear kernel, derive an efficient variational inference framework for it, and show that it is a universal kernel approximator." +HkgB2TNYPS,HJeR61ldwB,1569440000000.0,1583910000000.0,780,A Theoretical Analysis of the Number of Shots in Few-Shot Learning,"[""tianshi.cao@mail.utoronto.ca"", ""law@cs.toronto.edu"", ""fidler@cs.toronto.edu""]","[""Tianshi Cao"", ""Marc T Law"", ""Sanja Fidler""]","[""Few shot learning"", ""Meta Learning"", ""Performance Bounds""]","Few-shot classification is the task of predicting the category of an example from a set of few labeled examples. The number of labeled examples per category is called the number of shots (or shot number). Recent works tackle this task through meta-learning, where a meta-learner extracts information from observed tasks during meta-training to quickly adapt to new tasks during meta-testing. In this formulation, the number of shots exploited during meta-training has an impact on the recognition performance at meta-test time. Generally, the shot number used in meta-training should match the one used in meta-testing to obtain the best performance. We introduce a theoretical analysis of the impact of the shot number on Prototypical Networks, a state-of-the-art few-shot classification method. From our analysis, we propose a simple method that is robust to the choice of shot number used during meta-training, which is a crucial hyperparameter. The performance of our model trained for an arbitrary meta-training shot number shows great performance for different values of meta-testing shot numbers. We experimentally demonstrate our approach on different few-shot classification benchmarks.",/pdf/5c16618d45cfeaeb2884417e23a1d7c31d1d68e6.pdf,ICLR,2020,The paper analyzes the effect of shot number on prototypical networks and proposes a robust method when the shot number differs from meta-training to meta-testing time. +S1lqMn05Ym,BygZzp35YX,1538090000000.0,1550860000000.0,1293,Information asymmetry in KL-regularized RL,"[""agalashov@google.com"", ""sidmj@google.com"", ""leonardh@google.com"", ""dhruvat@google.com"", ""schwarzjn@google.com"", ""gdesjardins@google.com"", ""lejlot@google.com"", ""ywteh@google.com"", ""razp@google.com"", ""heess@google.com""]","[""Alexandre Galashov"", ""Siddhant M. Jayakumar"", ""Leonard Hasenclever"", ""Dhruva Tirumala"", ""Jonathan Schwarz"", ""Guillaume Desjardins"", ""Wojciech M. Czarnecki"", ""Yee Whye Teh"", ""Razvan Pascanu"", ""Nicolas Heess""]","[""Deep Reinforcement Learning"", ""Continuous Control"", ""RL as Inference""]","Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning. +Please watch the video demonstrating learned experts and default policies on several continuous control tasks ( https://youtu.be/U2qA3llzus8 ).",/pdf/b702f9a26f2cd46be2d9fed0c8a09c176930e77c.pdf,ICLR,2019,"Limiting state information for the default policy can improvement performance, in a KL-regularized RL framework where both agent and default policy are optimized together" +SJNceh0cFX,HJle4WC5Km,1538090000000.0,1545360000000.0,1104,A RECURRENT NEURAL CASCADE-BASED MODEL FOR CONTINUOUS-TIME DIFFUSION PROCESS,"[""sylvain.lamprier@lip6.fr""]","[""Sylvain Lamprier""]","[""Information Diffusion"", ""Recurrent Neural Network"", ""Black Box Inference""]","Many works have been proposed in the literature to capture the dynamics of diffusion in networks. While some of them define graphical markovian models to extract temporal relationships between node infections in networks, others consider diffusion episodes as sequences of infections via recurrent neural models. In this paper we propose a model at the crossroads of these two extremes, which embeds the history of diffusion in infected nodes as hidden continuous states. Depending on the trajectory followed by the content before reaching a given node, the distribution of influence probabilities may vary. However, content trajectories are usually hidden in the data, which induces challenging learning problems. We propose a topological recurrent neural model which exhibits good experimental performances for diffusion modelling and prediction. ",/pdf/a0e4b6907190189796c0584ed8124db6e51da0bf.pdf,ICLR,2019, +zrT3HcsWSAt,drsOFmc3KTg,1601310000000.0,1615510000000.0,2279,Behavioral Cloning from Noisy Demonstrations,"[""~Fumihiro_Sasaki2"", ""ryohta.yamashina@jp.ricoh.com""]","[""Fumihiro Sasaki"", ""Ryota Yamashina""]","[""Imitation Learning"", ""Inverse Reinforcement Learning"", ""Noisy Demonstrations""]","We consider the problem of learning an optimal expert behavior policy given noisy demonstrations that contain observations from both optimal and non-optimal expert behaviors. Popular imitation learning algorithms, such as generative adversarial imitation learning, assume that (clear) demonstrations are given from optimal expert policies but not the non-optimal ones, and thus often fail to imitate the optimal expert behaviors given the noisy demonstrations. Prior works that address the problem require (1) learning policies through environment interactions in the same fashion as reinforcement learning, and (2) annotating each demonstration with confidence scores or rankings. However, such environment interactions and annotations in real-world settings take impractically long training time and a significant human effort. In this paper, we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations. The proposed algorithm learns ensemble policies with a generalized behavioral cloning (BC) objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than ones learned by BC.",/pdf/980d70256a0232aceda73c88d52522e48fff995d.pdf,ICLR,2021,We propose an imitation learning algorithm to learn from non-optimal (noisy) demonstrations without any environment interactions and annotations associated with the demonstrations. +S19dR9x0b,SkYdRqgR-,1509110000000.0,1518730000000.0,361,Alternating Multi-bit Quantization for Recurrent Neural Networks,"[""xuen@pku.edu.cn"", ""tianduo@taobao.com"", ""zlin@pku.edu.cn"", ""santong.oww@taobao.com"", ""lingzun.cyb@alibaba-inc.com"", ""qingfeng@taobao.com"", ""zha@cis.pku.edu.cn""]","[""Chen Xu"", ""Jianqiang Yao"", ""Zhouchen Lin"", ""Wenwu Ou"", ""Yuanbin Cao"", ""Zhirong Wang"", ""Hongbin Zha""]","[""Alternating Minimization"", ""Quantized Recurrent Neural Network"", ""Binary Search Tree""]","Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent requests, the latency during inference can also be very critical for costly computing resources. In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes {-1,+1}. We formulate the quantization as an optimization problem. Under the key observation that once the quantization coefficients are fixed the binary codes can be derived efficiently by binary search tree, alternating minimization is then applied. We test the quantization for two well-known RNNs, i.e., long short term memory (LSTM) and gated recurrent unit (GRU), on the language models. Compared with the full-precision counter part, by 2-bit quantization we can achieve ~16x memory saving and ~6x real inference acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit quantization, we can achieve almost no loss in the accuracy or even surpass the original model, with ~10.5x memory saving and ~3x real inference acceleration. Both results beat the exiting quantization works with large margins. We extend our alternating quantization to image classification tasks. In both RNNs and feedforward neural networks, the method also achieves excellent performance.",/pdf/13a05841e630c5aaf6bd6f033bbfb98071c80c11.pdf,ICLR,2018,We propose a new quantization method and apply it to quantize RNNs for both compression and acceleration +H1lGHsA9KX,S1eNFL0-YX,1538090000000.0,1545360000000.0,68,A Resizable Mini-batch Gradient Descent based on a Multi-Armed Bandit,"[""ipcng00@kaist.ac.kr"", ""sunghun.kang@kaist.ac.kr"", ""cd_yoo@kaist.ac.kr""]","[""Seong Jin Cho"", ""Sunghun Kang"", ""Chang D. Yoo""]","[""Batch size"", ""Optimization"", ""Mini-batch gradient descent"", ""Multi-armed bandit""]","Determining the appropriate batch size for mini-batch gradient descent is always time consuming as it often relies on grid search. This paper considers a resizable mini-batch gradient descent (RMGD) algorithm based on a multi-armed bandit that achieves performance equivalent to that of best fixed batch-size. At each epoch, the RMGD samples a batch size according to a certain probability distribution proportional to a batch being successful in reducing the loss function. Sampling from this probability provides a mechanism for exploring different batch size and exploiting batch sizes with history of success. After obtaining the validation loss at each epoch with the sampled batch size, the probability distribution is updated to incorporate the effectiveness of the sampled batch size. Experimental results show that the RMGD achieves performance better than the best performing single batch size. It is surprising that the RMGD achieves better performance than grid search. Furthermore, it attains this performance in a shorter amount of time than grid search.",/pdf/9b31dd8a6abc5d5d37be4088ce2eef5d8cb66d13.pdf,ICLR,2019,An optimization algorithm that explores various batch sizes based on probability and automatically exploits successful batch size which minimizes validation loss. +ByxRM0Ntvr,SkxfL7HuDr,1569440000000.0,1583910000000.0,1020,Are Transformers universal approximators of sequence-to-sequence functions?,"[""chulheey@mit.edu"", ""bsrinadh@google.com"", ""ankitsrawat@google.com"", ""sashank@google.com"", ""sanjivk@google.com""]","[""Chulhee Yun"", ""Srinadh Bhojanapalli"", ""Ankit Singh Rawat"", ""Sashank Reddi"", ""Sanjiv Kumar""]","[""Transformer"", ""universal approximation"", ""contextual mapping"", ""expressive power"", ""permutation equivariance""]","Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.",/pdf/23ea8fe1cb7484f280fc69c6dd8b02a6348b4e2a.pdf,ICLR,2020,We prove that Transformer networks are universal approximators of sequence-to-sequence functions. +HkgnpiR9Y7,r1elfJzqF7,1538090000000.0,1545360000000.0,838,Recycling the discriminator for improving the inference mapping of GAN,"[""duhyeonbang@yonsei.ac.kr"", ""kateshim@yonsei.ac.kr""]","[""Duhyeon Bang"", ""Hyunjung Shim""]",[],"Generative adversarial networks (GANs) have achieved outstanding success in generating the high-quality data. Focusing on the generation process, existing GANs learn a unidirectional mapping from the latent vector to the data. Later, various studies point out that the latent space of GANs is semantically meaningful and can be utilized in advanced data analysis and manipulation. In order to analyze the real data in the latent space of GANs, it is necessary to investigate the inverse generation mapping from the data to the latent vector. To tackle this problem, the bidirectional generative models introduce an encoder to establish the inverse path of the generation process. Unfortunately, this effort leads to the degradation of generation quality because the imperfect generator rather interferes the encoder training and vice versa. +In this paper, we propose an effective algorithm to infer the latent vector based on existing unidirectional GANs by preserving their generation quality. +It is important to note that we focus on increasing the accuracy and efficiency of the inference mapping but not influencing the GAN performance (i.e., the quality or the diversity of the generated sample). +Furthermore, utilizing the proposed inference mapping algorithm, we suggest a new metric for evaluating the GAN models by measuring the reconstruction error of unseen real data. +The experimental analysis demonstrates that the proposed algorithm achieves more accurate inference mapping than the existing method and provides the robust metric for evaluating GAN performance. ",/pdf/978632ecd4fe110db10e6ca53a9f37cb5ab3c6d8.pdf,ICLR,2019, +jphnJNOwe36,9g3L7WfW57g,1601310000000.0,1616450000000.0,1977,Overparameterisation and worst-case generalisation: friend or foe?,"[""~Aditya_Krishna_Menon1"", ""~Ankit_Singh_Rawat1"", ""~Sanjiv_Kumar1""]","[""Aditya Krishna Menon"", ""Ankit Singh Rawat"", ""Sanjiv Kumar""]","[""overparameterisation"", ""worst-case generalisation""]","Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples. However, several recent works have revealed that such models' good average performance does not always translate to good worst-case performance: in particular, they may perform poorly on subgroups that are under-represented in the training set. In this paper, we show that in certain settings, overparameterised models' performance on under-represented subgroups may be improved via post-hoc processing. Specifically, such models' bias can be restricted to their classification layers, and manifest as structured prediction shifts for rare subgroups. We detail two post-hoc correction techniques to mitigate this bias, which operate purely on the outputs of standard model training. We empirically verify that with such post-hoc correction, overparameterisation can improve average and worst-case performance.",/pdf/87a7cc95e2626c1015dbcadc41a7564ced3fa495.pdf,ICLR,2021,Overparameterised models' worst-subgroup performance can be improved via post-hoc processing. +udbMZR1cKE6,YRGesZaLco,1601310000000.0,1614990000000.0,2167,Grounding Language to Entities for Generalization in Reinforcement Learning,"[""~H._J._Austin_Wang1"", ""~Karthik_R_Narasimhan1""]","[""H. J. Austin Wang"", ""Karthik R Narasimhan""]","[""reinforcement learning"", ""language grounding""]","In this paper, we consider the problem of leveraging textual descriptions to improve generalization of control policies to new scenarios. Unlike prior work in this space, we do not assume access to any form of prior knowledge connecting text and state observations, and learn both symbol grounding and control policy simultaneously. This is challenging due to a lack of concrete supervision, and incorrect groundings can result in worse performance than policies that do not use the text at all. +We develop a new model, EMMA (Entity Mapper with Multi-modal Attention) which uses a multi-modal entity-conditioned attention module that allows for selective focus over relevant sentences in the manual for each entity in the environment. EMMA is end-to-end differentiable and can learn a latent grounding of entities and dynamics from text to observations using environment rewards as the only source of supervision. +To empirically test our model, we design a new framework of 1320 games and collect text manuals with free-form natural language via crowd-sourcing. We demonstrate that EMMA achieves successful zero-shot generalization to unseen games with new dynamics, obtaining significantly higher rewards compared to multiple baselines. The grounding acquired by EMMA is also robust to noisy descriptions and linguistic variation.",/pdf/b623f87a2e8f4603ca80b916a8dbb1ec49618fab.pdf,ICLR,2021,We use textual descriptions to improve generalization of control policies to new environments without prior knowledge connecting text and state observations. +rJxe3xSYDS,rkxEgxbYPS,1569440000000.0,1583910000000.0,2523,Extreme Classification via Adversarial Softmax Approximation,"[""rbamler@uci.edu"", ""stephan.mandt@gmail.com""]","[""Robert Bamler"", ""Stephan Mandt""]","[""Extreme classification"", ""negative sampling""]","Training a classifier over a large number of classes, known as 'extreme classification', has become a topic of major interest with applications in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost proportional to the number of classes C, which often is prohibitively expensive. A popular scalable softmax approximation relies on uniform negative sampling, which suffers from slow convergence due a poor signal-to-noise ratio. In this paper, we propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model that mimics the data distribution. Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in C, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines. +",/pdf/a27d71bd050d73f92c63ca09118c8d644e3b68e6.pdf,ICLR,2020,"An efficient, unbiased approximation of the softmax loss function for extreme classification" +SkgtbaVYvH,Hklwzk_LDH,1569440000000.0,1577170000000.0,383,AutoLR: A Method for Automatic Tuning of Learning Rate,"[""nkwatra@microsoft.com"", ""thejasvenkatesh97@gmail.com"", ""t-niiyer@microsoft.com"", ""ramjee@microsoft.com"", ""muthian@microsoft.com""]","[""Nipun Kwatra"", ""V Thejas"", ""Nikhil Iyer"", ""Ramachandran Ramjee"", ""Muthian Sivathanu""]","[""Automatic Learning Rate"", ""Deep Learning"", ""Generalization"", ""Stochastic Optimization""]","One very important hyperparameter for training deep neural networks is the +learning rate of the optimizer. The choice of learning rate schedule determines +the computational cost of getting close to a minima, how close you actually get +to the minima, and most importantly the kind of local minima (wide/narrow) +attained. The kind of minima attained has a significant impact on the +generalization accuracy of the network. Current systems employ hand tuned +learning rate schedules, which are painstakingly tuned for each network and +dataset. Given that the state space of schedules is huge, finding a +satisfactory learning rate schedule can be very time consuming. In this paper, +we present AutoLR, a method for auto-tuning the learning rate as training +proceeds. Our method works with any optimizer, and we demonstrate results on +SGD, Momentum, and Adam optimizers. + +We extensively evaluate AutoLR on multiple datasets, models, and across +multiple optimizers. We compare favorably against state of the art learning +rate schedules for the given dataset and models, including for ImageNet on +Resnet-50, Cifar-10 on Resnet-18, and SQuAD fine-tuning on BERT. For example, +AutoLR achieves an EM score of 81.2 on SQuAD v1.1 with BERT_BASE compared to +80.8 reported in (Devlin et al. (2018)) by just auto-tuning the learning rate +schedule. To the best of our knowledge, this is the first automatic learning +rate tuning scheme to achieve state of the art generalization accuracy on these +datasets with the given models. +",/pdf/9303c3a9ff978eb4dbdff1685a5f6ff60dd97eba.pdf,ICLR,2020,"We present a method to automatically tune learning rate while training DNNs, and achieve or beat generalization accuracy of SOTA learning rates schedules for ImageNet (Resnet-50), Cifar-10 (Resnet-18), IWSLT (Transformer), Squad (Bert)" +SkHkeixAW,rJHyxsgAb,1509110000000.0,1518730000000.0,364,Regularization for Deep Learning: A Taxonomy,"[""jan.kukacka@tum.de"", ""vladimir.golkov@tum.de"", ""cremers@tum.de""]","[""Jan Kuka\u010dka"", ""Vladimir Golkov"", ""Daniel Cremers""]","[""neural networks"", ""deep learning"", ""regularization"", ""data augmentation"", ""network architecture"", ""loss function"", ""dropout"", ""residual learning"", ""optimization""]","Regularization is one of the crucial ingredients of deep learning, yet the term regularization has various definitions, and regularization methods are often studied separately from each other. In our work we present a novel, systematic, unifying taxonomy to categorize existing methods. We distinguish methods that affect data, network architectures, error terms, regularization terms, and optimization procedures. We identify the atomic building blocks of existing methods, and decouple the assumptions they enforce from the mathematical tools they rely on. We do not provide all details about the listed methods; instead, we present an overview of how the methods can be sorted into meaningful categories and sub-categories. This helps revealing links and fundamental similarities between them. Finally, we include practical recommendations both for users and for developers of new regularization methods.",/pdf/a23513dccf41ba532a7828e6b28e719e49fa6d48.pdf,ICLR,2018,"Systematic categorization of regularization methods for deep learning, revealing their similarities." +24-DxeAe2af,BUHvlALduJA,1601310000000.0,1614990000000.0,3132,Accurate and fast detection of copy number variations from short-read whole-genome sequencing with deep convolutional neural network,"[""~Jiajin_Li3"", ""sjhwang@ucsc.edu"", ""zhanglucasjifeng@gmail.com"", ""~Jae_Hoon_Sul1""]","[""Jiajin Li"", ""Stephen Hwang"", ""Luke Zhang"", ""Jae Hoon Sul""]","[""copy number variation"", ""deep learning"", ""convolutional neural network"", ""computational biology"", ""DNA sequencing""]","A copy number variant (CNV) is a type of genetic mutation where a stretch of DNA is lost or duplicated once or multiple times. CNVs play important roles in the development of diseases and complex traits. CNV detection with short-read DNA sequencing technology is challenging because CNVs significantly vary in size and are similar to DNA sequencing artifacts. Many methods have been developed but still yield unsatisfactory results with high computational costs. Here, we propose CNV-Net, a novel approach for CNV detection using a six-layer convolutional neural network. We encode DNA sequencing information into RGB images and train the convolutional neural network with these images. The fitted convolutional neural network can then be used to predict CNVs from DNA sequencing data. We benchmark CNV-Net with two high-quality whole-genome sequencing datasets available from the Genome in a Bottle Consortium, considered as gold standard benchmarking datasets for CNV detection. We demonstrate that CNV-Net is more accurate and efficient in CNV detection than current tools.",/pdf/998048333ded7a30efa128650ba190c4680059d0.pdf,ICLR,2021,"Developed CNV-Net to detect CNV, a type of genetic mutation, from short-read DNA sequencing using deep convolutional neural network" +rklj3gBYvH,rJxb9lWKwS,1569440000000.0,1577170000000.0,2549,NORML: Nodal Optimization for Recurrent Meta-Learning,"[""davidpetrus94@gmail.com""]","[""David van Niekerk""]","[""meta-learning"", ""learning to learn"", ""few-shot classification"", ""memory-based optimization""]","Meta-learning is an exciting and powerful paradigm that aims to improve the effectiveness of current learning systems. By formulating the learning process as an optimization problem, a model can learn how to learn while requiring significantly less data or experience than traditional approaches. Gradient-based meta-learning methods aims to do just that, however recent work have shown that the effectiveness of these approaches are primarily due to feature reuse and very little has to do with priming the system for rapid learning (learning to make effective weight updates on unseen data distributions). This work introduces Nodal Optimization for Recurrent Meta-Learning (NORML), a novel meta-learning framework where an LSTM-based meta-learner performs neuron-wise optimization on a learner for efficient task learning. Crucially, the number of meta-learner parameters needed in NORML, increases linearly relative to the number of learner parameters. Allowing NORML to potentially scale to learner networks with very large numbers of parameters. While NORML also benefits from feature reuse it is shown experimentally that the meta-learner LSTM learns to make effective weight updates using information from previous data-points and update steps.",/pdf/6293ea0dc75b348868f8c9ae456d3af4c49b9366.pdf,ICLR,2020,A novel meta-learning method is introduced where a meta-learner learns to optimize a learner's weight updates by optimizing the input and output to and from each node in the learner network. +B1l3M64KwB,SJxZeDhIvB,1569440000000.0,1577170000000.0,426,How many weights are enough : can tensor factorization learn efficient policies ?,"[""phr17@ic.ac.uk"", ""ak711@imperial.ac.uk"", ""y.guo@imperial.ac.uk""]","[""Pierre H. Richemond"", ""Arinbjorn Kolbeinsson"", ""Yike Guo""]","[""reinforcement learning"", ""Q-learning"", ""tensor factorization"", ""low-rank approximation"", ""data efficiency"", ""second-order optimization"", ""scattering""]","Deep reinforcement learning requires a heavy price in terms of sample efficiency and overparameterization in the neural networks used for function approximation. In this work, we employ tensor factorization in order to learn more compact representations for reinforcement learning policies. We show empirically that in the low-data regime, it is possible to learn online policies with 2 to 10 times less total coefficients, with little to no loss of performance. We also leverage progress in second order optimization, and use the theory of wavelet scattering to further reduce the number of learned coefficients, by foregoing learning the topmost convolutional layer filters altogether. We evaluate our results on the Atari suite against recent baseline algorithms that represent the state-of-the-art in data efficiency, and get comparable results with an order of magnitude gain in weight parsimony.",/pdf/d460c69d0084c2a2642e45a213ca42d7616f8254.pdf,ICLR,2020, +dgtpE6gKjHn,_U6h9HndOtY,1601310000000.0,1616310000000.0,2078,FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning,"[""~Hong-You_Chen1"", ""~Wei-Lun_Chao1""]","[""Hong-You Chen"", ""Wei-Lun Chao""]",[],"Federated learning aims to collaboratively train a strong global model by accessing users' locally trained models but not their own data. A crucial step is therefore to aggregate local models into a global model, which has been shown challenging when users have non-i.i.d. data. In this paper, we propose a novel aggregation algorithm named FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models and combining them via Bayesian model Ensemble, leading to much robust aggregation. We show that an effective model distribution can be constructed by simply fitting a Gaussian or Dirichlet distribution to the local models. Our empirical studies validate FedBE's superior performance, especially when users' data are not i.i.d. and when the neural networks go deeper. Moreover, FedBE is compatible with recent efforts in regularizing users' model training, making it an easily applicable module: you only need to replace the aggregation method but leave other parts of your federated learning algorithm intact.",/pdf/9ba5f8c61a25604e0dd3f88e6c9d54997bdad777.pdf,ICLR,2021, +rJWrK9lAb,ryxSKqxCW,1509100000000.0,1518730000000.0,354,Autoregressive Generative Adversarial Networks,"[""yasin001@e.ntu.edu.sg"", ""ekhyap@ntu.edu.sg"", ""stefan.winkler@adsc.com.sg""]","[""Yasin Yazici"", ""Kim-Hui Yap"", ""Stefan Winkler""]","[""Generative Adversarial Networks"", ""Latent Space Modeling""]","Generative Adversarial Networks (GANs) learn a generative model by playing an adversarial game between a generator and an auxiliary discriminator, which classifies data samples vs. generated ones. However, it does not explicitly model feature co-occurrences in samples. In this paper, we propose a novel Autoregressive Generative Adversarial Network (ARGAN), that models the latent distribution of data using an autoregressive model, rather than relying on binary classification of samples into data/generated categories. In this way, feature co-occurrences in samples can be more efficiently captured. Our model was evaluated on two widely used datasets: CIFAR-10 and STL-10. Its performance is competitive with respect to other GAN models both quantitatively and qualitatively.",/pdf/342eecf1599bca7f7104fd8c29c3ee3bb748da25.pdf,ICLR,2018, +HklmoRVYvr,r1lg2WtdvS,1569440000000.0,1577170000000.0,1314,Long History Short-Term Memory for Long-Term Video Prediction,"[""wonmin.byeon@gmail.com"", ""jkautz@nvidia.com""]","[""Wonmin Byeon"", ""Jan Kautz""]","[""LSTM"", ""video"", ""long-term prediction""]","While video prediction approaches have advanced considerably in recent years, learning to predict long-term future is challenging — ambiguous future or error propagation over time yield blurry predictions. To address this challenge, existing algorithms rely on extra supervision (e.g., action or object pose), motion flow learning, or adversarial training. In this paper, we propose a new recurrent unit, Long History Short-Term Memory (LH-STM). LH-STM incorporates long history states into a recurrent unit to learn longer range dependencies. To capture spatio-temporal dynamics in videos, we combined LH-STM with the Context-aware Video Prediction model (ContextVP). Our experiments on the KTH human actions and BAIR robot pushing datasets demonstrate that our approach produces not only sharper near-future predictions, but also farther into the future compared to the state-of-the-art methods. ",/pdf/4c0f4f9698b7400044070a992a193896a40d9fbb.pdf,ICLR,2020,"We propose a new recurrent unit, Long History Short-Term Memory (LH-STM) which incorporates long history states into a recurrent unit to learn longer range dependencies." +HyY4Owjll,,1478350000000.0,1488350000000.0,557,Boosted Generative Models,"[""adityag@cs.stanford.edu"", ""ermon@cs.stanford.edu""]","[""Aditya Grover"", ""Stefano Ermon""]","[""Theory"", ""Deep learning"", ""Unsupervised Learning""]","We propose a new approach for using boosting to create an ensemble of generative models, where models are trained in sequence to correct earlier mistakes. Our algorithm can leverage many existing base learners, including recent latent variable models. Further, our approach allows the ensemble to leverage discriminative models trained to distinguish real data from model generated data. We show theoretical conditions under which incorporating a new model to the ensemble will improve the fit and empirically demonstrate the effectiveness of boosting on density estimation and sample generation on real and synthetic datasets.",/pdf/b9c75622baa6089d0084628fa33037f8cbf56932.pdf,ICLR,2017, +SyxKrySYPr,BJeWKMauDS,1569440000000.0,1577170000000.0,1700,Stabilizing Transformers for Reinforcement Learning,"[""eparisot@cs.cmu.edu"", ""songf@google.com"", ""jwrae@google.com"", ""razp@google.com"", ""caglarg@google.com"", ""sidmj@google.com"", ""jaderberg@google.com"", ""rlopezkaufman@google.com"", ""aidanclark@google.com"", ""snoury@google.com"", ""botvinick@google.com"", ""heess@google.com"", ""raia@google.com""]","[""Emilio Parisotto"", ""Francis Song"", ""Jack Rae"", ""Razvan Pascanu"", ""Caglar Gulcehre"", ""Siddhant Jayakumar"", ""Max Jaderberg"", ""Rapha\u00ebl Lopez Kaufman"", ""Aidan Clark"", ""Seb Noury"", ""Matt Botvinick"", ""Nicolas Heess"", ""Raia Hadsell""]","[""Deep Reinforcement Learning"", ""Transformer"", ""Reinforcement Learning"", ""Self-Attention"", ""Memory"", ""Memory for Reinforcement Learning""]","Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state-of-the-art results in domains such as language modeling and machine translation. Harnessing the transformer's ability to process long time horizons of information could provide a similar performance boost in partially-observable reinforcement learning (RL) domains, but the large-scale transformers used in NLP have yet to be successfully applied to the RL setting. In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. We propose architectural modifications that substantially improve the stability and learning speed of the original Transformer and XL variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. We show that the GTrXL, trained using the same losses, has stability and performance that consistently matches or exceeds a competitive LSTM baseline, including on more reactive tasks where memory is less critical. GTrXL offers an easy-to-train, simple-to-implement but substantially more expressive architectural alternative to the standard multi-layer LSTM ubiquitously used for RL agents in partially-observable environments. ",/pdf/e70ab7ad0e3b98b08fb4b2d28883e8a1a39981d9.pdf,ICLR,2020,"We succeed in stabilizing transformers for training in the RL setting and demonstrate a large improvement over LSTMs on DMLab-30, matching an external memory architecture." +rq_Qr0c1Hyo,pZK1mwfnpSL,1601310000000.0,1612190000000.0,3269,On the Origin of Implicit Regularization in Stochastic Gradient Descent,"[""~Samuel_L_Smith1"", ""~Benoit_Dherin1"", ""~David_Barrett1"", ""~Soham_De2""]","[""Samuel L Smith"", ""Benoit Dherin"", ""David Barrett"", ""Soham De""]","[""SGD"", ""learning rate"", ""batch size"", ""optimization"", ""generalization"", ""implicit regularization"", ""backward error analysis"", ""SDE"", ""stochastic differential equation"", ""ODE"", ""ordinary differential equation""]","For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.",/pdf/e5f4bcf96d3ed905ac91e4ea6e3993321ecda830.pdf,ICLR,2021,"For small finite learning rates, the iterates of Random Shuffling SGD stay close to the path of gradient flow on a modified loss function containing an implicit regularizer." +LiX3ECzDPHZ,hYyY3ldwrNF,1601310000000.0,1617820000000.0,1373,X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback,"[""jenseng@berkeley.edu"", ""~Siddharth_Reddy1"", ""~Glen_Berseth1"", ""nhardy01@gmail.com"", ""nikhilesh.natraj@ucsf.edu"", ""karunesh.ganguly@ucsf.edu"", ""~Anca_Dragan1"", ""~Sergey_Levine1""]","[""Jensen Gao"", ""Siddharth Reddy"", ""Glen Berseth"", ""Nicholas Hardy"", ""Nikhilesh Natraj"", ""Karunesh Ganguly"", ""Anca Dragan"", ""Sergey Levine""]","[""reinforcement learning"", ""human-computer interaction""]","We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze or neural activity measured by a brain implant. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because extracting an error signal from user behavior can be challenging. We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user: online learning from user feedback on the accuracy of the interface's actions. In the typing domain, we leverage backspaces as feedback that the interface did not perform the desired action. We propose an algorithm called x-to-text (X2T) that trains a predictive model of this feedback signal, and uses this model to fine-tune any existing, default interface for translating user input into actions that select words or characters. We evaluate X2T through a small-scale online user study with 12 participants who type sentences by gazing at their desired words, a large-scale observational study on handwriting samples from 60 users, and a pilot study with one participant using an electrocorticography-based brain-computer interface. The results show that X2T learns to outperform a non-adaptive default interface, stimulates user co-adaptation to the interface, personalizes the interface to individual users, and can leverage offline data collected from the default interface to improve its initial performance and accelerate online learning.",/pdf/b299e92f993e7b50461323c326ad974dc5b67e09.pdf,ICLR,2021,We use online learning from user feedback to train an adaptive interface for typing words using inputs from a brain implant or webcam. +HJgd1nAqFX,Hyl76J0qF7,1538090000000.0,1550900000000.0,997,DOM-Q-NET: Grounded RL on Structured Language,"[""sheng.jia@mail.utoronto.ca"", ""kirosjamie@gmail.com"", ""jba@cs.utoronto.ca""]","[""Sheng Jia"", ""Jamie Ryan Kiros"", ""Jimmy Ba""]","[""Reinforcement Learning"", ""Web Navigation"", ""Graph Neural Networks""]","Building agents to interact with the web would allow for significant improvements in knowledge understanding and representation learning. However, web navigation tasks are difficult for current deep reinforcement learning (RL) models due to the large discrete action space and the varying number of actions between the states. In this work, we introduce DOM-Q-NET, a novel architecture for RL-based web navigation to address both of these problems. It parametrizes Q functions with separate networks for different action categories: clicking a DOM element and typing a string input. Our model utilizes a graph neural network to represent the tree-structured HTML of a standard web page. We demonstrate the capabilities of our model on the MiniWoB environment where we can match or outperform existing work without the use of expert demonstrations. Furthermore, we show 2x improvements in sample efficiency when training in the multi-task setting, allowing our model to transfer learned behaviours across tasks. ",/pdf/5b1c984f610ef1e0e115689134efb6b07b1b00aa.pdf,ICLR,2019,Graph-based Deep Q Network for Web Navigation +ryZ3KCy0W,H1x3Y0yRb,1509050000000.0,1518730000000.0,194,Link Weight Prediction with Node Embeddings,"[""yuchen.hou@wsu.edu"", ""holder@wsu.edu""]","[""Yuchen Hou"", ""Lawrence B. Holder""]",[],"Application of deep learning has been successful in various domains such as im- +age recognition, speech recognition and natural language processing. However, +the research on its application in graph mining is still in an early stage. Here we +present the first generic deep learning approach to the graph link weight prediction +problem based on node embeddings. We evaluate this approach with three differ- +ent node embedding techniques experimentally and compare its performance with +two state-of-the-art non deep learning baseline approaches. Our experiment re- +sults suggest that this deep learning approach outperforms the baselines by up to +70% depending on the dataset and embedding technique applied. This approach +shows that deep learning can be successfully applied to link weight prediction to +improve prediction accuracy.",/pdf/fd61d4320eab5cdef7a0c6ff148d8672b47a45fa.pdf,ICLR,2018, +HJx0U64FwS,H1lJ1W9PvH,1569440000000.0,1577170000000.0,580,A Mechanism of Implicit Regularization in Deep Learning,"[""kubo@i.kyoto-u.ac.jp"", ""sugiura.genki.42n@st.kyoto-u.ac.jp"", ""shinzato.kenta.82r@st.kyoto-u.ac.jp"", ""oyama.momose.75c@st.kyoto-u.ac.jp""]","[""Masayoshi Kubo"", ""Genki Sugiura"", ""Kenta Shinzato"", ""Momose Oyama""]","[""Implicit Regularization"", ""Generalization"", ""Deep Neural Network"", ""Low Complexity""]","Despite a lot of theoretical efforts, very little is known about mechanisms of implicit regularization by which the low complexity contributes to generalization in deep learning. In particular, causality between the generalization performance, implicit regularization and nonlinearity of activation functions is one of the basic mysteries of deep neural networks (DNNs). In this work, we introduce a novel technique for DNNs called random walk analysis and reveal a mechanism of the implicit regularization caused by nonlinearity of ReLU activation. Surprisingly, our theoretical results suggest that the learned DNNs interpolate almost linearly between data points, which leads to the low complexity solutions in the over-parameterized regime. As a result, we prove that stochastic gradient descent can learn a class of continuously differentiable functions with generalization bounds of the order of $O(n^{-2})$ ($n$: the number of samples). Furthermore, our analysis is independent of the kernel methods, including neural tangent kernels.",/pdf/5f29937345b194b2eb2230129d4e6c81e895332d.pdf,ICLR,2020, +BygkQeHKwB,HyeZ9QgFDS,1569440000000.0,1577170000000.0,2196,"Walking on the Edge: Fast, Low-Distortion Adversarial Examples","[""hanwei.zhang@irisa.fr"", ""teddy.furon@inria.fr"", ""yannis@avrithis.net"", ""laurent.amsaleg@irisa.fr""]","[""Hanwei Zhang"", ""Teddy Furon"", ""Yannis Avrithis"", ""Laurent Amsaleg""]","[""Deep learning"", ""adversarial attack""]","Adversarial examples of deep neural networks are receiving ever increasing attention because they help in understanding and reducing the sensitivity to their input. This is natural given the increasing applications of deep neural networks in our everyday lives. When white-box attacks are almost always successful, it is typically only the distortion of the perturbations that matters in their evaluation. + +In this work, we argue that speed is important as well, especially when considering that fast attacks are required by adversarial training. Given more time, iterative methods can always find better solutions. We investigate this speed-distortion trade-off in some depth and introduce a new attack called boundary projection BP that improves upon existing methods by a large margin. Our key idea is that the classification boundary is a manifold in the image space: we therefore quickly reach the boundary and then optimize distortion on this manifold.",/pdf/59737fa1b00531666d530779ed2d25760c1a15e0.pdf,ICLR,2020, +Naqw7EHIfrv,PWatAPGtopv,1601310000000.0,1614540000000.0,544,Representation Learning for Sequence Data with Deep Autoencoding Predictive Components,"[""~Junwen_Bai1"", ""~Weiran_Wang1"", ""~Yingbo_Zhou1"", ""~Caiming_Xiong1""]","[""Junwen Bai"", ""Weiran Wang"", ""Yingbo Zhou"", ""Caiming Xiong""]","[""Mutual Information"", ""Unsupervised Learning"", ""Sequence Data"", ""Masked Reconstruction""]","We propose Deep Autoencoding Predictive Components (DAPC) -- a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space. We encourage this latent structure by maximizing an estimate of \emph{predictive information} of latent feature sequences, which is the mutual information between the past and future windows at each time step. In contrast to the mutual information lower bound commonly used by contrastive learning, the estimate of predictive information we adopt is exact under a Gaussian assumption. Additionally, it can be computed without negative sampling. To reduce the degeneracy of the latent space extracted by powerful encoders and keep useful information from the inputs, we regularize predictive information learning with a challenging masked reconstruction loss. We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.",/pdf/1d9efef224111e20fa66c34f1102165be8afc889.pdf,ICLR,2021, +QubpWYfdNry,qOjMG1re4vx,1601310000000.0,1615240000000.0,1735,Domain-Robust Visual Imitation Learning with Mutual Information Constraints,"[""~Edoardo_Cetin1"", ""~Oya_Celiktutan2""]","[""Edoardo Cetin"", ""Oya Celiktutan""]","[""Imitation Learning"", ""Reinforcement Learning"", ""Observational Imitation"", ""Third-Person Imitation"", ""Mutual Information"", ""Domain Adaption"", ""Machine Learning""]","Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm - called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) - with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert's and the agent's domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.",/pdf/ffeade3d551ddc81dea27db706160b3bc6510cec.pdf,ICLR,2021,"Imitation of visual expert demonstrations robust to appearance and embodiment mismatch, working for high dimensional control problems." +HyfyN30qt7,SkeRrqp5FQ,1538090000000.0,1545360000000.0,1413,NICE: noise injection and clamping estimation for neural network quantization,"[""chaimbaskin@cs.technion.ac.il"", ""lissnatan@campus.technion.ac.il"", ""yoavchai1@mail.tau.ac.il"", ""evgeniizh@campus.technion.ac.il"", ""eli.shw@gmail.com"", ""raja@tauex.tau.ac.il"", ""avi.mendelson@tce.technion.ac.il"", ""bron@cs.technion.ac.il""]","[""Chaim Baskin"", ""Natan Liss"", ""Yoav Chai"", ""Evgenii Zheltonozhskii"", ""Eli Schwartz"", ""Raja Girayes"", ""Avi Mendelson"", ""Alexander M.Bronstein""]","[""Efficient inference"", ""Hardware-efficient model architectures"", ""Quantization""]","Convolutional Neural Networks (CNN) are very popular in many fields including computer vision, speech recognition, natural language processing, to name a few. Though deep learning leads to groundbreaking performance in these domains, the networks used are very demanding computationally and are far from real-time even on a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices. To overcome this challenge, some solutions have been proposed for quantizing the weights and activations of these networks, which accelerate the runtime significantly. Yet, this acceleration comes at the cost of a larger error. The NICE method proposed in this work trains quantized neural networks by noise injection and a learned clamping, which improve the accuracy. This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with low as 3-bit weights and 3 -bit activations. We implement the proposed solution on an FPGA to demonstrate its applicability for low power real-time applications.",/pdf/bc126879f95f7966ea0d5ca2b80884b653ea28f9.pdf,ICLR,2019,"Combine noise injection, gradual quantization and activation clamping learning to achieve state-of-the-art 3,4 and 5 bit quantization" +A-Sp6CR9-AA,SHaqTowY2SJ,1601310000000.0,1614990000000.0,2630,Sandwich Batch Normalization,"[""~Xinyu_Gong1"", ""~Wuyang_Chen1"", ""~Tianlong_Chen1"", ""~Zhangyang_Wang1""]","[""Xinyu Gong"", ""Wuyang Chen"", ""Tianlong Chen"", ""Zhangyang Wang""]","[""normalization""]","We present Sandwich Batch Normalization ($\textbf{SaBN}$), a frustratingly easy improvement of Batch Normalization (BN) with only a few lines of code changes. SaBN is motivated by addressing the inherent $\textit{feature distribution heterogeneity}$ that one can be identified in many tasks, which can arise from model heterogeneity (dynamic architectures, model conditioning, etc.), or data heterogeneity (multiple input domains). A SaBN factorizes the BN affine layer into one shared $\textit{sandwich affine}$ layer, cascaded by several parallel $\textit{independent affine}$ layers. Its variants include further decomposing the normalization layer into multiple parallel ones, and extending similar ideas to instance normalization. We demonstrate the prevailing effectiveness of SaBN (as well as its variants) as a $\textbf{drop-in replacement in four tasks}$: neural architecture search (NAS), image generation, adversarial training, and style transfer. Leveraging SaBN immediately boosts two state-of-the-art weight-sharing NAS algorithms significantly on NAS-Bench-201; achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional image generation with three state-of-the art GANs; substantially improves the robust and standard accuracy for adversarial defense; and produces superior arbitrary stylized results. We also provide visualizations and analysis to help understand why SaBN works. All our codes and pre-trained models will be released upon acceptance. ",/pdf/23bfcca6f72c2498e87b7e30d841ed9b4dc93e19.pdf,ICLR,2021,"We present Sandwich Batch Normalization, a plug-and-play module which is able to boost network performance on several tasks, including neural architecture search, conditional image generation, adversarial robustness and neural style transfer." +D3TNqCspFpM,LnzOlIMB06_,1601310000000.0,1614990000000.0,3342,Identifying Treatment Effects under Unobserved Confounding by Causal Representation Learning,"[""~Pengzhou_Abel_Wu1"", ""~Kenji_Fukumizu1""]","[""Pengzhou Abel Wu"", ""Kenji Fukumizu""]","[""VAE"", ""variational autoencoder"", ""Representation Learning"", ""treatment effects"", ""causal inference"", ""Unobserved Confounding"", ""identifiability"", ""CATE"", ""ATE""]","As an important problem of causal inference, we discuss the estimation of treatment effects under the existence of unobserved confounding. By representing the confounder as a latent variable, we propose Counterfactual VAE, a new variant of variational autoencoder, based on recent advances in identifiability of representation learning. Combining the identifiability and classical identification results of causal inference, under mild assumptions on the generative model and with small noise on the outcome, we theoretically show that the confounder is identifiable up to an affine transformation and then the treatment effects can be identified. Experiments on synthetic and semi-synthetic datasets demonstrate that our method matches the state-of-the-art, even under settings violating our formal assumptions.",/pdf/8de0dee37d97aabda1ce286dc28d0073b0c05209.pdf,ICLR,2021,"A new VAE architecture is proposed for estimating causal effects under unobserved confounding, with theoretical analysis and state-of-the-art performance." +gp5Uzbl-9C-,IozBPho4N0o,1601310000000.0,1614990000000.0,2766,Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning,"[""~Nan_Rosemary_Ke1"", ""~Aniket_Rajiv_Didolkar1"", ""~Sarthak_Mittal1"", ""~Anirudh_Goyal1"", ""~Guillaume_Lajoie1"", ""~Stefan_Bauer1"", ""~Danilo_Jimenez_Rezende2"", ""~Michael_Curtis_Mozer1"", ""~Yoshua_Bengio1"", ""~Christopher_Pal1""]","[""Nan Rosemary Ke"", ""Aniket Rajiv Didolkar"", ""Sarthak Mittal"", ""Anirudh Goyal"", ""Guillaume Lajoie"", ""Stefan Bauer"", ""Danilo Jimenez Rezende"", ""Michael Curtis Mozer"", ""Yoshua Bengio"", ""Christopher Pal""]","[""causal induction"", ""model based RL""]","Inducing causal relationships from observations is a classic problem in machine learning. Most work in causality starts from the premise that the causal variables themselves have known semantics or are observed. However, for AI agents such as robots trying to make sense of their environment, the only observables are low-level variables like pixels in images. To generalize well, an agent must induce high-level variables, particularly those which are causal or are affected by causal variables. A central goal for AI and causality is thus the joint discovery of abstract representations and causal structure. In this work, we systematically evaluate the agent's ability to learn underlying causal structure. We note that existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs with many confounding factors. Hence, to facilitate research in learning the representation of high-level variables as well as causal structure among these variables, we present a suite of RL environments created to systematically probe the ability of methods to identify variables as well as causal structure among those variables. We evaluate various representation learning algorithms from literature and found that explicitly incorporating structure and modularity in the model can help causal induction in model-based reinforcement learning.",/pdf/fd60f3b99ed8b26cd60f5f884fe2e6eb7e3ec327.pdf,ICLR,2021,We systematically evaluate aspects of causal induction for visual model based RL. +HygTE309t7,Hyx_e8nqF7,1538090000000.0,1545360000000.0,1492,Outlier Detection from Image Data,"[""lcao@csail.mit.edu"", ""yyan2@wpi.edu"", ""madden@csail.mit.edu"", ""rundenst@cs.wpi.edu""]","[""Lei Cao"", ""Yizhou Yan"", ""Samuel Madden"", ""Elke Rundensteiner""]","[""Image outlier"", ""CNN"", ""Deep Neural Forest""]","Modern applications from Autonomous Vehicles to Video Surveillance generate massive amounts of image data. In this work we propose a novel image outlier detection approach (IOD for short) that leverages the cutting-edge image classifier to discover outliers without using any labeled outlier. We observe that although intuitively the confidence that a convolutional neural network (CNN) has that an image belongs to a particular class could serve as outlierness measure to each image, directly applying this confidence to detect outlier does not work well. This is because CNN often has high confidence on an outlier image that does not belong to any target class due to its generalization ability that ensures the high accuracy in classification. To solve this issue, we propose a Deep Neural Forest-based approach that harmonizes the contradictory requirements of accurately classifying images and correctly detecting the outlier images. Our experiments using several benchmark image datasets including MNIST, CIFAR-10, CIFAR-100, and SVHN demonstrate the effectiveness of our IOD approach for outlier detection, capturing more than 90% of outliers generated by injecting one image dataset into another, while still preserving the classification accuracy of the multi-class classification problem.",/pdf/da76e55b1d54efe81fed9451ffb20bec35611b52.pdf,ICLR,2019,"A novel approach that detects outliers from image data, while preserving the classification accuracy of image classification" +r1xYr3C5t7,HJe8KQT9Y7,1538090000000.0,1549050000000.0,1560,Neural Message Passing for Multi-Label Classification,"[""jjl5sw@virginia.edu"", ""as5cu@virginia.edu"", ""yq2h@virginia.edu""]","[""Jack Lanchantin"", ""Arshdeep Sekhon"", ""Yanjun Qi""]","[""Multi-label Classification"", ""Graph Neural Networks"", ""Attention"", ""Graph Attention""]","Multi-label classification (MLC) is the task of assigning a set of target labels for a given sample. Modeling the combinatorial label interactions in MLC has been a long-haul challenge. Recurrent neural network (RNN) based encoder-decoder models have shown state-of-the-art performance for solving MLC. However, the sequential nature of modeling label dependencies through an RNN limits its ability in parallel computation, predicting dense labels, and providing interpretable results. In this paper, we propose Message Passing Encoder-Decoder (MPED) Networks, aiming to provide fast, accurate, and interpretable MLC. MPED networks model the joint prediction of labels by replacing all RNNs in the encoder-decoder architecture with message passing mechanisms and dispense with autoregressive inference entirely. The proposed models are simple, fast, accurate, interpretable, and structure-agnostic (can be used on known or unknown structured data). Experiments on seven real-world MLC datasets show the proposed models outperform autoregressive RNN models across five different metrics with a significant speedup during training and testing time.",/pdf/ca373e85ce0fb1c02de9776fb168bf23fe19328d.pdf,ICLR,2019,We propose Message Passing Encoder-Decode networks for a fast and accurate way of modelling label dependencies for multi-label classification. +SkGQujR5FX,S1lJ_ZAYYQ,1538090000000.0,1545360000000.0,342,DANA: Scalable Out-of-the-box Distributed ASGD Without Retuning,"[""idohakimi@gmail.com"", ""saarbarkai@gmail.com"", ""mgabel@cs.toronto.edu"", ""assaf@cs.technion.ac.il""]","[""Ido Hakimi"", ""Saar Barkai"", ""Moshe Gabel"", ""Assaf Schuster""]","[""distributed"", ""asynchronous"", ""gradient staleness"", ""nesterov"", ""optimization"", ""out-of-the-box"", ""stochastic gradient descent"", ""sgd"", ""imagenet"", ""distributed training"", ""neural networks"", ""deep learning""]","Distributed computing can significantly reduce the training time of neural networks. Despite its potential, however, distributed training has not been widely adopted: scaling the training process is difficult, and existing SGD methods require substantial tuning of hyperparameters and learning schedules to achieve sufficient accuracy when increasing the number of workers. In practice, such tuning can be prohibitively expensive given the huge number of potential hyperparameter configurations and the effort required to test each one. + +We propose DANA, a novel approach that scales out-of-the-box to large clusters using the same hyperparameters and learning schedule optimized for training on a single worker, while maintaining similar final accuracy without additional overhead. DANA estimates the future value of model parameters by adapting Nesterov Accelerated Gradient to a distributed setting, and so mitigates the effect of gradient staleness, one of the main difficulties in scaling SGD to more workers. + +Evaluation on three state-of-the-art network architectures and three datasets shows that DANA scales as well as or better than existing work without having to tune any hyperparameters or tweak the learning schedule. For example, DANA achieves 75.73% accuracy on ImageNet when training ResNet-50 with 16 workers, similar to the non-distributed baseline.",/pdf/7117ced4c7b4ad9bc014e8fded491e5cbcc62f8e.pdf,ICLR,2019,A new distributed asynchronous SGD algorithm that achieves state-of-the-art accuracy on existing architectures without any additional tuning or overhead. +H1gDNyrKDS,rklfZTn_vr,1569440000000.0,1583910000000.0,1657,Understanding and Robustifying Differentiable Architecture Search,"[""zelaa@cs.uni-freiburg.de"", ""thomas.elsken@de.bosch.com"", ""saikiat@cs.uni-freiburg.de"", ""marrakch@cs.uni-freiburg.de"", ""brox@cs.uni-freiburg.de"", ""fh@cs.uni-freiburg.de""]","[""Arber Zela"", ""Thomas Elsken"", ""Tonmoy Saikia"", ""Yassine Marrakchi"", ""Thomas Brox"", ""Frank Hutter""]","[""Neural Architecture Search"", ""AutoML"", ""AutoDL"", ""Deep Learning"", ""Computer Vision""]","Differentiable Architecture Search (DARTS) has attracted a lot of attention due to its simplicity and small search costs achieved by a continuous relaxation and an approximation of the resulting bi-level optimization problem. However, DARTS does not work robustly for new problems: we identify a wide range of search spaces for which DARTS yields degenerate architectures with very poor test performance. We study this failure mode and show that, while DARTS successfully minimizes validation loss, the found solutions generalize poorly when they coincide with high validation loss curvature in the architecture space. We show that by adding one of various types of regularization we can robustify DARTS to find solutions with less curvature and better generalization properties. Based on these observations, we propose several simple variations of DARTS that perform substantially more robustly in practice. Our observations are robust across five search spaces on three image classification tasks and also hold for the very different domains of disparity estimation (a dense regression task) and language modelling.",/pdf/33427b6cc256082512aff09e18b11986a4412a43.pdf,ICLR,2020,We study the failure modes of DARTS (Differentiable Architecture Search) by looking at the eigenvalues of the Hessian of validation loss w.r.t. the architecture and propose robustifications based on our analysis. +0XXpJ4OtjW,UHbu5Hu-Oux,1601310000000.0,1615840000000.0,2502,Evolving Reinforcement Learning Algorithms,"[""~John_D_Co-Reyes1"", ""~Yingjie_Miao1"", ""~Daiyi_Peng1"", ""ereal@google.com"", ""~Quoc_V_Le1"", ""~Sergey_Levine1"", ""~Honglak_Lee2"", ""~Aleksandra_Faust1""]","[""John D Co-Reyes"", ""Yingjie Miao"", ""Daiyi Peng"", ""Esteban Real"", ""Quoc V Le"", ""Sergey Levine"", ""Honglak Lee"", ""Aleksandra Faust""]","[""reinforcement learning"", ""evolutionary algorithms"", ""meta-learning"", ""genetic programming""]","We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.",/pdf/78e8fae1b2cfbbae3e7010ca2f27649cb057ae84.pdf,ICLR,2021,We meta-learn RL algorithms by evolving computational graphs which compute the loss function for a value-based model-free RL agent to optimize. +wMIdpzTmnct,8EFXP8GbxHx,1601310000000.0,1614990000000.0,2756,Hard-label Manifolds: Unexpected advantages of query efficiency for finding on-manifold adversarial examples,"[""~Washington_Garcia1"", ""~Pin-Yu_Chen1"", ""~Somesh_Jha1"", ""hamilton.clouse.1@us.af.mil"", ""~Kevin_Butler1""]","[""Washington Garcia"", ""Pin-Yu Chen"", ""Somesh Jha"", ""Hamilton Scott Clouse"", ""Kevin Butler""]","[""hard-label attacks"", ""adversarial machine learning"", ""generalization""]","Designing deep networks robust to adversarial examples remains an open problem. Likewise, recent zeroth order hard-label attacks on image classification tasks have shown comparable performance to their first-order alternatives. It is well known that in this setting, the adversary must search for the nearest decision boundary in a query-efficient manner. State-of-the-art (SotA) attacks rely on the concept of pixel grouping, or super-pixels, to perform efficient boundary search. It was recently shown in the first-order setting, that regular adversarial examples leave the data manifold, and on-manifold examples are generalization errors. In this paper, we argue that query efficiency in the zeroth-order setting is connected to the adversary's traversal through the data manifold. In particular, query-efficient hard-label attacks have the unexpected advantage of finding adversarial examples close to the data manifold. We empirically demonstrate that against both natural and robustly trained models, an efficient zeroth-order attack produces samples with a progressively smaller manifold distance measure. Further, when a normal zeroth-order attack is made query-efficient through the use of pixel grouping, it can make up to a two-fold increase in query efficiency, and in some cases, reduce a sample's distance to the manifold by an order of magnitude.",/pdf/47e0e80c67932d0108ac9d67b83b7e04b9ae3e71.pdf,ICLR,2021, +ryxPRpEtvH,SylUXc-_vB,1569440000000.0,1577170000000.0,858,Omnibus Dropout for Improving The Probabilistic Classification Outputs of ConvNets,"[""zz452@cornell.edu"", ""adalca@mit.edu"", ""msabuncu@cornell.edu""]","[""Zhilu Zhang"", ""Adrian V. Dalca"", ""Mert R. Sabuncu""]","[""Uncertainty Estimation"", ""Calibration"", ""Deep Learning""]","While neural network models achieve impressive classification accuracy across different tasks, they can suffer from poor calibration of their probabilistic predictions. A Bayesian perspective has recently suggested that dropout, a regularization strategy popularly used during training, can be employed to obtain better probabilistic predictions at test time (Gal & Ghahramani, 2016a). However, empirical results so far have not been encouraging, particularly with convolutional networks. In this paper, through the lens of ensemble learning, we associate this unsatisfactory performance with the correlation between the models sampled with dropout. Motivated by this, we explore the use of various structured dropout techniques to promote model diversity and improve the quality of probabilistic predictions. We also propose an omnibus dropout strategy that combines various structured dropout methods. Using the SVHN, CIFAR-10 and CIFAR-100 datasets, we empirically demonstrate the superior performance of omnibus dropout relative to several widely used strong baselines in addition to regular dropout. Lastly, we show the merit of omnibus dropout in a Bayesian active learning application. ",/pdf/11d4d33cbcafdbe85c3348375475f71d62e69311.pdf,ICLR,2020,We propose to combine structured dropout methods at different scales for improved model diversity and performance of dropout uncertainty estimates. +HkxQRTNYPH,S1xykub_Pr,1569440000000.0,1635130000000.0,850,Mirror-Generative Neural Machine Translation,"[""zhengzx.142857@gmail.com"", ""zhouhao.nlp@bytedance.com"", ""huangsj@nju.edu.cn"", ""lilei.02@bytedance.com"", ""daixinyu@nju.edu.cn"", ""chenjj@nju.edu.cn""]","[""Zaixiang Zheng"", ""Hao Zhou"", ""Shujian Huang"", ""Lei Li"", ""Xin-Yu Dai"", ""Jiajun Chen""]","[""neural machine translation"", ""generative model"", ""mirror""]","Training neural machine translation models (NMT) requires a large amount of parallel corpus, which is scarce for many language pairs. However, raw non-parallel corpora are often easy to obtain. Existing approaches have not exploited the full potential of non-parallel bilingual data either in training or decoding. In this paper, we propose the mirror-generative NMT (MGNMT), a single unified architecture that simultaneously integrates the source to target translation model, the target to source translation model, and two language models. Both translation models and language models share the same latent semantic space, therefore both translation directions can learn from non-parallel data more effectively. Besides, the translation models and language models can collaborate together during decoding. Our experiments show that the proposed MGNMT consistently outperforms existing approaches in a variety of scenarios and language pairs, including resource-rich and low-resource situations. ",/pdf/9e9fc8be2e0ee3a8c8fb7c1b7ea52ed40566621d.pdf,ICLR,2020, +yT7-k6Q6gda,4EATFHwaHZo,1601310000000.0,1614990000000.0,2346,Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization,"[""~Stanislaw_Kamil_Jastrzebski1"", ""~Devansh_Arpit2"", ""~Oliver_\u00c5strand1"", ""~Giancarlo_Kerg1"", ""~Huan_Wang1"", ""~Caiming_Xiong1"", ""~richard_socher1"", ""~Kyunghyun_Cho1"", ""~Krzysztof_J._Geras1""]","[""Stanislaw Kamil Jastrzebski"", ""Devansh Arpit"", ""Oliver \u00c5strand"", ""Giancarlo Kerg"", ""Huan Wang"", ""Caiming Xiong"", ""richard socher"", ""Kyunghyun Cho"", ""Krzysztof J. Geras""]","[""early phase of training"", ""implicit regularization"", ""SGD"", ""learning rate"", ""batch size"", ""Hessian"", ""Fisher Information Matrix"", ""curvature"", ""gradient norm""]","The early phase of training has been shown to be important in two ways for deep neural networks. First, the degree of regularization in this phase significantly impacts the final generalization. Second, it is accompanied by a rapid change in the local loss curvature influenced by regularization choices. Connecting these two findings, we show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We further show that the early value of the trace of the FIM correlates strongly with the final generalization. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples, and 2) trajectories with a low initial trace of the FIM end in flat minima, which are commonly associated with good generalization.",/pdf/e62fbda6fe74537da024b315a053d524c1cd8158.pdf,ICLR,2021,Explicit regularization of the trace of the Fisher Information Matrix models implicit regularization in stochastic gradient descent. +r1Kr3TyAb,SkOr361AZ,1509050000000.0,1518730000000.0,182,ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS,"[""abhishekpanigrahi@iitkgp.ac.in"", ""yueruche@usc.edu"", ""cckuo@sipi.usc.edu""]","[""Abhishek Panigrahi"", ""Yueru Chen"", ""C.-C. Jay Kuo""]","[""Batch normalization"", ""gradient backpropagation"", ""Residual network"", ""wide residual network""]","We conduct a mathematical analysis on the Batch normalization (BN) effect on gradient backpropagation in residual network training in this work, which is believed to play a critical role in addressing the gradient vanishing/explosion problem. Specifically, by analyzing the mean and variance behavior of the input and the gradient in the forward and backward passes through the BN and residual branches, respectively, we show that they work together to confine the gradient variance to a certain range across residual blocks in backpropagation. As a result, the gradient vanishing/explosion problem is avoided. Furthermore, we use the same analysis to discuss the tradeoff between depth and width of a residual network and demonstrate that shallower yet wider resnets have stronger learning performance than deeper yet thinner resnets.",/pdf/30f7bde86dde248842de91f37697ac30cd164e95.pdf,ICLR,2018,"Batch normalisation maintains gradient variance throughout training, thus stabilizing optimization." +SJxZnR4YvB,rygTHUYOPr,1569440000000.0,1583910000000.0,1345,Distributed Bandit Learning: Near-Optimal Regret with Efficient Communication,"[""yuanhao-16@mails.tsinghua.edu.cn"", ""nickh@pku.edu.cn"", ""cxy30@pku.edu.cn"", ""wanglw@cis.pku.edu.cn""]","[""Yuanhao Wang"", ""Jiachen Hu"", ""Xiaoyu Chen"", ""Liwei Wang""]","[""Theory"", ""Bandit Algorithms"", ""Communication Efficiency""]","We study the problem of regret minimization for distributed bandits learning, in which $M$ agents work collaboratively to minimize their total regret under the coordination of a central server. Our goal is to design communication protocols with near-optimal regret and little communication cost, which is measured by the total amount of transmitted data. For distributed multi-armed bandits, we propose a protocol with near-optimal regret and only $O(M\log(MK))$ communication cost, where $K$ is the number of arms. The communication cost is independent of the time horizon $T$, has only logarithmic dependence on the number of arms, and matches the lower bound except for a logarithmic factor. For distributed $d$-dimensional linear bandits, we propose a protocol that achieves near-optimal regret and has communication cost of order $O\left(\left(Md+d\log \log d\right)\log T\right)$, which has only logarithmic dependence on $T$.",/pdf/d768f63a63c71f1196c7304231b92983f9c040c0.pdf,ICLR,2020, +1flmvXGGJaa,aRQRJH6aV-,1601310000000.0,1614990000000.0,3128,NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search,"[""~Julien_Niklas_Siems1"", ""~Lucas_Zimmer1"", ""~Arber_Zela1"", ""~Jovita_Lukasik1"", ""~Margret_Keuper1"", ""~Frank_Hutter1""]","[""Julien Niklas Siems"", ""Lucas Zimmer"", ""Arber Zela"", ""Jovita Lukasik"", ""Margret Keuper"", ""Frank Hutter""]","[""Neural Architecture Search"", ""Benchmarking"", ""Performance Prediction"", ""Deep Learning""]","The most significant barrier to the advancement of Neural Architecture Search (NAS) is its demand for large computational resources, which hinders scientifically sound empirical evaluations. As a remedy, several tabular NAS benchmarks were proposed to simulate runs of NAS methods in seconds. However, all existing tabular NAS benchmarks are limited to extremely small architectural spaces since they rely on exhaustive evaluations of the space. This leads to unrealistic results that do not transfer to larger search spaces. To overcome this fundamental limitation, we propose NAS-Bench-301, the first surrogate NAS benchmark, using a search space containing $10^{18}$ architectures, many orders of magnitude larger than any previous tabular NAS benchmark. After motivating the benefits of a surrogate benchmark over a tabular one, we fit various regression models on our dataset, which consists of $\sim$60k architecture evaluations, and build surrogates via deep ensembles to model uncertainty. We benchmark a wide range of NAS algorithms using NAS-Bench-301 and obtain comparable results to the true benchmark at a fraction of the real cost. Finally, we show how NAS-Bench-301 can be used to generate new scientific insights.",/pdf/91c5f8e77cd11407ec3e2a6f3e1ec4916197b93b.pdf,ICLR,2021, +HyxQzBceg,,1478280000000.0,1487790000000.0,241,Deep Variational Information Bottleneck,"[""alemi@google.com"", ""iansf@google.com"", ""jvdillon@google.com"", ""kpmurphy@google.com""]","[""Alexander A. Alemi"", ""Ian Fischer"", ""Joshua V. Dillon"", ""Kevin Murphy""]","[""Theory"", ""Computer vision"", ""Deep learning"", ""Supervised Learning""]","We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method “Deep Variational Information Bottleneck”, or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.",/pdf/8c41248f88f7436402feb8fd572711569713d314.pdf,ICLR,2017,Applying the information bottleneck to deep networks using the variational lower bound and reparameterization trick. +B1lx42A9Ym,SylDeYN9FQ,1538090000000.0,1545360000000.0,1416,Neural Rendering Model: Joint Generation and Prediction for Semi-Supervised Learning,"[""minhnhat@berkeley.edu"", ""mn15@rice.edu"", ""ankit.patel@bcm.edu"", ""anima@caltech.edu"", ""jordan@cs.berkeley.edu"", ""richb@rice.edu""]","[""Nhat Ho"", ""Tan Nguyen"", ""Ankit B. Patel"", ""Anima Anandkumar"", ""Michael I. Jordan"", ""Richard G. Baraniuk""]","[""neural nets"", ""generative models"", ""semi-supervised learning"", ""cross-entropy""]","Unsupervised and semi-supervised learning are important problems that are especially challenging with complex data like natural images. Progress on these problems would accelerate if we had access to appropriate generative models under which to pose the associated inference tasks. Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Neural Rendering Model (NRM), a new hierarchical probabilistic generative model whose inference calculations correspond to those in a CNN. The NRM introduces a small set of latent variables at each level of the model and enforces dependencies among all the latent variables via a conjugate prior distribution. The conjugate prior yields a new regularizer for learning based on the paths rendered in the generative model for training CNNs–the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization both in theory and in practice. Likelihood estimation in the NRM yields the new Max-Min cross entropy training loss, which suggests a new deep network architecture–the Max- Min network–which exceeds or matches the state-of-art for semi-supervised and supervised learning on SVHN, CIFAR10, and CIFAR100.",/pdf/623e3479006502db960cb45dfe0f3754c04687c6.pdf,ICLR,2019,We develop a new deep generative model for semi-supervised learning and propose a new Max-Min cross-entropy for training CNNs. +Bylnx209YX,BJxDjehqKm,1538090000000.0,1550830000000.0,1112,Adversarial Attacks on Graph Neural Networks via Meta Learning,"[""zuegnerd@in.tum.de"", ""guennemann@in.tum.de""]","[""Daniel Z\u00fcgner"", ""Stephan G\u00fcnnemann""]","[""graph mining"", ""adversarial attacks"", ""meta learning"", ""graph neural networks"", ""node classification""]","Deep learning models for graphs have advanced the state of the art on many tasks. Despite their recent success, little is known about their robustness. We investigate training time attacks on graph neural networks for node classification that perturb the discrete graph structure. Our core principle is to use meta-gradients to solve the bilevel problem underlying training-time attacks, essentially treating the graph as a hyperparameter to optimize. Our experiments show that small graph perturbations consistently lead to a strong decrease in performance for graph convolutional networks, and even transfer to unsupervised embeddings. Remarkably, the perturbations created by our algorithm can misguide the graph neural networks such that they perform worse than a simple baseline that ignores all relational information. Our attacks do not assume any knowledge about or access to the target classifiers.",/pdf/9c456c2747f67ea78b03bd154b4ce729e1410d56.pdf,ICLR,2019,We use meta-gradients to attack the training procedure of deep neural networks for graphs. +Syf9Q209YQ,rJe70FnqY7,1538090000000.0,1545360000000.0,1385,Manifold regularization with GANs for semi-supervised learning,"[""bruno_lecouat@i2r.a-star.edu.sg"", ""foo_chuan_sheng@i2r.a-star.edu.sg"", ""houssam.zenati@student.ecp.fr"", ""vijay@i2r.a-star.edu.sg""]","[""Bruno Lecouat"", ""Chuan-Sheng Foo"", ""Houssam Zenati"", ""Vijay Chandrasekhar""]","[""semi-supervised learning"", ""generative adversarial networks"", ""manifold regularization""]","Generative Adversarial Networks are powerful generative models that can model the manifold of natural images. We leverage this property to perform manifold regularization by approximating a variant of the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the semi-supervised feature-matching GAN we achieve state-of-the-art results for semi-supervised learning on CIFAR-10 benchmarks when few labels are used, with a method that is significantly easier to implement than competing methods. We find that manifold regularization improves the quality of generated images, and is affected by the quality of the GAN used to approximate the regularizer.",/pdf/5068c6322d3de0b86bb0ca42e2769c3c11d2dacd.pdf,ICLR,2019, +w2mYg3d0eot,uDOvPU2ktrt,1601310000000.0,1615930000000.0,1337,Fast convergence of stochastic subgradient method under interpolation,"[""~Huang_Fang1"", ""zhenanf@cs.ubc.ca"", ""mpf@cs.ubc.ca""]","[""Huang Fang"", ""Zhenan Fan"", ""Michael Friedlander""]","[""Optimization"", ""stochastic subgradient method"", ""interpolation"", ""convergence analysis""]","This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized nonsmooth optimization problems that satisfy an interpolation condition. By leveraging the composite structure of the empirical risk minimization problems, we prove that SSGD converges, respectively, with rates $O(1/\epsilon)$ and $O(\log(1/\epsilon))$ for convex and strongly-convex objectives when interpolation holds. These rates coincide with established rates for the stochastic gradient descent (SGD) method applied to smooth problems that also satisfy an interpolation condition. Our analysis provides a partial explanation for the empirical observation that sometimes SGD and SSGD behave similarly for training smooth and nonsmooth machine learning models. We also prove that the rate $O(1/\epsilon)$ is optimal for the subgradient method in the convex and interpolation setting.",/pdf/a65651e46213dfe6307698d82f362d04acd34756.pdf,ICLR,2021, +qVyeW-grC2k,a-NkHIJOFkZ,1601310000000.0,1615920000000.0,1616,Long Range Arena : A Benchmark for Efficient Transformers ,"[""~Yi_Tay1"", ""~Mostafa_Dehghani1"", ""~Samira_Abnar1"", ""~Yikang_Shen1"", ""~Dara_Bahri1"", ""~Philip_Pham1"", ""~Jinfeng_Rao2"", ""yangliuy@google.com"", ""~Sebastian_Ruder2"", ""metzler@google.com""]","[""Yi Tay"", ""Mostafa Dehghani"", ""Samira Abnar"", ""Yikang Shen"", ""Dara Bahri"", ""Philip Pham"", ""Jinfeng Rao"", ""Liu Yang"", ""Sebastian Ruder"", ""Donald Metzler""]","[""Transformers"", ""Attention"", ""Deep Learning""]","Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle.",/pdf/c7ddcda9fb422b91032d80ebd1564c35dd6f9fa8.pdf,ICLR,2021,Better benchmarking for Xformers +GY6-6sTvGaf,RRxCVw-vtQ5,1601310000000.0,1615140000000.0,1429,Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels,"[""~Denis_Yarats1"", ""~Ilya_Kostrikov1"", ""~Rob_Fergus1""]","[""Denis Yarats"", ""Ilya Kostrikov"", ""Rob Fergus""]",[],"We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to transform input examples, as well as regularizing the value function and policy. Existing model-free approaches, such as Soft Actor-Critic (SAC), are not able to train deep networks effectively from image pixels. However, the addition of our augmentation method dramatically improves SAC’s performance, enabling it to reach state-of-the-art performance on the DeepMind control suite, surpassing model-based (Hafner et al., 2019; Lee et al., 2019; Hafner et al., 2018) methods and recently proposed contrastive learning (Srinivas et al., 2020). Our approach, which we dub DrQ: Data-regularized Q, can be combined with any model-free reinforcement learning algorithm. We further demonstrate this by applying it to DQN and significantly improve its data-efficiency on the Atari 100k benchmark.",/pdf/b8b967965ff52b2eb545d1a7d4284f59f0fc181f.pdf,ICLR,2021,The first successful demonstration that image augmentation can be applied to image-based Deep RL to achieve SOTA performance. +SJlVn6NKPB,HJx5PJxuwB,1569440000000.0,1577170000000.0,778,Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach,"[""aidanswope@gmail.com"", ""xander@descarteslabs.com"", ""kyle@descarteslabs.com""]","[""Aidan M. Swope"", ""Xander H. Rudelis"", ""Kyle T. Story""]","[""unsupervised learning"", ""representation learning"", ""deep learning"", ""remote sensing"", ""sensor fusion""]","In the application of machine learning to remote sensing, labeled data is often scarce or expensive, which impedes the training of powerful models like deep convolutional neural networks. Although unlabeled data is abundant, recent self-supervised learning approaches are ill-suited to the remote sensing domain. In addition, most remote sensing applications currently use only a small subset of the multi-sensor, multi-channel information available, motivating the need for fused multi-sensor representations. We propose a new self-supervised training objective, Contrastive Sensor Fusion, which exploits coterminous data from multiple sources to learn useful representations of every possible combination of those sources. This method uses information common across multiple sensors and bands by training a single model to produce a representation that remains similar when any subset of its input channels is used. Using a dataset of 47 million unlabeled coterminous image triplets, we train an encoder to produce semantically meaningful representations from any possible combination of channels from the input sensors. These representations outperform fully supervised ImageNet weights on a remote sensing classification task and improve as more sensors are fused.",/pdf/0f1c988cd122ae0c3c762f0ed8b9bc7e5cb58404.pdf,ICLR,2020,Multiple sensor views imply a self-supervised task for learning what things are in aerial imagery without labels +tlV90jvZbw,MtljYHkBdbH,1601310000000.0,1615790000000.0,2406,Early Stopping in Deep Networks: Double Descent and How to Eliminate it,"[""~Reinhard_Heckel1"", ""~Fatih_Furkan_Yilmaz1""]","[""Reinhard Heckel"", ""Fatih Furkan Yilmaz""]","[""early stopping"", ""double descent""]","Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, whereas a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent occurs for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and mitigating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a wide two-layer neural network, where the first and second layers govern bias-variance tradeoffs. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance.",/pdf/eaf02d8eb8ad9232e0b10b405cf104b4547de602.pdf,ICLR,2021,Epoch wise double descent can be explained as a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs. +IG3jEGLN0jd,Ag286Ov8Lv3,1601310000000.0,1614990000000.0,453,Contrastive estimation reveals topic posterior information to linear models,"[""~Christopher_Tosh1"", ""~Akshay_Krishnamurthy1"", ""~Daniel_Hsu1""]","[""Christopher Tosh"", ""Akshay Krishnamurthy"", ""Daniel Hsu""]","[""contrastive learning"", ""self-supervised learning"", ""representation learning"", ""theory""]","Contrastive learning is an approach to representation learning that utilizes naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. In the context of document classification under topic modeling assumptions, we prove that contrastive learning is capable of recovering a representation of documents that reveals their underlying topic posterior information to linear models. We apply this procedure in a semi-supervised setup and demonstrate empirically that linear classifiers with these representations perform well in document classification tasks with very few training examples.",/pdf/aefadef37056d027385e06b07b92238609cebc33.pdf,ICLR,2021,This paper demonstrates that contrastive learning on text data produces representations that are linearly related to underlying topic structure. +BJfYvo09Y7,SylwuHOcFX,1538090000000.0,1547570000000.0,283,Hierarchical Visuomotor Control of Humanoids,"[""jsmerel@google.com"", ""arahuja@google.com"", ""vuph@google.com"", ""stunya@google.com"", ""liusiqi@google.com"", ""dhruvat@google.com"", ""heess@google.com"", ""gregwayne@google.com""]","[""Josh Merel"", ""Arun Ahuja"", ""Vu Pham"", ""Saran Tunyasuvunakool"", ""Siqi Liu"", ""Dhruva Tirumala"", ""Nicolas Heess"", ""Greg Wayne""]","[""hierarchical reinforcement learning"", ""motor control"", ""motion capture""]","We aim to build complex humanoid agents that integrate perception, motor control, and memory. In this work, we partly factor this problem into low-level motor control from proprioception and high-level coordination of the low-level skills informed by vision. We develop an architecture capable of surprisingly flexible, task-directed motor control of a relatively high-DoF humanoid body by combining pre-training of low-level motor controllers with a high-level, task-focused controller that switches among low-level sub-policies. The resulting system is able to control a physically-simulated humanoid body to solve tasks that require coupling visual perception from an unstabilized egocentric RGB camera during locomotion in the environment. Supplementary video link: https://youtu.be/fBoir7PNxPk",/pdf/c64a5a1a7586ab5357231dd332bb03f3b34fc184.pdf,ICLR,2019,"Solve tasks involving vision-guided humanoid locomotion, reusing locomotion behavior from motion capture data." +Hyg0vbWC-,BJ10PWWC-,1509130000000.0,1519850000000.0,689,Generating Wikipedia by Summarizing Long Sequences,"[""peterjliu@google.com"", ""msaleh@google.com"", ""epot@google.com"", ""bgoodrich@google.com"", ""rsepassi@google.com"", ""lukaszkaiser@google.com"", ""noam@google.com""]","[""Peter J. Liu*"", ""Mohammad Saleh*"", ""Etienne Pot"", ""Ben Goodrich"", ""Ryan Sepassi"", ""Lukasz Kaiser"", ""Noam Shazeer""]","[""abstractive summarization"", ""Transformer"", ""long sequences"", ""natural language processing"", ""sequence transduction"", ""Wikipedia"", ""extractive summarization""]","We show that generating English Wikipedia articles can be approached as a multi- +document summarization of source documents. We use extractive summarization +to coarsely identify salient information and a neural abstractive model to generate +the article. For the abstractive model, we introduce a decoder-only architecture +that can scalably attend to very long sequences, much longer than typical encoder- +decoder architectures used in sequence transduction. We show that this model can +generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia +articles. When given reference documents, we show it can extract relevant factual +information as reflected in perplexity, ROUGE scores and human evaluations.",/pdf/21c055592a04c6aab518a8f4b0c4b8f035e988ce.pdf,ICLR,2018,We generate Wikipedia articles abstractively conditioned on source document text. +ryepUj0qtX,rkxfO9NqY7,1538090000000.0,1550750000000.0,213,Conditional Network Embeddings,"[""bo.kang@ugent.be"", ""jefrey.lijffijt@ugent.be"", ""tijl.debie@ugent.be""]","[""Bo Kang"", ""Jefrey Lijffijt"", ""Tijl De Bie""]","[""Network embedding"", ""graph embedding"", ""learning node representations"", ""link prediction"", ""multi-label classification of nodes""]","Network Embeddings (NEs) map the nodes of a given network into $d$-dimensional Euclidean space $\mathbb{R}^d$. Ideally, this mapping is such that 'similar' nodes are mapped onto nearby points, such that the NE can be used for purposes such as link prediction (if 'similar' means being 'more likely to be connected') or classification (if 'similar' means 'being more likely to have the same label'). In recent years various methods for NE have been introduced, all following a similar strategy: defining a notion of similarity between nodes (typically some distance measure within the network), a distance measure in the embedding space, and a loss function that penalizes large distances for similar nodes and small distances for dissimilar nodes. + +A difficulty faced by existing methods is that certain networks are fundamentally hard to embed due to their structural properties: (approximate) multipartiteness, certain degree distributions, assortativity, etc. To overcome this, we introduce a conceptual innovation to the NE literature and propose to create \emph{Conditional Network Embeddings} (CNEs); embeddings that maximally add information with respect to given structural properties (e.g. node degrees, block densities, etc.). We use a simple Bayesian approach to achieve this, and propose a block stochastic gradient descent algorithm for fitting it efficiently. + +We demonstrate that CNEs are superior for link prediction and multi-label classification when compared to state-of-the-art methods, and this without adding significant mathematical or computational complexity. Finally, we illustrate the potential of CNE for network visualization.",/pdf/a9cf7d5423f5e84f9c7246e0688e456b56f49a78.pdf,ICLR,2019,"We introduce a network embedding method that accounts for prior information about the network, yielding superior empirical performance." +HyyP33gAZ,B1yP23lA-,1509110000000.0,1519020000000.0,403,Activation Maximization Generative Adversarial Nets,"[""heyohai@apex.sjtu.edu.cn"", ""hcai@apex.sjtu.edu.cn"", ""shu.rong@yitu-inc.com"", ""songyuxuan@apex.sjtu.edu.cn"", ""kren@apex.sjtu.edu.cn"", ""wnzhang@sjtu.edu.cn"", ""j.wang@cs.ucl.ac.uk"", ""yyu@apex.sjtu.edu.cn""]","[""Zhiming Zhou"", ""Han Cai"", ""Shu Rong"", ""Yuxuan Song"", ""Kan Ren"", ""Weinan Zhang"", ""Jun Wang"", ""Yong Yu""]","[""Generative Adversarial Nets"", ""GANs"", ""Evaluation Metrics"", ""Generative Model"", ""Deep Learning"", ""Adversarial Learning"", ""Inception Score"", ""AM Score""]","Class labels have been empirically shown useful in improving the sample quality of generative adversarial nets (GANs). In this paper, we mathematically study the properties of the current variants of GANs that make use of class label information. With class aware gradient and cross-entropy decomposition, we reveal how class labels and associated losses influence GAN's training. Based on that, we propose Activation Maximization Generative Adversarial Networks (AM-GAN) as an advanced solution. Comprehensive experiments have been conducted to validate our analysis and evaluate the effectiveness of our solution, where AM-GAN outperforms other strong baselines and achieves state-of-the-art Inception Score (8.91) on CIFAR-10. In addition, we demonstrate that, with the Inception ImageNet classifier, Inception Score mainly tracks the diversity of the generator, and there is, however, no reliable evidence that it can reflect the true sample quality. We thus propose a new metric, called AM Score, to provide more accurate estimation on the sample quality. Our proposed model also outperforms the baseline methods in the new metric.",/pdf/a1f8eb4f218d293121af80f1ad74c264219e4a53.pdf,ICLR,2018,Understand how class labels help GAN training. Propose a new evaluation metric for generative models. +HkgsWxrtPB,rJlm--lKwS,1569440000000.0,1583910000000.0,2149,Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies,"[""srsohn@umich.edu"", ""hjwoo@umich.edu"", ""jwook@umich.edu"", ""honglak@eecs.umich.edu""]","[""Sungryull Sohn"", ""Hyunjae Woo"", ""Jongwook Choi"", ""Honglak Lee""]","[""Meta reinforcement learning"", ""subtask graph""]","We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph which describes a set of subtasks and their dependencies that are unknown to the agent. The agent needs to quickly adapt to the task over few episodes during adaptation phase to maximize the return in the test phase. Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference (MSGI), which infers the latent parameter of the task by interacting with the environment and maximizes the return given the latent parameter. To facilitate learning, we adopt an intrinsic reward inspired by upper confidence bound (UCB) that encourages efficient exploration. Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter, and to adapt more efficiently than existing meta RL and hierarchical RL methods.",/pdf/15fd85cbcd18cf2efbbee5bc401fe4b7b2be9849.pdf,ICLR,2020,A novel meta-RL method that infers latent subtask structure +B1GIB3A9YX,HJg4jC65KX,1538090000000.0,1545360000000.0,1546,Explicit Recall for Efficient Exploration,"[""dhh19951@gmail.com"", ""maojiayuan@gmail.com"", ""rogar2233cxy@gmail.com"", ""lihongli.cs@gmail.com""]","[""Honghua Dong"", ""Jiayuan Mao"", ""Xinyue Cui"", ""Lihong Li""]","[""Exploration"", ""goal-directed"", ""deep reinforcement learning"", ""explicit memory""]","In this paper, we advocate the use of explicit memory for efficient exploration in reinforcement learning. This memory records structured trajectories that have led to interesting states in the past, and can be used by the agent to revisit those states more effectively. In high-dimensional decision making problems, where deep reinforcement learning is considered crucial, our approach provides a simple, transparent and effective way that can be naturally combined with complex, deep learning models. We show how such explicit memory may be used to enhance existing exploration algorithms such as intrinsically motivated ones and count-based ones, and demonstrate our method's advantages in various simulated environments.",/pdf/72b577ce98ac76872d3b6104d0d19aae026f7772.pdf,ICLR,2019,We advocate the use of explicit memory for efficient exploration in reinforcement learning +BJ7d0fW0b,SJOxRzWCb,1509140000000.0,1518730000000.0,1050,Faster Reinforcement Learning with Expert State Sequences,"[""xiaoxiao.guo@ibm.com"", ""shiyu.chang@ibm.com"", ""yum@us.ibm.com""]","[""Xiaoxiao Guo"", ""Shiyu Chang"", ""Mo Yu"", ""Miao Liu"", ""Gerald Tesauro""]","[""Reinforcement Learning"", ""Imitation Learning""]","Imitation learning relies on expert demonstrations. Existing approaches often re- quire that the complete demonstration data, including sequences of actions and states are available. In this paper, we consider a realistic and more difficult sce- nario where a reinforcement learning agent only has access to the state sequences of an expert, while the expert actions are not available. Inferring the unseen ex- pert actions in a stochastic environment is challenging and usually infeasible when combined with a large state space. We propose a novel policy learning method which only utilizes the expert state sequences without inferring the unseen ac- tions. Specifically, our agent first learns to extract useful sub-goal information from the state sequences of the expert and then utilizes the extracted sub-goal information to factorize the action value estimate over state-action pairs and sub- goals. The extracted sub-goals are also used to synthesize guidance rewards in the policy learning. We evaluate our agent on five Doom tasks. Our empirical results show that the proposed method significantly outperforms the conventional DQN method.",/pdf/6dc921376ff46c9bc2055b92f9b2d2581c1fbdd0.pdf,ICLR,2018, +8iW8HOidj1_,AiHY5Lkrov,1601310000000.0,1614990000000.0,493,Dream and Search to Control: Latent Space Planning for Continuous Control,"[""~Anurag_Koul1"", ""~Varun_Kumar_Vijay1"", ""~Alan_Fern1"", ""~Somdeb_Majumdar1""]","[""Anurag Koul"", ""Varun Kumar Vijay"", ""Alan Fern"", ""Somdeb Majumdar""]","[""Reinforcement Learning"", ""Model Based RL"", ""Continuous Control"", ""Search"", ""Planning"", ""MCTS""]","Learning and planning with latent space dynamics has been shown to be useful for sample efficiency in model-based reinforcement learning (MBRL) for discrete and continuous control tasks. In particular, recent work, for discrete action spaces, demonstrated the effectiveness of latent-space planning via Monte-Carlo Tree Search (MCTS) for bootstrapping MBRL during learning and at test time. However, the potential gains from latent-space tree search have not yet been demonstrated for environments with continuous action spaces. In this work, we propose and explore an MBRL approach for continuous action spaces based on tree-based planning over learned latent dynamics. We show that it is possible to demonstrate the types of bootstrapping benefits as previously shown for discrete spaces. In particular, the approach achieves improved sample efficiency and performance on a majority of challenging continuous-control benchmarks compared to the state-of-the-art. ",/pdf/9caf8c6658ad6d05d1c8cb9ce36accd5f2c6a266.pdf,ICLR,2021,"We show that performing tree-based search on learnt, latent dynamics as a planning mechanism for continuous control outperforms Dreamer." +DigrnXQNMTe,19MvSrrfsv9,1601310000000.0,1614990000000.0,3560,A generalized probability kernel on discrete distributions and its application in two-sample test,"[""~Le_Niu1""]","[""Le Niu""]","[""maximum mean discrepancy"", ""RKHS"", ""two-sample test"", ""empirical estimator"", ""discrete distributions""]","We propose a generalized probability kernel(GPK) on discrete distributions with finite support. This probability kernel, defined as kernel between distributions instead of samples, generalizes the existing discrepancy statistics such as maximum mean discrepancy(MMD) as well as probability product kernels, and extends to more general cases. For both existing and newly proposed statistics, we estimate them through empirical frequency and illustrate the strategy to analyze the resulting bias and convergence bounds. We further propose power-MMD, a natural extension of MMD in the framework of GPK, illustrating its usage for the task of two-sample test. Our work connects the fields of discrete distribution-property estimation and kernel-based hypothesis test, which might shed light on more new possibilities.",/pdf/fc4c480a49160238509fb4ddf1ea6dc26e51abd4.pdf,ICLR,2021, +5jRVa89sZk,Jf9V9CJY0-a,1601310000000.0,1616040000000.0,823,Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition,"[""~Yangming_Li1"", ""~lemao_liu1"", ""~Shuming_Shi1""]","[""Yangming Li"", ""lemao liu"", ""Shuming Shi""]","[""Named Entity Recognition"", ""Unlabeled Entity Problem"", ""Negative Sampling""]","In many scenarios, named entity recognition (NER) models severely suffer from unlabeled entity problem, where the entities of a sentence may not be fully annotated. Through empirical studies performed on synthetic datasets, we find two causes of performance degradation. One is the reduction of annotated entities and the other is treating unlabeled entities as negative instances. The first cause has less impact than the second one and can be mitigated by adopting pretraining language models. The second cause seriously misguides a model in training and greatly affects its performances. Based on the above observations, we propose a general approach, which can almost eliminate the misguidance brought by unlabeled entities. The key idea is to use negative sampling that, to a large extent, avoids training NER models with unlabeled entities. Experiments on synthetic datasets and real-world datasets show that our model is robust to unlabeled entity problem and surpasses prior baselines. On well-annotated datasets, our model is competitive with the state-of-the-art method.",/pdf/3caa712b0d1c2caaf7b3578b53b9f9c46e78db74.pdf,ICLR,2021,This work studys what are the impacts of unlabeled entity problem on NER models and how to effectively eliminate them by a general method. +HJtEm4p6Z,S1_EQEp6-,1508880000000.0,1519250000000.0,69,Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning,"[""pingwei01@baidu.com"", ""pengkainan@baidu.com"", ""gibianskyandrew@baidu.com"", ""sercanarik@baidu.com"", ""kannanajay@baidu.com"", ""sharan@baidu.com"", ""raiman@openai.com"", ""miller_john@berkeley.edu""]","[""Wei Ping"", ""Kainan Peng"", ""Andrew Gibiansky"", ""Sercan O. Arik"", ""Ajay Kannan"", ""Sharan Narang"", ""Jonathan Raiman"", ""John Miller""]","[""2000-Speaker Neural TTS"", ""Monotonic Attention"", ""Speech Synthesis""]","We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training an order of magnitude faster. We scale Deep Voice 3 to dataset sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on a single GPU server.",/pdf/c932ad2481ff3f16a5d9f27b050653f83a47eb75.pdf,ICLR,2018, +HkgHk3RctX,S1lLRz-9tX,1538090000000.0,1545360000000.0,981,Seq2Slate: Re-ranking and Slate Optimization with RNNs,"[""ibello@google.com"", ""sayali@google.com"", ""sagarj@google.com"", ""cboutilier@google.com"", ""edchi@google.com"", ""elade@google.com"", ""xyluo@google.com"", ""mackeya@google.com"", ""meshi@google.com""]","[""Irwan Bello"", ""Sayali Kulkarni"", ""Sagar Jain"", ""Craig Boutilier"", ""Ed Chi"", ""Elad Eban"", ""Xiyang Luo"", ""Alan Mackey"", ""Ofer Meshi""]","[""Recurrent neural networks"", ""learning to rank"", ""pointer networks""]","Ranking is a central task in machine learning and information retrieval. In this task, it is especially important to present the user with a slate of items that is appealing as a whole. This in turn requires taking into account interactions between items, since intuitively, placing an item on the slate affects the decision of which other items should be chosen alongside it. +In this work, we propose a sequence-to-sequence model for ranking called seq2slate. At each step, the model predicts the next item to place on the slate given the items already chosen. The recurrent nature of the model allows complex dependencies between items to be captured directly in a flexible and scalable way. We show how to learn the model end-to-end from weak supervision in the form of easily obtained click-through data. We further demonstrate the usefulness of our approach in experiments on standard ranking benchmarks as well as in a real-world recommendation system.",/pdf/dc4fd08c5eeb3146c32f551d9a2ac1622e25c4ba.pdf,ICLR,2019,"A pointer network architecture for re-ranking items, learned from click-through logs." +7hMenh--8g,6kS9tawotXG,1601310000000.0,1614990000000.0,1316,Uncertainty Weighted Offline Reinforcement Learning,"[""~Yue_Wu17"", ""~Shuangfei_Zhai3"", ""~Nitish_Srivastava1"", ""~Joshua_M._Susskind1"", ""~Jian_Zhang23"", ""~Ruslan_Salakhutdinov1"", ""~Hanlin_Goh2""]","[""Yue Wu"", ""Shuangfei Zhai"", ""Nitish Srivastava"", ""Joshua M. Susskind"", ""Jian Zhang"", ""Ruslan Salakhutdinov"", ""Hanlin Goh""]","[""reinforcement learning"", ""offline"", ""batch reinforcement learning"", ""off-policy"", ""uncertainty estimation"", ""dropout"", ""actor-critic"", ""bootstrap error""]","Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that models the epistemic uncertainty to detect OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.",/pdf/60461108f34124e725ca20163d73a07d72e0df5e.pdf,ICLR,2021,A simple and effective uncertainty weighted training mechanism for stabilizing offline reinforcement learning. +Hyq4yhile,,1478370000000.0,1488580000000.0,571,Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning,"[""abhigupta@berkeley.edu"", ""coline@berkeley.edu"", ""yuxuanliu@berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Abhishek Gupta"", ""Coline Devin"", ""YuXuan Liu"", ""Pieter Abbeel"", ""Sergey Levine""]","[""Deep learning"", ""Reinforcement Learning"", ""Transfer Learning""]","People can learn a wide range of tasks from their own experience, but can also learn from observing other creatures. This can accelerate acquisition of new skills even when the observed agent differs substantially from the learning agent in terms of morphology. In this paper, we examine how reinforcement learning algorithms can transfer knowledge between morphologically different agents (e.g., different robots). We introduce a problem formulation where twp agents are tasked with learning multiple skills by sharing information. Our method uses the skills that were learned by both agents to train invariant feature spaces that can then be used to transfer other skills from one agent to another. The process of learning these invariant feature spaces can be viewed as a kind of ``analogy making,'' or implicit learning of partial correspondences between two distinct domains. We evaluate our transfer learning algorithm in two simulated robotic manipulation skills, and illustrate that we can transfer knowledge between simulated robotic arms with different numbers of links, as well as simulated arms with different actuation mechanisms, where one robot is torque-driven while the other is tendon-driven.",/pdf/58005d81a2be1dba738fdb9d0109b9d1e40b74a1.pdf,ICLR,2017,Learning a common feature space between robots with different morphology or actuation to transfer skills. +EUUp9nWXsop,5wU9opypT_a,1601310000000.0,1614990000000.0,187,IALE: Imitating Active Learner Ensembles,"[""~Christoffer_L\u00f6ffler1"", ""~Christopher_Mutschler1""]","[""Christoffer L\u00f6ffler"", ""Christopher Mutschler""]","[""active learning"", ""imitating learning"", ""ensembles""]","Active learning (AL) prioritizes the labeling of the most informative data samples. However, the performance of AL heuristics depends on the structure of the underlying classifier model and the data. We propose an imitation learning scheme that imitates the selection of the best expert heuristic at each stage of the AL cycle in a batch-mode pool-based setting. We use DAGGER to train the policy on a dataset and later apply it to datasets from similar domains. With multiple AL heuristics as experts, the policy is able to reflect the choices of the best AL heuristics given the current state of the AL process. Our experiment on well-known datasets show that we both outperform state of the art imitation learners and heuristics.",/pdf/e907b778892aab2e616043942fe07978a57df968.pdf,ICLR,2021,IALE uses imitation learning to learn a policy for pool-based active learning (AL) that imitates a set of hard-coded experts that all together outperform state of the art AL baselines +rJgFjREtwr,HJgaJQt_PH,1569440000000.0,1577170000000.0,1326,Distribution-Guided Local Explanation for Black-Box Classifiers,"[""fwj.edu@gmail.com"", ""eric.mengwang@gmail.com"", ""dumengnan@tamu.edu"", ""nhliu43@tamu.edu"", ""hfut.hsj@gmail.com"", ""hu@cse.tamu.edu""]","[""Weijie Fu"", ""Meng Wang"", ""Mengnan Du"", ""Ninghao Liu"", ""Shijie Hao"", ""Xia Hu""]","[""explanation"", ""cnn"", ""saliency map""]","Existing local explanation methods provide an explanation for each decision of black-box classifiers, in the form of relevance scores of features according to their contributions. To obtain satisfying explainability, many methods introduce ad hoc constraints into the classification loss to regularize these relevance scores. However, the large information gap between the classification loss and these constraints increases the difficulty of tuning hyper-parameters. To bridge this gap, in this paper we present a simple but effective mask predictor. Specifically, we model the above constraints with a distribution controller, and integrate it with a neural network to directly guide the distribution of relevance scores. The benefit of this strategy is to facilitate the setting of involved hyper-parameters, and enable discriminative scores over supporting features. The experimental results demonstrate that our method outperforms others in terms of faithfulness and explainability. Meanwhile, it also provides effective saliency maps for explaining each decision. ",/pdf/0dc0a31b03cfcd4cdc116450eb64b1b46e373b95.pdf,ICLR,2020,distribution-guided local explanation framework to provide discriminative saliency maps with easy-to-set hyper-parameters +BJe-_CNKPH,HkghIiDuPr,1569440000000.0,1577170000000.0,1201,Attention Interpretability Across NLP Tasks,"[""shikhar@iisc.ac.in"", ""shyamupa@google.com"", ""gtomar@google.com"", ""mfaruqui@google.com""]","[""Shikhar Vashishth"", ""Shyam Upadhyay"", ""Gaurav Singh Tomar"", ""Manaal Faruqui""]","[""Attention"", ""NLP"", ""Interpretability""]","The attention layer in a neural network model provides insights into the model’s reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights (Jain & Wallace, 2019; Vig & Belinkov, 2019). Amid such confusion arises the need to understand attention mechanism more systematically. In this work, we attempt to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, we validate our observations and reinforce our claim of interpretability of attention through manual evaluation.",/pdf/2d62995b00384077cf6a3811bc6ac2ba08b7c0a5.pdf,ICLR,2020,Analysis of attention mechanism across diverse NLP tasks. +r1yjkAtxe,,1478250000000.0,1481640000000.0,151,Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding,"[""nirb@campus.technion.ac.il"", ""tomzahavy@campus.technion.ac.il"", ""shie@ee.technion.ac.il""]","[""Nir Baram"", ""Tom Zahavy"", ""Shie Mannor""]","[""Reinforcement Learning"", ""Deep learning""]","Recent progress in the field of Reinforcement Learning (RL) has enabled to tackle bigger and more challenging tasks. However, the increasing complexity of the problems, as well as the use of more sophisticated models such as Deep Neural Networks (DNN), impedes the understanding of artificial agents behavior. In this work, we present the Semi-Aggregated Markov Decision Process (SAMDP) model. The purpose of the SAMDP modeling is to describe and allow a better understanding of complex behaviors by identifying temporal and spatial abstractions. In contrast to other modeling approaches, SAMDP is built in a transformed state-space that encodes the dynamics of the problem. We show that working with the \emph{right} state representation mitigates the problem of finding spatial and temporal abstractions. We describe the process of building the SAMDP model from observed trajectories and give examples for using it in a toy problem and complicated DQN policies. Finally, we show how using the SAMDP we can monitor the policy at hand and make it more robust.",/pdf/529784289bb3165bb17a5d07f11a569bc5f80683.pdf,ICLR,2017,A method for understanding and improving deep agents by creating spatio-temporal abstractions +SJxNzgSKvH,HyxjZGeFwB,1569440000000.0,1577170000000.0,2169,Selective sampling for accelerating training of deep neural networks,"[""berry.weinstein@post.idc.ac.il"", ""shai.fine@idc.ac.il"", ""toky@idc.ac.il""]","[""Berry Weinstein"", ""Shai Fine"", ""Yacov Hel-Or""]",[],"We present a selective sampling method designed to accelerate the training of deep neural networks. To this end, we introduce a novel measurement, the {\it minimal margin score} (MMS), which measures the minimal amount of displacement an input should take until its predicted classification is switched. For multi-class linear classification, the MMS measure is a natural generalization of the margin-based selection criterion, which was thoroughly studied in the binary classification setting. In addition, the MMS measure provides an interesting insight into the progress of the training process and can be useful for designing and monitoring new training regimes. Empirically we demonstrate a substantial acceleration when training commonly used deep neural network architectures for popular image classification tasks. The efficiency of our method is compared against the standard training procedures, and against commonly used selective sampling alternatives: Hard negative mining selection, and Entropy-based selection. +Finally, we demonstrate an additional speedup when we adopt a more aggressive learning-drop regime while using the MMS selective sampling method.",/pdf/2f6172635fd4b5b82ca61b88cfca796fee570ab6.pdf,ICLR,2020, +H1e-X64FDB,SylQYYC8PH,1569440000000.0,1577170000000.0,439,"Fast Linear Interpolation for Piecewise-Linear Functions, GAMs, and Deep Lattice Networks","[""nzhang32@gmail.com"", ""canini@google.com"", ""silvasean@google.com"", ""mayagupta@google.com""]","[""Nathan Zhang"", ""Kevin Canini"", ""Sean Silva"", ""and Maya R. Gupta""]","[""hardware"", ""compiler"", ""MLIR"", ""runtime"", ""CPU"", ""interpolation""]","We present fast implementations of linear interpolation operators for both piecewise linear functions and multi-dimensional look-up tables. We use a compiler-based solution (using MLIR) for accelerating this family of workloads. On real-world multi-layer lattice models and a standard CPU, we show these strategies deliver $5-10\times$ faster runtimes compared to a C++ interpreter implementation that uses prior techniques, producing runtimes that are 1000s of times faster than TensorFlow 2.0 for single evaluations.",/pdf/51bd03a213e61d6acf7fd571fa8f430744f81ee4.pdf,ICLR,2020,"Fast implementations of linear interpolation operators are given for both piecewise linear functions and multi-dimensional look-up tables, producing 3-11x faster runtimes for single evaluations." +BkbY4psgg,,1478380000000.0,1489200000000.0,597,Making Neural Programming Architectures Generalize via Recursion,"[""jonathon@cs.berkeley.edu"", ""ricshin@cs.berkeley.edu"", ""dawnsong@cs.berkeley.edu""]","[""Jonathon Cai"", ""Richard Shin"", ""Dawn Song""]","[""Deep learning""]","Empirically, neural networks that attempt to learn programs from data have exhibited poor generalizability. Moreover, it has traditionally been difficult to reason about the behavior of these models beyond a certain level of input complexity. In order to address these issues, we propose augmenting neural architectures with a key abstraction: recursion. As an application, we implement recursion in the Neural Programmer-Interpreter framework on four tasks: grade-school addition, bubble sort, topological sort, and quicksort. We demonstrate superior generalizability and interpretability with small amounts of training data. Recursion divides the problem into smaller pieces and drastically reduces the domain of each neural network component, making it tractable to prove guarantees about the overall system’s behavior. Our experience suggests that in order for neural architectures to robustly learn program semantics, it is necessary to incorporate a concept like recursion.",/pdf/342543971002b3e5f08be11d9a6da60b594a6b47.pdf,ICLR,2017, +7ehDLD1yoE0,tlJJvqbkOpm,1601310000000.0,1614990000000.0,945,"STRATA: Simple, Gradient-free Attacks for Models of Code","[""~Jacob_M._Springer1"", ""~Bryn_Marie_Reinstadler1"", ""~Una-May_O'Reilly1""]","[""Jacob M. Springer"", ""Bryn Marie Reinstadler"", ""Una-May O'Reilly""]","[""Deep Learning"", ""Models of Code"", ""Black-box Adversarial Attacks"", ""Adversarial Robustness""]","Adversarial examples are imperceptible perturbations in the input to a neural model that result in misclassification. Generating adversarial examples for source code poses an additional challenge compared to the domains of images and natural language, because source code perturbations must adhere to strict semantic guidelines so the resulting programs retain the functional meaning of the code. We propose a simple and efficient gradient-free method for generating state-of-the-art adversarial examples on models of code that can be applied in a white-box or black-box setting. Our method generates untargeted and targeted attacks, and empirically outperforms competing gradient-based methods with less information and less computational effort.",/pdf/3a89287209b67b8ee52ee736753c74c9190c7e2b.pdf,ICLR,2021,We present an efficient state-of-the-art method for constructing gradient-free adversarial attacks for models of code that outperform currently available gradient-based attacks. +S1XXq6lRW,HJfmcpe0Z,1509120000000.0,1518730000000.0,424,Zero-shot Cross Language Text Classification,"[""dsve@dtu.dk"", ""jonas@meinertz.org"", ""olwi@dtu.dk""]","[""Dan Svenstrup"", ""Jonas Meinertz Hansen"", ""Ole Winther""]","[""Cross Language Text Classification"", ""Neural Networks"", ""Machine Learning""]","Labeled text classification datasets are typically only available in a few select languages. In order to train a model for e.g news categorization in a language $L_t$ without a suitable text classification dataset there are two options. The first option is to create a new labeled dataset by hand, and the second option is to transfer label information from an existing labeled dataset in a source language $L_s$ to the target language $L_t$. In this paper we propose a method for sharing label information across languages by means of a language independent text encoder. The encoder will give almost identical representations to multilingual versions of the same text. This means that labeled data in one language can be used to train a classifier that works for the rest of the languages. The encoder is trained independently of any concrete classification task and can therefore subsequently be used for any classification task. We show that it is possible to obtain good performance even in the case where only a comparable corpus of texts is available. ",/pdf/ca2fd9d4825d25f3201a9af6b77df7e0c3f1d739.pdf,ICLR,2018,Cross Language Text Classification by universal encoding +dqyK5RKMaW4,iSKwPUQ0e7p,1601310000000.0,1614990000000.0,639,LEARNED HARDWARE/SOFTWARE CO-DESIGN OF NEURAL ACCELERATORS,"[""~Zhan_Shi3"", ""chirag.sakhuja@utexas.edu"", ""~Milad_Hashemi1"", ""~Kevin_Swersky1"", ""~Calvin_Lin1""]","[""Zhan Shi"", ""Chirag Sakhuja"", ""Milad Hashemi"", ""Kevin Swersky"", ""Calvin Lin""]","[""deep learning accelerator"", ""Bayesian optimization"", ""design space exploration"", ""hardware-software co-design""]","The use of deep learning has grown at an exponential rate, giving rise to numerous specialized hardware and software systems for deep learning. Because the design space of deep learning software stacks and hardware accelerators is diverse and vast, prior work considers software optimizations separately from hardware architectures, effectively reducing the search space. Unfortunately, this bifurcated approach means that many profitable design points are never explored. This paper instead casts the problem as hardware/software co-design, with the goal of automatically identifying desirable points in the joint design space. The key to our solution is a new constrained Bayesian optimization framework that avoids invalid solutions by exploiting the highly constrained features of this design space, which are semi-continuous/semi-discrete. We evaluate our optimization framework by applying it to a variety of neural models, improving the energy-delay product by 18% (ResNet) and 40% (DQN) over hand-tuned state-of-the-art systems, as well as demonstrating strong results on other neural network architectures, such as MLPs and Transformers.",/pdf/1584bae2d218eef95cebf305e59f99b65ec13b2f.pdf,ICLR,2021,A bilevel Bayesian optimization approach for the hardware/software co-design of neural accelerators. +SkxANsC9tQ,rkxBJVyFum,1538090000000.0,1545360000000.0,46,Learning Graph Representations by Dendrograms,"[""thomas.bonald@telecom-paristech.fr"", ""bertrand.charpentier@live.fr""]","[""Thomas Bonald"", ""Bertrand Charpentier""]","[""Graph"", ""hierarchical clustering"", ""dendrogram"", ""quality metric"", ""reconstruction"", ""entropy""]","Hierarchical clustering is a common approach to analysing the +multi-scale structure of graphs observed in practice. +We propose a novel metric for assessing the quality of a hierarchical clustering. This metric reflects the ability to reconstruct the graph from the dendrogram encoding the hierarchy. The best representation of the graph for this metric in turn yields a novel hierarchical clustering algorithm. Experiments on both real and synthetic data illustrate the efficiency of the approach. +",/pdf/b711a6cde6788973da7b0db7968262113576adaf.pdf,ICLR,2019,Novel quality metric for hierarchical graph clustering +y2I4gyAGlCB,foYUUePqug3,1601310000000.0,1614990000000.0,1628,Imagine That! Leveraging Emergent Affordances for 3D Tool Synthesis,"[""~Yizhe_Wu1"", ""~Sudhanshu_Kasewa1"", ""~Oliver_Groth1"", ""~Sasha_Salter1"", ""~Kevin_Li_Sun1"", ""~Oiwi_Parker_Jones1"", ""~Ingmar_Posner1""]","[""Yizhe Wu"", ""Sudhanshu Kasewa"", ""Oliver Groth"", ""Sasha Salter"", ""Kevin Li Sun"", ""Oiwi Parker Jones"", ""Ingmar Posner""]","[""Affordance Learning"", ""Imagination"", ""Generative Models"", ""Activation Maximisation""]","In this paper we explore the richness of information captured by the latent space of a vision-based generative model. The model combines unsupervised generative learning with a task-based performance predictor to learn and to exploit task-relevant object affordances given visual observations from a reaching task, involving a scenario and a stick-like tool. While the learned embedding of the generative model captures factors of variation in 3D tool geometry (e.g. length, width, and shape), the performance predictor identifies sub-manifolds of the embedding that correlate with task success. Within a variety of scenarios, we demonstrate that traversing the latent space via backpropagation from the performance predictor allows us to imagine tools appropriate for the task at hand. Our results indicate that affordances – like the utility for reaching – are encoded along smooth trajectories in latent space. Accessing these emergent affordances by considering only high-level performance criteria (such as task success) enables an agent to manipulate tool geometries in a targeted and deliberate way.",/pdf/607bcedcf30e811e9d04833db98a7d9ceae41560.pdf,ICLR,2021,We demonstrate that a task-driven traversal of a learned latent space leads to object affordances emerging naturally as smooth trajectories in this space accessible via the optimisation of high-level performance criteria. +rkGcYi09Km,Hkg6S5bOYX,1538090000000.0,1545360000000.0,470,NUTS: Network for Unsupervised Telegraphic Summarization,"[""chanakya.malireddy@gmail.com"", ""tirthmaniar1998@gmail.com"", ""sajalmaheshwari624@gmail.com"", ""m.shrivastava@iiit.ac.in""]","[""Chanakya Malireddy"", ""Tirth Maniar"", ""Sajal Maheshwari"", ""Manish Shrivastava""]","[""nlp"", ""summarization"", ""unsupervised learning"", ""deep learning""]","Extractive summarization methods operate by ranking and selecting the sentences which best encapsulate the theme of a given document. They do not fare well in domains like fictional narratives where there is no central theme and core information is not encapsulated by a small set of sentences. For the purpose of reducing the size of the document while conveying the idea expressed by each sentence, we need more sentence specific methods. Telegraphic summarization, which selects short segments across several sentences, is better suited for such domains. Telegraphic summarization captures the plot better by retaining shorter versions of each sentence while not really concerning itself with grammatically linking these segments. In this paper, we propose an unsupervised deep learning network (NUTS) to generate telegraphic summaries. +We use multiple encoder-decoder networks and learn to drop portions of the text that are inferable from the chosen segments. The model is agnostic to both sentence length and style. We demonstrate that the summaries produced by our model show significant quantitative and qualitative improvement over those produced by existing methods and baselines.",/pdf/e74210ec375af7a1db0f94484d378a5603321bb6.pdf,ICLR,2019,"In this paper, we propose an unsupervised deep learning network (NUTS) to generate telegraphic summaries." +ryCM8zWRb,Sy3z8Mb0W,1509140000000.0,1518730000000.0,843,Recurrent Neural Networks with Top-k Gains for Session-based Recommendations,"[""hidasib@gmail.com"", ""alexk@tid.es""]","[""Bal\u00e1zs Hidasi"", ""Alexandros Karatzoglou""]","[""gru4rec"", ""session-based recommendations"", ""recommender systems"", ""recurrent neural network""]","RNNs have been shown to be excellent models for sequential data and in particular for session-based user behavior. The use of RNNs provides impressive performance benefits over classical methods in session-based recommendations. In this work we introduce a novel ranking loss function tailored for RNNs in recommendation settings. The better performance of such loss over alternatives, along with further tricks and improvements described in this work, allow to achieve an overall improvement of up to 35% in terms of MRR and Recall@20 over previous session-based RNN solutions and up to 51% over classical collaborative filtering approaches. Unlike data augmentation-based improvements, our method does not increase training times significantly.",/pdf/12c8b098b16023f2f79ff0ae97725173db97dedc.pdf,ICLR,2018,Improving session-based recommendations with RNNs (GRU4Rec) by 35% using newly designed loss functions and sampling. +rygjcsR9Y7,HJeFJOcqF7,1538090000000.0,1546610000000.0,562,SOM-VAE: Interpretable Discrete Representation Learning on Time Series,"[""fortuin@inf.ethz.ch"", ""mhueser@inf.ethz.ch"", ""locatelf@inf.ethz.ch"", ""heiko.strathmann@gmail.com"", ""raetsch@inf.ethz.ch""]","[""Vincent Fortuin"", ""Matthias H\u00fcser"", ""Francesco Locatello"", ""Heiko Strathmann"", ""Gunnar R\u00e4tsch""]","[""deep learning"", ""self-organizing map"", ""variational autoencoder"", ""representation learning"", ""time series"", ""machine learning"", ""interpretability""]","High-dimensional time series are common in many domains. Since human cognition is not optimized to work well in high-dimensional spaces, these areas could benefit from interpretable low-dimensional representations. However, most representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. +To address this problem, we propose a new representation learning framework building on ideas from interpretable discrete dimensionality reduction and deep generative modeling. This framework allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. We introduce a new way to overcome the non-differentiability in discrete representation learning and present a gradient-based version of the traditional self-organizing map algorithm that is more performant than the original. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the representation space. +This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. +We evaluate our model in terms of clustering performance and interpretability on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application on the eICU data set. Our learned representations compare favorably with competitor methods and facilitate downstream tasks on the real world data.",/pdf/15c8cbb113daa27eb888054e58b057bcfeb1a203.pdf,ICLR,2019,"We present a method to learn interpretable representations on time series using ideas from variational autoencoders, self-organizing maps and probabilistic models." +dcktlmtcM7,6Gg-dMlracf,1601310000000.0,1614990000000.0,730,Neural Time-Dependent Partial Differential Equation,"[""~Yihao_Hu1"", ""~Tong_Zhao3"", ""~Zhiliang_Xu1"", ""~Lizhen_Lin1""]","[""Yihao Hu"", ""Tong Zhao"", ""Zhiliang Xu"", ""Lizhen Lin""]","[""Numerical analysis"", ""Deep learning"", ""Partial differential equation"", ""Machine learning"", ""Predictive modeling""]","Partial differential equations (PDEs) play a crucial role in studying a vast number of problems in science and engineering. Numerically solving nonlinear and/or high-dimensional PDEs is frequently a challenging task. Inspired by the traditional finite difference and finite elements methods and emerging advancements in machine learning, we propose a sequence-to-sequence learning (Seq2Seq) framework called Neural-PDE, which allows one to automatically learn governing rules of any time-dependent PDE system from existing data by using a bidirectional LSTM encoder, and predict the solutions in next $n$ time steps. One critical feature of our proposed framework is that the Neural-PDE is able to simultaneously learn and simulate all variables of interest in a PDE system. We test the Neural-PDE by a range of examples, from one-dimensional PDEs to a multi-dimensional and nonlinear complex fluids model. The results show that the Neural-PDE is capable of learning the initial conditions, boundary conditions and differential operators defining the initial-boundary-value problem of a PDE system without the knowledge of the specific form of the PDE system. In our experiments, the Neural-PDE can efficiently extract the dynamics within 20 epochs training and produce accurate predictions. Furthermore, unlike the traditional machine learning approaches for learning PDEs, such as CNN and MLP, which require great quantity of parameters for model precision, the Neural-PDE shares parameters among all time steps, and thus considerably reduces computational complexity and leads to a fast learning algorithm. ",/pdf/bb6b3dee2afc2e009e5bae1273537c4ca691e7ee.pdf,ICLR,2021,A sequence-to-sequence (Seq2Seq) learning framework to predict nonlinear time-dependent partial differential equations. +tW4QEInpni,ma0bvnpVw1y,1601310000000.0,1617380000000.0,2648,When Do Curricula Work?,"[""~Xiaoxia_Wu1"", ""~Ethan_Dyer1"", ""~Behnam_Neyshabur1""]","[""Xiaoxia Wu"", ""Ethan Dyer"", ""Behnam Neyshabur""]","[""Curriculum Learning"", ""Understanding Deep Learning"", ""Empirical Investigation""]","Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the implicit curricula resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of explicit curricula, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum or random ordering can indeed improve the performance either with limited training time budget or in the existence of noisy data.",/pdf/a6f2f483d8e768e936c0ab7b9c6f8209e4fb79a4.pdf,ICLR,2021,"We conduct extensive experiments over thousands of orderings to investigate the effectiveness of three kinds of learning: curriculum, anti-curriculum, and random-curriculum." +SyzjBiR9t7,rkl7E8FtFQ,1538090000000.0,1545360000000.0,118,MANIFOLDNET: A DEEP NEURAL NETWORK FOR MANIFOLD-VALUED DATA,"[""rudrasischa@gmail.com"", ""josebouza@ufl.edu"", ""jonathan.manton@ieee.org"", ""baba.vemuri@gmail.com""]","[""Rudrasis Chakraborty"", ""Jose Bouza"", ""Jonathan Manton"", ""Baba C. Vemuri""]",[],"Developing deep neural networks (DNNs) for manifold-valued data sets +has gained much interest of late in the deep learning research +community. Examples of manifold-valued data include data from +omnidirectional cameras on automobiles, drones etc., diffusion +magnetic resonance imaging, elastography and others. In this paper, we +present a novel theoretical framework for DNNs to cope with +manifold-valued data inputs. In doing this generalization, we draw +parallels to the widely popular convolutional neural networks (CNNs). +We call our network the ManifoldNet. + +As in vector spaces where convolutions are equivalent to computing the +weighted mean of functions, an analogous definition for +manifold-valued data can be constructed involving the computation of +the weighted Fr\'{e}chet Mean (wFM). To this end, we present a +provably convergent recursive computation of the wFM of the given +data, where the weights makeup the convolution mask, to be +learned. Further, we prove that the proposed wFM layer achieves a +contraction mapping and hence the ManifoldNet does not need the +additional non-linear ReLU unit used in standard CNNs. Operations such +as pooling in traditional CNN are no longer necessary in this setting +since wFM is already a pooling type operation. Analogous to the +equivariance of convolution in Euclidean space to translations, we +prove that the wFM is equivariant to the action of the group of +isometries admitted by the Riemannian manifold on which the data +reside. This equivariance property facilitates weight sharing within +the network. We present experiments, using the ManifoldNet framework, +to achieve video classification and image reconstruction using an +auto-encoder+decoder setting. Experimental results demonstrate the +efficacy of ManifoldNet in the context of classification and +reconstruction accuracy.",/pdf/380a4fe9969c914464b716162d8be299f93dc72b.pdf,ICLR,2019, +r1lpx3A9K7,SJlfQAp9t7,1538090000000.0,1545360000000.0,1120,Featurized Bidirectional GAN: Adversarial Defense via Adversarially Learned Semantic Inference,"[""rbao@princeton.edu"", ""sihangl@princeton.edu"", ""qingcanw@princeton.edu""]","[""Ruying Bao"", ""Sihang Liang"", ""Qingcan Wang""]",[],"Deep neural networks have been demonstrated to be vulnerable to adversarial attacks, where small perturbations intentionally added to the original inputs can fool the classifier. In this paper, we propose a defense method, Featurized Bidirectional Generative Adversarial Networks (FBGAN), to extract the semantic features of the input and filter the non-semantic perturbation. FBGAN is pre-trained on the clean dataset in an unsupervised manner, adversarially learning a bidirectional mapping between a high-dimensional data space and a low-dimensional semantic space; also mutual information is applied to disentangle the semantically meaningful features. After the bidirectional mapping, the adversarial data can be reconstructed to denoised data, which could be fed into any pre-trained classifier. We empirically show the quality of reconstruction images and the effectiveness of defense.",/pdf/9738a275f7113bead3ff53fb05b6e5620a9f6e1b.pdf,ICLR,2019, +HkxTwkrKDB,BkgnWCaODH,1569440000000.0,1583910000000.0,1785,On Universal Equivariant Set Networks,"[""nimrod.segol@weizmann.ac.il"", ""yaron.lipman@weizmann.ac.il""]","[""Nimrod Segol"", ""Yaron Lipman""]","[""deep learning"", ""universality"", ""set functions"", ""equivariance""]","Using deep neural networks that are either invariant or equivariant to permutations in order to learn functions on unordered sets has become prevalent. The most popular, basic models are DeepSets (Zaheer et al. 2017) and PointNet (Qi et al. 2017). While known to be universal for approximating invariant functions, DeepSets and PointNet are not known to be universal when approximating equivariant set functions. On the other hand, several recent equivariant set architectures have been proven equivariant universal (Sannai et al. 2019, Keriven and Peyre 2019), however these models either use layers that are not permutation equivariant (in the standard sense) and/or use higher order tensor variables which are less practical. There is, therefore, a gap in understanding the universality of popular equivariant set models versus theoretical ones. + +In this paper we close this gap by proving that: (i) PointNet is not equivariant universal; and (ii) adding a single linear transmission layer makes PointNet universal. We call this architecture PointNetST and argue it is the simplest permutation equivariant universal model known to date. Another consequence is that DeepSets is universal, and also PointNetSeg, a popular point cloud segmentation network (used e.g., in Qi et al. 2017) is universal. + +The key theoretical tool used to prove the above results is an explicit characterization of all permutation equivariant polynomial layers. Lastly, we provide numerical experiments validating the theoretical results and comparing different permutation equivariant models.",/pdf/d1adcc582763d5140cdd271d6c94bda5bf8e7fb6.pdf,ICLR,2020,Settling permutation equivariance universality for popular deep models. +PGmqOzKEPZN,CF7Yk6JOtiP,1601310000000.0,1614990000000.0,1058,Non-Negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation,"[""~Masahiro_Kato1"", ""~Takeshi_Teshima1""]","[""Masahiro Kato"", ""Takeshi Teshima""]","[""density ratio estimation"", ""bregman divergence""]","The estimation of the ratio of two probability densities has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. To estimate the density ratio, methods collectively known as direct density ratio estimation (DRE) have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, existing direct DRE suffers from serious overfitting when using flexible models such as neural networks. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method more robust against overfitting and enables the use of flexible models. In the theoretical analysis, we discuss the consistency of the empirical risk. In our experiments, the proposed estimators show favorable performance in inlier-based outlier detection and covariate shift adaptation.",/pdf/ae4d2b3aaee89188a5ee652889038b0a6f7f0418.pdf,ICLR,2021,Proposing the non-negative Bregman divergence minimization for density ratio estimation +HJgySxSKvB,H1gS6vgFPS,1569440000000.0,1577170000000.0,2269,Deep Relational Factorization Machines,"[""hongchanggao@gmail.com"", ""gawu@adobe.com"", ""ryrossi@adobe.com"", ""vishy@adobe.com"", ""henghuanghh@gmail.com""]","[""Hongchang Gao"", ""Gang Wu"", ""Ryan Rossi"", ""Viswanathan Swaminathan"", ""Heng Huang""]",[],"Factorization Machines (FMs) is an important supervised learning approach due to its unique ability to capture feature interactions when dealing with high-dimensional sparse data. However, FMs assume each sample is independently observed and hence incapable of exploiting the interactions among samples. On the contrary, Graph Neural Networks (GNNs) has become increasingly popular due to its strength at capturing the dependencies among samples. But unfortunately, it cannot efficiently handle high-dimensional sparse data, which is quite common in modern machine learning tasks. In this work, to leverage their complementary advantages and yet overcome their issues, we proposed a novel approach, namely Deep Relational Factorization Machines, which can capture both the feature interaction and the sample interaction. In particular, we disclosed the relationship between the feature interaction and the graph, which opens a brand new avenue to deal with high-dimensional features. Finally, we demonstrate the effectiveness of the proposed approach with experiments on several real-world datasets.",/pdf/000da9e96eb0a3b356ef3bf0980bc44b279bac20.pdf,ICLR,2020, +SJMO2iCct7,H1xOqj_FKQ,1538090000000.0,1545360000000.0,724,A NOVEL VARIATIONAL FAMILY FOR HIDDEN NON-LINEAR MARKOV MODELS,"[""dh2832@columbia.edu"", ""amoretti@cs.columbia.edu"", ""weiz@janelia.hhmi.org"", ""ss5513@columbia.edu"", ""jpcunni@gmail.com"", ""liam.paninski@gmail.com""]","[""Daniel Hernandez Diaz"", ""Antonio Khalil Moretti"", ""Ziqiang Wei"", ""Shreya Saxena"", ""John Cunningham"", ""Liam Paninski""]","[""variational inference"", ""time series"", ""nonlinear dynamics"", ""neuroscience""]","Latent variable models have been widely applied for the analysis and visualization of large datasets. In the case of sequential data, closed-form inference is possible when the transition and observation functions are linear. However, approximate inference techniques are usually necessary when dealing with nonlinear evolution and observations. Here, we propose a novel variational inference framework for the explicit modeling of time series, Variational Inference for Nonlinear Dynamics (VIND), that is able to uncover nonlinear observation and latent dynamics from sequential data. The framework includes a structured approximate posterior, and an algorithm that relies on the fixed-point iteration method to find the best estimate for latent trajectories. We apply the method to several datasets and show that it is able to accurately infer the underlying dynamics of these systems, in some cases substantially outperforming state-of-the-art methods.",/pdf/3f72f72e68a47cbebab9a219be1c3fc3c98f7b07.pdf,ICLR,2019,We propose a new variational inference algorithm for time series and a novel variational family endowed with nonlinear dynamics. +HJGven05Y7,ByeDzz65Ym,1538090000000.0,1551080000000.0,1086,How to train your MAML,"[""a.antoniou@sms.ed.ac.uk"", ""h.l.edwards@sms.ac.uk"", ""a.storkey@sms.ed.ac.uk""]","[""Antreas Antoniou"", ""Harrison Edwards"", ""Amos Storkey""]","[""meta-learning"", ""deep-learning"", ""few-shot learning"", ""supervised learning"", ""neural-networks"", ""stochastic optimization""]","The field of few-shot learning has recently seen substantial advancements. Most of these advancements came from casting few-shot learning as a meta-learning problem.Model Agnostic Meta Learning or MAML is currently one of the best approaches for few-shot learning via meta-learning. MAML is simple, elegant and very powerful, however, it has a variety of issues, such as being very sensitive to neural network architectures, often leading to instability during training, requiring arduous hyperparameter searches to stabilize training and achieve high generalization and being very computationally expensive at both training and inference times. In this paper, we propose various modifications to MAML that not only stabilize the system, but also substantially improve the generalization performance, convergence speed and computational overhead of MAML, which we call MAML++.",/pdf/9e41db3a4ce307482a165824c7ec5d415cf3cb8a.pdf,ICLR,2019,"MAML is great, but it has many problems, we solve many of those problems and as a result we learn most hyper parameters end to end, speed-up training and inference and set a new SOTA in few-shot learning" +M71R_ivbTQP,Uy9WgUddVN0,1601310000000.0,1614990000000.0,1069,Extract Local Inference Chains of Deep Neural Nets,"[""~Haiyan_Zhao2"", ""~Tianyi_Zhou1"", ""~Guodong_Long2"", ""~Jing_Jiang6"", ""~Chengqi_Zhang1""]","[""Haiyan Zhao"", ""Tianyi Zhou"", ""Guodong Long"", ""Jing Jiang"", ""Chengqi Zhang""]","[""Model Interpretability"", ""Model Pruning"", ""Attribution"", ""Model Visualization""]","We study how to explain the main steps/chains of inference that a deep neural net (DNN) relies on to produce predictions in a local region of data space. This problem is related to network pruning and interpretable machine learning but the highlighted differences are: (1) fine-tuning of neurons/filters is forbidden: only exact copies are allowed; (2) we target an extremely high pruning rate, e.g., $\geq 95\%$; (3) the interpretation is for the whole inference process in a local region rather than for individual neurons/filters or on a single sample. In this paper, we introduce an efficient method, \name, to extract the local inference chains by optimizing a differentiable sparse scoring for the filters and layers to preserve the outputs on given data from a local region. Thereby, \name~can extract an extremely small sub-network composed of filters exactly copied from the original DNN by removing the filters/layers with small scores. We then visualize the sub-network by applying existing interpretation technique to the retained layer/filter/neurons and on any sample from the local region. Its architecture reveals how the inference process stitches and integrates the information layer by layer and filter by filter. We provide detailed and insightful case studies together with three quantitative analyses over thousands of trials to demonstrate the quality, sparsity, fidelity and accuracy of the interpretation within the assigned local regions and over unseen data. In our empirical study, \name~significantly enriches the interpretation and makes the inner mechanism of DNNs more transparent than before. ",/pdf/f3508bc8d25c6ea769cd509a65e59cf8e1e1be3c.pdf,ICLR,2021,Our paper propose a method to visualize the reasoning process of DNN on given data from a local region. +HJgb7lSFwS,BklVh7xFvS,1569440000000.0,1577170000000.0,2200,Distance-based Composable Representations with Neural Networks,"[""graham.spinks@cs.kuleuven.be"", ""sien.moens@cs.kuleuven.be""]","[""Graham Spinks"", ""Marie-Francine Moens""]","[""Representation learning"", ""Wasserstein distance"", ""Composability"", ""Templates""]","We introduce a new deep learning technique that builds individual and class representations based on distance estimates to randomly generated contextual dimensions for different modalities. Recent works have demonstrated advantages to creating representations from probability distributions over their contexts rather than single points in a low-dimensional Euclidean vector space. These methods, however, rely on pre-existing features and are limited to textual information. In this work, we obtain generic template representations that are vectors containing the average distance of a class to randomly generated contextual information. These representations have the benefit of being both interpretable and composable. They are initially learned by estimating the Wasserstein distance for different data subsets with deep neural networks. Individual samples or instances can then be compared to the generic class representations, which we call templates, to determine their similarity and thus class membership. We show that this technique, which we call WDVec, delivers good results for multi-label image classification. Additionally, we illustrate the benefit of templates and their composability by performing retrieval with complex queries where we modify the information content in the representations. Our method can be used in conjunction with any existing neural network and create theoretically infinitely large feature maps.",/pdf/d21c65a73518e618cabb980e9daa65e3ff68a383.pdf,ICLR,2020, +N0M_4BkQ05i,y4D7YfvHwMP,1601310000000.0,1616140000000.0,2955,Selective Classification Can Magnify Disparities Across Groups,"[""~Erik_Jones3"", ""~Shiori_Sagawa1"", ""~Pang_Wei_Koh1"", ""~Ananya_Kumar1"", ""~Percy_Liang1""]","[""Erik Jones"", ""Shiori Sagawa"", ""Pang Wei Koh"", ""Ananya Kumar"", ""Percy Liang""]","[""selective classification"", ""group disparities"", ""log-concavity"", ""robustness""]","Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing abstentions can even decrease accuracies on some groups. To better understand this phenomenon, we study the margin distribution, which captures the model’s confidences over all predictions. For symmetric margin distributions, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. Our analysis also shows that selective classification tends to magnify full-coverage accuracy disparities. Motivated by our analysis, we train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group on these models. Altogether, our results suggest that selective classification should be used with care and underscore the importance of training models to perform equally well across groups at full coverage.",/pdf/b9ac6534faf7141a9138e3cfcfed7dbada0a6f36.pdf,ICLR,2021, +SJxbHkrKDH,ByluTl6dPB,1569440000000.0,1583910000000.0,1681,Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning,"[""qianlong@cs.cmu.edu"", ""footoredo@sjtu.edu.cn"", ""abhinavg@cs.cmu.edu"", ""feif@cs.cmu.edu"", ""jxwuyi@gmail.com"", ""dragonwxl123@gmail.com""]","[""Qian Long*"", ""Zihan Zhou*"", ""Abhinav Gupta"", ""Fei Fang"", ""Yi Wu\u2020"", ""Xiaolong Wang\u2020""]","[""multi-agent reinforcement learning"", ""evolutionary learning"", ""curriculum learning""]","In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi-Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially. The source code and videos can be found at https://sites.google.com/view/epciclr2020.",/pdf/290621318b771413010855c3dfd371f691b171ac.pdf,ICLR,2020, +H1MOqeHYvB,H1g9gJ-tvB,1569440000000.0,1577170000000.0,2478,At Your Fingertips: Automatic Piano Fingering Detection,"[""amitmoryossef@gmail.com"", ""yanaiela@gmail.com"", ""yoav.goldberg@gmail.com""]","[""Amit Moryossef"", ""Yanai Elazar"", ""Yoav Goldberg""]","[""piano"", ""fingering"", ""dataset""]","Automatic Piano Fingering is a hard task which computers can learn using data. As data collection is hard and expensive, we propose to automate this process by automatically extracting fingerings from public videos and MIDI files, using computer-vision techniques. Running this process on 90 videos results in the largest dataset for piano fingering with more than 150K notes. We show that when running a previously proposed model for automatic piano fingering on our dataset and then fine-tuning it on manually labeled piano fingering data, we achieve state-of-the-art results. +In addition to the fingering extraction method, we also introduce a novel method for transferring deep-learning computer-vision models to work on out-of-domain data, by fine-tuning it on out-of-domain augmentation proposed by a Generative Adversarial Network (GAN). + +For demonstration, we anonymously release a visualization of the output of our process for a single video on https://youtu.be/Gfs1UWQhr5Q",/pdf/654848b30a3510f4595c174bae1079a26b1367fe.pdf,ICLR,2020,"We automatically extract fingering information from videos of piano performances, to be used in automatic fingering prediction models." +S7Aeama_0s,hlwLmRo1UD5,1601310000000.0,1614990000000.0,3682,QRGAN: Quantile Regression Generative Adversarial Networks,"[""sunyeop97@gmail.com"", ""~Tuan_Anh_Nguyen3"", ""dkmin@konkuk.ac.kr""]","[""Sunyeop Lee"", ""Tuan Anh Nguyen"", ""Dugki Min""]","[""Quantile Regression"", ""Generative Adversarial Networks (GANs)"", ""Frechet Inception Distance (FID)"", ""Generative Neural Networks""]","Learning high-dimensional probability distributions by competitively training generative and discriminative neural networks is a prominent approach of Generative Adversarial Networks (GANs) among generative models to model complex real-world data. Nevertheless, training GANs likely suffer from non-convergence problem, mode collapse and gradient explosion or vanishing. Least Squares GAN (LSGANs) and Wasserstein GANs (WGAN) are of representative variants of GANs in literature that diminish the inherent problems of GANs by proposing the modification methodology of loss functions. However, LSGANs often fall into local minima and cause mode collapse. While WGANs unexpectedly encounter with inefficient computation and slow training due to its constraints in Wasserstein distance approximation. In this paper, we propose Quantile Regression GAN (QRGAN) in which quantile regression is adopted to minimize 1-Wasserstein distance between real and generated data distribution as a novel approach in modification of loss functions for improvement of GANs. To study the culprits of mode collapse problem, the output space of discriminator and gradients of fake samples are analyzed to see if the discriminator guides the generator well. And we found that the discriminator should not be bounded to specific numbers. Our proposed QRGAN exposes high robustness against mode collapse problem. Furthermore, QRGAN obtains an apparent improvement in the evaluation and comparison of Frechet Inception Distance (FID) for generation performance assessment compared to existing variants of GANs.",/pdf/db0db9192a520c580c9c97043258678a3047b8ea.pdf,ICLR,2021,A novel generative adversarial network with quantile regression for a significant improvement of model robustness and generation performance +BkM3ibZRW,ByW2o-WCb,1509130000000.0,1518730000000.0,715,Adversarially Regularized Autoencoders,"[""jakezhao@cs.nyu.edu"", ""yoonkim@seas.harvard.edu"", ""kz918@nyu.edu"", ""srush@seas.harvard.edu"", ""yann@cs.nyu.edu""]","[""Junbo (Jake) Zhao"", ""Yoon Kim"", ""Kelly Zhang"", ""Alexander M. Rush"", ""Yann LeCun""]","[""representation learning"", ""natural language generation"", ""discrete structure modeling"", ""adversarial training"", ""unaligned text style-transfer""]","While autoencoders are a key technique in representation learning for continuous structures, such as images or wave forms, developing general-purpose autoencoders for discrete structures, such as text sequence or discretized images, has proven to be more challenging. In particular, discrete inputs make it more difficult to learn a smooth encoder that preserves the complex local relationships in the input space. In this work, we propose an adversarially regularized autoencoder (ARAE) with the goal of learning more robust discrete-space representations. ARAE jointly trains both a rich discrete-space encoder, such as an RNN, and a simpler continuous space generator function, while using generative adversarial network (GAN) training to constrain the distributions to be similar. This method yields a smoother contracted code space that maps similar inputs to nearby codes, and also an implicit latent variable GAN model for generation. Experiments on text and discretized images demonstrate that the GAN model produces clean interpolations and captures the multimodality of the original space, and that the autoencoder produces improvements in semi-supervised learning as well as state-of-the-art results in unaligned text style transfer task using only a shared continuous-space representation.",/pdf/c6018a1358f0b0242e02f3f51a42bb30b889bb01.pdf,ICLR,2018,"Adversarially Regularized Autoencoders learn smooth representations of discrete structures allowing for interesting results in text generation, such as unaligned style transfer, semi-supervised learning, and latent space interpolation and arithmetic." +YZrQKLHFhv3,2Mdj8j-47Yk,1601310000000.0,1614990000000.0,1521,"MixSize: Training Convnets With Mixed Image Sizes for Improved Accuracy, Speed and Scale Resiliency","[""~Elad_Hoffer1"", ""~Berry_Weinstein1"", ""~Itay_Hubara1"", ""~Tal_Ben-Nun1"", ""htor@inf.ethz.ch"", ""~Daniel_Soudry1""]","[""Elad Hoffer"", ""Berry Weinstein"", ""Itay Hubara"", ""Tal Ben-Nun"", ""Torsten Hoefler"", ""Daniel Soudry""]",[],"Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of a specific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. \\ +In this work, we describe and evaluate a novel mixed-size training regime that mixes several image sizes at training time. We demonstrate that models trained using our method are more resilient to image size changes and generalize well even on small images. This allows faster inference by using smaller images at test time. For instance, we receive a $76.43\%$ top-1 accuracy using ResNet50 with an image size of $160$, which matches the accuracy of the baseline model with $2 \times$ fewer computations. +Furthermore, for a given image size used at test time, we show this method can be exploited either to accelerate training or the final test accuracy. For example, we are able to reach a $79.27\%$ accuracy with a model evaluated at a $288$ spatial size for a relative improvement of $14\%$ over the baseline. +Our PyTorch implementation and pre-trained models are publicly available\footnote{\url{https://github.com/paper-submissions/mix-match}",/pdf/b4bfff228291e1228125c2e53dcf1c1989dc5d38.pdf,ICLR,2021, +ryl-RTEYvB,rJlugDbODS,1569440000000.0,1577170000000.0,846,Robust Learning with Jacobian Regularization,"[""judy@gatech.edu"", ""dan@diffeo.com"", ""shoyaida@fb.com""]","[""Judy Hoffman"", ""Daniel A. Roberts"", ""Sho Yaida""]","[""Supervised Representation Learning"", ""Few-Shot Learning"", ""Regularization"", ""Adversarial Defense"", ""Deep Learning""]","Design of reliable systems must guarantee stability against input perturbations. In machine learning, such guarantee entails preventing overfitting and ensuring robustness of models against corruption of input data. In order to maximize stability, we analyze and develop a computationally efficient implementation of Jacobian regularization that increases classification margins of neural networks. The stabilizing effect of the Jacobian regularizer leads to significant improvements in robustness, as measured against both random and adversarial input perturbations, without severely degrading generalization properties on clean data.",/pdf/14d72e0cdace4e65f38ca3bd274f8f81bd98372e.pdf,ICLR,2020,We analyze and develop a computationally efficient implementation of Jacobian regularization that increases the classification margins of neural networks. +BJlXUsR5KQ,r1ep1T7FYm,1538090000000.0,1545360000000.0,161,Learning Neuron Non-Linearities with Kernel-Based Deep Neural Networks,"[""g.marra@unifi.it"", ""dario.zanca@unifi.it"", ""alessandro.betti@unifi.it"", ""marco.gori@unisi.it""]","[""Giuseppe Marra"", ""Dario Zanca"", ""Alessandro Betti"", ""Marco Gori""]","[""Activation functions"", ""Kernel methods"", ""Recurrent networks""]","The effectiveness of deep neural architectures has been widely supported in terms of both experimental and foundational principles. There is also clear evidence that the activation function (e.g. the rectifier and the LSTM units) plays a crucial role in the complexity of learning. Based on this remark, this paper discusses an optimal selection of the neuron non-linearity in a functional framework that is inspired from classic regularization arguments. A representation theorem is given which indicates that the best activation function is a kernel expansion in the training set, that can be effectively approximated over an opportune set of points modeling 1-D clusters. The idea can be naturally extended to recurrent networks, where the expressiveness of kernel-based activation functions turns out to be a crucial ingredient to capture long-term dependencies. We give experimental evidence of this property by a set of challenging experiments, where we compare the results with neural architectures based on state of the art LSTM cells.",/pdf/a5813434d3fa840a35057f8dec6dcf1a4c7f1e40.pdf,ICLR,2019, +SkJKHMW0Z,ByaOrGWRW,1509140000000.0,1518730000000.0,838,Recurrent Relational Networks for complex relational reasoning,"[""rasmusbergpalm@gmail.com"", ""upaq@google.com"", ""olwi@dtu.dk""]","[""Rasmus Berg Palm"", ""Ulrich Paquet"", ""Ole Winther""]","[""relational reasoning"", ""graph neural networks""]","Humans possess an ability to abstractly reason about objects and their interactions, an ability not shared with state-of-the-art deep learning models. Relational networks, introduced by Santoro et al. (2017), add the capacity for relational reasoning to deep neural networks, but are limited in the complexity of the reasoning tasks they can address. We introduce recurrent relational networks which increase the suite of solvable tasks to those that require an order of magnitude more steps of relational reasoning. We use recurrent relational networks to solve Sudoku puzzles and achieve state-of-the-art results by solving 96.6% of the hardest Sudoku puzzles, where relational networks fail to solve any. We also apply our model to the BaBi textual QA dataset solving 19/20 tasks which is competitive with state-of-the-art sparse differentiable neural computers. The recurrent relational network is a general purpose module that can augment any neural network model with the capacity to do many-step relational reasoning.",/pdf/b8777b69c8edec1613717f62a244fb93c75c8a4a.pdf,ICLR,2018,"We introduce Recurrent Relational Networks, a powerful and general neural network module for relational reasoning, and use it to solve 96.6% of the hardest Sudokus and 19/20 BaBi tasks." +SkxcZCNKDS,rylqiL4uwr,1569440000000.0,1577170000000.0,975,"If MaxEnt RL is the Answer, What is the Question?","[""beysenba@cs.cmu.edu"", ""svlevine@eecs.berkeley.edu""]","[""Benjamin Eysenbach"", ""Sergey Levine""]","[""reinforcement learning"", ""maximum entropy"", ""POMDP""]","Experimentally, it has been observed that humans and animals often make decisions that do not maximize their expected utility, but rather choose outcomes randomly, with probability proportional to expected utility. Probability matching, as this strategy is called, is equivalent to maximum entropy reinforcement learning (MaxEnt RL). However, MaxEnt RL does not optimize expected utility. In this paper, we formally show that MaxEnt RL does optimally solve certain classes of control problems with variability in the reward function. In particular, we show (1) that MaxEnt RL can be used to solve a certain class of POMDPs, and (2) that MaxEnt RL is equivalent to a two-player game where an adversary chooses the reward function. These results suggest a deeper connection between MaxEnt RL, robust control, and POMDPs, and provide insight for the types of problems for which we might expect MaxEnt RL to produce effective solutions. Specifically, our results suggest that domains with uncertainty in the task goal may be especially well-suited for MaxEnt RL methods.",/pdf/e0de4f92d8a53d399db8d1026ec1cce4062a041d.pdf,ICLR,2020,We show that MaxEnt RL implicitly solves control problems with variability in rewards. +JYVODnDjU20,pScJT4ZBx2U,1601310000000.0,1614990000000.0,3513,UNSUPERVISED ANOMALY DETECTION FROM SEMANTIC SIMILARITY SCORES,"[""nima.rafiee@hhu.de"", ""rahil.gholamipoorfard@hhu.de"", ""~Markus_Kollmann1""]","[""Nima Rafiee"", ""Rahil Gholamipoor"", ""Markus Kollmann""]","[""Anomaly Detection"", ""Out-of-Distribution Detection"", ""Novelty Detection""]","In this paper we present SemSAD, a simple and generic framework for detecting examples that lie out-of-distribution (OOD) for a given training set. The approach is based on learning a semantic similarity measure to find for a given test example the semantically closest example in the training set and then using a discriminator to classify whether the two examples show sufficient semantic dissimilarity such that the test example can be rejected as OOD. We are able to outperform previous approaches for anomaly, novelty, or out-of-distribution detection in the visual domain by a large margin. In particular we obtain AUROC values close to one for the challenging task of detecting examples from CIFAR-10 as out-of-distribution given CIFAR-100 as in-distribution, without making use of label information. ",/pdf/51e061e26b35a1913ac6d6d7c34821f840a07cea.pdf,ICLR,2021,Combining Contrastive Learning and Discriminative Learning for unsupervised Anomaly Detection +Syejj0NYvr,r1l4XQK_PS,1569440000000.0,1577170000000.0,1332,Adversarial Interpolation Training: A Simple Approach for Improving Model Robustness,"[""hczhang1@gmail.com"", ""wei.xu@horizon.ai""]","[""Haichao Zhang"", ""Wei Xu""]","[""adversarial training"", ""adversarial robustness""]","We propose a simple approach for adversarial training. The proposed approach utilizes an adversarial interpolation scheme for generating adversarial images and accompanying adversarial labels, which are then used in place of the original data for model training. The proposed approach is intuitive to understand, simple to implement and achieves state-of-the-art performance. We evaluate the proposed approach on a number of datasets including CIFAR10, CIFAR100 and SVHN. Extensive empirical results compared with several state-of-the-art methods against different attacks verify the effectiveness of the proposed approach. ",/pdf/1d1fb798039b9a33359a866d4530ddfed4976c69.pdf,ICLR,2020,"adversarial interpolation training: a simple, intuitive and effective approach for improving model robustness" +HyezBa4tPB,rJeZFqrDvS,1569440000000.0,1577170000000.0,514,Dirichlet Wrapper to Quantify Classification Uncertainty in Black-Box Systems,"[""jmenarol7@alumnes.ub.edu"", ""oriol_pujol@ub.edu"", ""jordi.vitria@ub.edu""]","[""Jos\u00e9 Mena Rold\u00e1n"", ""Oriol Pujol Vila"", ""Jordi Vitri\u00e0 Marca""]","[""uncertainty"", ""black-box classifiers"", ""rejection"", ""deep learning"", ""NLP"", ""CV""]","Nowadays, machine learning models are becoming a utility in many sectors. AI companies deliver pre-trained encapsulated models as application programming interfaces (APIs) that developers can combine with third party components, their models, and proprietary data, to create complex data products. This complexity and the lack of control and knowledge of the internals of these external components might cause unavoidable effects, such as lack of transparency, difficulty in auditability, and the emergence of uncontrolled potential risks. These issues are especially critical when practitioners use these components as black-boxes in new datasets. In order to provide actionable insights in this type of scenarios, in this work we propose the use of a wrapping deep learning model to enrich the output of a classification black-box with a measure of uncertainty. Given a black-box classifier, we propose a probabilistic neural network that works in parallel to the black-box and uses a Dirichlet layer as the fusion layer with the black-box. This Dirichlet layer yields a distribution on top of the multinomial output parameters of the classifier and enables the estimation of aleatoric uncertainty for any data sample. +Based on the resulting uncertainty measure, we advocate for a rejection system that selects the more confident predictions, discarding those more uncertain, leading to an improvement in the trustability of the resulting system. We showcase the proposed technique and methodology in two practical scenarios, one for NLP and another for computer vision, where a simulated API based is applied to different domains. Results demonstrate the effectiveness of the uncertainty computed by the wrapper and its high correlation to wrong predictions and misclassifications.",/pdf/a1b53ac7a08397d12ffe2b7555c0efa583f1dc56.pdf,ICLR,2020,A Dirichlet Deep Learning wrapper to quantify uncertainty in black-box systems applied to a rejection system to improve the quality of predictions +HJcSzz-CZ,HJ5rffZ0-,1509130000000.0,1519950000000.0,788,Meta-Learning for Semi-Supervised Few-Shot Classification,"[""mren@cs.toronto.edu"", ""eleni@cs.toronto.edu"", ""sachinr@princeton.edu"", ""jsnell@cs.toronto.edu"", ""kswersky@google.com"", ""jbt@mit.edu"", ""hugolarochelle@google.com"", ""zemel@cs.toronto.edu""]","[""Mengye Ren"", ""Eleni Triantafillou"", ""Sachin Ravi"", ""Jake Snell"", ""Kevin Swersky"", ""Joshua B. Tenenbaum"", ""Hugo Larochelle"", ""Richard S. Zemel""]","[""Few-shot learning"", ""semi-supervised learning"", ""meta-learning""]","In few-shot classification, we are interested in learning algorithms that train a classifier from only a handful of labeled examples. Recent progress in few-shot classification has featured meta-learning, in which a parameterized model for a learning algorithm is defined and trained on episodes representing different classification problems, each with a small labeled training set and its corresponding test set. In this work, we advance this few-shot classification paradigm towards a scenario where unlabeled examples are also available within each episode. We consider two situations: one where all unlabeled examples are assumed to belong to the same set of classes as the labeled examples of the episode, as well as the more challenging situation where examples from other distractor classes are also provided. To address this paradigm, we propose novel extensions of Prototypical Networks (Snell et al., 2017) that are augmented with the ability to use unlabeled examples when producing prototypes. These models are trained in an end-to-end way on episodes, to learn to leverage the unlabeled examples successfully. We evaluate these methods on versions of the Omniglot and miniImageNet benchmarks, adapted to this new framework augmented with unlabeled examples. We also propose a new split of ImageNet, consisting of a large set of classes, with a hierarchical structure. Our experiments confirm that our Prototypical Networks can learn to improve their predictions due to unlabeled examples, much like a semi-supervised algorithm would.",/pdf/0bbc5bd4bf231b441ce95aa2842d6dbfe472b562.pdf,ICLR,2018,We propose novel extensions of Prototypical Networks that are augmented with the ability to use unlabeled examples when producing prototypes. +SJ8M9yup-,BySfqyd6W,1508530000000.0,1518730000000.0,28,On Optimality Conditions for Auto-Encoder Signal Recovery,"[""devansharpit@gmail.com"", ""zybzmhhj@gmail.com"", ""hungngo@buffalo.edu"", ""nnapp@buffalo.edu"", ""venu@cubs.buffalo.edu""]","[""Devansh Arpit"", ""Yingbo Zhou"", ""Hung Q. Ngo"", ""Nils Napp"", ""Venu Govindaraju""]","[""Auto Encoder"", ""Signal Recovery"", ""Sparse Coding""]","Auto-Encoders are unsupervised models that aim to learn patterns from observed data by minimizing a reconstruction cost. The useful representations learned are often found to be sparse and distributed. On the other hand, compressed sensing and sparse coding assume a data generating process, where the observed data is generated from some true latent signal source, and try to recover the corresponding signal from measurements. Looking at auto-encoders from this signal recovery perspective enables us to have a more coherent view of these techniques. In this paper, in particular, we show that the true hidden representation can be approximately recovered if the weight matrices are highly incoherent with unit $ \ell^{2} $ row length and the bias vectors takes the value (approximately) equal to the negative of the data mean. The recovery also becomes more and more accurate as the sparsity in hidden signals increases. Additionally, we empirically also demonstrate that auto-encoders are capable of recovering the data generating dictionary when only data samples are given.",/pdf/e21c2e8c1a73c0137fbf115110a023bd50cd92c8.pdf,ICLR,2018, +B1l0wp4tvr,HJlDutsDvH,1569440000000.0,1577170000000.0,616,Information Plane Analysis of Deep Neural Networks via Matrix--Based Renyi's Entropy and Tensor Kernels,"[""kristoffer.k.wickstrom@uit.no"", ""sigurd.lokse@uit.no"", ""michael.c.kampffmeyer@uit.no"", ""yusjlcy9011@cnel.ufl.edu"", ""principe@cnel.ufl.edu"", ""robert.jenssen@uit.no""]","[""Kristoffer Wickstr\u00f8m"", ""Sigurd L\u00f8kse"", ""Michael Kampffmeyer"", ""Shujian Yu"", ""Jose Principe"", ""Robert Jenssen""]","[""information plane"", ""information theory"", ""deep neural networks"", ""entropy"", ""mutual information"", ""tensor kernels""]","Analyzing deep neural networks (DNNs) via information plane (IP) theory has gained tremendous attention recently as a tool to gain insight into, among others, their generalization ability. However, it is by no means obvious how to estimate mutual information (MI) between each hidden layer and the input/desired output, to construct the IP. For instance, hidden layers with many neurons require MI estimators with robustness towards the high dimensionality associated with such layers. MI estimators should also be able to naturally handle convolutional layers, while at the same time being computationally tractable to scale to large networks. None of the existing IP methods to date have been able to study truly deep Convolutional Neural Networks (CNNs), such as the e.g.\ VGG-16. In this paper, we propose an IP analysis using the new matrix--based R\'enyi's entropy coupled with tensor kernels over convolutional layers, leveraging the power of kernel methods to represent properties of the probability distribution independently of the dimensionality of the data. The obtained results shed new light on the previous literature concerning small-scale DNNs, however using a completely new approach. Importantly, the new framework enables us to provide the first comprehensive IP analysis of contemporary large-scale DNNs and CNNs, investigating the different training phases and providing new insights into the training dynamics of large-scale neural networks.",/pdf/82bb8a69aca8c4129acd225121948d276a4e3990.pdf,ICLR,2020,First comprehensive information plane analysis of large scale deep neural networks using matrix based entropy and tensor kernels. +ks5nebunVn_,WU5n41zleBj,1601310000000.0,1616010000000.0,1717,Towards Robustness Against Natural Language Word Substitutions,"[""~Xinshuai_Dong1"", ""~Anh_Tuan_Luu2"", ""~Rongrong_Ji5"", ""~Hong_Liu9""]","[""Xinshuai Dong"", ""Anh Tuan Luu"", ""Rongrong Ji"", ""Hong Liu""]","[""Natural Language Processing"", ""Adversarial Defense""]","Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing. Previous defense methods capture word substitutions in vector space by using either l_2-ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we introduce a novel Adversarial Sparse Convex Combination (ASCC) method. We model the word substitution attack space as a convex hull and leverages a regularization term to enforce perturbation towards an actual substitution, thus aligning our modeling better with the discrete textual space. Based on ASCC method, we further propose ASCC-defense, which leverages ASCC to generate worst-case perturbations and incorporates adversarial training towards robustness. Experiments show that ASCC-defense outperforms the current state-of-the-arts in terms of robustness on two prevailing NLP tasks, i.e., sentiment analysis and natural language inference, concerning several attacks across multiple model architectures. Besides, we also envision a new class of defense towards robustness in NLP, where our robustly trained word vectors can be plugged into a normally trained model and enforce its robustness without applying any other defense techniques.",/pdf/164becb7cba519983d9f4cb5dfe5a1661e8cfa13.pdf,ICLR,2021,Capture adversarial word substitutions in the vector space using convex hull towards robustness. +Oc-Aedbjq0,t-1PRrFm6pB,1601310000000.0,1614990000000.0,125,Model Compression via Hyper-Structure Network,"[""~Shangqian_Gao1"", ""~Feihu_Huang1"", ""~Heng_Huang1""]","[""Shangqian Gao"", ""Feihu Huang"", ""Heng Huang""]",[],"In this paper, we propose a novel channel pruning method to solve the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Previous channel pruning methods usually ignore the relationships between channels and layers. Many of them parameterize each channel independently by using gates or similar concepts. To fill this gap, a hyper-structure network is proposed to generate the architecture of the main network. Like the existing hypernet, our hyper-structure network can be optimized by regular backpropagation. Moreover, we use a regularization term to specify the computational resource of the compact network. Usually, FLOPs is used as the criterion of computational resource. However, if FLOPs is used in the regularization, it may over penalize early layers. To address this issue, we further introduce learnable layer-wise scaling factors to balance the gradients from different terms, and they can be optimized by hyper-gradient descent. Extensive experimental results on CIFAR-10 and ImageNet show that our method is competitive with state-of-the-art methods. ",/pdf/61d20f196eae9154df92776570e6ad43a158449a.pdf,ICLR,2021, +B1lKtjA9FQ,r1l1vGj9FQ,1538090000000.0,1545360000000.0,464,Overfitting Detection of Deep Neural Networks without a Hold Out Set,"[""konrad.groh@de.bosch.com""]","[""Konrad Groh""]","[""deep learning"", ""overfitting"", ""generalization"", ""memorization""]","Overfitting is an ubiquitous problem in neural network training and usually mitigated using a holdout data set. +Here we challenge this rationale and investigate criteria for overfitting without using a holdout data set. +Specifically, we train a model for a fixed number of epochs multiple times with varying fractions of randomized labels and for a range of regularization strengths. +A properly trained model should not be able to attain an accuracy greater than the fraction of properly labeled data points. Otherwise the model overfits. +We introduce two criteria for detecting overfitting and one to detect underfitting. We analyze early stopping, the regularization factor, and network depth. +In safety critical applications we are interested in models and parameter settings which perform well and are not likely to overfit. The methods of this paper allow characterizing and identifying such models.",/pdf/60fb95af056f8b89084ffa9e89d25568cdb43f32.pdf,ICLR,2019,We introduce and analyze several criteria for detecting overfitting. +kXwdjtmMbUr,wFwNegyFbqJ,1601310000000.0,1614990000000.0,1167,Practical Evaluation of Out-of-Distribution Detection Methods for Image Classification,"[""~Engkarat_Techapanurak1"", ""~Takayuki_Okatani1""]","[""Engkarat Techapanurak"", ""Takayuki Okatani""]","[""out-of-distribution"", ""novel class detection"", ""domain shift"", ""concept drift""]","We reconsider the evaluation of OOD detection methods for image recognition. Although many studies have been conducted so far to build better OOD detection methods, most of them follow Hendrycks and Gimpel's work for the method of experimental evaluation. While the unified evaluation method is necessary for a fair comparison, there is a question of if its choice of tasks and datasets reflect real-world applications and if the evaluation results can generalize to other OOD detection application scenarios. In this paper, we experimentally evaluate the performance of representative OOD detection methods for three scenarios, i.e., irrelevant input detection, novel class detection, and domain shift detection, on various datasets and classification tasks. The results show that differences in scenarios and datasets alter the relative performance among the methods. Our results can also be used as a guide for practitioners for the selection of OOD detection methods.",/pdf/ff35c3294b7f13e051191893cdb813100d2eb15c.pdf,ICLR,2021,This paper provides a practical evaluation of OOD detection methods that is missing in previous studies and will serve as a practitioners' guide. +S1gEFkrtvH,SkgUKNRuDS,1569440000000.0,1577170000000.0,1837,BasisVAE: Orthogonal Latent Space for Deep Disentangled Representation,"[""seago0828@yonsei.ac.kr"", ""sbcho@yonsei.ac.kr""]","[""Jin-Young Kim"", ""Sung-Bae Cho""]","[""variational autoencoder"", ""latent space"", ""basis"", ""disentangled representation""]","The variational autoencoder, one of the generative models, defines the latent space for the data representation, and uses variational inference to infer the posterior probability. Several methods have been devised to disentangle the latent space for controlling the generative model easily. However, due to the excessive constraints, the more disentangled the latent space is, the lower quality the generative model has. A disentangled generative model would allocate a single feature of the generated data to the only single latent variable. In this paper, we propose a method to decompose the latent space into basis, and reconstruct it by linear combination of the latent bases. The proposed model called BasisVAE consists of the encoder that extracts the features of data and estimates the coefficients for linear combination of the latent bases, and the decoder that reconstructs the data with the combined latent bases. In this method, a single latent basis is subject to change in a single generative factor, and relatively invariant to the changes in other factors. It maintains the performance while relaxing the constraint for disentanglement on a basis, as we no longer need to decompose latent space on a standard basis. Experiments on the well-known benchmark datasets of MNIST, 3DFaces and CelebA demonstrate the efficacy of the proposed method, compared to other state-of-the-art methods. The proposed model not only defines the latent space to be separated by the generative factors, but also shows the better quality of the generated and reconstructed images. The disentangled representation is verified with the generated images and the simple classifier trained on the output of the encoder.",/pdf/079b41d419bb116ebe66ee31cbc020649feb676e.pdf,ICLR,2020,Construct orthogonal latent space for deep disentangled representation based on a basis in the linear algebra +B1gUn24tPr,Bkl2MP4MPH,1569440000000.0,1577170000000.0,191,Classification Attention for Chinese NER,"[""geyc2@lenovo.com"", ""yangfan24@lenovo.com"", ""yangpei4@lenovo.com""]","[""Yuchen Ge"", ""FanYang"", ""PeiYang""]","[""Chinese NER"", ""NER"", ""tagging"", ""deeplearning"", ""nlp""]","The character-based model, such as BERT, has achieved remarkable success in Chinese named entity recognition (NER). However, such model would likely miss the overall information of the entity words. In this paper, we propose to combine priori entity information with BERT. Instead of relying on additional lexicons or pre-trained word embeddings, our model has generated entity classification embeddings directly on the pre-trained BERT, having the merit of increasing model practicability and avoiding OOV problem. Experiments show that our model has achieved state-of-the-art results on 3 Chinese NER datasets.",/pdf/c4df8bbf5593032c160b6e31ed084b4a73f3daba.pdf,ICLR,2020,Classification Attention for Chinese NER +lJgbDxGhJ4r,89ehJsl1nMw,1601310000000.0,1614990000000.0,2540,OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set Unlabeled Data,"[""~Jongjin_Park1"", ""~Sukmin_Yun1"", ""~Jongheon_Jeong1"", ""~Jinwoo_Shin1""]","[""Jongjin Park"", ""Sukmin Yun"", ""Jongheon Jeong"", ""Jinwoo Shin""]","[""semi-supervised learning"", ""realistic semi-supervised learning"", ""class-distribution mismatch"", ""unsupervised learning""]","Modern semi-supervised learning methods conventionally assume both labeled and unlabeled data have the same class distribution. However, unlabeled data may include out-of-class samples in practice; those that cannot have one-hot encoded labels from a closed-set of classes in label data, i.e., unlabeled data is an open-set. In this paper, we introduce OpenCoS, a method for handling this realistic semi-supervised learning scenario based on a recent framework of contrastive learning. One of our key findings is that out-of-class samples in the unlabeled dataset can be identified effectively via (unsupervised) contrastive learning. OpenCoS utilizes this information to overcome the failure modes in the existing state-of-the-art semi-supervised methods, e.g., ReMixMatch or FixMatch. In particular, we propose to assign soft-labels for out-of-class samples using the representation learned from contrastive learning. Our extensive experimental results show the effectiveness of OpenCoS, fixing the state-of-the-art semi-supervised methods to be suitable for diverse scenarios involving open-set unlabeled data.",/pdf/0a830e619701eb11127145c9fa2296c943e27f73.pdf,ICLR,2021,"We utilize unsupervised representations to handle realistic semi-supervised learning, where the class distributions of labeled and unlabeled datasets do not match." +rkevMnRqYQ,ByxW5sTcKX,1538090000000.0,1555620000000.0,1273,Preferences Implicit in the State of the World,"[""rohinmshah@berkeley.edu"", ""dmitrii.krasheninnikov@student.uva.nl"", ""jfalex@stanford.edu"", ""pabbeel@cs.berkeley.edu"", ""anca@berkeley.edu""]","[""Rohin Shah"", ""Dmitrii Krasheninnikov"", ""Jordan Alexander"", ""Pieter Abbeel"", ""Anca Dragan""]","[""Preference learning"", ""Inverse reinforcement learning"", ""Inverse optimal stochastic control"", ""Maximum entropy reinforcement learning"", ""Apprenticeship learning""]","Reinforcement learning (RL) agents optimize only the features specified in a reward function and are indifferent to anything left out inadvertently. This means that we must not only specify what to do, but also the much larger space of what not to do. It is easy to forget these preferences, since these preferences are already satisfied in our environment. This motivates our key insight: when a robot is deployed in an environment that humans act in, the state of the environment is already optimized for what humans want. We can therefore use this implicit preference information from the state to fill in the blanks. We develop an algorithm based on Maximum Causal Entropy IRL and use it to evaluate the idea in a suite of proof-of-concept environments designed to show its properties. We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized. Our code can be found at https://github.com/HumanCompatibleAI/rlsp.",/pdf/b7c89c850d43383e90519a89e0d3874084e94aef.pdf,ICLR,2019,"When a robot is deployed in an environment that humans have been acting in, the state of the environment is already optimized for what humans want, and we can use this to infer human preferences." +LxhlyKH6VP,3JsQ5ssgmXK,1601310000000.0,1614990000000.0,2549,ProGAE: A Geometric Autoencoder-based Generative Model for Disentangling Protein Conformational Space,"[""~Norman_Joseph_Tatro1"", ""~Payel_Das1"", ""~Pin-Yu_Chen1"", ""~Vijil_Chenthamarakshan1"", ""~Rongjie_Lai4""]","[""Norman Joseph Tatro"", ""Payel Das"", ""Pin-Yu Chen"", ""Vijil Chenthamarakshan"", ""Rongjie Lai""]","[""generative models"", ""deep learning"", ""interpretability""]","Understanding the protein conformational landscape is critical, as protein function, as well as modulations thereof due to ligand binding or changes in environment, are intimately connected with structural variations. This work focuses on learning a generative neural network on a simulated ensemble of protein structures obtained using molecular simulation to characterize the distinct structural fluctuations of a protein bound to various drug molecules. Specifically, we use a geometric autoencoder framework to learn separate latent space encodings of the intrinsic and extrinsic geometries of the system. For this purpose, the proposed Protein Geometric AutoEncoder (ProGAE) model is trained on the length of the alpha-carbon pseudobonds and the orientation of the backbone bonds of the protein. Using ProGAE latent embeddings, we reconstruct and generate the conformational ensemble of a protein at or near the experimental resolution. Empowered by the disentangled latent space learning, the intrinsic latent embedding help in geometric error correction, whereas the extrinsic latent embedding is successfully used for classification or property prediction of different drugs bound to a specific protein. Additionally, ProGAE is able to be transferred to the structures of a different state of the same protein or to a completely different protein of different size, where only the dense layer decoding from the latent representation needs to be retrained. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations, charting the path toward scalable and improved approaches for analyzing and enhancing molecular simulations.",/pdf/d564c41e04243386d42ac003632202762c114e80.pdf,ICLR,2021,"We introduce ProGAE, a geometric autoencoder for generating the protein conformational space, with separate latent representations of intrinsic and extrinsic geometry." +ByxT7TNFvH,Syem0GbPDH,1569440000000.0,1583910000000.0,467,Semantically-Guided Representation Learning for Self-Supervised Monocular Depth,"[""vitor.guizilini@tri.global"", ""rayhou@umich.edu"", ""jie.li@tri.global"", ""rares.ambrus@tri.global"", ""adrien.gaidon@tri.global""]","[""Vitor Guizilini"", ""Rui Hou"", ""Jie Li"", ""Rares Ambrus"", ""Adrien Gaidon""]","[""computer vision"", ""machine learning"", ""deep learning"", ""monocular depth estimation"", ""self-supervised learning""]","Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories. +",/pdf/c0fd688272831a614b38082da78c86779a4e9afd.pdf,ICLR,2020,We propose a novel semantically-guided architecture for self-supervised monocular depth estimation +boZj4g3Jocj,#NAME?,1601310000000.0,1614990000000.0,3553,Learning to communicate through imagination with model-based deep multi-agent reinforcement learning,"[""~Arnu_Pretorius1"", ""scott.a.cameron@live.co.uk"", ""dries.epos@gmail.com"", ""elanvanbiljon@gmail.com"", ""l.francis@instadeep.com"", ""f.azeez@instadeep.com"", ""~Alexandre_Laterre1"", ""kb@instadeep.com""]","[""Arnu Pretorius"", ""Scott Cameron"", ""Andries Petrus Smit"", ""Elan van Biljon"", ""Lawrence Francis"", ""Femi Azeez"", ""Alexandre Laterre"", ""Karim Beguir""]",[],"The human imagination is an integral component of our intelligence. Furthermore, the core utility of our imagination is deeply coupled with communication. Language, argued to have been developed through complex interaction within growing collective societies serves as an instruction to the imagination, giving us the ability to share abstract mental representations and perform joint spatiotemporal planning. In this paper, we explore communication through imagination with multi-agent reinforcement learning. Specifically, we develop a model-based approach where agents jointly plan through recurrent communication of their respective predictions of the future. Each agent has access to a learned world model capable of producing model rollouts of future states and predicted rewards, conditioned on the actions sampled from the agent's policy. These rollouts are then encoded into messages and used to learn a communication protocol during training via differentiable message passing. We highlight the benefits of our model-based approach, compared to a set of strong baselines, by developing a set of specialised experiments using novel as well as well-known multi-agent environments.",/pdf/8e517b8c1c71632aecca36506992a30f69b4031e.pdf,ICLR,2021, +jEYKjPE1xYN,EO0Hfx7Fpr7r,1601310000000.0,1616080000000.0,1710,Symmetry-Aware Actor-Critic for 3D Molecular Design,"[""~Gregor_N._C._Simm1"", ""~Robert_Pinsler1"", ""gc121@cam.ac.uk"", ""~Jos\u00e9_Miguel_Hern\u00e1ndez-Lobato1""]","[""Gregor N. C. Simm"", ""Robert Pinsler"", ""G\u00e1bor Cs\u00e1nyi"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""deep reinforcement learning"", ""molecular design"", ""covariant neural networks""]","Automating molecular design using deep reinforcement learning (RL) has the potential to greatly accelerate the search for novel materials. Despite recent progress on leveraging graph representations to design molecules, such methods are fundamentally limited by the lack of three-dimensional (3D) information. In light of this, we propose a novel actor-critic architecture for 3D molecular design that can generate molecular structures unattainable with previous approaches. This is achieved by exploiting the symmetries of the design process through a rotationally covariant state-action representation based on a spherical harmonics series expansion. We demonstrate the benefits of our approach on several 3D molecular design tasks, where we find that building in such symmetries significantly improves generalization and the quality of generated molecules.",/pdf/910f34d86d427e8bfed4af189076b9d497578643.pdf,ICLR,2021,Covariant actor-critic based on spherical harmonics that exploits symmetries to design molecules in 3D +SJG1wjRqFQ,r1eClnNqKQ,1538090000000.0,1545360000000.0,228,Discrete Structural Planning for Generating Diverse Translations,"[""shu@nlab.ci.i.u-tokyo.ac.jp"", ""nakayama@ci.i.u-tokyo.ac.jp""]","[""Raphael Shu"", ""Hideki Nakayama""]","[""machine translation"", ""syntax"", ""diversity"", ""code learning""]","Planning is important for humans when producing complex languages, which is a missing part in current language generation models. In this work, we add a planning phase in neural machine translation to control the global sentence structure ahead of translation. Our approach learns discrete structural representations to encode syntactic information of target sentences. During translation, we can either let beam search to choose the structural codes automatically or specify the codes manually. The word generation is then conditioned on the selected discrete codes. Experiments show that the translation performance remains intact by learning the codes to capture pure structural variations. Through structural planning, we are able to control the global sentence structure by manipulating the codes. By evaluating with a proposed structural diversity metric, we found that the sentences sampled using different codes have much higher diversity scores. In qualitative analysis, we demonstrate that the sampled paraphrase translations have drastically different structures. ",/pdf/a644831b836e754acab8f4f0216ef6b0830153cd.pdf,ICLR,2019,Learning discrete structural representation to control sentence generation and obtain diverse outputs +rk9eAFcxg,,1478300000000.0,1489040000000.0,525,Variational Recurrent Adversarial Deep Domain Adaptation,"[""spurusho@usc.edu"", ""wcarvalh@usc.edu"", ""nilanon@usc.edu"", ""yanliu.cs@usc.edu""]","[""Sanjay Purushotham"", ""Wilka Carvalho"", ""Tanachat Nilanon"", ""Yan Liu""]","[""Deep learning"", ""Transfer Learning""]","We study the problem of learning domain invariant representations for time series data while transferring the complex temporal latent dependencies between the domains. Our model termed as Variational Recurrent Adversarial Deep Domain Adaptation (VRADA) is built atop a variational recurrent neural network (VRNN) and trains adversarially to capture complex temporal relationships that are domain-invariant. This is (as far as we know) the first to capture and transfer temporal latent dependencies in multivariate time-series data. Through experiments on real-world multivariate healthcare time-series datasets, we empirically demonstrate that learning temporal dependencies helps our model's ability to create domain-invariant representations, allowing our model to outperform current state-of-the-art deep domain adaptation approaches.",/pdf/af99f4be8bec43ed9d2d85713b3966173492c634.pdf,ICLR,2017,We propose Variational Recurrent Adversarial Deep Domain Adaptation approach to capture and transfer temporal latent dependencies in multivariate time-series data +N3zUDGN5lO,T_5gXohAjp,1601310000000.0,1615710000000.0,1292,My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control,"[""~Vitaly_Kurin1"", ""~Maximilian_Igl1"", ""~Tim_Rockt\u00e4schel1"", ""~Wendelin_Boehmer1"", ""~Shimon_Whiteson1""]","[""Vitaly Kurin"", ""Maximilian Igl"", ""Tim Rockt\u00e4schel"", ""Wendelin Boehmer"", ""Shimon Whiteson""]","[""Deep Reinforcement Learning"", ""Multitask Reinforcement Learning"", ""Graph Neural Networks"", ""Continuous Control"", ""Incompatible Environments""]","Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses the physical morphology of the agent to construct the input graph, i.e., encoding limb features as node labels and using edges to connect the nodes if their corresponded limbs are physically connected. +In this work, we present a series of ablations on existing methods that show that morphological information encoded in the graph does not improve their performance. Motivated by the hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing, we also propose Amorpheus, a transformer-based approach. Further results show that, while Amorpheus ignores the morphological information that GNNs encode, it nonetheless substantially outperforms GNN-based methods.",/pdf/130809fe4dd2abe36b6f0c395f8cc2f51174bc4d.pdf,ICLR,2021,Transformer-based approach to multitask incompatible continuous control inspired by a hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing. +Te1aZ2myPIu,iasdKnZlcTA,1601310000000.0,1614990000000.0,3591,Pretrain-to-Finetune Adversarial Training via Sample-wise Randomized Smoothing,"[""~Lei_Wang22"", ""~Runtian_Zhai1"", ""~Di_He1"", ""~Liwei_Wang1"", ""~Li_Jian1""]","[""Lei Wang"", ""Runtian Zhai"", ""Di He"", ""Liwei Wang"", ""Li Jian""]","[""Adversarial Robustness"", ""Provable Adversarial Defense"", ""Sample-wise Randomized Smoothing.""]","Developing certified models that can provably defense adversarial perturbations is important in machine learning security. Recently, randomized smoothing, combined with other techniques (Cohen et al., 2019; Salman et al., 2019), has been shown to be an effective method to certify models under $l_2$ perturbations. Existing work for certifying $l_2$ perturbations added the same level of Gaussian noise to each sample. The noise level determines the trade-off between the test accuracy and the average certified robust radius. We propose to further improve the defense via sample-wise randomized smoothing, which assigns different noise levels to different samples. Specifically, we propose a pretrain-to-finetune framework that first pretrains a model and then adjusts the noise levels for higher performance based on the model’s outputs. For certification, we carefully allocate specific robust regions for each test sample. We perform extensive experiments on CIFAR-10 and MNIST datasets and the experimental results demonstrate that our method can achieve better accuracy-robustness trade-off in the transductive setting.",/pdf/c648ae9244ad1c05afa53f1cef53e1e35a4ad5a1.pdf,ICLR,2021,Propose sample-wise randomized smoothing and achieve better accuracy-robustness trade-off. +B1gXR3NtwS,SJl9cYD4Pr,1569440000000.0,1577170000000.0,257,Deep Bayesian Structure Networks,"[""dzj17@mails.tsinghua.edu.cn"", ""luoyc15@mails.tsinghua.edu.cn"", ""dcszj@tsinghua.edu.cn"", ""dcszb@tsinghua.edu.cn""]","[""Zhijie Deng"", ""Yucen Luo"", ""Jun Zhu"", ""Bo Zhang""]",[],"Bayesian neural networks (BNNs) introduce uncertainty estimation to deep networks by performing Bayesian inference on network weights. However, such models bring the challenges of inference, and further BNNs with weight uncertainty rarely achieve superior performance to standard models. In this paper, we investigate a new line of Bayesian deep learning by performing Bayesian reasoning on the structure of deep neural networks. Drawing inspiration from the neural architecture search, we define the network structure as random weights on the redundant operations between computational nodes, and apply stochastic variational inference techniques to learn the structure distributions of networks. Empirically, the proposed method substantially surpasses the advanced deep neural networks across a range of classification and segmentation tasks. More importantly, our approach also preserves benefits of Bayesian principles, producing improved uncertainty estimation than the strong baselines including MC dropout and variational BNNs algorithms (e.g. noisy EK-FAC). ",/pdf/3b3e710f54097dd28bf5b5ee0533bd46020c2d71.pdf,ICLR,2020, +lxHgXYN4bwl,kTGTyGnOcH7,1601310000000.0,1616850000000.0,3094,Expressive Power of Invariant and Equivariant Graph Neural Networks,"[""waiss.azizian@ens.fr"", ""~marc_lelarge1""]","[""Waiss Azizian"", ""marc lelarge""]","[""Graph Neural Network"", ""Universality"", ""Approximation""]","Various classes of Graph Neural Networks (GNN) have been proposed and shown to be successful in a wide range of applications with graph structured data. In this paper, we propose a theoretical framework able to compare the expressive power of these GNN architectures. The current universality theorems only apply to intractable classes of GNNs. Here, we prove the first approximation guarantees for practical GNNs, paving the way for a better understanding of their generalization. Our theoretical results are proved for invariant GNNs computing a graph embedding (permutation of the nodes of the input graph does not affect the output) and equivariant GNNs computing an embedding of the nodes (permutation of the input permutes the output). We show that Folklore Graph Neural Networks (FGNN), which are tensor based GNNs augmented with matrix multiplication are the most expressive architectures proposed so far for a given tensor order. We illustrate our results on the Quadratic Assignment Problem (a NP-Hard combinatorial problem) by showing that FGNNs are able to learn how to solve the problem, leading to much better average performances than existing algorithms (based on spectral, SDP or other GNNs architectures). On a practical side, we also implement masked tensors to handle batches of graphs of varying sizes. ",/pdf/21b5cfad2a0412a534b651374073f23af8310688.pdf,ICLR,2021, +9nIulvlci5,W3Kto-_Pzp,1601310000000.0,1614990000000.0,3347,Neural Random Projection: From the Initial Task To the Input Similarity Problem,"[""~Alan_Savushkin1"", ""nikita.benkovich@kaspersky.com"", ""dmitry.s.golubev@kaspersky.com""]","[""Alan Savushkin"", ""Nikita Benkovich"", ""Dmitry Golubev""]",[],"The data representation plays an important role in evaluating similarity between objects. In this paper, we propose a novel approach for implicit data representation to evaluate similarity of input data using a trained neural network. In contrast to the previous approach, which uses gradients for representation, we utilize only the outputs from the last hidden layer of a neural network and do not use a backward step. The proposed technique explicitly takes into account the initial task and significantly reduces the size of the vector representation, as well as the computation time. Generally, a neural network obtains representations related only to the problem being solved, which makes the last hidden layer representation useless for input similarity task. +In this paper, we consider two reasons for the decline in the quality of representations: correlation between neurons and insufficient size of the last hidden layer. To reduce the correlation between neurons we use orthogonal weight initialization for each layer and modify the loss function to ensure orthogonality of the weights during training. Moreover, we show that activation functions can potentially increase correlation. To solve this problem, we apply modified Batch-Normalization with Dropout. Using orthogonal weight matrices allow us to consider such neural networks as an application of the Random Projection method and get a lower bound estimate for the size of the last hidden layer. We perform experiments on MNIST and physical examination datasets. In both experiments, initially, we split a set of labels into two disjoint subsets to train a neural network for binary classification problem, and then use this model to measure similarity between input data and define hidden classes. We also cluster the inputs to evaluate how well objects from the same hidden class are grouped together. Our experimental results show that the proposed approach achieves competitive results on the input similarity task while reducing both computation time and the size of the input representation.",/pdf/8c462fcb2dd146516c4c9c7cb36b2267988cbee1.pdf,ICLR,2021,A neural network from the Random Projection perspective +#NAME?,CXY78YpSSX,1601310000000.0,1614990000000.0,2277,SSW-GAN: Scalable Stage-wise Training of Video GANs,"[""~Lluis_Castrejon1"", ""~Nicolas_Ballas1"", ""~Aaron_Courville3""]","[""Lluis Castrejon"", ""Nicolas Ballas"", ""Aaron Courville""]","[""video generation"", ""GANs"", ""scalable methods""]","Current state-of-the-art generative models for videos have high computational requirements that impede high resolution generations beyond a few frames. In this work we propose a stage-wise strategy to train Generative Adversarial Networks (GANs) for videos. We decompose the generative process to first produce a downsampled video that is then spatially upscaled and temporally interpolated by subsequent stages. Upsampling stages are applied locally on temporal chunks of previous outputs to manage the computational complexity. Stages are defined as Generative Adversarial Networks, which are trained sequentially and independently. We validate our approach on Kinetics-600 and BDD100K, for which we train a three stage model capable of generating 128x128 videos with 100 frames.",/pdf/d48909e8a9644b69d2eb51496566c2c2fafc153e.pdf,ICLR,2021,"We propose a scalable methods to generate videos by training a GAN to produce a low resolution and temporally subsampled version of a video, which is then upsampled by one or more local upsampling stages." +Byx91R4twB,SyxPyhzOvB,1569440000000.0,1577170000000.0,902,Adversarial Video Generation on Complex Datasets,"[""aidanclark@google.com"", ""jeffdonahue@google.com"", ""simonyan@google.com""]","[""Aidan Clark"", ""Jeff Donahue"", ""Karen Simonyan""]","[""GAN"", ""generative model"", ""generative adversarial network"", ""video prediction""]","Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Fréchet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600.",/pdf/e5b58cde320ea21c65a48c8bf18e402a1b26cb28.pdf,ICLR,2020,"We propose DVD-GAN, a large video generative model that is state of the art on several tasks and produces highly complex videos when trained on large real world datasets." +dgd4EJqsbW5,wH2ayRBbJQn,1601310000000.0,1616030000000.0,2381,Control-Aware Representations for Model-based Reinforcement Learning,"[""bcui@fb.com"", ""~Yinlam_Chow1"", ""~Mohammad_Ghavamzadeh2""]","[""Brandon Cui"", ""Yinlam Chow"", ""Mohammad Ghavamzadeh""]",[],"A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space.We call this model control-aware representation learning(CARL). We derive a loss function and three implementations for CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g., iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.",/pdf/f0d80d862dab33f2ed69b44a0f14fda119006af8.pdf,ICLR,2021, +rkewaxrtvr,HkxQ7WWFvH,1569440000000.0,1577170000000.0,2577,Privacy-preserving Representation Learning by Disentanglement,"[""tassilo.klein@sap.com"", ""m.nabi@sap.com""]","[""Tassilo Klein"", ""Moin Nabi""]",[],"Deep learning and latest machine learning technology heralded an era of success in data analysis. Accompanied by the ever increasing performance, reaching super-human performance in many areas, is the requirement of amassing more and more data to train these models. Often ignored or underestimated, the big data curation is associated with the risk of privacy leakages. The proposed approach seeks to mitigate these privacy issues. In order to sanitize data from sensitive content, we propose to learn a privacy-preserving data representation by disentangling into public and private part, with the public part being shareable without privacy infringement. The proposed approach deals with the setting where the private features are not explicit, and is estimated though the course of learning. This is particularly appealing, when the notion of sensitive attribute is ``fuzzy''. We showcase feasibility in terms of classification of facial attributes and identity on the CelebA dataset. The results suggest that private component can be removed in the cases where the the downstream task is known a priori (i.e., ``supervised''), and the case where it is not known a priori (i.e., ``weakly-supervised'').",/pdf/2d3671fc19158fe4b2bb6ee7916651d149a92993.pdf,ICLR,2020, +OBI5QuStBz3,jmYP4xLYhB,1601310000000.0,1614990000000.0,2018,Improved Communication Lower Bounds for Distributed Optimisation,"[""~Janne_H._Korhonen2"", ""~Dan_Alistarh7""]","[""Janne H. Korhonen"", ""Dan Alistarh""]","[""distributed optimization"", ""lower bounds"", ""upper bounds"", ""communication complexity""]","Motivated by the interest in communication-efficient methods for distributed machine learning, we consider the communication complexity of minimising a sum of $d$-dimensional functions $\sum_{i = 1}^N f_i (x)$, where each function $f_i$ is held by one of the $N$ different machines. Such tasks arise naturally in large-scale optimisation, where a standard solution is to apply variants of (stochastic) gradient descent. As our main result, we show that $\Omega( Nd \log d / \varepsilon)$ bits in total need to be communicated between the machines to find an additive $\epsilon$-approximation to the minimum of $\sum_{i = 1}^N f_i (x)$. The results holds for deterministic algorithms, and randomised algorithms under some restrictions on the parameter values. Importantly, our lower bounds require no assumptions on the structure of the algorithm, and are matched within constant factors for strongly convex objectives by a new variant of quantised gradient descent. The lower bounds are obtained by bringing over tools from communication complexity to distributed optimisation, an approach we hope will find further use in future. +",/pdf/9e95755f92176d8519fd749f94488cf5454a7507.pdf,ICLR,2021,We give the first tight bounds for the communication complexity of optimizing a sum of quadratic functions in a distributed setting; the result has non-trivial extensions and implications for the fundamental limits of distributed optimization. +whE31dn74cL,QoYuLS2GLZB,1601310000000.0,1616040000000.0,2930,A Temporal Kernel Approach for Deep Learning with Continuous-time Information,"[""~Da_Xu2"", ""~Chuanwei_Ruan1"", ""~Evren_Korpeoglu1"", ""~Sushant_Kumar1"", ""~Kannan_Achan1""]","[""Da Xu"", ""Chuanwei Ruan"", ""Evren Korpeoglu"", ""Sushant Kumar"", ""Kannan Achan""]","[""Kernel Learning"", ""Continuous-time System"", ""Spectral Distribution"", ""Random Feature"", ""Reparameterization"", ""Learning Theory""]","Sequential deep learning models such as RNN, causal CNN and attention mechanism do not readily consume continuous-time information. Discretizing the temporal data, as we show, causes inconsistency even for simple continuous-time processes. Current approaches often handle time in a heuristic manner to be consistent with the existing deep learning architectures and implementations. In this paper, we provide a principled way to characterize continuous-time systems using deep learning tools. Notably, the proposed approach applies to all the major deep learning architectures and requires little modifications to the implementation. The critical insight is to represent the continuous-time system by composing neural networks with a temporal kernel, where we gain our intuition from the recent advancements in understanding deep learning with Gaussian process and neural tangent kernel. To represent the temporal kernel, we introduce the random feature approach and convert the kernel learning problem to spectral density estimation under reparameterization. We further prove the convergence and consistency results even when the temporal kernel is non-stationary, and the spectral density is misspecified. The simulations and real-data experiments demonstrate the empirical effectiveness of our temporal kernel approach in a broad range of settings.",/pdf/40fc3e707f1a7db2333d5459c3b472809d4e33c1.pdf,ICLR,2021,We propose a temporal kernel learning approach based on random features and reparameterization to characterize the continuous-time information in deep learning models. +cxRUccyjw0S,b3k4A-m8TCg,1601310000000.0,1614990000000.0,582,Learning Disentangled Representations for Image Translation,"[""~Aviv_Gabbay1"", ""~Yedid_Hoshen3""]","[""Aviv Gabbay"", ""Yedid Hoshen""]","[""disentanglement"", ""image translation"", ""latent optimization""]","Recent approaches for unsupervised image translation are strongly reliant on generative adversarial training and architectural locality constraints. Despite their appealing results, it can be easily observed that the learned class and content representations are entangled which often hurts the translation performance. To this end, we propose OverLORD, for learning disentangled representations for the image class and attributes, utilizing latent optimization and carefully designed content and style bottlenecks. We further argue that the commonly used adversarial optimization can be decoupled from representation disentanglement and be applied at a later stage of the training to increase the perceptual quality of the generated images. Based on these principles, our model learns significantly more disentangled representations and achieves higher translation quality and greater output diversity than state-of-the-art methods.",/pdf/8c2888067c2f697b9e99a4f975b9b60942696ec2.pdf,ICLR,2021,A disentanglement method for high-fidelity image translation +Hkl_bCVKDr,HJlwbLN_PS,1569440000000.0,1577170000000.0,971,Scaleable input gradient regularization for adversarial robustness,"[""christopher.finlay@mail.mcgill.ca"", ""adam.oberman@mcgill.ca""]","[""Chris Finlay"", ""Adam M Oberman""]","[""adversarial robustness"", ""gradient regularization"", ""robust certification"", ""robustness bounds""]","In this work we revisit gradient regularization for adversarial robustness with some new ingredients. First, we derive new per-image theoretical robustness bounds based on local gradient information. These bounds strongly motivate input gradient regularization. Second, we implement a scaleable version of input gradient regularization which avoids double backpropagation: adversarially robust ImageNet models are trained in 33 hours on four consumer grade GPUs. Finally, we show experimentally and through theoretical certification that input gradient regularization is competitive with adversarial training. Moreover we demonstrate that gradient regularization does not lead to gradient obfuscation or gradient masking.",/pdf/75100d74283a226b21efd04803d219f53b015e2f.pdf,ICLR,2020,New robust certification bounds motivate gradient regularization for adversarial robustness +S1xXiREKDB,SyemFWtdwH,1569440000000.0,1577170000000.0,1313,Adversarial training with perturbation generator networks,"[""hyungil0113@snu.ac.kr"", ""yubise7en@snu.ac.kr"", ""junglee@snu.ac.kr""]","[""Hyeungill Lee"", ""Sungyeob Han"", ""Jungwoo Lee""]","[""Adversarial training"", ""Generative model"", ""Adaptive perturbation generator"", ""Robust optimization""]","Despite the remarkable development of recent deep learning techniques, neural networks are still vulnerable to adversarial attacks, i.e., methods that fool the neural networks with perturbations that are too small for human eyes to perceive. Many adversarial training methods were introduced as to solve this problem, using adversarial examples as a training data. However, these adversarial attack methods used in these techniques are fixed, making the model stronger only to attacks used in training, which is widely known as an overfitting problem. In this paper, we suggest a novel adversarial training approach. In addition to the classifier, our method adds another neural network that generates the most effective adversarial perturbation by finding the weakness of the classifier. This perturbation generator network is trained to produce perturbations that maximize the loss function of the classifier, and these adversarial examples train the classifier with a true label. In short, the two networks compete with each other, performing a minimax game. In this scenario, attack patterns created by the generator network are adaptively altered to the classifier, mitigating the overfitting problem mentioned above. We theoretically proved that our minimax optimization problem is equivalent to minimizing the adversarial loss after all. Beyond this, we proposed an evaluation method that could accurately compare a wide-range of adversarial algorithms. Experiments with various datasets show that our method outperforms conventional adversarial algorithms. ",/pdf/2b0d1bf7a636aa5137e020fa00b8ff68335b4f8a.pdf,ICLR,2020,We proposed the adaptive adversarial training algorithm with learnable perturbation generator networks. +SJRpRfKxx,,1478210000000.0,1486810000000.0,83,Recurrent Mixture Density Network for Spatiotemporal Visual Attention,"[""loris.bazzani@gmail.com"", ""hugo.larochelle@usherbrooke.ca"", ""lt@dartmouth.edu""]","[""Loris Bazzani"", ""Hugo Larochelle"", ""Lorenzo Torresani""]","[""Computer vision"", ""Deep learning"", ""Applications""]","In many computer vision tasks, the relevant information to solve the problem at hand is mixed to irrelevant, distracting information. This has motivated researchers to design attentional models that can dynamically focus on parts of images or videos that are salient, e.g., by down-weighting irrelevant pixels. In this work, we propose a spatiotemporal attentional model that learns where to look in a video directly from human fixation data. We model visual attention with a mixture of Gaussians at each frame. This distribution is used to express the probability of saliency for each pixel. Time consistency in videos is modeled hierarchically by: 1) deep 3D convolutional features to represent spatial and short-term time relations and 2) a long short-term memory network on top that aggregates the clip-level representation of sequential clips and therefore expands the temporal domain from few frames to seconds. The parameters of the proposed model are optimized via maximum likelihood estimation using human fixations as training data, without knowledge of the action in each video. Our experiments on Hollywood2 show state-of-the-art performance on saliency prediction for video. We also show that our attentional model trained on Hollywood2 generalizes well to UCF101 and it can be leveraged to improve action classification accuracy on both datasets.",/pdf/581303df88ca3bbe6ae8ed1186fcbb3aaa8a6356.pdf,ICLR,2017, +r1SuFjkRW,HJ4uYjyAb,1509040000000.0,1518730000000.0,158,Discrete Sequential Prediction of Continuous Actions for Deep RL,"[""lmetz@google.com"", ""julianibarz@google.com"", ""njaitly@google.com"", ""jcdavidson@google.com""]","[""Luke Metz"", ""Julian Ibarz"", ""Navdeep Jaitly"", ""James Davidson""]","[""Reinforcement learning"", ""continuous control"", ""deep learning""]","It has long been assumed that high dimensional continuous control problems cannot be solved effectively by discretizing individual dimensions of the action space due to the exponentially large number of bins over which policies would have to be learned. In this paper, we draw inspiration from the recent success of sequence-to-sequence models for structured prediction problems to develop policies over discretized spaces. Central to this method is the realization that complex functions over high dimensional spaces can be modeled by neural networks that predict one dimension at a time. Specifically, we show how Q-values and policies over continuous spaces can be modeled using a next step prediction model over discretized dimensions. With this parameterization, it is possible to both leverage the compositional structure of action spaces during learning, as well as compute maxima over action spaces (approximately). On a simple example task we demonstrate empirically that our method can perform global search, which effectively gets around the local optimization issues that plague DDPG. We apply the technique to off-policy (Q-learning) methods and show that our method can achieve the state-of-the-art for off-policy methods on several continuous control tasks.",/pdf/2657e56090af97cd72a725e25af9a509cee17281.pdf,ICLR,2018,A method to do Q-learning on continuous action spaces by predicting a sequence of discretized 1-D actions. +HowQIZwD_42,yWO_nw3cN79,1601310000000.0,1614990000000.0,909,Measuring and Harnessing Transference in Multi-Task Learning,"[""~Chris_Fifty1"", ""~Ehsan_Amid1"", ""~Zhe_Zhao3"", ""~Tianhe_Yu1"", ""~Rohan_Anil1"", ""~Chelsea_Finn1""]","[""Chris Fifty"", ""Ehsan Amid"", ""Zhe Zhao"", ""Tianhe Yu"", ""Rohan Anil"", ""Chelsea Finn""]","[""multitask learning""]","Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naive formulations often degrade performance and in particular, identifying the tasks that would benefit from co-training remains a challenging design question. In this paper, we analyze the dynamics of information transfer, or transference, across tasks throughout training. Specifically, we develop a similarity measure that can quantify transference among tasks and use this quantity to both better understand the optimization dynamics of multi-task learning as well as improve overall learning performance. In the latter case, we propose two methods to leverage our transference metric. The first operates at a macro-level by selecting which tasks should train together while the second functions at a micro-level by determining how to combine task gradients at each training step. We find these methods can lead to significant improvement over prior work on three supervised multi-task learning benchmarks and one multi-task reinforcement learning paradigm.",/pdf/3f2205f7177ea1fd65d8a29a83f8ab595bc22f96.pdf,ICLR,2021,Quantifying information transfer in multi-task learning and leveraging this measure to determine task groupings and improve learning efficiency. +xW9zZm9qK0_,TZvL6mxfHpQ,1601310000000.0,1614990000000.0,1431,Class2Simi: A New Perspective on Learning with Label Noise,"[""~Songhua_Wu1"", ""~Xiaobo_Xia1"", ""~Tongliang_Liu1"", ""~Bo_Han1"", ""~Mingming_Gong1"", ""~Nannan_Wang1"", ""haifeng@leinao.ai"", ""~Gang_Niu1""]","[""Songhua Wu"", ""Xiaobo Xia"", ""Tongliang Liu"", ""Bo Han"", ""Mingming Gong"", ""Nannan Wang"", ""Haifeng Liu"", ""Gang Niu""]",[],"Label noise is ubiquitous in the era of big data. Deep learning algorithms can easily fit the noise and thus cannot generalize well without properly modeling the noise. In this paper, we propose a new perspective on dealing with label noise called ``\textit{Class2Simi}''. Specifically, we transform the training examples with noisy class labels into pairs of examples with noisy similarity labels, and propose a deep learning framework to learn robust classifiers with the noisy similarity labels. Note that a class label shows the class that an instance belongs to; while a similarity label indicates whether or not two instances belong to the same class. It is worthwhile to perform the transformation: We prove that the noise rate for the noisy similarity labels is lower than that of the noisy class labels, because similarity labels themselves are robust to noise. For example, given two instances, even if both of their class labels are incorrect, their similarity label could be correct. Due to the lower noise rate, Class2Simi achieves remarkably better classification accuracy than its baselines that directly deals with the noisy class labels.",/pdf/5196043e6871969a64d1fb33e8cbfa8766063cdc.pdf,ICLR,2021, +ByxGSsR9FQ,rkgApEZMYQ,1538090000000.0,1549400000000.0,67,L2-Nonexpansive Neural Networks,"[""qianhaifeng@us.ibm.com"", ""wegman@us.ibm.com""]","[""Haifeng Qian"", ""Mark N. Wegman""]","[""adversarial defense"", ""regularization"", ""robustness"", ""generalization""]","This paper proposes a class of well-conditioned neural networks in which a unit amount of change in the inputs causes at most a unit amount of change in the outputs or any of the internal layers. We develop the known methodology of controlling Lipschitz constants to realize its full potential in maximizing robustness, with a new regularization scheme for linear layers, new ways to adapt nonlinearities and a new loss function. With MNIST and CIFAR-10 classifiers, we demonstrate a number of advantages. Without needing any adversarial training, the proposed classifiers exceed the state of the art in robustness against white-box L2-bounded adversarial attacks. They generalize better than ordinary networks from noisy data with partially random labels. Their outputs are quantitatively meaningful and indicate levels of confidence and generalization, among other desirable properties.",/pdf/782b0c8dfa7e25840d0a1fb44d7161a9f6fb83ba.pdf,ICLR,2019, +B1TTpYKgx,,1478230000000.0,1482080000000.0,115,On the Expressive Power of Deep Neural Networks,"[""maithrar@gmail.com"", ""benmpoole@gmail.com"", ""kleinber@cs.cornell.edu"", ""sganguli@stanford.edu"", ""jaschasd@google.com""]","[""Maithra Raghu"", ""Ben Poole"", ""Jon Kleinberg"", ""Surya Ganguli"", ""Jascha Sohl-Dickstein""]","[""Theory"", ""Deep learning""]","We study the expressive power of deep neural networks before and after +training. Considering neural nets after random initialization, we show that +three natural measures of expressivity all display an exponential dependence +on the depth of the network. We prove, theoretically and experimentally, +that all of these measures are in fact related to a fourth quantity, trajectory +length. This quantity grows exponentially in the depth of the network, and +is responsible for the depth sensitivity observed. These results translate +to consequences for networks during and after training. They suggest that +parameters earlier in a network have greater influence on its expressive power +– in particular, given a layer, its influence on expressivity is determined by +the remaining depth of the network after that layer. This is verified with +experiments on MNIST and CIFAR-10. We also explore the effect of training +on the input-output map, and find that it trades off between the stability +and expressivity of the input-output map.",/pdf/13029099cee636443454372dd3471bf701e74616.pdf,ICLR,2017,"Derives and explains the exponential depth sensitivity of different expressivity measures for deep neural networks, and explores consequences during and after training. " +SklwGlHFvH,S1g1DMeKwH,1569440000000.0,1577170000000.0,2177,Learning Curves for Deep Neural Networks: A field theory perspective,"[""omrycohen.38.talpiot@gmail.com"", ""or.malka@mail.huji.ac.il"", ""zohar.ringel@mail.huji.ac.il""]","[""Omry Cohen"", ""Or Malka"", ""Zohar Ringel""]","[""Gaussian Processes"", ""Neural Tangent Kernel"", ""Learning Curves"", ""Field Theory"", ""Statistical Mechanics"", ""Generalization"", ""Deep neural networks""]","A series of recent works established a rigorous correspondence between very wide deep neural networks (DNNs), trained in a particular manner, and noiseless Bayesian Inference with a certain Gaussian Process (GP) known as the Neural Tangent Kernel (NTK). Here we extend a known field-theory formalism for GP inference to get a detailed understanding of learning-curves in DNNs trained in the regime of this correspondence (NTK regime). In particular, a renormalization-group approach is used to show that noiseless GP inference using NTK, which lacks a good analytical handle, can be well approximated by noisy GP inference on a related kernel we call the renormalized NTK. Following this, a perturbation-theory analysis is carried in one over the dataset-size yielding analytical expressions for the (fixed-teacher/fixed-target) leading and sub-leading asymptotics of the learning curves. At least for uniform datasets, a coherent picture emerges wherein fully-connected DNNs have a strong implicit bias towards functions which are low order polynomials of the input. ",/pdf/405c2c3ec9b9efe880152c567516e6e81339df63.pdf,ICLR,2020,Nice and accurate predictions for DNN learning curves using a novel field theory approach +HJlY0jA5F7,B1g7yV69tX,1538090000000.0,1545360000000.0,912,Improving Sample-based Evaluation for Generative Adversarial Networks,"[""b1ueber2y@gmail.com"", ""wei-y15@mails.tsinghua.edu.cn"", ""lujiwen@tsinghua.edu.cn"", ""jzhou@tsinghua.edu.cn""]","[""Shaohui Liu*"", ""Yi Wei*"", ""Jiwen Lu"", ""Jie Zhou""]",[],"In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks (GANs) on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available.",/pdf/52043f6188abbdf8209cfbc0a27b9615698a1ee1.pdf,ICLR,2019,This paper improves existing sample-based evaluation for GANs and contains some insightful experiments. +Hygm8jC9FQ,r1eCVibOYQ,1538090000000.0,1545360000000.0,158,FAVAE: SEQUENCE DISENTANGLEMENT USING IN- FORMATION BOTTLENECK PRINCIPLE,"[""yamada0224@gmail.com"", ""h-kim@isi.imi.i.u-tokyo.ac.jp"", ""miyoshi@narr.jp"", ""hiroshi_yamakawa@dwango.co.jp""]","[""Masanori Yamada"", ""Kim Heecheol"", ""Kosuke Miyoshi"", ""Hiroshi Yamakawa""]","[""disentangled representation learning""]","A state-of-the-art generative model, a ”factorized action variational autoencoder (FAVAE),” is presented for learning disentangled and interpretable representations from sequential data via the information bottleneck without supervision. The purpose of disentangled representation learning is to obtain interpretable and transferable representations from data. We focused on the disentangled representation of sequential data because there is a wide range of potential applications if disentanglement representation is extended to sequential data such as video, speech, and stock price data. Sequential data is characterized by dynamic factors and static factors: dynamic factors are time-dependent, and static factors are independent of time. Previous works succeed in disentangling static factors and dynamic factors by explicitly modeling the priors of latent variables to distinguish between static and dynamic factors. However, this model can not disentangle representations between dynamic factors, such as disentangling ”picking” and ”throwing” in robotic tasks. In this paper, we propose new model that can disentangle multiple dynamic factors. Since our method does not require modeling priors, it is capable of disentangling ”between” dynamic factors. In experiments, we show that FAVAE can extract the disentangled dynamic factors.",/pdf/a004e4cd56b0a7d6abe227b276f3fcf920f51763.pdf,ICLR,2019,We propose new model that can disentangle multiple dynamic factors in sequential data +BygPO2VKPH,Bylj1LLBUH,1569440000000.0,1583910000000.0,44,Sparse Coding with Gated Learned ISTA,"[""wukl14@mails.tsinghua.edu.cn"", ""guoyiwen.ai@bytedance.com"", ""liza19@mails.tsinghua.edu.cn"", ""zcs@mail.tsinghua.edu.cn""]","[""Kailun Wu"", ""Yiwen Guo"", ""Ziang Li"", ""Changshui Zhang""]","[""Sparse coding"", ""deep learning"", ""learned ISTA"", ""convergence analysis""]","In this paper, we study the learned iterative shrinkage thresholding algorithm (LISTA) for solving sparse coding problems. Following assumptions made by prior works, we first discover that the code components in its estimations may be lower than expected, i.e., require gains, and to address this problem, a gated mechanism amenable to theoretical analysis is then introduced. Specific design of the gates is inspired by convergence analyses of the mechanism and hence its effectiveness can be formally guaranteed. In addition to the gain gates, we further introduce overshoot gates for compensating insufficient step size in LISTA. Extensive empirical results confirm our theoretical findings and verify the effectiveness of our method.",/pdf/3af900efdd403a600321a3e02a64f2514b34039f.pdf,ICLR,2020,"We propose gated mechanisms to enhance learned ISTA for sparse coding, with theoretical guarantees on the superiority of the method. " +ysXk8cCHcQN,48EGuX2yJP_,1601310000000.0,1614990000000.0,3372,Fast 3D Acoustic Scattering via Discrete Laplacian Based Implicit Function Encoders,"[""~Hsien-Yu_Meng1"", ""~Zhenyu_Tang4"", ""~Dinesh_Manocha2""]","[""Hsien-Yu Meng"", ""Zhenyu Tang"", ""Dinesh Manocha""]","[""wave equation"", ""wave acoustics"", ""geometric deep learning"", ""sound simulation"", ""shape laplacian""]","Acoustic properties of objects corresponding to scattering characteristics are frequently used for 3D audio content creation, environmental acoustic effects, localization and acoustic scene analysis, etc. The numeric solvers used to compute these acoustic properties are too slow for interactive applications. We present a novel geometric deep learning algorithm based on discrete-laplacian and implicit encoders to compute these characteristics for rigid or deformable objects at interactive rates. We use a point cloud approximation of each object, and each point is encoded in a high-dimensional latent space. Our multi-layer network can accurately estimate these acoustic properties for arbitrary topologies and takes less than 1ms per object on a NVIDIA GeForce RTX 2080 Ti GPU. We also prove that our learning method is permutation and rotation invariant and demonstrate high accuracy on objects that are quite different from the training data. We highlight its application to generating environmental acoustic effects in dynamic environments.",/pdf/7dbb05e410b1056ebc7bd985d82428592995b6af.pdf,ICLR,2021,We use a novel neural network to efficiently predict the acoustic scattering effect from 3D objects. +PcUprce4TM2,BW5IFXbQtnZ,1601310000000.0,1614990000000.0,995,CAFE: Catastrophic Data Leakage in Federated Learning,"[""jinxiao96@gmail.com"", ""du461007169@gmail.com"", ""~Pin-Yu_Chen1"", ""~Tianyi_Chen1""]","[""Xiao Jin"", ""Ruijie Du"", ""Pin-Yu Chen"", ""Tianyi Chen""]",[],"Private training data can be leaked through the gradient sharing mechanism deployed in machine learning systems, such as federated learning (FL). +Increasing batch size is often viewed as a promising defense strategy against data leakage. In this paper, we revisit this defense premise and propose an advanced data leakage attack to efficiently recover batch data from the shared aggregated gradients. +We name our proposed method as \textit{\underline{c}atastrophic d\underline{a}ta leakage in \underline{f}ederated l\underline{e}arning (CAFE)}. +Comparing to existing data leakage attacks, CAFE demonstrates the ability to perform large-batch data leakage attack with high data recovery quality. +Experimental results on vertical and horizontal FL settings have validated the effectiveness of CAFE in recovering private data from the shared aggregated gradients. +Our results suggest that data participated in FL, especially the vertical case, have a high risk of being leaked from the training gradients. Our analysis implies unprecedented and practical data leakage risks in those learning settings.",/pdf/2510de287f2987580e93df8c54d9ca3ad920e85d.pdf,ICLR,2021, +pAq1h9sQhqd,lzao_xBcvG,1601310000000.0,1614990000000.0,131,Stochastic Canonical Correlation Analysis: A Riemannian Approach,"[""~Zihang_Meng1"", ""~Rudrasis_Chakraborty1"", ""~Vikas_Singh1""]","[""Zihang Meng"", ""Rudrasis Chakraborty"", ""Vikas Singh""]","[""CCA"", ""streaming"", ""differential geometry"", ""DeepCCA"", ""fairness""]"," We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) derived via a differential geometric perspective of the underlying optimization task. We show that exploiting the Riemannian structure of the problem reveals natural strategies for modified forms of manifold stochastic gradient descent schemes that have been variously used in the literature for numerical optimization on manifolds. Our developments complement existing methods for this problem which either require $O(d^3)$ time complexity per iteration with $O(\frac{1}{\sqrt{t}})$ convergence rate (where $d$ is the dimensionality) or only extract the top $1$ component with $O(\frac{1}{t})$ convergence rate. In contrast, our algorithm achieves $O(d^2k)$ runtime complexity per iteration for extracting top $k$ canonical components with $O(\frac{1}{t})$ convergence rate. We present our theoretical analysis as well as experiments describing the empirical behavior of our algorithm, including a potential application of this idea for training fair models where the label of protected attribute is missing or otherwise unavailable.",/pdf/46a100d101a05682223099f9d44b113c0cb49950.pdf,ICLR,2021,We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) derived via a differential geometric perspective of the underlying optimization task. +H1BLjgZCb,SkVUjxWRW,1509130000000.0,1519430000000.0,623,Generating Natural Adversarial Examples,"[""zhengliz@uci.edu"", ""ddua@uci.edu"", ""sameer@uci.edu""]","[""Zhengli Zhao"", ""Dheeru Dua"", ""Sameer Singh""]","[""adversarial examples"", ""generative adversarial networks"", ""interpretability"", ""image classification"", ""textual entailment"", ""machine translation""]","Due to their complex nature, it is hard to characterize the ways in which machine learning models can misbehave or be exploited when deployed. Recent work on adversarial examples, i.e. inputs with minor perturbations that result in substantially different model predictions, is helpful in evaluating the robustness of these models by exposing the adversarial scenarios where they fail. However, these malicious perturbations are often unnatural, not semantically meaningful, and not applicable to complicated domains such as language. In this paper, we propose a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks. We present generated adversaries to demonstrate the potential of the proposed approach for black-box classifiers for a wide range of applications such as image classification, textual entailment, and machine translation. We include experiments to show that the generated adversaries are natural, legible to humans, and useful in evaluating and analyzing black-box classifiers.",/pdf/6abc15f8bf89f63e302640ba7016cd9b9bfd3898.pdf,ICLR,2018,"We propose a framework to generate “natural” adversaries against black-box classifiers for both visual and textual domains, by doing the search for adversaries in the latent semantic space." +rkgyS0VFvr,HkxCSOIdvS,1569440000000.0,1588470000000.0,1097,DBA: Distributed Backdoor Attacks against Federated Learning,"[""chulinxie@zju.edu.cn"", ""nick_cooper@sjtu.edu.cn"", ""pin-yu.chen@ibm.com"", ""lbo@illinois.edu""]","[""Chulin Xie"", ""Keli Huang"", ""Pin-Yu Chen"", ""Bo Li""]","[""distributed backdoor attack"", ""federated learning""]","Backdoor attacks aim to manipulate a subset of training data by injecting adversarial triggers such that machine learning models trained on the tampered dataset will make arbitrarily (targeted) incorrect prediction on the testset with the same trigger embedded. While federated learning (FL) is capable of aggregating information provided by different parties for training a better model, its distributed learning methodology and inherently heterogeneous data distribution across parties may bring new vulnerabilities. In addition to recent centralized backdoor attacks on FL where each party embeds the same global trigger during training, we propose the distributed backdoor attack (DBA) --- a novel threat assessment framework developed by fully exploiting the distributed nature of FL. DBA decomposes a global trigger pattern into separate local patterns and embed them into the training set of different adversarial parties respectively. Compared to standard centralized backdoors, we show that DBA is substantially more persistent and stealthy against FL on diverse datasets such as finance and image data. We conduct extensive experiments to show that the attack success rate of DBA is significantly higher than centralized backdoors under different settings. Moreover, we find that distributed attacks are indeed more insidious, as DBA can evade two state-of-the-art robust FL algorithms against centralized backdoors. We also provide explanations for the effectiveness of DBA via feature visual interpretation and feature importance ranking. +To further explore the properties of DBA, we test the attack performance by varying different trigger factors, including local trigger variations (size, gap, and location), scaling factor in FL, data distribution, and poison ratio and interval. Our proposed DBA and thorough evaluation results shed lights on characterizing the robustness of FL.",/pdf/61dc789b9f12be96506a23ddb7670ac132a51d6d.pdf,ICLR,2020,"We proposed a novel distributed backdoor attack on federated learning and show that it is not only more effective compared with standard centralized attacks, but also harder to be defended by existing robust FL methods" +H1Go7Koex,,1478360000000.0,1483510000000.0,563,Character-aware Attention Residual Network for Sentence Representation,"[""xzheng008@e.ntu.edu.sg"", ""zhenzhou.wu@sap.com""]","[""Xin Zheng"", ""Zhenzhou Wu""]","[""Deep learning""]","Text classification in general is a well studied area. However, classifying short and noisy text remains challenging. Feature sparsity is a major issue. The quality of document representation here has a great impact on the classification accuracy. Existing methods represent text using bag-of-word model, with TFIDF or other weighting schemes. Recently word embedding and even document embedding are proposed to represent text. The purpose is to capture features at both word level and sentence level. However, the character level information are usually ignored. In this paper, we take word morphology and word semantic meaning into consideration, which are represented by character-aware embedding and word distributed embedding. By concatenating both character-level and word distributed embedding together and arranging words in order, a sentence representation matrix could be obtained. To overcome data sparsity problem of short text, sentence representation vector is then derived based on different views from sentence representation matrix. The various views contributes to the construction of an enriched sentence embedding. We employ a residual network on the sentence embedding to get a consistent and refined sentence representation. Evaluated on a few short text datasets, our model outperforms state-of-the-art models.",/pdf/dd6c821a700f996289e5d2f1f82c0f60e98df04b.pdf,ICLR,2017,We propose a character-aware attention residual network for short text representation. +HyBbjW-RW,S1LejWW0b,1509130000000.0,1518730000000.0,708,Open Loop Hyperparameter Optimization and Determinantal Point Processes,"[""jessed@cs.cmu.edu"", ""jamieson@cs.washington.edu"", ""nasmith@cs.washington.edu""]","[""Jesse Dodge"", ""Kevin Jamieson"", ""Noah A. Smith""]","[""hyperparameter optimization"", ""random search"", ""determinantal point processes"", ""low discrepancy sequences""]","Driven by the need for parallelizable hyperparameter optimization methods, this paper studies \emph{open loop} search methods: sequences that are predetermined and can be generated before a single configuration is evaluated. Examples include grid search, uniform random search, low discrepancy sequences, and other sampling distributions. +In particular, we propose the use of $k$-determinantal point processes in hyperparameter optimization via random search. Compared to conventional uniform random search where hyperparameter settings are sampled independently, a $k$-DPP promotes diversity. We describe an approach that transforms hyperparameter search spaces for efficient use with a $k$-DPP. In addition, we introduce a novel Metropolis-Hastings algorithm which can sample from $k$-DPPs defined over spaces with a mixture of discrete and continuous dimensions. Our experiments show significant benefits over uniform random search in realistic scenarios with a limited budget for training supervised learners, whether in serial or parallel.",/pdf/b6ad00739d857a9c792f582adddbc40ed5ecdc00.pdf,ICLR,2018,"Driven by the need for parallelizable, open-loop hyperparameter optimization methods, we propose the use of $k$-determinantal point processes in hyperparameter optimization via random search." +UAAJMiVjTY_,pQ9_HyZoTN0s,1601310000000.0,1614990000000.0,1133,Abductive Knowledge Induction from Raw Data,"[""~Wang-Zhou_Dai2"", ""~Stephen_Muggleton1""]","[""Wang-Zhou Dai"", ""Stephen Muggleton""]","[""Neural-Symbolic Model"", ""Inductive Logic Programming"", ""Abduction""]","For many reasoning-heavy tasks, it is challenging to find an appropriate end-to-end differentiable approximation to domain-specific inference mechanisms. Neural-Symbolic (NeSy) AI divides the end-to-end pipeline into neural perception and symbolic reasoning, which can directly exploit general domain knowledge such as algorithms and logic rules. However, it suffers from the exponential computational complexity caused by the interface between the two components, where the neural model lacks direct supervision, and the symbolic model lacks accurate input facts. As a result, they usually focus on learning the neural model with a sound and complete symbolic knowledge base while avoiding a crucial problem: where does the knowledge come from? In this paper, we present Abductive Meta-Interpretive Learning ($Meta_{Abd}$), which unites abduction and induction to learn perceptual neural network and first-order logic theories simultaneously from raw data. Given the same amount of domain knowledge, we demonstrate that $Meta_{Abd}$ not only outperforms the compared end-to-end models in predictive accuracy and data efficiency but also induces logic programs that can be re-used as background knowledge in subsequent learning tasks. To the best of our knowledge, $Meta_{Abd}$ is the first system that can jointly learn neural networks and recursive first-order logic theories with predicate invention.",/pdf/3a7f97a5938549dfd0fa5219a365aa126f660751.pdf,ICLR,2021,We propose an approach combining abduction and induction to jointly learn neural models and recursive first-order logic programs with predicate invention. +H1e31AEYwB,r1ejw6M_DB,1569440000000.0,1577170000000.0,908,Stiffness: A New Perspective on Generalization in Neural Networks,"[""stanislav.fort@gmail.com"", ""powalnow@google.com"", ""staszek.jastrzebski@gmail.com"", ""srinin@google.com""]","[""Stanislav Fort"", ""Pawe\u0142 Krzysztof Nowak"", ""Stanis\u0142aw Jastrzebski"", ""Srini Narayanan""]","[""stiffness"", ""gradient alignment"", ""critical scale""]","We investigate neural network training and generalization using the concept of stiffness. We measure how stiff a network is by looking at how a small gradient step on one example affects the loss on another example. In particular, we study how stiffness depends on 1) class membership, 2) distance between data points in the input space, 3) training iteration, and 4) learning rate. We experiment on MNIST, FASHION MNIST, and CIFAR-10 using fully-connected and convolutional neural networks. Our results demonstrate that stiffness is a useful concept for diagnosing and characterizing generalization. We observe that small learning rates reliably lead to higher stiffness at a given epoch as well as at a given training loss. In addition, we measure how stiffness between two data points depends on their mutual input-space distance, and establish the concept of a dynamical critical length that characterizes the distance over which datapoints react similarly to gradient updates. The dynamical critical length decreases with training and the higher the learning rate, the smaller the critical length.",/pdf/bf0434382619717eacda744a2e8484402f3f1ff0.pdf,ICLR,2020,"We defined the concept of stiffness, showed its utility in providing a perspective to better understand generalization in neural networks, observed its variation with learning rate, and defined the concept of dynamical critical length using it." +S1xaf6VFPB,HyePaZ6LvB,1569440000000.0,1577170000000.0,429,PDP: A General Neural Framework for Learning SAT Solvers,"[""saamizad@microsoft.com"", ""sergiym@microsoft.com"", ""markus.weimer@microsoft.com""]","[""Saeed Amizadeh"", ""Sergiy Matusevych"", ""Markus Weimer""]","[""Neural SAT solvers"", ""Graph Neural Networks"", ""Neural Message Passing"", ""Unsupervised Learning"", ""Neural Decimation""]","There have been recent efforts for incorporating Graph Neural Network models for learning fully neural solvers for constraint satisfaction problems (CSP) and particularly Boolean satisfiability (SAT). Despite the unique representational power of these neural embedding models, it is not clear to what extent they actually learn a search strategy vs. statistical biases in the training data. On the other hand, by fixing the search strategy (e.g. greedy search), one would effectively deprive the neural models of learning better strategies than those given. In this paper, we propose a generic neural framework for learning SAT solvers (and in general any CSP solver) that can be described in terms of probabilistic inference and yet learn search strategies beyond greedy search. Our framework is based on the idea of propagation, decimation and prediction (and hence the name PDP) in graphical models, and can be trained directly toward solving SAT in a fully unsupervised manner via energy minimization, as shown in the paper. Our experimental results demonstrate the effectiveness of our framework for SAT solving compared to both neural and the industrial baselines.",/pdf/ce999394f64088f4eae5367c804ee32fea8f15ed.pdf,ICLR,2020,"We propose a general neural message passing framework for SAT solving based on the idea of propagation, decimation and prediction (PDP). " +QtTKTdVrFBB,q7aUeJfrS8H,1601310000000.0,1616650000000.0,1925,Random Feature Attention,"[""~Hao_Peng4"", ""~Nikolaos_Pappas1"", ""~Dani_Yogatama2"", ""~Roy_Schwartz1"", ""~Noah_Smith1"", ""~Lingpeng_Kong1""]","[""Hao Peng"", ""Nikolaos Pappas"", ""Dani Yogatama"", ""Roy Schwartz"", ""Noah Smith"", ""Lingpeng Kong""]","[""Attention"", ""transformers"", ""machine translation"", ""language modeling""]","Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA’s efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.",/pdf/3066e95a6460c5d1da53125f5cff04e2d4ad6c4a.pdf,ICLR,2021,"We propose a random-feature-based attention that scales linearly in sequence length, and performs on par with strong transformer baselines on language modeling and machine translation." +HJDV5YxCW,ryP4cFxA-,1509100000000.0,1518730000000.0,339,Heterogeneous Bitwidth Binarization in Convolutional Neural Networks,"[""jwfromm@uw.edu"", ""matthaip@microsoft.com"", ""shwetak@cs.washington.edu""]","[""Josh Fromm"", ""Matthai Philipose"", ""Shwetak Patel""]","[""Deep Learning"", ""Computer Vision"", ""Approximation""]","Recent work has shown that performing inference with fast, very-low-bitwidth +(e.g., 1 to 2 bits) representations of values in models can yield surprisingly accurate +results. However, although 2-bit approximated networks have been shown to +be quite accurate, 1 bit approximations, which are twice as fast, have restrictively +low accuracy. We propose a method to train models whose weights are a mixture +of bitwidths, that allows us to more finely tune the accuracy/speed trade-off. We +present the “middle-out” criterion for determining the bitwidth for each value, and +show how to integrate it into training models with a desired mixture of bitwidths. +We evaluate several architectures and binarization techniques on the ImageNet +dataset. We show that our heterogeneous bitwidth approximation achieves superlinear +scaling of accuracy with bitwidth. Using an average of only 1.4 bits, we are +able to outperform state-of-the-art 2-bit architectures.",/pdf/997b30c27efab22a9ed7a4687462fd77508761ab.pdf,ICLR,2018,We introduce fractional bitwidth approximation and show it has significant advantages. +7IElVSrNm54,Hb53byu8DFH,1601310000000.0,1614990000000.0,3518,Zero-shot Fairness with Invisible Demographics,"[""~Thomas_Kehrenberg1"", ""~Viktoriia_Sharmanska1"", ""~Myles_Scott_Bartlett1"", ""~Novi_Quadrianto1""]","[""Thomas Kehrenberg"", ""Viktoriia Sharmanska"", ""Myles Scott Bartlett"", ""Novi Quadrianto""]","[""fairness"", ""missing data"", ""adversary"", ""classification"", ""disentanglement""]","In a statistical notion of algorithmic fairness, we partition individuals into groups based on some key demographic factors such as race and gender, and require that some statistics of a classifier be approximately equalized across those groups. Current approaches require complete annotations for demographic factors, or focus on an abstract worst-off group rather than demographic groups. In this paper, we consider the setting where the demographic factors are only partially available. For example, we have training examples for white-skinned and dark-skinned males, and white-skinned females, but we have zero examples for dark-skinned females. We could also have zero examples for females regardless of their skin colors. Without additional knowledge, it is impossible to directly control the discrepancy of the classifier's statistics for those invisible groups. We develop a disentanglement algorithm that splits a representation of data into a component that captures the demographic factors and another component that is invariant to them based on a context dataset. The context dataset is much like the deployment dataset, it is unlabeled but it contains individuals from all demographics including the invisible. We cluster the context set, equalize the cluster size to form a ""perfect batch"", and use it as a supervision signal for the disentanglement. We propose a new discriminator loss based on a learnable attention mechanism to distinguish a perfect batch from a non-perfect one. We evaluate our approach on standard classification benchmarks and show that it is indeed possible to protect invisible demographics.",/pdf/28a695c88dd2282d349a523d30bfc5d776d97a2f.pdf,ICLR,2021,We use perfect batches to disentangle the outcomes from the demographic groups via adversarial distribution-matching. +rJBwoM-Cb,B1RzsGZAW,1509140000000.0,1518730000000.0,932,Neural Tree Transducers for Tree to Tree Learning,"[""joao@cis.upenn.edu"", ""dean@foster.net"", ""ungar@cis.upenn.edu""]","[""Jo\u00e3o Sedoc"", ""Dean Foster"", ""Lyle Ungar""]","[""deep learning"", ""tree transduction""]","We introduce a novel approach to tree-to-tree learning, the neural tree transducer (NTT), a top-down depth first context-sensitive tree decoder, which is paired with recursive neural encoders. Our method works purely on tree-to-tree manipulations rather than sequence-to-tree or tree-to-sequence and is able to encode and decode multiple depth trees. We compare our method to sequence-to-sequence models applied to serializations of the trees and show that our method outperforms previous methods for tree-to-tree transduction. ",/pdf/297b7ddaf5c031893d7c9f0e6587fddd20150379.pdf,ICLR,2018, +rsogjAnYs4z,gCu8yEU6Jne,1601310000000.0,1615940000000.0,1569,Understanding the effects of data parallelism and sparsity on neural network training,"[""~Namhoon_Lee1"", ""~Thalaiyasingam_Ajanthan1"", ""~Philip_Torr1"", ""~Martin_Jaggi1""]","[""Namhoon Lee"", ""Thalaiyasingam Ajanthan"", ""Philip Torr"", ""Martin Jaggi""]","[""data parallelism"", ""sparsity"", ""neural network training""]","We study two factors in neural network training: data parallelism and sparsity; here, data parallelism means processing training data in parallel using distributed systems (or equivalently increasing batch size), so that training can be accelerated; for sparsity, we refer to pruning parameters in a neural network model, so as to reduce computational and memory cost. Despite their promising benefits, however, understanding of their effects on neural network training remains elusive. In this work, we first measure these effects rigorously by conducting extensive experiments while tuning all metaparameters involved in the optimization. As a result, we find across various workloads of data set, network model, and optimization algorithm that there exists a general scaling trend between batch size and number of training steps to convergence for the effect of data parallelism, and further, difficulty of training under sparsity. Then, we develop a theoretical analysis based on the convergence properties of stochastic gradient methods and smoothness of the optimization landscape, which illustrates the observed phenomena precisely and generally, establishing a better account of the effects of data parallelism and sparsity on neural network training.",/pdf/ccc4046bb2f3bad4efbb7f58fb7f860fa43b7d4d.pdf,ICLR,2021,We accurately measure the effects of data parallelism and sparsity on neural network training and develop a theoretical analysis to precisely account for their effects. +HkAClQgA-,HJTAgQgRW,1509070000000.0,1519320000000.0,228,A Deep Reinforced Model for Abstractive Summarization,"[""rpaulus@salesforce.com"", ""cxiong@salesforce.com"", ""richard@socher.org""]","[""Romain Paulus"", ""Caiming Xiong"", ""Richard Socher""]","[""deep learning"", ""natural language processing"", ""reinforcement learning"", ""text summarization"", ""sequence generation""]","Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. We introduce a neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL). +Models trained only with supervised learning often exhibit ""exposure bias"" - they assume ground truth is provided at each step during training. +However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. +We evaluate this model on the CNN/Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail dataset, an improvement over previous state-of-the-art models. Human evaluation also shows that our model produces higher quality summaries.",/pdf/ae4dce902a6df56754a36bcc69366eca1f6e502f.pdf,ICLR,2018,A summarization model combining a new intra-attention and reinforcement learning method to increase summary ROUGE scores and quality for long sequences. +HJStZKqel,,1478300000000.0,1484520000000.0,470,Lifelong Perceptual Programming By Example,"[""t-algaun@microsoft.com"", ""mabrocks@microsoft.com"", ""nkushman@microsoft.com"", ""dtarlow@microsoft.com""]","[""Alexander L. Gaunt"", ""Marc Brockschmidt"", ""Nate Kushman"", ""Daniel Tarlow""]","[""Deep learning"", ""Supervised Learning""]","We introduce and develop solutions for the problem of Lifelong Perceptual Programming By Example (LPPBE). The problem is to induce a series of programs that require understanding perceptual data like images or text. LPPBE systems learn from weak supervision (input-output examples) and incrementally construct a shared library of components that grows and improves as more tasks are solved. Methodologically, we extend differentiable interpreters to operate on perceptual data and to share components across tasks. Empirically we show that this leads to a lifelong learning system that transfers knowledge to new tasks more effectively than baselines, and the performance on earlier tasks continues to improve even as the system learns on new, different tasks.",/pdf/1af94cbdee9cb33e75af6d46adbbb6e4601abcb9.pdf,ICLR,2017,Combination of differentiable interpreters and neural networks for lifelong learning of a model composed of neural and source code functions +SJU4ayYgl,,1478190000000.0,1487760000000.0,72,Semi-Supervised Classification with Graph Convolutional Networks,"[""T.N.Kipf@uva.nl"", ""M.Welling@uva.nl""]","[""Thomas N. Kipf"", ""Max Welling""]","[""Deep learning"", ""Semi-Supervised Learning""]",We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.,/pdf/fe93c12d3f33c50e9a05d3b8d4b6f1f97c51501c.pdf,ICLR,2017,Semi-supervised classification with a CNN model for graphs. State-of-the-art results on a number of citation network datasets. +LIOgGKRCYkG,#NAME?,1601310000000.0,1614990000000.0,937,Target Training: Tricking Adversarial Attacks to Fail,"[""~Blerta_Lindqvist1""]","[""Blerta Lindqvist""]","[""adversarial machine learning""]","Recent adversarial defense approaches have failed. Untargeted gradient-based attacks cause classifiers to choose any wrong class. Our novel white-box defense tricks untargeted attacks into becoming attacks targeted at designated target classes. From these target classes, we derive the real classes. The Target Training defense tricks the minimization at the core of untargeted, gradient-based adversarial attacks: minimize the sum of (1) perturbation and (2) classifier adversarial loss. Target Training changes the classifier minimally, and trains it with additional duplicated points (at 0 distance) labeled with designated classes. These differently-labeled duplicated samples minimize both terms (1) and (2) of the minimization, steering attack convergence to samples of designated classes, from which correct classification is derived. Importantly, Target Training eliminates the need to know the attack and the overhead of generating adversarial samples of attacks that minimize perturbations. Without using adversarial samples and against an adaptive attack aware of our defense, Target Training exceeds even default, unsecured classifier accuracy of 84.3% for CIFAR10 with 86.6% against DeepFool attack; and achieves 83.2% against CW-$L_2$ (κ=0) attack. Using adversarial samples, we achieve 75.6% against CW-$L_2$ (κ=40). Due to our deliberate choice of low-capacity classifiers, Target Training does not withstand $L_\infty$ adaptive attacks in CIFAR10 but withstands CW-$L_\infty$ (κ=0) in MNIST. Target Training presents a fundamental change in adversarial defense strategy.",/pdf/8706d42623fa5afbd6497a9eefa1333f6218a769.pdf,ICLR,2021,Target Training tricks untargeted attacks into becoming attacks targeted at designated target classes. +r1fYuytex,,1478190000000.0,1487890000000.0,71,Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks,"[""arash.ardakani@mail.mcgill.ca"", ""carlo.condo@mail.mcgill.ca"", ""warren.gross@mcgill.ca""]","[""Arash Ardakani"", ""Carlo Condo"", ""Warren J. Gross""]","[""Deep learning"", ""Applications"", ""Optimization""]","Recently deep neural networks have received considerable attention due to their ability to extract and represent high-level abstractions in data sets. Deep neural networks such as fully-connected and convolutional neural networks have shown excellent performance on a wide range of recognition and classification tasks. However, their hardware implementations currently suffer from large silicon area and high power consumption due to the their high degree of complexity. The power/energy consumption of neural networks is dominated by memory accesses, the majority of which occur in fully-connected networks. In fact, they contain most of the deep neural network parameters. In this paper, we propose sparsely-connected networks, by showing that the number of connections in fully-connected networks can be reduced by up to 90% while improving the accuracy performance on three popular datasets (MNIST, CIFAR10 and SVHN). We then propose an efficient hardware architecture based on linear-feedback shift registers to reduce the memory requirements of the proposed sparsely-connected networks. The proposed architecture can save up to 90% of memory compared to the conventional implementations of fully-connected neural networks. Moreover, implementation results show up to 84% reduction in the energy consumption of a single neuron of the proposed sparsely-connected networks compared to a single neuron of fully-connected neural networks.",/pdf/ae297fea4af26c7ec220b0b744098a7e7ef26a0a.pdf,ICLR,2017,We show that the number of connections in fully-connected networks can be reduced by up to 90% while improving the accuracy performance. +BylIciRcYQ,SJxOkROqF7,1538090000000.0,1551370000000.0,535,SGD Converges to Global Minimum in Deep Learning via Star-convex Path,"[""yi.zhou610@duke.edu"", ""baymax@mail.ustc.edu.cn"", ""huishuai.zhang@microsoft.com"", ""liang.889@osu.edu"", ""vahid.tarokh@duke.edu""]","[""Yi Zhou"", ""Junjie Yang"", ""Huishuai Zhang"", ""Yingbin Liang"", ""Vahid Tarokh""]","[""SGD"", ""deep learning"", ""global minimum"", ""convergence""]","Stochastic gradient descent (SGD) has been found to be surprisingly effective in training a variety of deep neural networks. However, there is still a lack of understanding on how and why SGD can train these complex networks towards a global minimum. In this study, we establish the convergence of SGD to a global minimum for nonconvex optimization problems that are commonly encountered in neural network training. Our argument exploits the following two important properties: 1) the training loss can achieve zero value (approximately), which has been widely observed in deep learning; 2) SGD follows a star-convex path, which is verified by various experiments in this paper. In such a context, our analysis shows that SGD, although has long been considered as a randomized algorithm, converges in an intrinsically deterministic manner to a global minimum. ",/pdf/06cea9de669d2b7ce5b2e75a513d4032fec4063f.pdf,ICLR,2019, +nQxCYIFk7Rz,dGaG-eBYQk3,1601310000000.0,1614990000000.0,1341,Multiple Descent: Design Your Own Generalization Curve,"[""~Lin_Chen14"", ""~Yifei_Min1"", ""mbelkin@ucsd.edu"", ""~amin_karbasi1""]","[""Lin Chen"", ""Yifei Min"", ""Mikhail Belkin"", ""amin karbasi""]","[""multiple descent"", ""interpolation"", ""overparametrization""]","This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms. ",/pdf/be07260433ea6424e95e27b62684d3f1b100213c.pdf,ICLR,2021,"We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. " +jn1WDxmDe5P,VzC6f45FrZS,1601310000000.0,1614990000000.0,47,Meta-k: Towards Unsupervised Prediction of Number of Clusters,"[""~Azade_Farshad1"", ""hamidi.1732304@studenti.uniroma1.it"", ""~Nassir_Navab1""]","[""Azade Farshad"", ""Samin Hamidi"", ""Nassir Navab""]","[""Clustering"", ""Self-supervised learning"", ""Meta-learning""]","Data clustering is a well-known unsupervised learning approach. Despite the recent advances in clustering using deep neural networks, determining the number of clusters without any information about the given dataset remains an existing problem. There have been classical approaches based on data statistics that require the manual analysis of a data scientist to calculate the probable number of clusters in a dataset. In this work, we propose a new method for unsupervised prediction of the number of clusters in a dataset given only the data without any labels. We evaluate our method extensively on randomly generated datasets using the scikit-learn package and multiple computer vision datasets and show that our method is able to determine the number of classes in a dataset effectively without any supervision.",/pdf/550b3a31ad6336d9b267ce9b3fe8dc125a9f1b9e.pdf,ICLR,2021,Our work is an attempt to self-supervised prediction of number of clusters in a given data using policy gradient optimization. +BJgK6iA5KX,SJxgJXW5tX,1538090000000.0,1545360000000.0,821,AutoLoss: Learning Discrete Schedule for Alternate Optimization,"[""haowen.will.xu@gmail.com"", ""hao@cs.cmu.edu"", ""zhitingh@cs.cmu.edu"", ""xiaodan1@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu"", ""eric.xing@petuum.com""]","[""Haowen Xu"", ""Hao Zhang"", ""Zhiting Hu"", ""Xiaodan Liang"", ""Ruslan Salakhutdinov"", ""Eric Xing""]","[""Meta Learning"", ""AutoML"", ""Optimization Schedule""]","Many machine learning problems involve iteratively and alternately optimizing different task objectives with respect to different sets of parameters. Appropriately scheduling the optimization of a task objective or a set of parameters is usually crucial to the quality of convergence. In this paper, we present AutoLoss, a meta-learning framework that automatically learns and determines the optimization schedule. AutoLoss provides a generic way to represent and learn the discrete optimization schedule from metadata, allows for a dynamic and data-driven schedule in ML problems that involve alternating updates of different parameters or from different loss objectives. + +We apply AutoLoss on four ML tasks: d-ary quadratic regression, classification using a multi-layer perceptron (MLP), image generation using GANs, and multi-task neural machine translation (NMT). We show that the AutoLoss controller is able to capture the distribution of better optimization schedules that result in higher quality of convergence on all four tasks. The trained AutoLoss controller is generalizable -- it can guide and improve the learning of a new task model with different specifications, or on different datasets.",/pdf/74f21881b35dcad935fadd9e7b79885923dd96d2.pdf,ICLR,2019,"We propose a unified formulation for iterative alternate optimization and develop AutoLoss, a framework to automatically learn and generate optimization schedules." +Q1jmmQz72M2,IdfrxIBOoKR,1601310000000.0,1615990000000.0,1188,Neural Delay Differential Equations,"[""~Qunxi_Zhu1"", ""~Yao_Guo3"", ""~Wei_Lin1""]","[""Qunxi Zhu"", ""Yao Guo"", ""Wei Lin""]","[""Delay differential equations"", ""neural networks""]"," Neural Ordinary Differential Equations (NODEs), a framework of continuous-depth neural networks, have been widely applied, showing exceptional efficacy in coping with some representative datasets. Recently, an augmented framework has been successfully developed for conquering some limitations emergent in application of the original framework. Here we propose a new class of continuous-depth neural networks with delay, named as Neural Delay Differential Equations (NDDEs), and, for computing the corresponding gradients, we use the adjoint sensitivity method to obtain the delayed dynamics of the adjoint. Since the differential equations with delays are usually seen as dynamical systems of infinite dimension possessing more fruitful dynamics, the NDDEs, compared to the NODEs, own a stronger capacity of nonlinear representations. Indeed, we analytically validate that the NDDEs are of universal approximators, and further articulate an extension of the NDDEs, where the initial function of the NDDEs is supposed to satisfy ODEs. More importantly, we use several illustrative examples to demonstrate the outstanding capacities of the NDDEs and the NDDEs with ODEs' initial value. More precisely, (1) we successfully model the delayed dynamics where the trajectories in the lower-dimensional phase space could be mutually intersected, while the traditional NODEs without any argumentation are not directly applicable for such modeling, and (2) we achieve lower loss and higher accuracy not only for the data produced synthetically by complex models but also for the real-world image datasets, i.e., CIFAR10, MNIST and SVHN. Our results on the NDDEs reveal that appropriately articulating the elements of dynamical systems into the network design is truly beneficial to promoting the network performance.",/pdf/39dbf04903dd5453991dacd8dcb2333de44b3a6e.pdf,ICLR,2021,"We propose a new class of continuous-depth neural networks with delay, named as Neural Delay Differential Equations and having better representation capability outperforming the Neural ODEs." +HJMHpjC9Ym,Hkg4DXnYYQ,1538090000000.0,1550030000000.0,800,Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition,"[""chenrich@us.ibm.com"", ""qfan@us.ibm.com"", ""neil.r.mallinar@ibm.com"", ""tom.sercu1@ibm.com"", ""rsferis@us.ibm.com""]","[""Chun-Fu (Richard) Chen"", ""Quanfu Fan"", ""Neil Mallinar"", ""Tom Sercu"", ""Rogerio Feris""]","[""CNN"", ""multi-scale"", ""efficiency"", ""object recognition"", ""speech recognition""]","In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches with different resolutions. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks, using popular architectures including ResNet, ResNeXt and SEResNeXt. For object recognition, our approach reduces computation by 1/3 while improving accuracy significantly over 1% point than the baselines, and the computational savings can be higher up to 1/2 without compromising the accuracy. Our model also surpasses state-of-the-art CNN acceleration approaches by a large margin in terms of accuracy and FLOPs. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains.",/pdf/f8d29f7c14d1021ae591f4ea44813918e7317e7f.pdf,ICLR,2019, +QcqsxI6rKDs,0dFg5RbjMGM,1601310000000.0,1614990000000.0,2438,Meta Gradient Boosting Neural Networks,"[""~Manqing_Dong1"", ""~Lina_Yao2"", ""xianzhi.wang@uts.edu.au"", ""xiwei.xu@data61.csiro.au"", ""liming.zhu@data61.csiro.au""]","[""Manqing Dong"", ""Lina Yao"", ""Xianzhi Wang"", ""Xiwei Xu"", ""Liming Zhu""]","[""meta learning"", ""deep learning""]","Meta-optimization is an effective approach that learns a shared set of parameters across tasks for parameter initialization in meta-learning. +A key challenge for meta-optimization based approaches is to determine whether an initialization condition can be generalized to tasks with diverse distributions to accelerate learning. +To address this issue, we design a meta-gradient boosting framework that uses a base learner to learn shared information across tasks and a series of gradient-boosted modules to capture task-specific information to fit diverse distributions. +We evaluate the proposed model on both regression and classification tasks with multi-mode distributions. +The results demonstrate both the effectiveness of our model in modulating task-specific meta-learned priors and its advantages on multi-mode distributions.",/pdf/4943864d76084cbafbab40eeb5331907a152fa7f.pdf,ICLR,2021, +PoP96DrBHnl,2J-0rlxG9z7,1601310000000.0,1614990000000.0,3260,Gradient descent temporal difference-difference learning,"[""~Rong_Zhu4"", ""jmurray9@uoregon.edu""]","[""Rong Zhu"", ""James Murray""]","[""temporal difference learning"", ""gradient-descent based temporal difference"", ""Off-policy"", ""regularization""]","Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. +In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to accelerate GTD learning by introducing second-order differences in successive parameter updates. +We investigate this algorithm in the framework of linear value function approximation and analytically showing its improvement over GTD learning. Studying the model empirically on the random walk and Boyan-chain prediction tasks, we find substantial improvement over GTD learning and, in several cases, better performance even than conventional TD learning. +",/pdf/a26d3b8f6274422bafedcfcbac55905ab325885c.pdf,ICLR,2021,We provide gradient descent temporal difference-difference learning in order to accelerate gradient descent temporal difference learning by introducing second-order differences in successive parameter updates. +BygfrANKvB,H1gr3KU_wr,1569440000000.0,1577170000000.0,1103,Learning to Make Generalizable and Diverse Predictions for Retrosynthesis,"[""bensonc@mit.edu"", ""tianxiao@mit.edu"", ""tommi@csail.mit.edu"", ""regina@csail.mit.edu""]","[""Benson Chen"", ""Tianxiao Shen"", ""Tommi S. Jaakkola"", ""Regina Barzilay""]","[""Chemistry"", ""Retrosynthesis"", ""Transformer"", ""Pre-training"", ""Diversity""]","We propose a new model for making generalizable and diverse retrosynthetic reaction predictions. Given a target compound, the task is to predict the likely chemical reactants to produce the target. This generative task can be framed as a sequence-to-sequence problem by using the SMILES representations of the molecules. Building on top of the popular Transformer architecture, we propose two novel pre-training methods that construct relevant auxiliary tasks (plausible reactions) for our problem. Furthermore, we incorporate a discrete latent variable model into the architecture to encourage the model to produce a diverse set of alternative predictions. On the 50k subset of reaction examples from the United States patent literature (USPTO-50k) benchmark dataset, our model greatly improves performance over the baseline, while also generating predictions that are more diverse.",/pdf/995c97f417013804a8791ec1f6f60e389c69169e.pdf,ICLR,2020,We propose a new model for making generalizable and diverse retrosynthetic reaction predictions. +#NAME?,Ua8pG_xlzZJQ,1601310000000.0,1614990000000.0,1294,Enhancing Visual Representations for Efficient Object Recognition during Online Distillation,"[""~Shashanka_Venkataramanan1"", ""~Bruce_W_McIntosh1"", ""~Abhijit_Mahalanobis1""]","[""Shashanka Venkataramanan"", ""Bruce W McIntosh"", ""Abhijit Mahalanobis""]",[],"We propose ENVISE, an online distillation framework that ENhances VISual representations for Efficient object recognition. We are motivated by the observation that in many real-world scenarios, the probability of occurrence of all classes is not the same and only a subset of classes occur frequently. Exploiting this fact, we aim to reduce the computations of our framework by employing a binary student network (BSN) to learn the frequently occurring classes using the pseudo-labels generated by the teacher network (TN) on an unlabeled image stream. To maintain overall accuracy, the BSN must also accurately determine when a rare (or unknown) class is present in the image stream so that the TN can be used in such cases. To achieve this, we propose an attention triplet loss which ensures that the BSN emphasizes the same semantically meaningful regions of the image as the TN. When the prior class probabilities in the image stream vary, we demonstrate that the BSN adapts to the TN faster than the real-valued student network. We also introduce Gain in Efficiency (GiE), a new metric which estimates the relative reduction in FLOPS based on the number of times the BSN and TN are used to process the image stream. We benchmark CIFAR-100 and tiny-imagenet datasets by creating meaningful inlier (frequent) and outlier (rare) class pairs that mimic real-world scenarios. We show that ENVISE outperforms state-of-the-art (SOTA) outlier detection methods in terms of GiE, and also achieves greater separation between inlier and outlier classes in the feature space.",/pdf/ec2ff3db1ada0ee31556cfa8ddeb61656db40b80.pdf,ICLR,2021, +r1gOe209t7,HJeOZ-CcFQ,1538090000000.0,1545360000000.0,1092,Reconciling Feature-Reuse and Overfitting in DenseNet with Specialized Dropout,"[""kun@cs.ucsb.edu"", ""boyuan@cs.ucsb.edu"", ""xielingwei@stu.xmu.edu.cn"", ""yufeiding@cs.ucsb.edu""]","[""Kun Wan"", ""Boyuan Feng"", ""Lingwei Xie"", ""Yufei Ding""]","[""Specialized dropout"", ""computer vision""]","Recently convolutional neural networks (CNNs) achieve great accuracy in visual recognition tasks. DenseNet becomes one of the most popular CNN models due to its effectiveness in feature-reuse. However, like other CNN models, DenseNets also face overfitting problem if not severer. Existing dropout method can be applied but not as effective due to the introduced nonlinear connections. In particular, the property of feature-reuse in DenseNet will be impeded, and the dropout effect will be weakened by the spatial correlation inside feature maps. To address these problems, we craft the design of a specialized dropout method from three aspects, dropout location, dropout granularity, and dropout probability. The insights attained here could potentially be applied as a general approach for boosting the accuracy of other CNN models with similar nonlinear connections. Experimental results show that DenseNets with our specialized dropout method yield better accuracy compared to vanilla DenseNet and state-of-the-art CNN models, and such accuracy boost increases with the model depth.",/pdf/2939c75d31519701b96f705bb886e29725429b45.pdf,ICLR,2019,"Realizing the drawbacks when applying original dropout on DenseNet, we craft the design of dropout method from three aspects, the idea of which could also be applied on other CNN models." +HyKZyYlRZ,B1_W1YxCW,1509100000000.0,1518730000000.0,324,Large Scale Multi-Domain Multi-Task Learning with MultiModel,"[""lukaszkaiser@google.com"", ""aidan.n.gomez@gmail.com"", ""noam@google.com"", ""avaswani@google.com"", ""nikip@google.com"", ""llion@google.com"", ""usz@google.com""]","[""Lukasz Kaiser"", ""Aidan N. Gomez"", ""Noam Shazeer"", ""Ashish Vaswani"", ""Niki Parmar"", ""Llion Jones"", ""Jakob Uszkoreit""]","[""multi-task learning"", ""transfer learning""]","Deep learning yields great results across many fields, +from speech recognition, image classification, to translation. +But for each problem, getting a deep model to work well involves +research into the architecture and a long period of tuning. + +We present a single model that yields good results on a number +of problems spanning multiple domains. In particular, this single model +is trained concurrently on ImageNet, multiple translation tasks, +image captioning (COCO dataset), a speech recognition corpus, +and an English parsing task. + +Our model architecture incorporates building blocks from multiple +domains. It contains convolutional layers, an attention mechanism, +and sparsely-gated layers. + +Each of these computational blocks is crucial for a subset of +the tasks we train on. Interestingly, even if a block is not +crucial for a task, we observe that adding it never hurts performance +and in most cases improves it on all tasks. + +We also show that tasks with less data benefit largely from joint +training with other tasks, while performance on large tasks degrades +only slightly if at all.",/pdf/6c94242c126f9439a8d33041de5fa16156117a12.pdf,ICLR,2018,Large scale multi-task architecture solves ImageNet and translation together and shows transfer learning. +Ogga20D2HO-,mKIM8i4zIA,1601310000000.0,1615340000000.0,2947,FedMix: Approximation of Mixup under Mean Augmented Federated Learning,"[""~Tehrim_Yoon1"", ""sym807@kaist.ac.kr"", ""~Sung_Ju_Hwang1"", ""~Eunho_Yang1""]","[""Tehrim Yoon"", ""Sumin Shin"", ""Sung Ju Hwang"", ""Eunho Yang""]","[""federated learning"", ""mixup""]","Federated learning (FL) allows edge devices to collectively learn a model without directly sharing data within each device, thus preserving privacy and eliminating the need to store data globally. While there are promising results under the assumption of independent and identically distributed (iid) local data, current state-of-the-art algorithms suffer a performance degradation as the heterogeneity of local data across clients increases. To resolve this issue, we propose a simple framework, \emph{Mean Augmented Federated Learning (MAFL)}, where clients send and receive \emph{averaged} local data, subject to the privacy requirements of target applications. Under our framework, we propose a new augmentation algorithm, named \emph{FedMix}, which is inspired by a phenomenal yet simple data augmentation method, Mixup, but does not require local raw data to be directly shared among devices. Our method shows greatly improved performance in the standard benchmark datasets of FL, under highly non-iid federated settings, compared to conventional algorithms.",/pdf/0258da18459084a22b881d20dbd411e7184bb3d3.pdf,ICLR,2021,"We introduce a new federated framework, Mean Augmented Federated Learning (MAFL), and propose an efficient algorithm, Federated Mixup (FedMix), which shows good performance on difficult non-iid situations." +AHOs7Sm5H7R,oIdMkZksrBz,1601310000000.0,1616060000000.0,3164,Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning,"[""~Zhiyuan_Li2"", ""~Yuping_Luo1"", ""~Kaifeng_Lyu2""]","[""Zhiyuan Li"", ""Yuping Luo"", ""Kaifeng Lyu""]","[""matrix factorization"", ""gradient descent"", ""implicit regularization"", ""implicit bias""]","Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent. Gunasekar et al. (2017) conjectured that gradient flow with infinitesimal initialization converges to the solution that minimizes the nuclear norm, but a series of recent papers argued that the language of norm minimization is not sufficient to give a full characterization for the implicit regularization. In this work, we provide theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions. This generalizes the rank minimization view from previous works to a much broader setting and enables us to construct counter-examples to refute the conjecture from Gunasekar et al. (2017). We also extend the results to the case where depth >= 3, and we show that the benefit of being deeper is that the above convergence has a much weaker dependence over initialization magnitude so that this rank minimization is more likely to take effect for initialization with practical scale.",/pdf/e29b53584bc9017cb15b9394735cd51b56c32446.pdf,ICLR,2021,"We prove that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions." +bkincnjT8zx,HuIsYkrU0A,1601310000000.0,1614990000000.0,2184,Neural Dynamical Systems: Balancing Structure and Flexibility in Physical Prediction,"[""~Viraj_Mehta1"", ""ichar@cs.cmu.edu"", ""~Willie_Neiswanger1"", ""youngsec@cs.cmu.edu"", ""anelson@pppl.gov"", ""mboyer@pppl.gov"", ""ekolemen@pppl.gov"", ""~Jeff_Schneider1""]","[""Viraj Mehta"", ""Ian Char"", ""Willie Neiswanger"", ""Youngseog Chung"", ""Andrew Oakleigh Nelson"", ""Mark D Boyer"", ""Egemen Kolemen"", ""Jeff Schneider""]","[""nuclear fusion"", ""physics"", ""differential equations"", ""dynamical systems"", ""control"", ""dynamics""]","We introduce Neural Dynamical Systems (NDS), a method of learning dynamical models in various gray-box settings which incorporates prior knowledge in the form of systems of ordinary differential equations. NDS uses neural networks to estimate free parameters of the system, predicts residual terms, and numerically integrates over time to predict future states. A key insight is that many real dynamical systems of interest are hard to model because the dynamics may vary across rollouts. We mitigate this problem by taking a trajectory of prior states as the input to NDS and train it to dynamically estimate system parameters using the preceding trajectory. We find that NDS learns dynamics with higher accuracy and fewer samples than a variety of deep learning methods that do not incorporate the prior knowledge and methods from the system identification literature which do. We demonstrate these advantages first on synthetic dynamical systems and then on real data captured from deuterium shots from a nuclear fusion reactor. Finally, we demonstrate that these benefits can be utilized for control in small-scale experiments.",/pdf/edcc880a4b09b9ec065f51aae3bc14feb689e8f3.pdf,ICLR,2021,We use prior knowledge in the form of differential equations to make predictions and do control more sample-efficiently. +HygQro05KX,BygGNz9yY7,1538090000000.0,1545360000000.0,75,$A^*$ sampling with probability matching,"[""vofhqn@gmail.com"", ""dcszj@mail.tsinghua.edu.cn""]","[""Yichi Zhou"", ""Jun Zhu""]",[],"Probabilistic methods often need to draw samples from a nontrivial distribution. $A^*$ sampling is a nice algorithm by building upon a top-down construction of a Gumbel process, where a large state space is divided into subsets and at each round $A^*$ sampling selects a subset to process. However, the selection rule depends on a bound function, which can be intractable. Moreover, we show that such a selection criterion can be inefficient. This paper aims to improve $A^*$ sampling by addressing these issues. To design a suitable selection rule, we apply \emph{Probability Matching}, a widely used method for decision making, to $A^*$ sampling. We provide insights into the relationship between $A^*$ sampling and probability matching by analyzing a nontrivial special case in which the state space is partitioned into two subsets. We show that in this case probability matching is optimal within a constant gap. Furthermore, as directly applying probability matching to $A^*$ sampling is time consuming, we design an approximate version based on Monte-Carlo estimators. We also present an efficient implementation by leveraging special properties of Gumbel distributions and well-designed balanced trees. Empirical results show that our method saves a significantly amount of computational resources on suboptimal regions compared with $A^*$ sampling.",/pdf/b6b0788d4814feecb2129cc6fd45dc67877a7886.pdf,ICLR,2019, +B1x33sC9KQ,HJgR0CYqtm,1538090000000.0,1545360000000.0,749,ACIQ: Analytical Clipping for Integer Quantization of neural networks,"[""ron.banner@intel.com"", ""yury.nahshan@intel.com"", ""daniel.soudry@gmail.com"", ""elad.hoffer@gmail.com""]","[""Ron Banner"", ""Yury Nahshan"", ""Elad Hoffer"", ""Daniel Soudry""]","[""quantization"", ""reduced precision"", ""training"", ""inference"", ""activation""]","We analyze the trade-off between quantization noise and clipping distortion in low precision networks. We identify the statistics of various tensors, and derive exact expressions for the mean-square-error degradation due to clipping. By optimizing these expressions, we show marked improvements over standard quantization schemes that normally avoid clipping. For example, just by choosing the accurate clipping values, more than 40\% accuracy improvement is obtained for the quantization of VGG-16 to 4-bits of precision. Our results have many applications for the quantization of neural networks at both training and inference time. +",/pdf/0871d08996fbd469ba1b2bc9cf43c46de4cb18d3.pdf,ICLR,2019,"We analyze the trade-off between quantization noise and clipping distortion in low precision networks, and show marked improvements over standard quantization schemes that normally avoid clipping" +99M-4QlinPr,qr0nQHfBuo7,1601310000000.0,1614990000000.0,2219,Efficient Competitive Self-Play Policy Optimization,"[""~Yuanyi_Zhong1"", ""~Yuan_Zhou1"", ""~Jian_Peng1""]","[""Yuanyi Zhong"", ""Yuan Zhou"", ""Jian Peng""]","[""self-play"", ""policy optimization"", ""two-player zero-sum game"", ""multiagent""]","Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. This paper proposes a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method simultaneously trains several agents and intelligently takes each other as opponents based on a simple adversarial rule derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponent-selection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.",/pdf/4bc505dd96e2a3bc262fae922603d4b3fdfc7743.pdf,ICLR,2021,We present a population-based self-play policy optimization algorithm with a principled opponent-selection rule. +SJeqs6EFvB,rygAPB1dPr,1569440000000.0,1590180000000.0,754,HOPPITY: LEARNING GRAPH TRANSFORMATIONS TO DETECT AND FIX BUGS IN PROGRAMS,"[""edinella@seas.upenn.edu"", ""hadai@google.com"", ""liby99@seas.upenn.edu"", ""mhnaik@cis.upenn.edu"", ""lsong@cc.gatech.edu"", ""kewang@visa.com""]","[""Elizabeth Dinella"", ""Hanjun Dai"", ""Ziyang Li"", ""Mayur Naik"", ""Le Song"", ""Ke Wang""]","[""Bug Detection"", ""Program Repair"", ""Graph Neural Network"", ""Graph Transformation""]","We present a learning-based approach to detect and fix a broad range of bugs in Javascript programs. We frame the problem in terms of learning a sequence of graph transformations: given a buggy program modeled by a graph structure, our model makes a sequence of predictions including the position of bug nodes and corresponding graph edits to produce a fix. Unlike previous works that use deep neural networks, our approach targets bugs that are more complex and semantic in nature (i.e.~bugs that require adding or deleting statements to fix). We have realized our approach in a tool called HOPPITY. By training on 290,715 Javascript code change commits on Github, HOPPITY correctly detects and fixes bugs in 9,490 out of 36,361 programs in an end-to-end fashion. Given the bug location and type of the fix, HOPPITY also outperforms the baseline approach by a wide margin.",/pdf/9d37b18aba351f4294aa84e69ea330d1fa51c471.pdf,ICLR,2020,An learning-based approach for detecting and fixing bugs in Javascript +b7ZRqEFXdQ,Idv_zsexjxS,1601310000000.0,1614990000000.0,3452,Improving Sequence Generative Adversarial Networks with Feature Statistics Alignment,"[""~Yekun_Chai1"", ""~Qiyue_Yin1"", ""~Junge_Zhang1""]","[""Yekun Chai"", ""Qiyue Yin"", ""Junge Zhang""]",[],"Generative Adversarial Networks (GAN) are facing great challenges in synthesizing sequences of discrete elements, such as mode dropping and unstable training. The binary classifier in the discriminator may limit the capacity of learning signals and thus hinder the advance of adversarial training. To address such issues, apart from the binary classification feedback, we harness a Feature Statistics Alignment (FSA) paradigm to deliver fine-grained signals in the latent high-dimensional representation space. Specifically, FSA forces the mean statistics of the fake data distribution to approach that of real data as close as possible in a finite-dimensional feature space. Experiments on synthetic and real benchmark datasets show the superior performance in quantitative evaluation and demonstrate the effectiveness of our approach to discrete sequence generation. To the best of our knowledge, the proposed architecture is the first that employs feature alignment regularization in the Gumbel-Softmax based GAN framework for sequence generation. ",/pdf/1e3bdb8f26d24be04a7123941e95699837aa0577.pdf,ICLR,2021,The paper proposes a Feature Statistics Alignment method to improve the training of Gumbel-Softmax-based language GANs. +HyxgBerKwB,Bye0e_lFDB,1569440000000.0,1579110000000.0,2273,GraphQA: Protein Model Quality Assessment using Graph Convolutional Network,"[""baldassarre.fe@gmail.com"", ""david.menendez.hurtado@scilifelab.se"", ""arne@bioinfo.se"", ""azizpour@kth.se""]","[""Federico Baldassarre"", ""David Men\u00e9ndez Hurtado"", ""Arne Elofsson"", ""Hossein Azizpour""]","[""Protein Quality Assessment"", ""Graph Networks"", ""Representation Learning""]","Proteins are ubiquitous molecules whose function in biological processes is determined by their 3D structure. +Experimental identification of a protein's structure can be time-consuming, prohibitively expensive, and not always possible. +Alternatively, protein folding can be modeled using computational methods, which however are not guaranteed to always produce optimal results. +GraphQA is a graph-based method to estimate the quality of protein models, that possesses favorable properties such as representation learning, explicit modeling of both sequential and 3D structure, geometric invariance and computational efficiency. +In this work, we demonstrate significant improvements of the state-of-the-art for both hand-engineered and representation-learning approaches, as well as carefully evaluating the individual contributions of GraphQA.",/pdf/c7005f6ebf03998e2d54e2ae91a7e0018709b018.pdf,ICLR,2020,GraphQA is a graph-based method for protein Quality Assessment that improves the state-of-the-art for both hand-engineered and representation-learning approaches +Bk7wvW-C-,rJQww--AZ,1509130000000.0,1518730000000.0,686,Exploring Asymmetric Encoder-Decoder Structure for Context-based Sentence Representation Learning,"[""shuaitang93@ucsd.edu"", ""hljin@adobe.com"", ""cfang@adobe.com"", ""zhawang@adobe.com"", ""desa@ucsd.edu""]","[""Shuai Tang"", ""Hailin Jin"", ""Chen Fang"", ""Zhaowen Wang"", ""Virginia R. de Sa""]","[""asymmetric structure"", ""RNN-CNN"", ""fast"", ""unsupervised"", ""representation"", ""sentence""]","Context information plays an important role in human language understanding, and it is also useful for machines to learn vector representations of language. In this paper, we explore an asymmetric encoder-decoder structure for unsupervised context-based sentence representation learning. As a result, we build an encoder-decoder architecture with an RNN encoder and a CNN decoder, and we show that neither an autoregressive decoder nor an RNN decoder is required. We further combine a suite of effective designs to significantly improve model efficiency while also achieving better performance. Our model is trained on two different large unlabeled corpora, and in both cases transferability is evaluated on a set of downstream language understanding tasks. We empirically show that our model is simple and fast while producing rich sentence representations that excel in downstream tasks.",/pdf/d6eab07e93a7b695fd40140c8a59fad643fdf436.pdf,ICLR,2018,We proposed an RNN-CNN encoder-decoder model for fast unsupervised sentence representation learning. +BygANhA9tQ,HygbNPpcFm,1538090000000.0,1550860000000.0,1496,Cost-Sensitive Robustness against Adversarial Examples,"[""xz7bc@virginia.edu"", ""evans@virginia.edu""]","[""Xiao Zhang"", ""David Evans""]","[""Certified robustness"", ""Adversarial examples"", ""Cost-sensitive learning""]","Several recent works have developed methods for training classifiers that are certifiably robust against norm-bounded adversarial perturbations. These methods assume that all the adversarial transformations are equally important, which is seldom the case in real-world applications. We advocate for cost-sensitive robustness as the criteria for measuring the classifier's performance for tasks where some adversarial transformation are more important than others. We encode the potential harm of each adversarial transformation in a cost matrix, and propose a general objective function to adapt the robust training method of Wong & Kolter (2018) to optimize for cost-sensitive robustness. Our experiments on simple MNIST and CIFAR10 models with a variety of cost matrices show that the proposed approach can produce models with substantially reduced cost-sensitive robust error, while maintaining classification accuracy.",/pdf/f0d81f8022ac0e1a26ad7c36d4ad58e9731f72be.pdf,ICLR,2019,A general method for training certified cost-sensitive robust classifier against adversarial perturbations +53WS781RzT9,dMCjjRt-NpI,1601310000000.0,1614990000000.0,3082,The Impact of the Mini-batch Size on the Dynamics of SGD: Variance and Beyond,"[""~Xin_Qian2"", ""~Diego_Klabjan1""]","[""Xin Qian"", ""Diego Klabjan""]",[],"We study mini-batch stochastic gradient descent (SGD) dynamics under linear regression and deep linear networks by focusing on the variance of the gradients only given the initial weights and mini-batch size, which is the first study of this nature. In the linear regression case, we show that in each iteration the norm of the gradient is a decreasing function of the mini-batch size $b$ and thus the variance of the stochastic gradient estimator is a decreasing function of $b$. For deep neural networks with $L_2$ loss we show that the variance of the gradient is a polynomial in $1/b$. The results theoretically back the important intuition that smaller batch sizes yield larger variance of the stochastic gradients and lower loss function values which is a common believe among the researchers. The proof techniques exhibit a relationship between stochastic gradient estimators and initial weights, which is useful for further research on the dynamics of SGD. We empirically provide insights to our results on various datasets and commonly used deep network structures. We further discuss possible extensions of the approaches we build in studying the generalization ability of the deep learning models.",/pdf/45eea402f0a41047ccbf528e0a518393c7b60093.pdf,ICLR,2021, +9GUTgHZgKCH,sPCrtq6Y2y_K,1601310000000.0,1614990000000.0,3700,Reducing the number of neurons of Deep ReLU Networks based on the current theory of Regularization,"[""~Jakob_Heiss1"", ""~Alexis_Stockinger1"", ""josef.teichmann@math.ethz.ch""]","[""Jakob Heiss"", ""Alexis Stockinger"", ""Josef Teichmann""]","[""Reduction"", ""Compression"", ""Regularization"", ""Theory"", ""Pruning"", ""Deep"", ""Interpretability"", ""Generalization""]","We introduce a new Reduction Algorithm which makes use of the properties of ReLU neurons to reduce significantly the number of neurons in a trained Deep Neural Network. This algorithm is based on the recent theory of implicit and explicit regularization in Deep ReLU Networks from (Maennel et al, 2018) and the authors. + +We discuss two experiments which illustrate the efficiency of the algorithm to reduce the number of neurons significantly with provably almost no change of the learned function within the training data (and therefore almost no loss in accuracy).",/pdf/6b145bbec5116bfdc0c697da3720cc1f0278afd1.pdf,ICLR,2021,An algorithm which reduces the number of neurons in a Deep ReLU Network and allows several important benefits is presented. +80FMcTSZ6J0,lxXT6ocdk0q,1601310000000.0,1614860000000.0,568,Noise against noise: stochastic label noise helps combat inherent label noise,"[""~Pengfei_Chen1"", ""gy.chen@siat.ac.cn"", ""kourenmu@gmail.com"", ""~jingwei_zhao1"", ""~Pheng-Ann_Heng1""]","[""Pengfei Chen"", ""Guangyong Chen"", ""Junjie Ye"", ""jingwei zhao"", ""Pheng-Ann Heng""]","[""Noisy Labels"", ""Robust Learning"", ""SGD noise"", ""Regularization""]","The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect, previously studied in optimization by analyzing the dynamics of parameter updates. In this paper, we are interested in learning with noisy labels, where we have a collection of samples with potential mislabeling. We show that a previously rarely discussed SGD noise, induced by stochastic label noise (SLN), mitigates the effects of inherent label noise. In contrast, the common SGD noise directly applied to model parameters does not. We formalize the differences and connections of SGD noise variants, showing that SLN induces SGD noise dependent on the sharpness of output landscape and the confidence of output probability, which may help escape from sharp minima and prevent overconfidence. SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. Specifically, we present an enhanced algorithm by applying SLN to label correction. Our code is released.",/pdf/cb07afb92c9402f5b191a438058b6a911ae61ba1.pdf,ICLR,2021,"SGD noise induced by stochastic label noise helps escape sharp minima and prevents overconfidence, hence can mitigate the effects of inherent label noise and improve generalization." +HklkeR4KPB,SJgVBbmdwB,1569440000000.0,1583910000000.0,914,ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring,"[""dberth@google.com"", ""ncarlini@google.com"", ""cubuk@google.com"", ""kurakin@google.com"", ""kihyuks@google.com"", ""zhanghan@google.com"", ""craffel@google.com""]","[""David Berthelot"", ""Nicholas Carlini"", ""Ekin D. Cubuk"", ""Alex Kurakin"", ""Kihyuk Sohn"", ""Han Zhang"", ""Colin Raffel""]","[""semi-supervised learning""]","We improve the recently-proposed ``MixMatch semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring. +- Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of ground-truth labels. +- Augmentation anchoring} feeds multiple strongly augmented versions of an input into the model and encourages each output to be close to the prediction for a weakly-augmented version of the same input. +To produce strong augmentations, we propose a variant of AutoAugment which learns the augmentation policy while the model is being trained. + +Our new algorithm, dubbed ReMixMatch, is significantly more data-efficient than prior work, requiring between 5 times and 16 times less data to reach the same accuracy. For example, on CIFAR-10 with 250 labeled examples we reach 93.73% accuracy (compared to MixMatch's accuracy of 93.58% with 4000 examples) and a median accuracy of 84.92% with just four labels per class. +",/pdf/7e0bce0c7b750533163a2782f6af5b039305918c.pdf,ICLR,2020,"We introduce Distribution Matching and Augmentation Anchoring, two improvements to MixMatch which produce state-of-the-art results and enable surprisingly strong performance with only 40 labels on CIFAR-10 and SVHN." +BJx1SsAcYQ,H1l2do1DFQ,1538090000000.0,1545360000000.0,50,Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference,"[""jlmckins@us.ibm.com"", ""sesser@us.ibm.com"", ""rappusw@us.ibm.com"", ""deepika.bablani@ibm.com"", ""arthurjo@us.ibm.com"", ""izzet.burak.yildiz@gmail.com"", ""dmodha@us.ibm.com""]","[""Jeffrey L. McKinstry"", ""Steven K. Esser"", ""Rathinakumar Appuswamy"", ""Deepika Bablani"", ""John V. Arthur"", ""Izzet B. Yildiz"", ""Dharmendra S. Modha""]","[""Deep Learning"", ""Convolutional Neural Networks"", ""Low-precision inference"", ""Network quantization""]","To realize the promise of ubiquitous embedded deep network inference, it is essential to seek limits of energy and area efficiency. To this end, low-precision networks offer tremendous promise because both energy and area scale down quadratically with the reduction in precision. Here, for the first time, we demonstrate ResNet-18, ResNet-34, ResNet-50, ResNet-152, Inception-v3, densenet-161, and VGG-16bn networks on the ImageNet classification benchmark that, at 8-bit precision exceed the accuracy of the full-precision baseline networks after one epoch of finetuning, thereby leveraging the availability of pretrained models. +We also demonstrate ResNet-18, ResNet-34, and ResNet-50 4-bit models that match the accuracy of the full-precision baseline networks -- the highest scores to date. Surprisingly, the weights of the low-precision networks are very close (in cosine similarity) to the weights of the corresponding baseline networks, making training from scratch unnecessary. + +We find that gradient noise due to quantization during training increases with reduced precision, and seek ways to overcome this noise. The number of iterations required by stochastic gradient descent to achieve a given training error is related to the square of (a) the distance of the initial solution from the final plus (b) the maximum variance of the gradient estimates. By drawing inspiration from this observation, we (a) reduce solution distance by starting with pretrained fp32 precision baseline networks and fine-tuning, and (b) combat noise introduced by quantizing weights and activations during training, by using larger batches along with matched learning rate annealing. Sensitivity analysis indicates that these techniques, coupled with proper activation function range calibration, offer a promising heuristic to discover low-precision networks, if they exist, close to fp32 precision baseline networks. +",/pdf/04e9aa53e41ed48dcff9226fc38ccb2f3b5352ab.pdf,ICLR,2019,Finetuning after quantization matches or exceeds full-precision state-of-the-art networks at both 8- and 4-bit quantization. +BJluxbWC-,ByQve-bCb,1509130000000.0,1518730000000.0,643,Unseen Class Discovery in Open-world Classification,"[""lshu3@uic.edu"", ""hxu48@uic.edu"", ""liub@uic.edu""]","[""Lei Shu"", ""Hu Xu"", ""Bing Liu""]",[],"This paper concerns open-world classification, where the classifier not only needs to classify test examples into seen classes that have appeared in training but also reject examples from unseen or novel classes that have not appeared in training. Specifically, this paper focuses on discovering the hidden unseen classes of the rejected examples. Clearly, without prior knowledge this is difficult. However, we do have the data from the seen training classes, which can tell us what kind of similarity/difference is expected for examples from the same class or from different classes. It is reasonable to assume that this knowledge can be transferred to the rejected examples and used to discover the hidden unseen classes in them. This paper aims to solve this problem. It first proposes a joint open classification model with a sub-model for classifying whether a pair of examples belongs to the same or different classes. This sub-model can serve as a distance function for clustering to discover the hidden classes of the rejected examples. Experimental results show that the proposed model is highly promising. +",/pdf/1834d6f3c93c2efadd23527fbd64e45f7061fa02.pdf,ICLR,2018, +S1e5YC4KPS,r1lvKHd_vr,1569440000000.0,1577170000000.0,1257,Winning Privately: The Differentially Private Lottery Ticket Mechanism,"[""lgondara@sfu.ca"", ""wang@sfu.ca"", ""ricardo_silva_carvalho@sfu.ca""]","[""Lovedeep Gondara"", ""Ke Wang"", ""Ricardo Silva Carvalho""]","[""Differentially private neural networks"", ""lottery ticket hypothesis"", ""differential privacy""]","We propose the differentially private lottery ticket mechanism (DPLTM). An end-to-end differentially private training paradigm based on the lottery ticket hypothesis. Using ``high-quality winners"", selected via our custom score function, DPLTM significantly outperforms state-of-the-art. We show that DPLTM converges faster, allowing for early stopping with reduced privacy budget consumption. We further show that the tickets from DPLTM are transferable across datasets, domains, and architectures. Our extensive evaluation on several public datasets provides evidence to our claims. ",/pdf/5fd431bbb380ac72ca3df0e7a44f3ddf7fa7003e.pdf,ICLR,2020,An end-to-end differentially private extension of the lottery ticket mechanism +Oe2XI-Aft-k,NKny1AazhSm,1601310000000.0,1614990000000.0,778,Perturbation Type Categorization for Multiple $\ell_p$ Bounded Adversarial Robustness,"[""~Pratyush_Maini1"", ""~Xinyun_Chen1"", ""~Bo_Li19"", ""~Dawn_Song1""]","[""Pratyush Maini"", ""Xinyun Chen"", ""Bo Li"", ""Dawn Song""]","[""adversarial examples"", ""robustness"", ""multiple perturbation types""]","Despite the recent advances in $\textit{adversarial training}$ based defenses, deep neural networks are still vulnerable to adversarial attacks outside the perturbation type they are trained to be robust against. Recent works have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, when evaluating the model against each individual attack, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against that perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their $\ell_p$ perturbation types. Based on our analysis, we propose $\textit{PROTECTOR}$, a two-stage pipeline to improve the robustness against multiple perturbation types. Instead of training a single predictor, $\textit{PROTECTOR}$ first categorizes the perturbation type of the input, and then utilizes a predictor specifically trained against the predicted perturbation type to make the final prediction. We first theoretically show that adversarial examples created by different perturbation types constitute different distributions, which makes it possible to distinguish them. Further, we show that at test time the adversary faces a natural trade-off between fooling the perturbation type classifier and the succeeding predictor optimized with perturbation specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. In addition, we demonstrate the realization of this trade-off in deep networks by adding random noise to the model input at test time, enabling enhanced robustness against strong adaptive attacks. Extensive experiments on MNIST and CIFAR-10 show that $\textit{PROTECTOR}$ outperforms prior adversarial training based defenses by over $5\%$, when tested against the union of $\ell_1, \ell_2, \ell_\infty$ attacks.",/pdf/03a9895c89930deb5d7fc789700e570957c0e56d.pdf,ICLR,2021,We introduce a method that performs Perturbation Type Categorization for Robustness against multiple perturbation types +Hke0V1rKPS,r1eRIkadvB,1569440000000.0,1583910000000.0,1673,Jacobian Adversarially Regularized Networks for Robustness,"[""guoweial001@e.ntu.edu.sg"", ""ytay017@e.ntu.edu.sg"", ""asysong@ntu.edu.sg"", ""jie.fu@polymtl.ca""]","[""Alvin Chan"", ""Yi Tay"", ""Yew Soon Ong"", ""Jie Fu""]","[""adversarial examples"", ""robust machine learning"", ""deep learning""]","Adversarial examples are crafted with imperceptible perturbations with the intent to fool neural networks. Against such attacks, adversarial training and its variants stand as the strongest defense to date. Previous studies have pointed out that robust models that have undergone adversarial training tend to produce more salient and interpretable Jacobian matrices than their non-robust counterparts. A natural question is whether a model trained with an objective to produce salient Jacobian can result in better robustness. This paper answers this question with affirmative empirical results. We propose Jacobian Adversarially Regularized Networks (JARN) as a method to optimize the saliency of a classifier's Jacobian by adversarially regularizing the model's Jacobian to resemble natural training images. Image classifiers trained with JARN show improved robust accuracy compared to standard models on the MNIST, SVHN and CIFAR-10 datasets, uncovering a new angle to boost robustness without using adversarial training.",/pdf/b71a044605b9fb104c10c4aeff0af76763e44138.pdf,ICLR,2020,We show that training classifiers to produce salient input Jacobian matrices with a GAN-like regularization can boost adversarial robustness. +S1EHOsC9tX,rylP2pxKYQ,1538090000000.0,1548770000000.0,353,Towards the first adversarially robust neural network model on MNIST,"[""lukas.schott@bethgelab.org"", ""jonas.rauber@bethgelab.org"", ""matthias.bethge@bethgelab.org"", ""wieland.brendel@bethgelab.org""]","[""Lukas Schott"", ""Jonas Rauber"", ""Matthias Bethge"", ""Wieland Brendel""]","[""adversarial examples"", ""MNIST"", ""robustness"", ""deep learning"", ""security""]","Despite much effort, deep neural networks remain highly susceptible to tiny input perturbations and even for MNIST, one of the most common toy datasets in computer vision, no neural network model exists for which adversarial perturbations are large and make semantic sense to humans. We show that even the widely recognized and by far most successful L-inf defense by Madry et~al. (1) has lower L0 robustness than undefended networks and still highly susceptible to L2 perturbations, (2) classifies unrecognizable images with high certainty, (3) performs not much better than simple input binarization and (4) features adversarial perturbations that make little sense to humans. These results suggest that MNIST is far from being solved in terms of adversarial robustness. We present a novel robust classification model that performs analysis by synthesis using learned class-conditional data distributions. We derive bounds on the robustness and go to great length to empirically evaluate our model using maximally effective adversarial attacks by (a) applying decision-based, score-based, gradient-based and transfer-based attacks for several different Lp norms, (b) by designing a new attack that exploits the structure of our defended model and (c) by devising a novel decision-based attack that seeks to minimize the number of perturbed pixels (L0). The results suggest that our approach yields state-of-the-art robustness on MNIST against L0, L2 and L-inf perturbations and we demonstrate that most adversarial examples are strongly perturbed towards the perceptual boundary between the original and the adversarial class.",/pdf/ba645fe0aec0af7d9a401c8eb0c7462a25da3dba.pdf,ICLR,2019, +rkeYL1SFvH,S1lpiPauPS,1569440000000.0,1577170000000.0,1738,WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia,"[""schwenk@fb.com"", ""vishrav@fb.com"", ""ssun32@jhu.edu"", ""hgong6@illinois.edu"", ""fguzman@fb.com""]","[""Holger Schwenk"", ""Vishrav Chaudhary"", ""Shuo Sun"", ""Hongyu Gong"", ""Francisco Guzm\u00e1n""]","[""multilinguality"", ""bitext mining"", ""neural MT"", ""Wikipedia"", ""low-resource languages"", ""joint sentence representation""]","We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus of parallel sentences is freely available (URL anonymized) + +To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.",/pdf/8c4fd954d80878d600e3d55f9f2b57d9d5e1d546.pdf,ICLR,2020,"Large-scale bitext extraction from Wikipedia: 1620 language pairs in 85 languages, 135M parallel sentences, Systematic NMT evaluation on TED test set." +S1lRg0VKDr,HJlf0TXuwr,1569440000000.0,1577170000000.0,948,On summarized validation curves and generalization,"[""mohammad.hashir.khan@umontreal.ca"", ""yoshua.bengio@mila.quebec"", ""joseph@josephpcohen.com""]","[""Mohammad Hashir"", ""Yoshua Bengio"", ""Joseph Paul Cohen""]","[""model selection"", ""deep learning"", ""early stopping"", ""validation curves""]","The validation curve is widely used for model selection and hyper-parameter search with the curve usually summarized over all the training tasks. However, this summarization tends to lose the intricacies of the per-task curves and it isn't able to reflect if all the tasks are at their validation optimum even if the summarized curve might be. In this work, we explore this loss of information, how it affects the model at testing and how to detect it using interval plots. We propose two techniques as a proof-of-concept of the potential gain in the test performance when per-task validation curves are accounted for. Our experiments on three large datasets show up to a 2.5% increase (averaged over multiple trials) in the test accuracy rate when model selection uses the per-task validation maximums instead of the summarized validation maximum. This potential increase is not a result of any modification to the model but rather at what point of training the weights were selected from. This presents an exciting direction for new training and model selection techniques that rely on more than just averaged metrics. ",/pdf/d7a671f64111b00df78c9d0fc274087caacc7803.pdf,ICLR,2020, +SJeFNoRcFQ,rkeksuoOKm,1538090000000.0,1545360000000.0,22,Traditional and Heavy Tailed Self Regularization in Neural Network Models,"[""charles@calculationconsulting.com"", ""mmahoney@stat.berkeley.edu""]","[""Charles H. Martin"", ""Michael W. Mahoney""]","[""statistical mechanics"", ""self-regularization"", ""random matrix"", ""glassy behavior"", ""heavy-tailed""]","Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly indicate that the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of regularization, such as Dropout or Weight Norm constraints. Building on recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization. For smaller and/or older DNNs, this Implicit Self-Regularization is like traditional Tikhonov regularization, in that there is a ""size scale"" separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed Self-Regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit Self-Regularization can depend strongly on the many knobs of the training process. By exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size.",/pdf/1d06ef44c0deafa874c86dbf29dbbb814ee58765.pdf,ICLR,2019,"See the abstract. (For the revision, the paper is identical, except for a 59 page Supplementary Material, which can serve as a stand-along technical report version of the paper.)" +Hk2aImxAb,BJ9TUmlR-,1509070000000.0,1519360000000.0,229,Multi-Scale Dense Networks for Resource Efficient Image Classification,"[""gh349@cornell.edu"", ""taineleau@gmail.com"", ""lth14@mails.tsinghua.edu.cn"", ""fw245@cornell.edu"", ""lvdmaaten@fb.com"", ""kqw4@cornell.edu""]","[""Gao Huang"", ""Danlu Chen"", ""Tianhong Li"", ""Felix Wu"", ""Laurens van der Maaten"", ""Kilian Weinberger""]","[""efficient learning"", ""budgeted learning"", ""deep learning"", ""image classification"", ""convolutional networks""]","In this paper we investigate image classification with computational resource limits at test time. Two such settings are: 1. anytime classification, where the network’s prediction for a test example is progressively updated, facilitating the output of a prediction at any time; and 2. budgeted batch classification, where a fixed amount of computation is available to classify a set of examples that can be spent unevenly across “easier” and “harder” inputs. In contrast to most prior work, such as the popular Viola and Jones algorithm, our approach is based on convolutional neural networks. We train multiple classifiers with varying resource demands, which we adaptively apply during test time. To maximally re-use computation between the classifiers, we incorporate them as early-exits into a single deep convolutional neural network and inter-connect them with dense connectivity. To facilitate high quality classification early on, we use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network. Experiments on three image-classification tasks demonstrate that our framework substantially improves the existing state-of-the-art in both settings.",/pdf/b92cc4191e13816cacce262b4e421aac1052ec10.pdf,ICLR,2018, +d-XzF81Wg1,BliX9GcJhTZ,1601310000000.0,1615900000000.0,1876,Deconstructing the Regularization of BatchNorm,"[""~Yann_Dauphin1"", ""~Ekin_Dogus_Cubuk1""]","[""Yann Dauphin"", ""Ekin Dogus Cubuk""]","[""deep learning"", ""batch normalization"", ""regularization"", ""understanding neural networks""]","Batch normalization (BatchNorm) has become a standard technique in deep learning. Its popularity is in no small part due to its often positive effect on generalization. Despite this success, the regularization effect of the technique is still poorly understood. This study aims to decompose BatchNorm into separate mechanisms that are much simpler. We identify three effects of BatchNorm and assess their impact directly with ablations and interventions. Our experiments show that preventing explosive growth at the final layer at initialization and during training can recover a large part of BatchNorm's generalization boost. This regularization mechanism can lift accuracy by $2.9\%$ for Resnet-50 on Imagenet without BatchNorm. We show it is linked to other methods like Dropout and recent initializations like Fixup. Surprisingly, this simple mechanism matches the improvement of $0.9\%$ of the more complex Dropout regularization for the state-of-the-art Efficientnet-B8 model on Imagenet. This demonstrates the underrated effectiveness of simple regularizations and sheds light on directions to further improve generalization for deep nets.",/pdf/a940f4ffe517172e86f0802bdffb5c6f0a602068.pdf,ICLR,2021,We deconstruct the regularization effect of batch normalization and show that preventing explosive growth at the final layer at initialization and during training can recover a large part of BatchNorm's generalization boost. +HJlTpCEKvS,HklnD4cOPB,1569440000000.0,1577170000000.0,1410,Which Tasks Should Be Learned Together in Multi-task Learning?,"[""tstand@cs.stanford.edu"", ""zamir@cs.stanford.edu"", ""sdawnchen@gmail.com"", ""guibas@cs.stanford.edu"", ""malik@eecs.berkeley.edu"", ""ssilvio@stanford.edu""]","[""Trevor Standley"", ""Amir R. Zamir"", ""Dawn Chen"", ""Leonidas Guibas"", ""Jitendra Malik"", ""Silvio Savarese""]","[""multi-task learning"", ""Computer Vision""]","Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using 'multi-task learning'. This saves computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives compete, which consequently poses the question: which tasks should and should not be learned together in one network when employing multi-task learning? We systematically study task cooperation and competition and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. Our framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks. +",/pdf/2149d3a2992bff379985b93ec6eb341b8059fc6e.pdf,ICLR,2020,"We analyze what tasks are best learned together in one network, and which are best to learn separately. " +Hyg1Ls0cKQ,SJgZ9V2Kt7,1538090000000.0,1545360000000.0,140,Learning Latent Semantic Representation from Pre-defined Generative Model,"[""seago0828@yonsei.ac.kr"", ""sbcho@yonsei.ac.kr""]","[""Jin-Young Kim"", ""Sung-Bae Cho""]","[""Latent space"", ""Generative adversarial network"", ""variational autoencoder"", ""conditioned generation""]","Learning representations of data is an important issue in machine learning. Though GAN has led to significant improvements in the data representations, it still has several problems such as unstable training, hidden manifold of data, and huge computational overhead. GAN tends to produce the data simply without any information about the manifold of the data, which hinders from controlling desired features to generate. Moreover, most of GAN’s have a large size of manifold, resulting in poor scalability. In this paper, we propose a novel GAN to control the latent semantic representation, called LSC-GAN, which allows us to produce desired data to generate and learns a representation of the data efficiently. Unlike the conventional GAN models with hidden distribution of latent space, we define the distributions explicitly in advance that are trained to generate the data based on the corresponding features by inputting the latent variables that follow the distribution. As the larger scale of latent space caused by deploying various distributions in one latent space makes training unstable while maintaining the dimension of latent space, we need to separate the process of defining the distributions explicitly and operation of generation. We prove that a VAE is proper for the former and modify a loss function of VAE to map the data into the pre-defined latent space so as to locate the reconstructed data as close to the input data according to its characteristics. Moreover, we add the KL divergence to the loss function of LSC-GAN to include this process. The decoder of VAE, which generates the data with the corresponding features from the pre-defined latent space, is used as the generator of the LSC-GAN. Several experiments on the CelebA dataset are conducted to verify the usefulness of the proposed method to generate desired data stably and efficiently, achieving a high compression ratio that can hold about 24 pixels of information in each dimension of latent space. Besides, our model learns the reverse of features such as not laughing (rather frowning) only with data of ordinary and smiling facial expression.",/pdf/35c7239a229fd9030f0d906a93428001abde9917.pdf,ICLR,2019,We propose a generative model that not only produces data with desired features from the pre-defined latent space but also fully understands the features of the data to create characteristics that are not in the dataset. +SkwAEQbAb,rJt6Emb0Z,1509140000000.0,1518730000000.0,1163,A novel method to determine the number of latent dimensions with SVD,"[""asana.neishabouri@polymtl.ca"", ""michel.desmarais@polymtl.ca""]","[""Asana Neishabouri"", ""Michel Desmarais""]","[""SVD"", ""Latent Dimensions"", ""Dimension Reductions"", ""Machine Learning""]","Determining the number of latent dimensions is a ubiquitous problem in machine +learning. In this study, we introduce a novel method that relies on SVD to discover +the number of latent dimensions. The general principle behind the method is to +compare the curve of singular values of the SVD decomposition of a data set with +the randomized data set curve. The inferred number of latent dimensions corresponds +to the crossing point of the two curves. To evaluate our methodology, we +compare it with competing methods such as Kaisers eigenvalue-greater-than-one +rule (K1), Parallel Analysis (PA), Velicers MAP test (Minimum Average Partial). +We also compare our method with the Silhouette Width (SW) technique which is +used in different clustering methods to determine the optimal number of clusters. +The result on synthetic data shows that the Parallel Analysis and our method have +similar results and more accurate than the other methods, and that our methods is +slightly better result than the Parallel Analysis method for the sparse data sets.",/pdf/5a5d920c9b7b9b39015b595683426873a38b3e8b.pdf,ICLR,2018,"In this study, we introduce a novel method that relies on SVD to discover the number of latent dimensions." +B1fpDsAqt7,B1gJqb3FK7,1538090000000.0,1548440000000.0,305,Visual Reasoning by Progressive Module Networks,"[""seung@cs.toronto.edu"", ""makarand@cs.toronto.edu"", ""fidler@cs.toronto.edu""]","[""Seung Wook Kim"", ""Makarand Tapaswi"", ""Sanja Fidler""]",[],"Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn – most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline. +",/pdf/7ba81b43cb43e5d88e3829f767d714f8de03f675.pdf,ICLR,2019, +B1IzH7cxl,,1478270000000.0,1484330000000.0,192,A Neural Stochastic Volatility Model,"[""r.luo@cs.ucl.ac.uk"", ""xuxj@apex.sjtu.edu.cn"", ""wnzhang@apex.sjtu.edu.cn"", ""j.wang@cs.ucl.ac.uk""]","[""Rui Luo"", ""Xiaojun Xu"", ""Weinan Zhang"", ""Jun Wang""]","[""Deep learning"", ""Supervised Learning""]","In this paper, we show that the recent integration of statistical models with recurrent neural networks provides a new way of formulating volatility models that have been popular in time series analysis and prediction. The model comprises a pair of complementary stochastic recurrent neural networks: the generative network models the joint distribution of the stochastic volatility process; the inference network approximates the conditional distribution of the latent variables given the observable ones. +Our focus in this paper is on the formulation of temporal dynamics of volatility over time under a stochastic recurrent neural network framework. Our derivations show that some popular volatility models are a special case of our proposed neural stochastic volatility model. Experiments demonstrate that the proposed model generates a smoother volatility estimation, and largely outperforms a widely used GARCH model on several metrics about the fitness of the volatility modelling and the accuracy of the prediction.",/pdf/04c83a73829a02e59e4813d7e5376a918d1d724f.pdf,ICLR,2017,A novel integration of statistical models with recurrent neural networks providing a new way of formulating volatility models. +Hyls7h05FQ,ryx6bfCcFQ,1538090000000.0,1545360000000.0,1389,A Differentiable Self-disambiguated Sense Embedding Model via Scaled Gumbel Softmax,"[""fenfeigo@cs.umd.edu"", ""miyyer@cs.umass.edu"", ""leahkf@uw.edu"", ""jbg@umiacs.umd.edu""]","[""Fenfei Guo"", ""Mohit Iyyer"", ""Leah Findlater"", ""Jordan Boyd-Graber""]","[""unsupervised representation learning"", ""sense embedding"", ""word sense disambiguation"", ""human evaluation""]","We present a differentiable multi-prototype word representation model that disentangles senses of polysemous words and produces meaningful sense-specific embeddings without external resources. It jointly learns how to disambiguate senses given local context and how to represent senses using hard attention. Unlike previous multi-prototype models, our model approximates discrete sense selection in a differentiable manner via a modified Gumbel softmax. We also propose a novel human evaluation task that quantitatively measures (1) how meaningful the learned sense groups are to humans and (2) how well the model is able to disambiguate senses given a context sentence. Our model outperforms competing approaches on both human evaluations and multiple word similarity tasks.",/pdf/be0888bf76fdedb1c5a86f752da632be3b213bb2.pdf,ICLR,2019,Disambiguate and embed word senses with a differentiable hard-attention model using Scaled Gumbel Softmax +ErrNJYcVRmS,DsYhBvNhDW_,1601310000000.0,1614990000000.0,1397,F^2ed-Learning: Good Fences Make Good Neighbors,"[""~Lun_Wang1"", ""~Qi_Pang1"", ""~Shuai_Wang7"", ""~Dawn_Song1""]","[""Lun Wang"", ""Qi Pang"", ""Shuai Wang"", ""Dawn Song""]","[""Byzantine-Robust Federated Learning"", ""Secure Aggregation""]","In this paper, we present F^2ed-Learning, the first federated learning protocol simultaneously defending against both semi-honest server and Byzantine malicious clients. Using a robust mean estimator called FilterL2, F^2ed-Learning is the first FL protocol with dimension-free estimation error against Byzantine malicious clients. Besides, F^2ed-Learning leverages secure aggregation to protect the clients from a semi-honest server who wants to infer the clients' information from the legitimate updates. The main challenge stems from the incompatibility between FilterL2 and secure aggregation. Specifically, to run FilterL2, the server needs to access individual updates from clients while secure aggregation hides those updates from it. We propose to split the clients into shards, securely aggregate each shard's updates and run FilterL2 on the updates from different shards. The evaluation shows that F^2ed-Learning consistently achieves optimal or sub-optimal performance under three attacks among five robust FL protocols. The code for evaluation is available in the supplementary material.",/pdf/270506f364685060272ae79a343b1a6796b66ceb.pdf,ICLR,2021,"We propose F^2ed-Learning, the first federated learning protocol defending against both semi-honest server and Byzantine malicious clients." +3AOj0RCNC2,wBgGX6jZ3RQ,1601310000000.0,1615840000000.0,857,Gradient Projection Memory for Continual Learning,"[""~Gobinda_Saha1"", ""~Isha_Garg1"", ""~Kaushik_Roy1""]","[""Gobinda Saha"", ""Isha Garg"", ""Kaushik Roy""]","[""Continual Learning"", ""Representation Learning"", ""Computer Vision"", ""Deep learning""]","The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches. ",/pdf/a65e5f689852fad25dc117881988232e0f95ed52.pdf,ICLR,2021,"To avoid catastrophic forgetting in continual learning, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for past tasks." +ONBPHFZ7zG4,EFq9Qct6Uly,1601310000000.0,1616010000000.0,3475,Temporally-Extended ε-Greedy Exploration,"[""~Will_Dabney1"", ""~Georg_Ostrovski1"", ""~Andre_Barreto1""]","[""Will Dabney"", ""Georg Ostrovski"", ""Andre Barreto""]","[""reinforcement learning"", ""exploration""]","Recent work on exploration in reinforcement learning (RL) has led to a series of increasingly complex solutions to the problem. This increase in complexity often comes at the expense of generality. Recent empirical studies suggest that, when applied to a broader set of domains, some sophisticated exploration methods are outperformed by simpler counterparts, such as ε-greedy. In this paper we propose an exploration algorithm that retains the simplicity of ε-greedy while reducing dithering. We build on a simple hypothesis: the main limitation of ε-greedy exploration is its lack of temporal persistence, which limits its ability to escape local optima. We propose a temporally extended form of ε-greedy that simply repeats the sampled action for a random duration. It turns out that, for many duration distributions, this suffices to improve exploration on a large set of domains. Interestingly, a class of distributions inspired by ecological models of animal foraging behaviour yields particularly strong performance.",/pdf/be288b1cdd527108548adea1d4d8319ce8a8eae8.pdf,ICLR,2021,"We discuss a new framework for option-based exploration, present a thorough empirical study of a simple, generally applicable set of options within this framework, and observe improved performance over state-of-the-art agents and exploration methods." +ryx4PJrtvS,SJxb7j6uPr,1569440000000.0,1577170000000.0,1762,A Copula approach for hyperparameter transfer learning,"[""david.salinas.pro@gmail.com"", ""huibishe@amazon.com"", ""vperrone@amazon.com""]","[""David Salinas"", ""Huibin Shen"", ""Valerio Perrone""]","[""Hyperparameter optimization"", ""Bayesian Optimization"", ""Gaussian Process"", ""Copula"", ""Transfer-learning""]","Bayesian optimization (BO) is a popular methodology to tune the hyperparameters of expensive black-box functions. Despite its success, standard BO focuses on a single task at a time and is not designed to leverage information from related functions, such as tuning performance metrics of the same algorithm across multiple datasets. In this work, we introduce a novel approach to achieve transfer learning across different datasets as well as different metrics. The main idea is to regress the mapping from hyperparameter to metric quantiles with a semi-parametric Gaussian Copula distribution, which provides robustness against different scales or outliers that can occur in different tasks. We introduce two methods to leverage this estimation: a Thompson sampling strategy as well as a Gaussian Copula process using such quantile estimate as a prior. We show that these strategies can combine the estimation of multiple metrics such as runtime and accuracy, steering the optimization toward cheaper hyperparameters for the same level of accuracy. Experiments on an extensive set of hyperparameter tuning tasks demonstrate significant improvements over state-of-the-art methods.",/pdf/73e160c8a92a4ab100dff5d33d6adb928e4c42e2.pdf,ICLR,2020,We show how using semi-parametric prior estimations can speed up HPO significantly across datasets and metrics. +HyI6s40a-,HyrpoE0T-,1508950000000.0,1518730000000.0,89,Towards Safe Deep Learning: Unsupervised Defense Against Generic Adversarial Attacks,"[""bita@ucsd.edu"", ""msamragh@ucsd.edu"", ""tjavidi@ucsd.edu"", ""farinaz@ucsd.edu""]","[""Bita Darvish Rouhani"", ""Mohammad Samragh"", ""Tara Javidi"", ""Farinaz Koushanfar""]","[""Adversarial Attacks"", ""Unsupervised Defense"", ""Deep Learning""]","Recent advances in adversarial Deep Learning (DL) have opened up a new and largely unexplored surface for malicious attacks jeopardizing the integrity of autonomous DL systems. We introduce a novel automated countermeasure called Parallel Checkpointing Learners (PCL) to thwart the potential adversarial attacks and significantly improve the reliability (safety) of a victim DL model. The proposed PCL methodology is unsupervised, meaning that no adversarial sample is leveraged to build/train parallel checkpointing learners. We formalize the goal of preventing adversarial attacks as an optimization problem to minimize the rarely observed regions in the latent feature space spanned by a DL network. To solve the aforementioned minimization problem, a set of complementary but disjoint checkpointing modules are trained and leveraged to validate the victim model execution in parallel. Each checkpointing learner explicitly characterizes the geometry of the input data and the corresponding high-level data abstractions within a particular DL layer. As such, the adversary is required to simultaneously deceive all the defender modules in order to succeed. We extensively evaluate the performance of the PCL methodology against the state-of-the-art attack scenarios, including Fast-Gradient-Sign (FGS), Jacobian Saliency Map Attack (JSMA), Deepfool, and Carlini&WagnerL2 algorithm. Extensive proof-of-concept evaluations for analyzing various data collections including MNIST, CIFAR10, and ImageNet corroborate the effectiveness of our proposed defense mechanism against adversarial samples. ",/pdf/5674dfa11daa2938917c20bde5dcde7899eeb011.pdf,ICLR,2018,Devising unsupervised defense mechanisms against adversarial attacks is crucial to ensure the generalizability of the defense. +rJg4J3CqFm,Skepida5tX,1538090000000.0,1550920000000.0,978,Learning Embeddings into Entropic Wasserstein Spaces,"[""frogner@mit.edu"", ""farzaneh@ibm.com"", ""jsolomon@mit.edu""]","[""Charlie Frogner"", ""Farzaneh Mirzazadeh"", ""Justin Solomon""]","[""Embedding"", ""Wasserstein"", ""Sinkhorn"", ""Optimal Transport""]","Despite their prevalence, Euclidean embeddings of data are fundamentally limited in their ability to capture latent semantic structures, which need not conform to Euclidean spatial assumptions. Here we consider an alternative, which embeds data as discrete probability distributions in a Wasserstein space, endowed with an optimal transport metric. Wasserstein spaces are much larger and more flexible than Euclidean spaces, in that they can successfully embed a wider variety of metric structures. We propose to exploit this flexibility by learning an embedding that captures the semantic information in the Wasserstein distance between embedded distributions. We examine empirically the representational capacity of such learned Wasserstein embeddings, showing that they can embed a wide variety of complex metric structures with smaller distortion than an equivalent Euclidean embedding. We also investigate an application to word embedding, demonstrating a unique advantage of Wasserstein embeddings: we can directly visualize the high-dimensional embedding, as it is a probability distribution on a low-dimensional space. This obviates the need for dimensionality reduction techniques such as t-SNE for visualization.",/pdf/7022cf0cd3a16ba1442467989f8bef9464050314.pdf,ICLR,2019,We show that Wasserstein spaces are good targets for embedding data with complex semantic structure. +B1xu6yStPH,H1eJ-ByYPr,1569440000000.0,1577170000000.0,1995,Using Explainabilty to Detect Adversarial Attacks,"[""amosy3@gmail.com"", ""gal.chechik@gmail.com""]","[""Ohad Amosy and Gal Chechik""]","[""adversarial"", ""detection"", ""explainability""]","Deep learning models are often sensitive to adversarial attacks, where carefully-designed input samples can cause the system to produce incorrect decisions. Here we focus on the problem of detecting attacks, rather than robust classification, since detecting that an attack occurs may be even more important than avoiding misclassification. We build on advances in explainability, where activity-map-like explanations are used to justify and validate decisions, by highlighting features that are involved with a classification decision. The key observation is that it is hard to create explanations for incorrect decisions. We propose EXAID, a novel attack-detection approach, which uses model explainability to identify images whose explanations are inconsistent with the predicted class. Specifically, we use SHAP, which uses Shapley values in the space of the input image, to identify which input features contribute to a class decision. Interestingly, this approach does not require to modify the attacked model, and it can be applied without modelling a specific attack. It can therefore be applied successfully to detect unfamiliar attacks, that were unknown at the time the detection model was designed. We evaluate EXAID on two benchmark datasets CIFAR-10 and SVHN, and against three leading attack techniques, FGSM, PGD and C&W. We find that EXAID improves over the SoTA detection methods by a large margin across a wide range of noise levels, improving detection from 70% to over 90% for small perturbations.",/pdf/8cee59947d1d42acff775ded49e79e9c52f396f5.pdf,ICLR,2020,"A novel adversarial detection approach, which uses explainability methods to identify images whose explanations are inconsistent with the predicted class. " +Syx79eBKwr,BygAo0xKDS,1569440000000.0,1583910000000.0,2466,A Mutual Information Maximization Perspective of Language Representation Learning,"[""lingpenk@google.com"", ""cyprien@google.com"", ""leiyu@google.com"", ""lingwang@google.com"", ""zihangd@google.com"", ""dyogatama@google.com""]","[""Lingpeng Kong"", ""Cyprien de Masson d'Autume"", ""Lei Yu"", ""Wang Ling"", ""Zihang Dai"", ""Dani Yogatama""]",[],"We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).",/pdf/eb7c97ac3d4465bf40ae5911244e1fba78fd3314.pdf,ICLR,2020, +HJeLBpEFPB,HJeCtQIwPH,1569440000000.0,1577170000000.0,523,Unsupervised Universal Self-Attention Network for Graph Classification,"[""dai.nguyen@monash.edu"", ""tu.dinh.nguyen@monash.edu"", ""dinh.phung@monash.edu""]","[""Dai Quoc Nguyen"", ""Tu Dinh Nguyen"", ""Dinh Phung""]","[""Graph embedding"", ""graph classification"", ""universal self-attention network"", ""graph neural network""]","Existing graph embedding models often have weaknesses in exploiting graph structure similarities, potential dependencies among nodes and global network properties. To this end, we present U2GAN, a novel unsupervised model leveraging on the strength of the recently introduced universal self-attention network (Dehghani et al., 2019), to learn low-dimensional embeddings of graphs which can be used for graph classification. In particular, given an input graph, U2GAN first applies a self-attention computation, which is then followed by a recurrent transition to iteratively memorize its attention on vector representations of each node and its neighbors across each iteration. Thus, U2GAN can address the weaknesses in the existing models in order to produce plausible node embeddings whose sum is the final embedding of the whole graph. Experimental results show that our unsupervised U2GAN produces new state-of-the-art performances on a range of well-known benchmark datasets for the graph classification task. It even outperforms supervised methods in most of benchmark cases.",/pdf/890c4ec4c1c8106153bfa0fca28406f98ef8184d.pdf,ICLR,2020, +R6tNszN_QfA,btPqwRPGOQO,1601310000000.0,1614990000000.0,2118,Adversarial Problems for Generative Networks,"[""~Kalliopi_Basioti1"", ""moustaki@upatras.gr""]","[""Kalliopi Basioti"", ""George V. Moustakides""]","[""generative networks"", ""adversarial generative networks""]","We are interested in the design of generative networks. The training of these mathematical structures is mostly performed with the help of adversarial (min-max) optimization problems. We propose a simple methodology for constructing such problems assuring, at the same time, consistency of the corresponding solution. We give characteristic examples developed by our method, some of which can be recognized from other applications and some are introduced here for the first time. We compare various possibilities by applying them to well known datasets using neural networks of different configurations and sizes.",/pdf/7a8c91e241b11bc93bf9cb0d2f874cc5ce748920.pdf,ICLR,2021, +ry2YOrcge,,1478280000000.0,1488470000000.0,260,Learning a Natural Language Interface with Neural Programmer,"[""arvind@cs.umass.edu"", ""qvl@google.com"", ""abadi@google.com"", ""mccallum@cs.umass.edu"", ""damodei@openai.com""]","[""Arvind Neelakantan"", ""Quoc V. Le"", ""Martin Abadi"", ""Andrew McCallum"", ""Dario Amodei""]","[""Natural language processing"", ""Deep learning""]","Learning a natural language interface for database tables is a challenging task that involves deep language understanding and multi-step reasoning. The task is often approached by mapping natural language queries to logical forms or programs that provide the desired response when executed on the database. To our knowledge, this paper presents the first weakly supervised, end-to-end neural network model to induce such programs on a real-world dataset. We enhance the objective function of Neural Programmer, a neural network with built-in discrete operations, and apply it on WikiTableQuestions, a natural language question-answering dataset. The model is trained end-to-end with weak supervision of question-answer pairs, and does not require domain-specific grammars, rules, or annotations that are key elements in previous approaches to program induction. The main experimental result in this paper is that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples with weak supervision. An ensemble of 15 models, with a trivial combination technique, achieves 37.7% accuracy, which is competitive to the current state-of-the-art accuracy of 37.1% obtained by a traditional natural language semantic parser.",/pdf/9aafb921d591ef83b4288e36d124c2fd0a9234c5.pdf,ICLR,2017,"To our knowledge, this paper presents the first weakly supervised, end-to-end neural network model to induce programs on a real-world dataset." +B1-q5Pqxl,,1478290000000.0,1489370000000.0,417,Machine Comprehension Using Match-LSTM and Answer Pointer,"[""shwang.2014@phdis.smu.edu.sg"", ""jingjiang@smu.edu.sg""]","[""Shuohang Wang"", ""Jing Jiang""]","[""Natural language processing"", ""Deep learning""]","Machine comprehension of text is an important problem in natural language processing. A recently released dataset, the Stanford Question Answering Dataset (SQuAD), offers a large number of real questions and their answers created by humans through crowdsourcing. SQuAD provides a challenging testbed for evaluating machine comprehension algorithms, partly because compared with previous datasets, in SQuAD the answers do not come from a small set of candidate answers and they have variable lengths. We propose an end-to-end neural architecture for the task. The architecture is based on match-LSTM, a model we proposed previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al. (2015) to constrain the output tokens to be from the input sequences. We propose two ways of using Pointer Net for our tasks. Our experiments show that both of our two models substantially outperform the best results obtained by Rajpurkar et al. (2016) using logistic regression and manually crafted features. Besides, our boundary model also achieves the best performance on the MSMARCO dataset (Nguyen et al. 2016).",/pdf/78402508a0a3ebc1c9078ef8ff48dccc6fef6cd8.pdf,ICLR,2017,Using Match-LSTM and Answer Pointer to select a variable length answer from a paragraph +SJu63o10b,Sywp3jyCb,1509040000000.0,1518730000000.0,163,UNSUPERVISED METRIC LEARNING VIA NONLINEAR FEATURE SPACE TRANSFORMATIONS,"[""pz335412@ohio.edu"", ""bibo.shi@duke.edu"", ""liuj1@ohio.edu""]","[""Pin Zhang"", ""Bibo Shi"", ""JundongLiu""]","[""Metric Learning"", ""K-means"", ""CPD"", ""Clustering""]","In this paper, we propose a nonlinear unsupervised metric learning framework to boost of the performance of clustering algorithms. Under our framework, nonlinear distance metric learning and manifold embedding are integrated and conducted simultaneously to increase the natural separations among data samples. The metric learning component is implemented through feature space transformations, regulated by a nonlinear deformable model called Coherent Point Drifting (CPD). Driven by CPD, data points can get to a higher level of linear separability, which is subsequently picked up by the manifold embedding component to generate well-separable sample projections for clustering. Experimental results on synthetic and benchmark datasets show the effectiveness of our proposed approach over the state-of-the-art solutions in unsupervised metric learning. +",/pdf/70702223b00983b82d5f731a759a8b0147b2c721.pdf,ICLR,2018, a nonlinear unsupervised metric learning framework to boost the performance of clustering algorithms. +Bkxbrn0cYX,SyxCIe_cK7,1538090000000.0,1550900000000.0,1517,Selfless Sequential Learning,"[""rahaf.aljundi@gmail.com"", ""mrf@fb.com"", ""tinne.tuytelaars@esat.kuleuven.be""]","[""Rahaf Aljundi"", ""Marcus Rohrbach"", ""Tinne Tuytelaars""]","[""Lifelong learning"", ""Continual Learning"", ""Sequential learning"", ""Regularization""]","Sequential learning, also called lifelong learning, studies the problem of learning tasks in a sequence with access restricted to only the data of the current task. In this paper we look at a scenario with fixed model capacity, and postulate that the learning process should not be selfish, i.e. it should account for future tasks to be added and thus leave enough capacity for them. To achieve Selfless Sequential Learning we study different regularization strategies and activation functions. We find that +imposing sparsity at the level of the representation (i.e. neuron activations) is more beneficial for sequential learning than encouraging parameter sparsity. In particular, we propose a novel regularizer, that encourages representation sparsity by means of neural inhibition. It results in few active neurons which in turn leaves more free neurons to be utilized by upcoming tasks. As neural inhibition over an entire layer can be too drastic, especially for complex tasks requiring strong representations, +our regularizer only inhibits other neurons in a local neighbourhood, inspired by lateral inhibition processes in the brain. We combine our novel regularizer with state-of-the-art lifelong learning methods that penalize changes to important previously learned parts of the network. We show that our new regularizer leads to increased sparsity which translates in consistent performance improvement on diverse datasets.",/pdf/64e66d31f9f3a99f2109c369d875ca110f290d6b.pdf,ICLR,2019,A regularization strategy for improving the performance of sequential learning +TYXs_y84xRj,qF8M2lCoS57h,1601310000000.0,1616060000000.0,2432,PolarNet: Learning to Optimize Polar Keypoints for Keypoint Based Object Detection,"[""~Wu_Xiongwei1"", ""~Doyen_Sahoo1"", ""~Steven_HOI1""]","[""Wu Xiongwei"", ""Doyen Sahoo"", ""Steven HOI""]","[""Object Detection"", ""Deep Learning""]","A variety of anchor-free object detectors have been actively proposed as possible alternatives to the mainstream anchor-based detectors that often rely on complicated design of anchor boxes. Despite achieving promising performance on par with anchor-based detectors, the existing anchor-free detectors such as FCOS or CenterNet predict objects based on standard Cartesian coordinates, which often yield poor quality keypoints. Further, the feature representation is also scale-sensitive. In this paper, we propose a new anchor-free keypoint based detector ``PolarNet"", where keypoints are represented as a set of Polar coordinates instead of Cartesian coordinates. The ``PolarNet"" detector learns offsets pointing to the corners of objects in order to learn high quality keypoints. Additionally, PolarNet uses features of corner points to localize objects, making the localization scale-insensitive. Finally in our experiments, we show that PolarNet, an anchor-free detector, outperforms the existing anchor-free detectors, and it is able to achieve highly competitive result on COCO test-dev benchmark ($47.8\%$ and $50.3\%$ AP under the single-model single-scale and multi-scale testing) which is on par with the state-of-the-art two-stage anchor-based object detectors. The code and the models are available at https://github.com/XiongweiWu/PolarNetV1",/pdf/d08ca7f6d8b412afb77ae32d7522a517e41f4741.pdf,ICLR,2021, +SJlJegHFvH,ryllAjyKPr,1569440000000.0,1577170000000.0,2083,Address2vec: Generating vector embeddings for blockchain analytics,"[""ali.hussein@ronininstitute.org"", ""nsamiiha@gmail.com""]","[""Ali Hussein"", ""Samiiha Nalwooga""]","[""crypto-currency"", ""bitcoin"", ""blockchain"", ""2vec""]","Bitcoin is a virtual coinage system that enables users to trade virtually free of a central trusted authority. All transactions on the Bitcoin blockchain are publicly available for viewing, yet as Bitcoin is built mainly for security it’s original structure does not allow for direct analysis of address transactions. +Existing analysis methods of the Bitcoin blockchain can be complicated, computationally expensive or inaccurate. We propose a computationally efficient model to analyze bitcoin blockchain addresses and allow for their use with existing machine learning algorithms. We compare our approach against Multi Level Sequence Learners (MLSLs), one of the best performing models on bitcoin address data.",/pdf/4d4739f14ce77fe3b287d33c212ce20ec3d61db8.pdf,ICLR,2020,a 2vec model for cryptocurrency transaction graphs +BJbD_Pqlg,,1478290000000.0,1484660000000.0,403,Human perception in computer vision,"[""ron.dekel@weizmann.ac.il""]","[""Ron Dekel""]","[""Computer vision"", ""Transfer Learning""]","Computer vision has made remarkable progress in recent years. Deep neural network (DNN) models optimized to identify objects in images exhibit unprecedented task-trained accuracy and, remarkably, some generalization ability: new visual problems can now be solved more easily based on previous learning. Biological vision (learned in life and through evolution) is also accurate and general-purpose. Is it possible that these different learning regimes converge to similar problem-dependent optimal computations? We therefore asked whether the human system-level computation of visual perception has DNN correlates and considered several anecdotal test cases. We found that perceptual sensitivity to image changes has DNN mid-computation correlates, while sensitivity to segmentation, crowding and shape has DNN end-computation correlates. Our results quantify the applicability of using DNN computation to estimate perceptual loss, and are consistent with the fascinating theoretical view that properties of human perception are a consequence of architecture-independent visual learning.",/pdf/f7a47138dcb8ca903d2eb0305183df272b9b4db4.pdf,ICLR,2017,Correlates for several properties of human perception emerge in convolutional neural networks following image categorization learning. +HkIQH7qel,,1478270000000.0,1484270000000.0,193,Learning Recurrent Span Representations for Extractive Question Answering,"[""kentonl@cs.washington.edu"", ""tomkwiat@google.com"", ""aparikh@google.com"", ""dipanjand@google.com""]","[""Kenton Lee"", ""Tom Kwiatkowksi"", ""Ankur Parikh"", ""Dipanjan Das""]","[""Natural language processing""]","The reading comprehension task, that asks questions about a given evidence document, is a central problem in natural language understanding. Recent formulations of this task have typically focused on answer selection from a set of candidates pre-defined manually or through the use of an external NLP pipeline. However, Rajpurkar et al. (2016) recently released the SQUAD dataset in which the answers can be arbitrary strings from the supplied text. In this paper, we focus on this answer extraction task, presenting a novel model architecture that efficiently builds fixed length representations of all spans in the evidence document with a recurrent network. We show that scoring explicit span representations significantly improves performance over other approaches that factor the prediction into separate predictions about words or start and end markers. Our approach improves upon the best published results of Wang & Jiang (2016) by 5% and decreases the error of Rajpurkar et al.’s baseline by > 50%.",/pdf/fbd210da689f374d13a294133f9b9cc1fa9f671c.pdf,ICLR,2017,We present a globally normalized architecture for extractive question answering that contains explicit representations of all possible answer spans. +r1xQNlBYPS,S1eLQUxKvS,1569440000000.0,1577170000000.0,2241,Multichannel Generative Language Models,"[""hchan@cs.toronto.edu"", ""kiros@google.com"", ""williamchan@google.com""]","[""Harris Chan"", ""Jamie Kiros"", ""William Chan""]","[""text generation"", ""generative language models"", ""natural language processing""]","A channel corresponds to a viewpoint or transformation of an underlying meaning. A pair of parallel sentences in English and French express the same underlying meaning but through two separate channels corresponding to their languages. In this work, we present Multichannel Generative Language Models (MGLM), which models the joint distribution over multiple channels, and all its decompositions using a single neural network. MGLM can be trained by feeding it k way parallel-data, bilingual data, or monolingual data across pre-determined channels. MGLM is capable of both conditional generation and unconditional sampling. For conditional generation, the model is given a fully observed channel, and generates the k-1 channels in parallel. In the case of machine translation, this is akin to giving it one source, and the model generates k-1 targets. MGLM can also do partial conditional sampling, where the channels are seeded with prespecified words, and the model is asked to infill the rest. Finally, we can sample from MGLM unconditionally over all k channels. Our experiments on the Multi30K dataset containing English, French, Czech, and German languages suggest that the multitask training with the joint objective leads to improvements in bilingual translations. We provide a quantitative analysis of the quality-diversity trade-offs for different variants of the multichannel model for conditional generation, and a measurement of self-consistency during unconditional generation. We provide qualitative examples for parallel greedy decoding across languages and sampling from the joint distribution of the 4 languages.",/pdf/db58d3ed2153df0d6efed04a621cca3305d92e36.pdf,ICLR,2020,"we propose Multichannel Generative Language Models (MGLM), which models the joint distribution over multiple channels, and all its decompositions using a single neural network" +rkx35lHKwB,BJxn7kZKvr,1569440000000.0,1577170000000.0,2487,Generalizing Reinforcement Learning to Unseen Actions,"[""ayushj@usc.edu"", ""szot@usc.edu"", ""jinchenz@usc.edu"", ""limjj@usc.edu""]","[""Ayush Jain*"", ""Andrew Szot*"", ""Jincheng Zhou"", ""Joseph J. Lim""]","[""reinforcement learning"", ""unsupervised representation learning"", ""generalization""]","A fundamental trait of intelligence is the ability to achieve goals in the face of novel circumstances. In this work, we address one such setting which requires solving a task with a novel set of actions. Empowering machines with this ability requires generalization in the way an agent perceives its available actions along with the way it uses these actions to solve tasks. Hence, we propose a framework to enable generalization over both these aspects: understanding an action’s functionality, and using actions to solve tasks through reinforcement learning. Specifically, an agent interprets an action’s behavior using unsupervised representation learning over a collection of data samples reflecting the diverse properties of that action. We employ a reinforcement learning architecture which works over these action representations, and propose regularization metrics essential for enabling generalization in a policy. We illustrate the generalizability of the representation learning method and policy, to enable zero-shot generalization to previously unseen actions on challenging sequential decision-making environments. Our results and videos can be found at sites.google.com/view/action-generalization/",/pdf/7889025dadd5038935b45ca48b4f19de2a81eba4.pdf,ICLR,2020,We address the problem of generalization of reinforcement learning to unseen action spaces. +rke5R1SFwS,SyeSTPJKPB,1569440000000.0,1577170000000.0,2034,Learning to Remember from a Multi-Task Teacher,"[""yuwen@cs.toronto.edu"", ""mren@cs.toronto.edu"", ""urtasun@uber.com""]","[""Yuwen Xiong"", ""Mengye Ren"", ""Raquel Urtasun""]","[""Meta-learning"", ""sequential learning"", ""catastrophic forgetting""]","Recent studies on catastrophic forgetting during sequential learning typically focus on fixing the accuracy of the predictions for a previously learned task. In this paper we argue that the outputs of neural networks are subject to rapid changes when learning a new data distribution, and networks that appear to ""forget"" everything still contain useful representation towards previous tasks. We thus propose to enforce the output accuracy to stay the same, we should aim to reduce the effect of catastrophic forgetting on the representation level, as the output layer can be quickly recovered later with a small number of examples. Towards this goal, we propose an experimental setup that measures the amount of representational forgetting, and develop a novel meta-learning algorithm to overcome this issue. The proposed meta-learner produces weight updates of a sequential learning network, mimicking a multi-task teacher network's representation. We show that our meta-learner can improve its learned representations on new tasks, while maintaining a good representation for old tasks.",/pdf/c4d0f24cf868bcbd6c20b0e7d0a0f77907b3dda8.pdf,ICLR,2020,We propose a new meta-learning algorithm for sequential representation learning +awOrpNtsCX,pflG89Hwu1,1601310000000.0,1614990000000.0,1300,Shape-Tailored Deep Neural Networks Using PDEs for Segmentation,"[""~Naeemullah_Khan1"", ""~Angira_Sharma1"", ""~Philip_Torr1"", ""~Ganesh_Sundaramoorthi1""]","[""Naeemullah Khan"", ""Angira Sharma"", ""Philip Torr"", ""Ganesh Sundaramoorthi""]","[""robustness"", ""covariance"", ""invariance"", ""convolutional neural nets"", ""PDEs"", ""segmentation""]","We present Shape-Tailored Deep Neural Networks (ST-DNN). ST-DNN extend convolutional networks, which aggregate data from fixed shape (square) neighbor-hoods to compute descriptors, to be defined on arbitrarily shaped regions. This is useful for segmentation applications, where it is desired to have descriptors that aggregate data only within regions of segmentation to avoid mixing data from different regions, otherwise, the descriptors are difficult to group to a unique region. We formulate these descriptors through partial differential equations (PDE) that naturally generalize convolution to arbitrary regions, and derive the methodology to jointly estimate the segmentation and ST-DNN descriptor. We also show that ST-DNN inherit covariance to translations and rotations from the PDE, a natural property of a segmentation method, which existing CNN based methods lack. ST-DNN are 3-4 order of magnitude smaller than typical CNN. We empirically show that they exceed segmentation performance compared to state-of-the-art CNN-based descriptors using 2-3 orders smaller training sets on the texture segmentation problem.",/pdf/2eb2d25ecae3d0c19a814ee737568649e59c9111.pdf,ICLR,2021,Partial differential equations in deep neural networks for better convarince/invariance propertires. +xng0HoPDaFN,Q6WXYHaoqCa,1601310000000.0,1614990000000.0,1726,An Adversarial Attack via Feature Contributive Regions,"[""qianyaguan@zust.edu.cn"", ""~Jiamin_Wang1"", ""~Xiang_Ling1"", ""~Zhaoquan_Gu2"", ""wbin2006@gmail.com"", ""wuchunming@zju.edu.cn""]","[""Yaguan Qian"", ""Jiamin Wang"", ""Xiang Ling"", ""Zhaoquan Gu"", ""Bin Wang"", ""Chunming Wu""]","[""Adversarial example"", ""Feature contributive regions"", ""Local attack""]","Recently, to deal with the vulnerability to generate examples of CNNs, there are many advanced algorithms that have been proposed. These algorithms focus on modifying global pixels directly with small perturbations, and some work involves modifying local pixels. However, the global attacks have the problem of perturbations’ redundancy and the local attacks are not effective. To overcome this challenge, we achieve a trade-off between the perturbation power and the number of perturbed pixels in this paper. The key idea is to find the feature contributive regions (FCRs) of the images. Furthermore, in order to create an adversarial example similar to the corresponding clean image as much as possible, we redefine a loss function as the objective function of the optimization in this paper and then using gradient descent optimization algorithm to find the efficient perturbations. Various experiments have been carried out on CIFAR-10 and ILSVRC2012 datasets, which show the excellence of this method, and in addition, the FCRs attack shows strong attack ability in both white-box and black-box settings.",/pdf/380af2ef79d0af605003c5eb5db345f7caffc281.pdf,ICLR,2021,This work explores the method of generating perturbations via the feature contribution regions and provides evidence to prove that the attack on the local semanticsis the most effective. +kdm4Lm9rgB,FWhwA0xWc-p,1601310000000.0,1614990000000.0,3728,Monotonic Robust Policy Optimization with Model Discrepancy,"[""~Yuankun_Jiang1"", ""~Chenglin_Li2"", ""~Junni_Zou1"", ""~Wenrui_Dai1"", ""~Hongkai_Xiong1""]","[""Yuankun Jiang"", ""Chenglin Li"", ""Junni Zou"", ""Wenrui Dai"", ""Hongkai Xiong""]","[""Reinforcement Learning"", ""generalization""]","State-of-the-art deep reinforcement learning (DRL) algorithms tend to overfit in some specific environments due to the lack of data diversity in training. To mitigate the model discrepancy between training and target (testing) environments, domain randomization (DR) can generate plenty of environments with a sufficient diversity by randomly sampling environment parameters in simulator. Though standard DR using a uniform distribution improves the average performance on the whole range of environments, the worst-case environment is usually neglected without any performance guarantee. Since the average and worst-case performance are equally important for the generalization in RL, in this paper, we propose a policy optimization approach for concurrently improving the policy's performance in the average case (i.e., over all possible environments) and the worst-case environment. We theoretically derive a lower bound for the worst-case performance of a given policy over all environments. Guided by this lower bound, we formulate an optimization problem which aims to optimize the policy and sampling distribution together, such that the constrained expected performance of all environments is maximized. We prove that the worst-case performance is monotonically improved by iteratively solving this optimization problem. Based on the proposed lower bound, we develop a practical algorithm, named monotonic robust policy optimization (MRPO), and validate MRPO on several robot control tasks. By modifying the environment parameters in simulation, we obtain environments for the same task but with different transition dynamics for training and testing. We demonstrate that MRPO can improve both the average and worst-case performance in the training environments, and facilitate the learned policy with a better generalization capability in unseen testing environments.",/pdf/9e766d188175795a805e9636377508e9945ae758.pdf,ICLR,2021, +HygYmJBKwH,HyxG6vhuPr,1569440000000.0,1577170000000.0,1625,YaoGAN: Learning Worst-case Competitive Algorithms from Self-generated Inputs,"[""zuza777@gmail.com"", ""wadi@google.com"", ""aranyak@google.com"", ""siva@google.com""]","[""Goran Zuzic"", ""Di Wang"", ""Aranyak Mehta"", ""D. Sivakumar""]",[],"We tackle the challenge of using machine learning to find algorithms with strong worst-case guarantees for online combinatorial optimization problems. Whereas the previous approach along this direction (Kong et al., 2018) relies on significant domain expertise to provide hard distributions over input instances at training, we ask whether this can be accomplished from first principles, i.e., without any human-provided data beyond specifying the objective of the optimization problem. To answer this question, we draw insights from classic results in game theory, analysis of algorithms, and online learning to introduce a novel framework. At the high level, similar to a generative adversarial network (GAN), our framework has two components whose respective goals are to learn the optimal algorithm as well as a set of input instances that captures the essential difficulty of the given optimization problem. The two components are trained against each other and evolved simultaneously. We test our ideas on the ski rental problem and the fractional AdWords problem. For these well-studied problems, our preliminary results demonstrate that the framework is capable of finding algorithms as well as difficult input instances that are consistent with known optimal results. We believe our new framework points to a promising direction which can facilitate the research of algorithm design by leveraging ML to improve the state of the art both in theory and in practice. + ",/pdf/09646aa6fae9b8e0add8e241f95383cd2017b23e.pdf,ICLR,2020, +JFKR3WqwyXR,yJnaU-lAzOx,1601310000000.0,1618580000000.0,3695,Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering,"[""~Calypso_Herrera1"", ""~Florian_Krach1"", ""jteichma@math.ethz.ch""]","[""Calypso Herrera"", ""Florian Krach"", ""Josef Teichmann""]","[""Neural ODE"", ""conditional expectation"", ""irregular-observed data modelling""]","Combinations of neural ODEs with recurrent neural networks (RNN), like GRU-ODE-Bayes or ODE-RNN are well suited to model irregularly observed time series. While those models outperform existing discrete-time approaches, no theoretical guarantees for their predictive capabilities are available. Assuming that the irregularly-sampled time series data originates from a continuous stochastic process, the $L^2$-optimal online prediction is the conditional expectation given the currently available information. We introduce the Neural Jump ODE (NJ-ODE) that provides a data-driven approach to learn, continuously in time, the conditional expectation of a stochastic process. Our approach models the conditional expectation between two observations with a neural ODE and jumps whenever a new observation is made. We define a novel training framework, which allows us to prove theoretical guarantees for the first time. In particular, we show that the output of our model converges to the $L^2$-optimal prediction. This can be interpreted as solution to a special filtering problem. We provide experiments showing that the theoretical results also hold empirically. Moreover, we experimentally show that our model outperforms the baselines in more complex learning tasks and give comparisons on real-world datasets.",/pdf/56667371832a16b261e1e17afd3b1d2f6e9bfb0e.pdf,ICLR,2021,Online prediction and filtering of irregularly-observed time series data using Neural Jump ODE with theoretical convergence guarantees. +H1gFuiA9KX,BJeMjCtqK7,1538090000000.0,1545360000000.0,371,Skip-gram word embeddings in hyperbolic space,"[""matthias@lateral.io"", ""benjamin@lateral.io""]","[""Matthias Leimeister"", ""Benjamin J. Wilson""]","[""word embeddings"", ""hyperbolic"", ""skip-gram""]","Embeddings of tree-like graphs in hyperbolic space were recently shown to surpass their Euclidean counterparts in performance by a large margin. +Inspired by these results, we present an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance is derived and included in the skip-gram negative-sampling architecture from word2vec. The hyperbolic word embeddings are then evaluated on word similarity and analogy benchmarks. The results demonstrate the potential of hyperbolic word embeddings, particularly in low dimensions, though without clear superiority over their Euclidean counterparts. We further discuss subtleties in the formulation of the analogy task in curved spaces.",/pdf/21d686ecc03659e322bb33a334340db7a8c755ab.pdf,ICLR,2019, +r1w7Jdqxl,,1478290000000.0,1481280000000.0,449,Collaborative Deep Embedding via Dual Networks,"[""xy014@ie.cuhk.edu.hk"", ""dhlin@ie.cuhk.edu.hk"", ""niu.haoying@huawei.com"", ""cheng.jiefeng@huawei.com"", ""li.zhenguo@huawei.com""]","[""Yilei Xiong"", ""Dahua Lin"", ""Haoying Niu"", ""JIefeng Cheng"", ""Zhenguo Li""]",[],"Despite the long history of research on recommender systems, current approaches still face a number of challenges in practice, e.g. the difficulties in handling new items, the high diversity of user interests, and the noisiness and sparsity of observations. Many of such difficulties stem from the lack of expressive power to capture the complex relations between items and users. This paper presents a new method to tackle this problem, called Collaborative Deep Embedding. In this method, a pair of dual networks, one for encoding items and the other for users, are jointly trained in a collaborative fashion. +Particularly, both networks produce embeddings at multiple aligned levels, which, when combined together, can accurately predict the matching between items and users. Compared to existing methods, the proposed one not only provides greater expressive power to capture complex matching relations, but also generalizes better to unseen items or users. On multiple real-world datasets, this method outperforms the state of the art.",/pdf/c7ce590223d60f743d93d2450ad333ab8e1f00af.pdf,ICLR,2017, +HJzLdjR9FX,HJe0iJ8qt7,1538090000000.0,1545360000000.0,359,DeepTwist: Learning Model Compression via Occasional Weight Distortion,"[""dslee3@gmail.com"", ""kparichay@gmail.com"", ""quddnr145@gmail.com""]","[""Dongsoo Lee"", ""Parichay Kapoor"", ""Byeongwook Kim""]","[""deep learning"", ""model compression"", ""pruning"", ""quantization"", ""SVD"", ""regularization"", ""framework""]","Model compression has been introduced to reduce the required hardware resources while maintaining the model accuracy. Lots of techniques for model compression, such as pruning, quantization, and low-rank approximation, have been suggested along with different inference implementation characteristics. Adopting model compression is, however, still challenging because the design complexity of model compression is rapidly increasing due to additional hyper-parameters and computation overhead in order to achieve a high compression ratio. In this paper, we propose a simple and efficient model compression framework called DeepTwist which distorts weights in an occasional manner without modifying the underlying training algorithms. The ideas of designing weight distortion functions are intuitive and straightforward given formats of compressed weights. We show that our proposed framework improves compression rate significantly for pruning, quantization, and low-rank approximation techniques while the efforts of additional retraining and/or hyper-parameter search are highly reduced. Regularization effects of DeepTwist are also reported.",/pdf/6aac6a633567daaaaf9b203c091801b317547dc5.pdf,ICLR,2019,We propose a unified model compression framework for performing a variety of model compression techniques. +u846Bqhry_,Yh5_k5Am8UW,1601310000000.0,1614990000000.0,1796,Asynchronous Modeling: A Dual-phase Perspective for Long-Tailed Recognition,"[""~Hu_Zhang1"", ""~Linchao_Zhu1"", ""~Yi_Yang4""]","[""Hu Zhang"", ""Linchao Zhu"", ""Yi Yang""]","[""long-tailed classification"", ""gradient distortion"", ""asynchronous modeling""]","This work explores deep learning based classification model on real-world datasets with a long-tailed distribution. Most of previous works deal with the long-tailed classification problem by re-balancing the overall distribution within the whole dataset or directly transferring knowledge from data-rich classes to data-poor ones. In this work, we consider the gradient distortion in long-tailed classification when the gradient on data-rich classes and data-poor ones are incorporated simultaneously, i.e., shifted gradient direction towards data-rich classes as well as the enlarged variance by the gradient fluctuation on data-poor classes. Motivated by such phenomenon, we propose to disentangle the distinctive effects of data-rich and data-poor gradient and asynchronously train a model via a dual-phase learning process. The first phase only concerns the data-rich classes. In the second phase, besides the standard classification upon data-poor classes, we propose an exemplar memory bank to reserve representative examples and a memory-retentive loss via graph matching to retain the relation between two phases. The extensive experimental results on four commonly used long-tailed benchmarks including CIFAR100-LT, Places-LT, ImageNet-LT and iNaturalist 2018 highlight the excellent performance of our proposed method.",/pdf/55e989955f4dd3eb0abb91e8da4d991db1b4450b.pdf,ICLR,2021, +SJLy_SxC-,SJIkOBeRb,1509080000000.0,1518730000000.0,257,Log-DenseNet: How to Sparsify a DenseNet,"[""hanzhang@cs.cmu.edu"", ""dedey@microsoft.com"", ""adelgior@ri.cmu.edu"", ""hebert@ri.cmu.edu"", ""dbagnell@ri.cmu.edu""]","[""Hanzhang Hu"", ""Debadeepta Dey"", ""Allie Del Giorno"", ""Martial Hebert"", ""J. Andrew Bagnell""]","[""DenseNet"", ""sparse shortcut connections"", ""network architecture"", ""scene parsing"", ""image classification""]","Skip connections are increasingly utilized by deep neural networks to improve accuracy and cost-efficiency. In particular, the recent DenseNet is efficient in computation and parameters, and achieves state-of-the-art predictions by directly connecting each feature layer to all previous ones. However, DenseNet's extreme connectivity pattern may hinder its scalability to high depths, and in applications like fully convolutional networks, full DenseNet connections are prohibitively expensive. +This work first experimentally shows that one key advantage of skip connections is to have short distances among feature layers during backpropagation. Specifically, using a fixed number of skip connections, the connection patterns with shorter backpropagation distance among layers have more accurate predictions. Following this insight, we propose a connection template, Log-DenseNet, which, in comparison to DenseNet, only slightly increases the backpropagation distances among layers from 1 to ($1 + \log_2 L$), but uses only $L\log_2 L$ total connections instead of $O(L^2)$. Hence, \logdenses are easier to scale than DenseNets, and no longer require careful GPU memory management. We demonstrate the effectiveness of our design principle by showing better performance than DenseNets on tabula rasa semantic segmentation, and competitive results on visual recognition.",/pdf/fd11ad29bc2400441cde591b9bbd502b8ef7c86b.pdf,ICLR,2018,"We show shortcut connections should be placed in patterns that minimize between-layer distances during backpropagation, and design networks that achieve log L distances using L log(L) connections." +SyxS0T4tvS,ByxavYZODS,1569440000000.0,1577170000000.0,855,RoBERTa: A Robustly Optimized BERT Pretraining Approach,"[""yinhanliu@fb.com"", ""myleott@fb.com"", ""namangoyal@instagram.com"", ""jingfeidu@fb.com"", ""mandar90@cs.washington.edu"", ""danqic@cs.princeton.edu"", ""omerlevy@gmail.com"", ""mikelewis@fb.com"", ""lsz@fb.com"", ""ves@fb.com""]","[""Yinhan Liu"", ""Myle Ott"", ""Naman Goyal"", ""Jingfei Du"", ""Mandar Joshi"", ""Danqi Chen"", ""Omer Levy"", ""Mike Lewis"", ""Luke Zettlemoyer"", ""Veselin Stoyanov""]","[""Deep learning"", ""language representation learning"", ""natural language understanding""]","Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE, SQuAD, SuperGLUE and XNLI. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.",/pdf/23ef66489352e1835915bb2c5e8dbc0a216e730c.pdf,ICLR,2020,We evaluate a number of design decisions when pretraining BERT models and propose an improved recipe that achieves state-of-the-art results on many natural language understanding tasks. +SJgaRA4FPH,rJldYqcODH,1569440000000.0,1583910000000.0,1448,"Generative Models for Effective ML on Private, Decentralized Datasets","[""saugenst@google.com"", ""mcmahan@google.com"", ""dramage@google.com"", ""swaroopram@google.com"", ""kairouz@google.com"", ""mingqing@google.com"", ""mathews@google.com"", ""blaisea@google.com""]","[""Sean Augenstein"", ""H. Brendan McMahan"", ""Daniel Ramage"", ""Swaroop Ramaswamy"", ""Peter Kairouz"", ""Mingqing Chen"", ""Rajiv Mathews"", ""Blaise Aguera y Arcas""]","[""generative models"", ""federated learning"", ""decentralized learning"", ""differential privacy"", ""privacy"", ""security"", ""GAN""]","To improve real-world applications of machine learning, experienced modelers develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data—of representative samples, of outliers, of misclassifications—is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, +and c) assigning or refining human-provided labels. However, manual data inspection is risky for privacy-sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the modeler may only access aggregated outputs such as metrics or model parameters. This paper demonstrates that generative models—trained using federated methods and with formal differential privacy guarantees—can be used effectively to debug data issues even +when the data cannot be directly inspected. We explore these methods in applications to text with differentially private federated RNNs and to images using a novel algorithm for differentially private federated GANs.",/pdf/ec8b5bcb57ec7ce967ce19be3e0aa3389526aca0.pdf,ICLR,2020,"Generative Models + Federated Learning + Differential Privacy gives data scientists a way to analyze private, decentralized data (e.g., on mobile devices) where direct inspection is prohibited." +ryxSrhC9KX,Bke09e6cKQ,1538090000000.0,1547060000000.0,1536,Revealing interpretable object representations from human behavior,"[""charles.zheng@nih.gov"", ""francisco.pereira@nih.gov"", ""bakerchris@mail.nih.gov"", ""martin.hebart@nih.gov""]","[""Charles Y. Zheng"", ""Francisco Pereira"", ""Chris I. Baker"", ""Martin N. Hebart""]","[""category representation"", ""sparse coding"", ""representation learning"", ""interpretable representations""]","To study how mental object representations are related to behavior, we estimated sparse, non-negative representations of objects using human behavioral judgments on images representative of 1,854 object categories. These representations predicted a latent similarity structure between objects, which captured most of the explainable variance in human behavioral judgments. Individual dimensions in the low-dimensional embedding were found to be highly reproducible and interpretable as conveying degrees of taxonomic membership, functionality, and perceptual attributes. We further demonstrated the predictive power of the embeddings for explaining other forms of human behavior, including categorization, typicality judgments, and feature ratings, suggesting that the dimensions reflect human conceptual representations of objects beyond the specific task.",/pdf/a46e0f965eaf30a542a1193a26ad70f9d036060f.pdf,ICLR,2019,Human behavioral judgments are used to obtain sparse and interpretable representations of objects that generalize to other tasks +rytstxWAW,SydjFxb0W,1509130000000.0,1518730000000.0,613,FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling,"[""chenjie@us.ibm.com"", ""tengfei.ma1@ibm.com"", ""cxiao@us.ibm.com""]","[""Jie Chen"", ""Tengfei Ma"", ""Cao Xiao""]","[""Graph convolutional networks"", ""importance sampling""]","The graph convolutional networks (GCN) recently proposed by Kipf and Welling are an effective graph model for semi-supervised learning. Such a model, however, is transductive in nature because parameters are learned through convolutions with both training and test data. Moreover, the recursive neighborhood expansion across layers poses time and memory challenges for training with large, dense graphs. To relax the requirement of simultaneous availability of test data, we interpret graph convolutions as integral transforms of embedding functions under probability measures. Such an interpretation allows for the use of Monte Carlo approaches to consistently estimate the integrals, which in turn leads to a batched training scheme as we propose in this work---FastGCN. Enhanced with importance sampling, FastGCN not only is efficient for training but also generalizes well for inference. We show a comprehensive set of experiments to demonstrate its effectiveness compared with GCN and related models. In particular, training is orders of magnitude more efficient while predictions remain comparably accurate. +",/pdf/c5e296a072c3877ab807c0d4bf3a79cbfae5e224.pdf,ICLR,2018, +6zaTwpNSsQ2,QGR1YC6245P,1601310000000.0,1615900000000.0,3654,A Block Minifloat Representation for Training Deep Neural Networks,"[""~Sean_Fox2"", ""seyedramin.rasoulinezhad@sydney.edu.au"", ""~Julian_Faraone1"", ""david.boland@sydney.edu.au"", ""~Philip_Leong1""]","[""Sean Fox"", ""Seyedramin Rasoulinezhad"", ""Julian Faraone"", ""david boland"", ""Philip Leong""]",[],"Training Deep Neural Networks (DNN) with high efficiency can be difficult to achieve with native floating-point representations and commercially available hardware. Specialized arithmetic with custom acceleration offers perhaps the most promising alternative. Ongoing research is trending towards narrow floating-point representations, called minifloats, that pack more operations for a given silicon area and consume less power. In this paper, we introduce Block Minifloat (BM), a new spectrum of minifloat formats capable of training DNNs end-to-end with only 4-8 bit weight, activation and gradient tensors. While standard floating-point representations have two degrees of freedom, via the exponent and mantissa, BM exposes the exponent bias as an additional field for optimization. Crucially, this enables training with fewer exponent bits, yielding dense integer-like hardware for fused multiply-add (FMA) operations. For ResNet trained on ImageNet, 6-bit BM achieves almost no degradation in floating-point accuracy with FMA units that are $4.1\times(23.9\times)$ smaller and consume $2.3\times(16.1\times)$ less energy than FP8 (FP32). Furthermore, our 8-bit BM format matches floating-point accuracy while delivering a higher computational density and faster expected training times.",/pdf/7ed682ba5c220f98e96984a0b3bb08ed91c59ce7.pdf,ICLR,2021,"A new number representation, comparable to recently proposed 8-bit formats, for efficiently training a subset of DNN models." +BkluqlSFDS,BJlt0RgtwS,1569440000000.0,1583910000000.0,2476,Federated Learning with Matched Averaging,"[""hongyiwang@cs.wisc.edu"", ""mikhail.yurochkin@ibm.com"", ""yuekai@umich.edu"", ""dimitris@papail.io"", ""yasaman.khazaeni@us.ibm.com""]","[""Hongyi Wang"", ""Mikhail Yurochkin"", ""Yuekai Sun"", ""Dimitris Papailiopoulos"", ""Yasaman Khazaeni""]","[""federated learning""]","Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud. We propose Federated matched averaging (FedMA) algorithm designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements (i.e. channels for convolution layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures. Our experiments indicate that FedMA not only outperforms popular state-of-the-art federated learning algorithms on deep CNN and LSTM architectures trained on real world datasets, but also reduces the overall communication burden.",/pdf/6b9ef72b07bb3390dcf6145f41df02ceffbb916e.pdf,ICLR,2020,Communication efficient federated learning with layer-wise matching +Ske31kBtPr,HJe5KyouvS,1569440000000.0,1583910000000.0,1483,Mathematical Reasoning in Latent Space,"[""ldennis@google.com"", ""szegedy@google.com"", ""mrabe@google.com"", ""smoos@google.com"", ""kbk@google.com""]","[""Dennis Lee"", ""Christian Szegedy"", ""Markus Rabe"", ""Sarah Loos"", ""Kshitij Bansal""]","[""machine learning"", ""formal reasoning""]","We design and conduct a simple experiment to study whether neural networks can perform several steps of approximate reasoning in a fixed dimensional latent space. The set of rewrites (i.e. transformations) that can be successfully performed on a statement represents essential semantic features of the statement. We can compress this information by embedding the formula in a vector space, such that the vector associated with a statement can be used to predict whether a statement can be rewritten by other theorems. Predicting the embedding of a formula generated by some rewrite rule is naturally viewed as approximate reasoning in the latent space. In order to measure the effectiveness of this reasoning, we perform approximate deduction sequences in the latent space and use the resulting embedding to inform the semantic features of the corresponding formal statement (which is obtained by performing the corresponding rewrite sequence using real formulas). Our experiments show that graph neural networks can make non-trivial predictions about the rewrite-success of statements, even when they propagate predicted latent representations for several steps. Since our corpus of mathematical formulas includes a wide variety of mathematical disciplines, this experiment is a strong indicator for the feasibility of deduction in latent space in general.",/pdf/7f1f2bf386393e98651939a6b7fab8f4fdb99341.pdf,ICLR,2020,Learning to reason about higher order logic formulas in the latent space. +rkeeoeHYvr,Syl-I1ZKPB,1569440000000.0,1577170000000.0,2496,AdvCodec: Towards A Unified Framework for Adversarial Text Generation,"[""boxinw2@illinois.edu"", ""hzpei16@fudan.edu.cn"", ""hanliu@northwestern.edu"", ""lbo@illinois.edu""]","[""Boxin Wang"", ""Hengzhi Pei"", ""Han Liu"", ""Bo Li""]","[""adversarial text generation"", ""tree-autoencoder"", ""human evaluation""]","Machine learning (ML) especially deep neural networks (DNNs) have been widely applied to real-world applications. However, recent studies show that DNNs are vulnerable to carefully crafted \emph{adversarial examples} which only deviate from the original data by a small magnitude of perturbation. +While there has been great interest on generating imperceptible adversarial examples in continuous data domain (e.g. image and audio) to explore the model vulnerabilities, generating \emph{adversarial text} in the discrete domain is still challenging. +The main contribution of this paper is to propose a general targeted attack framework \advcodec for adversarial text generation which addresses the challenge of discrete input space and be easily adapted to general natural language processing (NLP) tasks. +In particular, we propose a tree based autoencoder to encode discrete text data into continuous vector space, upon which we optimize the adversarial perturbation. With the tree based decoder, it is possible to ensure the grammar correctness of the generated text; and the tree based encoder enables flexibility of making manipulations on different levels of text, such as sentence (\advcodecsent) and word (\advcodecword) levels. We consider multiple attacking scenarios, including appending an adversarial sentence or adding unnoticeable words to a given paragraph, to achieve arbitrary \emph{targeted attack}. To demonstrate the effectiveness of the proposed method, we consider two most representative NLP tasks: sentiment analysis and question answering (QA). Extensive experimental results show that \advcodec has successfully attacked both tasks. In particular, our attack causes a BERT-based sentiment classifier accuracy to drop from $0.703$ to $0.006$, and a BERT-based QA model's F1 score to drop from $88.62$ to $33.21$ (with best targeted attack F1 score as $46.54$). Furthermore, we show that the white-box generated adversarial texts can transfer across other black-box models, shedding light on an effective way to examine the robustness of existing NLP models.",/pdf/e2470d63c0d32be8fd17958f6bfcb1c526cadc98.pdf,ICLR,2020,"we propose a novel framework AdvCodec to generate adversarial text agaist general NLP tasks based on tree-autoencoder, and we show that AdvCodec outperforms other baselines and achieves high performance in human evaluation." +ZzwDy_wiWv,cdy4QE-dJQA,1601310000000.0,1616260000000.0,869,Knowledge distillation via softmax regression representation learning,"[""~Jing_Yang7"", ""~Brais_Martinez3"", ""~Adrian_Bulat1"", ""~Georgios_Tzimiropoulos1""]","[""Jing Yang"", ""Brais Martinez"", ""Adrian Bulat"", ""Georgios Tzimiropoulos""]",[],"This paper addresses the problem of model compression via knowledge distillation. We advocate for a method that optimizes the output feature of the penultimate layer of the student network and hence is directly related to representation learning. Previous distillation methods which typically impose direct feature matching between the student and the teacher do not take into account the classification problem at hand. On the contrary, our distillation method decouples representation learning and classification and utilizes the teacher's pre-trained classifier to train the student's penultimate layer feature. In particular, for the same input image, we wish the teacher's and student's feature to produce the same output when passed through the teacher's classifier which is achieved with a simple $L_2$ loss. Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains. The code will be available at \url{https://github.com/jingyang2017/KD_SRRL}.",/pdf/94c700e0e7f29d564e416cc3b56b6b350c29aafb.pdf,ICLR,2021, +rJlwAa4YwS,r1xSUqbOwH,1569440000000.0,1577170000000.0,859,Lattice Representation Learning,"[""lastrasl@us.ibm.com""]","[""Luis A Lastras""]","[""lattices"", ""representation learning"", ""coding theory"", ""lossy source coding"", ""information theory""]","We introduce the notion of \emph{lattice representation learning}, in which the representation for some object of interest (e.g. a sentence or an image) is a lattice point in an Euclidean space. Our main contribution is a result for replacing an objective function which employs lattice quantization with an objective function in which quantization is absent, thus allowing optimization techniques based on gradient descent to apply; we call the resulting algorithms \emph{dithered stochastic gradient descent} algorithms as they are designed explicitly to allow for an optimization procedure where only local information is employed. We also argue that a technique commonly used in Variational Auto-Encoders (Gaussian priors and Gaussian approximate posteriors) is tightly connected with the idea of lattice representations, as the quantization error in good high dimensional lattices can be modeled as a Gaussian distribution. We use a traditional encoder/decoder architecture to explore the idea of latticed valued representations, and provide experimental evidence of the potential of using lattice representations by modifying the \texttt{OpenNMT-py} generic \texttt{seq2seq} architecture so that it can implement not only Gaussian dithering of representations, but also the well known straight-through estimator and its application to vector quantization. +",/pdf/5144bcdd2035e322b3a20d8440aea63acaa7fa6e.pdf,ICLR,2020,We propose to use lattices to represent objects and prove a fundamental result on how to train networks that use them. +rJxG3pVKPB,Syg89TJ_wB,1569440000000.0,1577170000000.0,773,"Translation Between Waves, wave2wave","[""tsuyoshi.okita@gmail.com"", ""hirotaka.hachiya@riken.jp"", ""sozo.inoue@riken.jp"", ""naonori.ueda@riken.jp""]","[""Tsuyoshi Okita"", ""Hirotaka Hachiya"", ""Sozo Inoue"", ""Naonori Ueda""]","[""sequence to sequence model"", ""signal to signal"", ""deep learning"", ""RNN"", ""encoder-decoder model""]","The understanding of sensor data has been greatly improved by advanced deep learning methods with big data. However, available sensor data in the real world are still limited, which is called the opportunistic sensor problem. This paper proposes a new variant of neural machine translation seq2seq to deal with continuous signal waves by introducing the window-based (inverse-) representation to adaptively represent partial shapes of waves and the iterative back-translation model for high-dimensional data. Experimental results are shown for two real-life data: earthquake and activity translation. The performance improvements of one-dimensional data was about 46 % in test loss and that of high-dimensional data was about 1625 % in perplexity with regard to the original seq2seq. +",/pdf/c1cf16608e4afb5fc2dae7da5e2af9294ef47463.pdf,ICLR,2020, +rJeW1yHYwH,Hke13jq_PH,1569440000000.0,1583910000000.0,1457,Inductive representation learning on temporal graphs,"[""da.xu@walmartlabs.com"", ""ruanchuanwei@gmail.com"", ""ekorpeoglu@walmart.com"", ""skumar4@walmartlabs.com"", ""kachan@walmartlabs.com""]","[""da Xu"", ""chuanwei ruan"", ""evren korpeoglu"", ""sushant kumar"", ""kannan achan""]","[""temporal graph"", ""inductive representation learning"", ""functional time encoding"", ""self-attention""]","Inductive representation learning on temporal graphs is an important step toward salable machine learning on real-world dynamic networks. The evolving nature of temporal dynamic graphs requires handling new nodes as well as capturing temporal patterns. The node embeddings, which are now functions of time, should represent both the static node features and the evolving topological structures. Moreover, node and topological features can be temporal as well, whose patterns the node embeddings should also capture. We propose the temporal graph attention (TGAT) layer to efficiently aggregate temporal-topological neighborhood features to learn the time-feature interactions. For TGAT, we use the self-attention mechanism as building block and develop a novel functional time encoding technique based on the classical Bochner's theorem from harmonic analysis. By stacking TGAT layers, the network recognizes the node embeddings as functions of time and is able to inductively infer embeddings for both new and observed nodes as the graph evolves. The proposed approach handles both node classification and link prediction task, and can be naturally extended to include the temporal edge features. We evaluate our method with transductive and inductive tasks under temporal settings with two benchmark and one industrial dataset. Our TGAT model compares favorably to state-of-the-art baselines as well as the previous temporal graph embedding approaches.",/pdf/35f0e0e0b42200e2c21ced03637e8b30f6e6b6fc.pdf,ICLR,2020, +BkgHWkrtPB,B1x8U9jdwr,1569440000000.0,1577170000000.0,1543,Where is the Information in a Deep Network?,"[""achille@cs.ucla.edu"", ""soatto@cs.ucla.edu""]","[""Alessandro Achille"", ""Stefano Soatto""]","[""Information"", ""Learning Dynamics"", ""PAC-Bayes"", ""Deep Learning""]","Whatever information a deep neural network has gleaned from past data is encoded in its weights. How this information affects the response of the network to future data is largely an open question. In fact, even how to define and measure information in a network entails some subtleties. We measure information in the weights of a deep neural network as the optimal trade-off between accuracy of the network and complexity of the weights relative to a prior. Depending on the prior, the definition reduces to known information measures such as Shannon Mutual Information and Fisher Information, but in general it affords added flexibility that enables us to relate it to generalization, via the PAC-Bayes bound, and to invariance. For the latter, we introduce a notion of effective information in the activations, which are deterministic functions of future inputs. We relate this to the Information in the Weights, and use this result to show that models of low (information) complexity not only generalize better, but are bound to learn invariant representations of future inputs. These relations hinge not only on the architecture of the model, but also on how it is trained.",/pdf/bd3b790baa96e955c781ff12b9360251fff62f83.pdf,ICLR,2020, +NsMLjcFaO8O,#NAME?,1601310000000.0,1616510000000.0,1311,WaveGrad: Estimating Gradients for Waveform Generation,"[""~Nanxin_Chen1"", ""~Yu_Zhang2"", ""~Heiga_Zen1"", ""~Ron_J_Weiss1"", ""~Mohammad_Norouzi1"", ""~William_Chan1""]","[""Nanxin Chen"", ""Yu Zhang"", ""Heiga Zen"", ""Ron J Weiss"", ""Mohammad Norouzi"", ""William Chan""]","[""vocoder"", ""diffusion"", ""score matching"", ""text-to-speech"", ""gradient estimation"", ""waveform generation""]","This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. +WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality. +We find that it can generate high fidelity audio samples using as few as six iterations. +Experiments reveal WaveGrad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations. Audio samples are available at https://wavegrad.github.io/.",/pdf/511dfd60f5250df72c5b324c7fb22552789093e1.pdf,ICLR,2021,"This paper introduces WaveGrad, a conditional model for waveform generation through estimating gradients of the data density." +BkevoJSYPB,ByxPoA0ODB,1569440000000.0,1583910000000.0,1919,Differentiation of Blackbox Combinatorial Solvers,"[""marin.vlastelica@tue.mpg.de"", ""anselm.paulus@tuebingen.mpg.de"", ""vejtek@atrey.karlin.mff.cuni.cz"", ""georg.martius@tuebingen.mpg.de"", ""michal.rolinek@tuebingen.mpg.de""]","[""Marin Vlastelica Pogan\u010di\u0107"", ""Anselm Paulus"", ""Vit Musil"", ""Georg Martius"", ""Michal Rolinek""]","[""combinatorial algorithms"", ""deep learning"", ""representation learning"", ""optimization""]","Achieving fusion of deep learning with combinatorial algorithms promises transformative changes to artificial intelligence. One possible approach is to introduce combinatorial building blocks into neural networks. Such end-to-end architectures have the potential to tackle combinatorial problems on raw input data such as ensuring global consistency in multi-object tracking or route planning on maps in robotics. In this work, we present a method that implements an efficient backward pass through blackbox implementations of combinatorial solvers with linear objective functions. We provide both theoretical and experimental backing. In particular, we incorporate the Gurobi MIP solver, Blossom V algorithm, and Dijkstra's algorithm into architectures that extract suitable features from raw inputs for the traveling salesman problem, the min-cost perfect matching problem and the shortest path problem.",/pdf/1d89c425702b17fe7ee5eae69086ccc14d7acc60.pdf,ICLR,2020," In this work, we present a method that implements an efficient backward pass through blackbox implementations of combinatorial solvers with linear objective functions." +yWkP7JuHX1,hPYUPqfLnBn,1601310000000.0,1615950000000.0,2477,Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering,"[""~Yuxuan_Zhang1"", ""~Wenzheng_Chen1"", ""~Huan_Ling1"", ""~Jun_Gao3"", ""~Yinan_Zhang2"", ""~Antonio_Torralba1"", ""~Sanja_Fidler1""]","[""Yuxuan Zhang"", ""Wenzheng Chen"", ""Huan Ling"", ""Jun Gao"", ""Yinan Zhang"", ""Antonio Torralba"", ""Sanja Fidler""]","[""Differentiable rendering"", ""inverse graphics"", ""GANs""]","Differentiable rendering has paved the way to training neural networks to perform “inverse graphics” tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D “neural renderer"", complementing traditional graphics renderers.",/pdf/ddea76d45b925517b5cf900df64b91dc3d44d918.pdf,ICLR,2021,We marry generative models with differentiable rendering to extract and disentangle 3D knowledge learned implicitly by generative image synthesis models +B1l9qsA5KQ,H1e788vOtX,1538090000000.0,1545360000000.0,559,Mental Fatigue Monitoring using Brain Dynamics Preferences,"[""yuangang.pan@student.uts.edu.au"", ""avinashsingh@outlook.com"", ""ivor.tsang@uts.edu.au"", ""chin-teng.lin@uts.edu.au""]","[""Yuangang Pan"", ""Avinash K Singh"", ""Ivor W. Tsang"", ""Chin-teng Lin""]","[""mental fatigue"", ""brain dynamics preference"", ""brain dynamics ranking"", ""channel reliability"", ""channel Selection""]","Driver's cognitive state of mental fatigue significantly affects driving performance and more importantly public safety. Previous studies leverage the response time (RT) as the metric for mental fatigue and aim at estimating the exact value of RT using electroencephalogram (EEG) signals within a regression model. However, due to the easily corrupted EEG signals and also non-smooth RTs during data collection, regular regression methods generally suffer from poor generalization performance. Considering that human response time is the reflection of brain dynamics preference rather than a single value, a novel model called Brain Dynamic ranking (BDrank) has been proposed. BDrank could learn from brain dynamics preferences using EEG data robustly and preserve the ordering corresponding to RTs. BDrank model is based on the regularized alternative ordinal classification comparing to regular regression based practices. Furthermore, a transition matrix is introduced to characterize the reliability of each channel used in EEG data, which helps in learning brain dynamics preferences only from informative EEG channels. In order to handle large-scale EEG signals~and obtain higher generalization, an online-generalized Expectation Maximum (OnlineGEM) algorithm also has been proposed to update BDrank in an online fashion. Comprehensive empirical analysis on EEG signals from 44 participants shows that BDrank together with OnlineGEM achieves substantial improvements in reliability while simultaneously detecting possible less informative and noisy EEG channels.",/pdf/9e31f37d81afabbe95666b63b1e598e46cf9cee2.pdf,ICLR,2019, +Uqu9yHvqlRf,4C9yeheVV8m,1601310000000.0,1614990000000.0,1081,What Preserves the Emergence of Language?,"[""~Ziluo_Ding1"", ""~Tiejun_Huang1"", ""~Zongqing_Lu2""]","[""Ziluo Ding"", ""Tiejun Huang"", ""Zongqing Lu""]","[""emergence of language"", ""reinforcement learning""]","The emergence of language is a mystery. One dominant theory is that cooperation boosts language to emerge. However, as a means of giving out information, language seems not to be an evolutionarily stable strategy. To ensure the survival advantage of many competitors, animals are selfish in nature. From the perspective of Darwinian, if an individual can obtain a higher benefit by deceiving the other party, why not deceive? For those who are cheated, once bitten and twice shy, cooperation will no longer be a good option. As a result, motivation for communication, as well as the emergence of language would perish. Then, what preserves the emergence of language? We aim to answer this question in a brand new framework of agent community, reinforcement learning, and natural selection. Empirically, we reveal that lying indeed dispels cooperation. Even with individual resistance to lying behaviors, liars can easily defeat truth tellers and survive during natural selection. However, social resistance eventually constrains lying and makes the emergence of language possible.",/pdf/786d024ac7892a332f304c9812d7f83f45edfae6.pdf,ICLR,2021, +rylwJxrYDS,Bkxa55ytPr,1569440000000.0,1583910000000.0,2066,vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations,"[""alexei.b@gmail.com"", ""stes@fb.com"", ""michael.auli@gmail.com""]","[""Alexei Baevski"", ""Steffen Schneider"", ""Michael Auli""]","[""speech recognition"", ""speech representation learning""]",We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.,/pdf/0a5ac6d85b01d047385eff9fc4507ef6fc067b1d.pdf,ICLR,2020,Learn how to quantize speech signal and apply algorithms requiring discrete inputs to audio data such as BERT. +5k8F6UU39V,odrzsGq-PRh,1601310000000.0,1615800000000.0,1115,Autoregressive Entity Retrieval,"[""~Nicola_De_Cao1"", ""~Gautier_Izacard1"", ""~Sebastian_Riedel1"", ""~Fabio_Petroni2""]","[""Nicola De Cao"", ""Gautier Izacard"", ""Sebastian Riedel"", ""Fabio Petroni""]","[""entity retrieval"", ""document retrieval"", ""autoregressive language model"", ""entity linking"", ""end-to-end entity linking"", ""entity disambiguation"", ""constrained beam search""]","Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach leads to several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion and conditioned on the context. This enables us to mitigate the aforementioned technical issues since: (i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach, experimenting with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name. Code and pre-trained models at https://github.com/facebookresearch/GENRE.",/pdf/921ba67c80871fda61a4c0cf8f889b1c381a2a78.pdf,ICLR,2021,"We address entity retrieval by generating their unique name identifiers, left to right, in an autoregressive fashion, and conditioned on the context showing SOTA results in more than 20 datasets with a tiny fraction of the memory of recent systems." +ByxCrerKvS,SkxEVYeFwB,1569440000000.0,1577170000000.0,2305,Set Functions for Time Series,"[""max.horn@bsse.ethz.ch"", ""michael.moor@bsse.ethz.ch"", ""christian.bock@bsse.ethz.ch"", ""bastian.rieck@bsse.ethz.ch"", ""karsten.borgwardt@bsse.ethz.ch""]","[""Max Horn"", ""Michael Moor"", ""Christian Bock"", ""Bastian Rieck"", ""Karsten Borgwardt""]","[""Time Series"", ""Set functions"", ""Irregularly sampling"", ""Medical Time series"", ""Dynamical Systems"", ""Time series classification""]","Despite the eminent successes of deep neural networks, many architectures are often hard to transfer to irregularly-sampled and asynchronous time series that occur in many real-world datasets, such as healthcare applications. This paper proposes a novel framework for classifying irregularly sampled time series with unaligned measurements, focusing on high scalability and data efficiency. +Our method SeFT (Set Functions for Time Series) is based on recent advances in differentiable set function learning, extremely parallelizable, and scales well to very large datasets and online monitoring scenarios. +We extensively compare our method to competitors on multiple healthcare time series datasets and show that it performs competitively whilst significantly reducing runtime.",/pdf/a6ef0768e668742dfeb1e5f46eb0b1f1c2110310.pdf,ICLR,2020,We propose a novel method for the scalable and interpretable classification of irregularly sampled time series. +_IM-AfFhna9,ayVB9wD4dU,1601310000000.0,1615960000000.0,3595,Generalized Variational Continual Learning,"[""~Noel_Loo1"", ""~Siddharth_Swaroop2"", ""~Richard_E_Turner1""]","[""Noel Loo"", ""Siddharth Swaroop"", ""Richard E Turner""]",[],"Continual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing for interpolation between the two approaches. We term the general algorithm Generalized VCL (GVCL). In order to mitigate the observed overpruning effect of VI, we take inspiration from a common multi-task architecture, neural networks with task-specific FiLM layers, and find that this addition leads to significant performance gains, specifically for variational methods. In the small-data regime, GVCL strongly outperforms existing baselines. In larger datasets, GVCL with FiLM layers outperforms or is competitive with existing baselines in terms of accuracy, whilst also providing significantly better calibration.",/pdf/75e7423995a4eb4d239591596556bd1ac05f5e63.pdf,ICLR,2021,We generalize VCL and Online-EWC and combine with task-specific FiLM layers +ryex8CEKPr,HJx_6aIOwB,1569440000000.0,1577170000000.0,1125,Knockoff-Inspired Feature Selection via Generative Models,"[""mduarte@ecs.umass.edu"", ""siwei@umass.edu""]","[""Marco F. Duarte"", ""Siwei Feng""]","[""feature selection"", ""variable selection"", ""knockoff variables"", ""supervised learning""]","We propose a feature selection algorithm for supervised learning inspired by the recently introduced +knockoff framework for variable selection in statistical regression. While variable selection in statistics aims +to distinguish between true and false predictors, feature selection in machine learning aims to reduce the +dimensionality of the data while preserving the performance of the learning method. The knockoff framework +has attracted significant interest due to its strong control of false discoveries while preserving predictive +power. In contrast to the original approach and later variants that assume a given probabilistic model for the +variables, our proposed approach relies on data-driven generative models that learn mappings from data +space to a parametric space that characterizes the probability distribution of the data. Our approach +requires only the availability of mappings from data space to a distribution in parametric space and from +parametric space to a distribution in data space; thus, it can be integrated with multiple popular generative +models from machine learning. We provide example knockoff designs using a variational autoencoder and +a Gaussian process latent variable model. We also propose a knockoff score metric for a softmax classifier +that accounts for the contribution of each feature and its knockoff during supervised learning. Experimental +results with multiple benchmark datasets for feature selection showcase the advantages of our knockoff +designs and the knockoff framework with respect to existing approaches.",/pdf/0a4d33f04ee56e8abf99ca4f5d103880dbd7313e.pdf,ICLR,2020,We propose a feature selection algorithm for supervised learning inspired by the recently introduced knockoff framework for variable selection in statistical regression. +SyxhVkrYvr,BygWG1TdwS,1569440000000.0,1583910000000.0,1670,Towards Verified Robustness under Text Deletion Interventions,"[""johannes.welbl.14@ucl.ac.uk"", ""posenhuang@google.com"", ""stanforth@google.com"", ""sgowal@google.com"", ""dvij@google.com"", ""szummer@google.com"", ""pushmeet@google.com""]","[""Johannes Welbl"", ""Po-Sen Huang"", ""Robert Stanforth"", ""Sven Gowal"", ""Krishnamurthy (Dj) Dvijotham"", ""Martin Szummer"", ""Pushmeet Kohli""]","[""natural language processing"", ""specification"", ""verification"", ""model undersensitivity"", ""adversarial"", ""interval bound propagation""]","Neural networks are widely used in Natural Language Processing, yet despite their empirical successes, their behaviour is brittle: they are both over-sensitive to small input changes, and under-sensitive to deletions of large fractions of input text. This paper aims to tackle under-sensitivity in the context of natural language inference by ensuring that models do not become more confident in their predictions as arbitrary subsets of words from the input text are deleted. We develop a novel technique for formal verification of this specification for models based on the popular decomposable attention mechanism by employing the efficient yet effective interval bound propagation (IBP) approach. Using this method we can efficiently prove, given a model, whether a particular sample is free from the under-sensitivity problem. We compare different training methods to address under-sensitivity, and compare metrics to measure it. In our experiments on the SNLI and MNLI datasets, we observe that IBP training leads to a significantly improved verified accuracy. On the SNLI test set, we can verify 18.4% of samples, a substantial improvement over only 2.8% using standard training.",/pdf/60f5a5cd607e69be77c5ea713d4dd249c10e9b4a.pdf,ICLR,2020,Formal verification of a specification on a model's prediction undersensitivity using Interval Bound Propagation +ByBAl2eAZ,SJB0xhgCZ,1509110000000.0,1518730000000.0,379,Parameter Space Noise for Exploration,"[""matthiasplappert@me.com"", ""rein.houthooft@openai.com"", ""prafulla@openai.com"", ""szymon@openai.com"", ""richardchen@openai.com"", ""peter@openai.com"", ""asfour@kit.edu"", ""pabbeel@cs.berkeley.edu"", ""marcin@openai.com""]","[""Matthias Plappert"", ""Rein Houthooft"", ""Prafulla Dhariwal"", ""Szymon Sidor"", ""Richard Y. Chen"", ""Xi Chen"", ""Tamim Asfour"", ""Pieter Abbeel"", ""Marcin Andrychowicz""]","[""reinforcement learning"", ""exploration"", ""parameter noise""]","Deep reinforcement learning (RL) methods generally engage in exploratory behavior through noise injection in the action space. An alternative is to add noise directly to the agent's parameters, which can lead to more consistent exploration and a richer set of behaviors. Methods such as evolutionary strategies use parameter perturbations, but discard all temporal structure in the process and require significantly more samples. Combining parameter noise with traditional RL methods allows to combine the best of both worlds. We demonstrate that both off- and on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks.",/pdf/1458cb5fd30844c3f193980654e0ee1915899975.pdf,ICLR,2018,"Parameter space noise allows reinforcement learning algorithms to explore by perturbing parameters instead of actions, often leading to significantly improved exploration performance." +BkeYSlrYwH,rkxZa_gtvS,1569440000000.0,1577170000000.0,2294,Collaborative Inter-agent Knowledge Distillation for Reinforcement Learning,"[""williamd4112@gapp.nthu.edu.tw"", ""prabhat@preferred.jp"", ""gjmaeda@preferred.jp""]","[""Zhang-Wei Hong"", ""Prabhat Nagarajan"", ""Guilherme Maeda""]","[""Reinforcement learning"", ""distillation""]","Reinforcement Learning (RL) has demonstrated promising results across several sequential decision-making tasks. However, reinforcement learning struggles to learn efficiently, thus limiting its pervasive application to several challenging problems. A typical RL agent learns solely from its own trial-and-error experiences, requiring many experiences to learn a successful policy. To alleviate this problem, we propose collaborative inter-agent knowledge distillation (CIKD). CIKD is a learning framework that uses an ensemble of RL agents to execute different policies in the environment while sharing knowledge amongst agents in the ensemble. Our experiments demonstrate that CIKD improves upon state-of-the-art RL methods in sample efficiency and performance on several challenging MuJoCo benchmark tasks. Additionally, we present an in-depth investigation on how CIKD leads to performance improvements. +",/pdf/70cb22581b3971586edcfe24cd1ddd308c0be0e3.pdf,ICLR,2020, +p5uylG94S68,mQFrCYJfAKg,1601310000000.0,1616030000000.0,3299,Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?,"[""~Bal\u00e1zs_K\u00e9gl2"", ""gabriel.j.hurtado@gmail.com"", ""~Albert_Thomas1""]","[""Bal\u00e1zs K\u00e9gl"", ""Gabriel Hurtado"", ""Albert Thomas""]","[""model-based reinforcement learning"", ""generative models"", ""mixture density nets"", ""dynamic systems"", ""heteroscedasticity""]","We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models using a fixed (random shooting) control agent. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. When multimodality is not required, our surprising finding is that we do not need probabilistic posterior predictives: deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts. We also found that heteroscedasticity at training time, perhaps acting as a regularizer, improves predictions at longer horizons. At the methodological side, we design metrics and an experimental protocol which can be used to evaluate the various models, predicting their asymptotic performance when using them on the control problem. Using this framework, we improve the state-of-the-art sample complexity of MBRL on Acrobot by two to four folds, using an aggressive training schedule which is outside of the hyperparameter interval usually considered.",/pdf/04313ea0678f51bf6e97525219f5b92003b041b9.pdf,ICLR,2021,Crucial model properties for model-based reinforcement learning: multi-modal posterior predictives and heteroscedasticity. +HJlISCEKvB,B1gCl3IODH,1569440000000.0,1577170000000.0,1111,Improving Multi-Manifold GANs with a Learned Noise Prior,"[""matthew.amodio@yale.edu"", ""smita.krishnaswamy@yale.edu""]","[""Matthew Amodio"", ""Smita Krishnaswamy""]","[""GAN"", ""generative adversarial network"", ""ensemble""]","Generative adversarial networks (GANs) learn to map samples from a noise distribution to a chosen data distribution. Recent work has demonstrated that GANs are consequently sensitive to, and limited by, the shape of the noise distribution. For example, a single generator struggles to map continuous noise (e.g. a uniform distribution) to discontinuous output (e.g. separate Gaussians) or complex output (e.g. intersecting parabolas). We address this problem by learning to generate from multiple models such that the generator's output is actually the combination of several distinct networks. We contribute a novel formulation of multi-generator models where we learn a prior over the generators conditioned on the noise, parameterized by a neural network. Thus, this network not only learns the optimal rate to sample from each generator but also optimally shapes the noise received by each generator. The resulting Noise Prior GAN (NPGAN) achieves expressivity and flexibility that surpasses both single generator models and previous multi-generator models.",/pdf/effaae05b6f991786afda64089c03eaf1ef75010.pdf,ICLR,2020,A multi-generator GAN framework with an additional network to learn a prior over the input noise. +SkgKzh0cY7,Hyxm3vaqKQ,1538090000000.0,1545360000000.0,1283,Unsupervised Video-to-Video Translation,"[""dbash@bu.edu"", ""usmn@bu.edu"", ""saenko@bu.edu""]","[""Dina Bashkirova"", ""Ben Usman"", ""Kate Saenko""]","[""Generative Adversarial Networks"", ""Computer Vision"", ""Deep Learning""]","Unsupervised image-to-image translation is a recently proposed task of translating an image to a different style or domain given only unpaired image examples at training time. In this paper, we formulate a new task of unsupervised video-to-video translation, which poses its own unique challenges. Translating video implies learning not only the appearance of objects and scenes but also realistic motion and transitions between consecutive frames. We investigate the performance of per-frame video-to-video translation using existing image-to-image translation networks, and propose a spatio-temporal 3D translator as an alternative solution to this problem. We evaluate our 3D method on multiple synthetic datasets, such as moving colorized digits, as well as the realistic segmentation-to-video GTA dataset and a new CT-to-MRI volumetric images translation dataset. Our results show that frame-wise translation produces realistic results on a single frame level but underperforms significantly on the scale of the whole video compared to our three-dimensional translation approach, which is better able to learn the complex structure of video and motion and continuity of object appearance. ",/pdf/09123844b56e4f98240b819d1ba3c83b8cd7a6f2.pdf,ICLR,2019,"Proposed new task, datasets and baselines; 3D Conv CycleGAN preserves object properties across frames; batch structure in frame-level methods matters." +SkBcLugC-,rk49I_l0W,1509100000000.0,1518730000000.0,309,Fast and Accurate Inference with Adaptive Ensemble Prediction for Deep Networks,"[""inouehrs@jp.ibm.com""]","[""Hiroshi Inoue""]","[""ensemble"", ""confidence level""]","Ensembling multiple predictions is a widely-used technique to improve the accuracy of various machine learning tasks. In image classification tasks, for example, averaging the predictions for multiple patches extracted from the input image significantly improves accuracy. Using multiple networks trained independently to make predictions improves accuracy further. One obvious drawback of the ensembling technique is its higher execution cost during inference.% If we average 100 local predictions, the execution cost will be 100 times as high as the cost without the ensemble. This higher cost limits the real-world use of ensembling. In this paper, we first describe our insights on relationship between the probability of the prediction and the effect of ensembling with current deep neural networks; ensembling does not help mispredictions for inputs predicted with a high probability, i.e. the output from the softmax. This finding motivates us to develop a new technique called adaptive ensemble prediction, which achieves the benefits of ensembling with much smaller additional execution costs. Hence, we calculate the confidence level of the prediction for each input from the probabilities of the local predictions during the ensembling computation. If the prediction for an input reaches a high enough probability on the basis of the confidence level, we stop ensembling for this input to avoid wasting computation power. We evaluated the adaptive ensembling by using various datasets and showed that it reduces the computation cost significantly while achieving similar accuracy to the naive ensembling. We also showed that our statistically rigorous confidence-level-based termination condition reduces the burden of the task-dependent parameter tuning compared to the naive termination based on the pre-defined threshold in addition to yielding a better accuracy with the same cost. +",/pdf/bcabcccd7c98a824640e4ee80fdf25f582d9a928.pdf,ICLR,2018, +H1zJ-v5xl,,1478290000000.0,1488570000000.0,355,Quasi-Recurrent Neural Networks,"[""james.bradbury@salesforce.com"", ""smerity@salesforce.com"", ""cxiong@salesforce.com"", ""rsocher@salesforce.com""]","[""James Bradbury"", ""Stephen Merity"", ""Caiming Xiong"", ""Richard Socher""]","[""Natural language processing"", ""Deep learning""]","Recurrent neural networks are a powerful tool for modeling sequential data, but the dependence of each timestep’s computation on the previous timestep’s output limits parallelism and makes RNNs unwieldy for very long sequences. We introduce quasi-recurrent neural networks (QRNNs), an approach to neural sequence modeling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Despite lacking trainable recurrent layers, stacked QRNNs have better predictive accuracy than stacked LSTMs of the same hidden size. Due to their increased parallelism, they are up to 16 times faster at train and test time. Experiments on language modeling, sentiment classification, and character-level neural machine translation demonstrate these advantages and underline the viability of QRNNs as a basic building block for a variety of sequence tasks.",/pdf/a69b67683af8de0cd62aae10bea94aaf3e2fa42b.pdf,ICLR,2017,"QRNNs, composed of convolutions and a recurrent pooling function, outperform LSTMs on a variety of sequence tasks and are up to 16 times faster." +ByzvHagA-,BJfvragRW,1509120000000.0,1518730000000.0,416,Disentangled activations in deep networks,"[""kageback@chalmers.se"", ""olof@mogren.one""]","[""Mikael K\u00e5geb\u00e4ck"", ""Olof Mogren""]","[""representation learning"", ""disentanglement"", ""regularization""]","Deep neural networks have been tremendously successful in a number of tasks. +One of the main reasons for this is their capability to automatically +learn representations of data in levels of abstraction, +increasingly disentangling the data as the internal transformations are applied. +In this paper we propose a novel regularization method that penalize covariance between dimensions of the hidden layers in a network, something that benefits the disentanglement. +This makes the network learn nonlinear representations that are linearly uncorrelated, yet allows the model to obtain good results on a number of tasks, as demonstrated by our experimental evaluation. +The proposed technique can be used to find the dimensionality of the underlying data, because it effectively disables dimensions that aren't needed. +Our approach is simple and computationally cheap, as it can be applied as a regularizer to any gradient-based learning model.",/pdf/1268ebf35db399a5cab1b4f19fefae16d8f1cd06.pdf,ICLR,2018,We propose a novel regularization method that penalize covariance between dimensions of the hidden layers in a network. +6puCSjH3hwA,mhuQjKePbls,1601310000000.0,1615960000000.0,751,A Good Image Generator Is What You Need for High-Resolution Video Synthesis,"[""~Yu_Tian2"", ""~Jian_Ren2"", ""~Menglei_Chai1"", ""~Kyle_Olszewski1"", ""~Xi_Peng1"", ""~Dimitris_N._Metaxas1"", ""~Sergey_Tulyakov1""]","[""Yu Tian"", ""Jian Ren"", ""Menglei Chai"", ""Kyle Olszewski"", ""Xi Peng"", ""Dimitris N. Metaxas"", ""Sergey Tulyakov""]","[""high-resolution video generation"", ""contrastive learning"", ""cross-domain video generation""]","Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.",/pdf/08bf1c319723defae9a4e04ca258811da08d2ed3.pdf,ICLR,2021,Reuse a pre-trained image generator for high-resolution video synthesis +Hye_V0NKwr,ryliyrUuDB,1569440000000.0,1583910000000.0,1080,Locality and Compositionality in Zero-Shot Learning,"[""tristan.sylvain@gmail.com"", ""lindapetrini@gmail.com"", ""devon.hjelm@microsoft.com""]","[""Tristan Sylvain"", ""Linda Petrini"", ""Devon Hjelm""]","[""Zero-shot learning"", ""Compositionality"", ""Locality"", ""Deep Learning""]","In this work we study locality and compositionality in the context of learning representations for Zero Shot Learning (ZSL). +In order to well-isolate the importance of these properties in learned representations, we impose the additional constraint that, differently from most recent work in ZSL, no pre-training on different datasets (e.g. ImageNet) is performed. +The results of our experiment show how locality, in terms of small parts of the input, and compositionality, i.e. how well can the learned representations be expressed as a function of a smaller vocabulary, are both deeply related to generalization and motivate the focus on more local-aware models in future research directions for representation learning.",/pdf/ae21cd01627929186c1673ccc2f9b1c9cc1dfc2e.pdf,ICLR,2020,An analysis of the effects of compositionality and locality on representation learning for zero-shot learning. +ryeoxnRqKQ,HJgxQic9F7,1538090000000.0,1545360000000.0,1106,NATTACK: A STRONG AND UNIVERSAL GAUSSIAN BLACK-BOX ADVERSARIAL ATTACK,"[""lyndon.leeseu@outlook.com"", ""lilijun1990@buaa.edu.cn"", ""lwang@cs.ucf.edu"", ""bradymzhang@tencent.com"", ""boqinggo@outlook.com""]","[""Yandong Li"", ""Lijun Li"", ""Liqiang Wang"", ""Tong Zhang"", ""Boqing Gong""]","[""adversarial attack"", ""black-box"", ""evolutional strategy"", ""policy gradient""]","Recent works find that DNNs are vulnerable to adversarial examples, whose changes from the benign ones are imperceptible and yet lead DNNs to make wrong predictions. One can find various adversarial examples for the same input to a DNN using different attack methods. In other words, there is a population of adversarial examples, instead of only one, for any input to a DNN. By explicitly modeling this adversarial population with a Gaussian distribution, we propose a new black-box attack called NATTACK. The adversarial attack is hence formalized as an optimization problem, which searches the mean of the Gaussian under the guidance of increasing the target DNN's prediction error. NATTACK achieves 100% attack success rate on six out of eleven recently published defense methods (and greater than 90% for four), all using the same algorithm. Such results are on par with or better than powerful state-of-the-art white-box attacks. While the white-box attacks are often model-specific or defense-specific, the proposed black-box NATTACK is universally applicable to different defenses. ",/pdf/6510e82cd870e51ee93428137d9f03a08f786426.pdf,ICLR,2019, +SJa9iHgAZ,Hy2qjBeRb,1509080000000.0,1519660000000.0,262,Residual Connections Encourage Iterative Inference,"[""staszek.jastrzebski@gmail.com"", ""devansharpit@gmail.com"", ""ballas.n@gmail.com"", ""vikasverma.iitm@gmail.com"", ""tongcheprivate@gmail.com"", ""yoshua.umontreal@gmail.com""]","[""Stanis\u0142aw Jastrzebski"", ""Devansh Arpit"", ""Nicolas Ballas"", ""Vikas Verma"", ""Tong Che"", ""Yoshua Bengio""]","[""residual network"", ""iterative inference"", ""deep learning""]","Residual networks (Resnets) have become a prominent architecture in deep learning. However, a comprehensive understanding of Resnets is still a topic of ongoing research. A recent view argues that Resnets perform iterative refinement of features. We attempt to further expose properties of this aspect. To this end, we study Resnets both analytically and empirically. We formalize the notion of iterative refinement in Resnets by showing that residual architectures naturally encourage features to move along the negative gradient of loss during the feedforward phase. In addition, our empirical analysis suggests that Resnets are able to perform both representation learning and iterative refinement. In general, a Resnet block tends to concentrate representation learning behavior in the first few layers while higher layers perform iterative refinement of features. Finally we observe that sharing residual layers naively leads to representation explosion and hurts generalization performance, and show that simple existing strategies can help alleviating this problem.",/pdf/0d7f7063b4ed063a4388aae42dea8a428413c092.pdf,ICLR,2018,Residual connections really perform iterative inference +GXJPLbB5P-y,il3jO9nYji,1601310000000.0,1614990000000.0,1883,Simplifying Models with Unlabeled Output Data,"[""~Sang_Michael_Xie1"", ""~Tengyu_Ma1"", ""~Percy_Liang1""]","[""Sang Michael Xie"", ""Tengyu Ma"", ""Percy Liang""]","[""semi-supervised learning"", ""structured prediction""]","We focus on prediction problems with high-dimensional outputs that are subject to output validity constraints, e.g. a pseudocode-to-code translation task where the code must compile. For these problems, labeled input-output pairs are expensive to obtain, but ""unlabeled"" outputs, i.e. outputs without corresponding inputs, are freely available and provide information about output validity (e.g. code on GitHub). In this paper, we present predict-and-denoise, a framework that can leverage unlabeled outputs. Specifically, we first train a denoiser to map possibly invalid outputs to valid outputs using synthetic perturbations of the unlabeled outputs. Second, we train a predictor composed with this fixed denoiser. We show theoretically that for a family of functions with a high-dimensional discrete valid output space, composing with a denoiser reduces the complexity of a 2-layer ReLU network needed to represent the function and that this complexity gap can be arbitrarily large. We evaluate the framework empirically on several datasets, including image generation from attributes and pseudocode-to-code translation. On the SPoC pseudocode-to-code dataset, our framework improves the proportion of code outputs that pass all test cases by 3-5% over a baseline Transformer.",/pdf/842da7f2047bd52445ca8427ce84960f15d12a08.pdf,ICLR,2021,"Composing a model with a denoiser learned on unlabeled output examples can offload the complexity of learning complex output structure and invariances onto the denoiser, improving generalization in structured prediction problems." +HJgcvJBFvB,B1gdtTpdDr,1569440000000.0,1583910000000.0,1778,Network Randomization: A Simple Technique for Generalization in Deep Reinforcement Learning,"[""kiminlee@kaist.ac.kr"", ""kibok@umich.edu"", ""jinwoos@kaist.ac.kr"", ""honglak@eecs.umich.edu""]","[""Kimin Lee"", ""Kibok Lee"", ""Jinwoo Shin"", ""Honglak Lee""]","[""Deep reinforcement learning"", ""Generalization in visual domains""]","Deep reinforcement learning (RL) agents often fail to generalize to unseen environments (yet semantically similar to trained agents), particularly when they are trained on high-dimensional state spaces, such as images. In this paper, we propose a simple technique to improve a generalization ability of deep RL agents by introducing a randomized (convolutional) neural network that randomly perturbs input observations. It enables trained agents to adapt to new domains by learning robust features invariant across varied and randomized environments. Furthermore, we consider an inference method based on the Monte Carlo approximation to reduce the variance induced by this randomization. We demonstrate the superiority of our method across 2D CoinRun, 3D DeepMind Lab exploration and 3D robotics control tasks: it significantly outperforms various regularization and data augmentation methods for the same purpose.",/pdf/e6b7edd03b697541e4457dfbba7ed1e361cc329d.pdf,ICLR,2020,We propose a simple randomization technique for improving generalization in deep reinforcement learning across tasks with various unseen visual patterns. +r-gPPHEjpmw,XqQ9Ss3D0R,1601310000000.0,1615420000000.0,548,Hierarchical Reinforcement Learning by Discovering Intrinsic Options,"[""~Jesse_Zhang3"", ""~Haonan_Yu5"", ""~Wei_Xu13""]","[""Jesse Zhang"", ""Haonan Yu"", ""Wei Xu""]","[""hierarchical reinforcement learning"", ""reinforcement learning"", ""options"", ""unsupervised skill discovery"", ""exploration""]","We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lower-level policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic entropy minimization objective conditioned on the option sub-trajectories. The learned options are diverse and task-agnostic. In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency than regular RL baselines and two state-of-the-art hierarchical RL methods. Code at: https://github.com/jesbu1/hidio.",/pdf/8ab82acd2672b63eb1d694fcb5fc26a32c2f6d74.pdf,ICLR,2021,Hierarchical RL that discovers short-horizon task-agnostic options to perform well on sparse reward manipulation and navigation tasks. +Jr8XGtK04Pw,Lfj1vjd6f0b,1601310000000.0,1614990000000.0,3712,Hippocampal representations emerge when training recurrent neural networks on a memory dependent maze navigation task,"[""~Justin_Jude1"", ""m.hennig@ed.ac.uk""]","[""Justin Jude"", ""Matthias Hennig""]","[""recurrent neural network"", ""place cell"", ""hippocampus"", ""neural dynamics""]","Can neural networks learn goal-directed behaviour using similar strategies to the brain, by combining the relationships between the current state of the organism and the consequences of future actions? Recent work has shown that recurrent neural networks trained on goal based tasks can develop representations resembling those found in the brain, entorhinal cortex grid cells, for instance. Here we explore the evolution of the dynamics of their internal representations and compare this with experimental data. We observe that once a recurrent network is trained to learn the structure of its environment solely based on sensory prediction, an attractor based landscape forms in the network's representation, which parallels hippocampal place cells in structure and function. Next, we extend the predictive objective to include Q-learning for a reward task, where rewarding actions are dependent on delayed cue modulation. Mirroring experimental findings in hippocampus recordings in rodents performing the same task, this training paradigm causes nonlocal neural activity to sweep forward in space at decision points, anticipating the future path to a rewarded location. Moreover, prevalent choice and cue-selective neurons form in this network, again recapitulating experimental findings. Together, these results indicate that combining predictive, unsupervised learning of the structure of an environment with reinforcement learning can help understand the formation of hippocampus-like representations containing both spatial and task-relevant information.",/pdf/c0b1f19618baa14026ed1976b87d36ba9b80e383.pdf,ICLR,2021,Recurrent neural networks trained on a combined predictive and maze navigation task produce goal directed behaviour and neural dynamics reported experimentally in the hippocampus. +Bkfwyw5xg,,1478290000000.0,1485370000000.0,348,Investigating Different Context Types and Representations for Learning Word Embeddings,"[""libofang@ruc.edu.cn"", ""tliu@ruc.edu.cn"", ""helloworld@ruc.edu.cn"", ""tangbuzhou@gmail.com"", ""duyong@ruc.edu.cn""]","[""Bofang Li"", ""Tao Liu"", ""Zhe Zhao"", ""Buzhou Tang"", ""Xiaoyong Du""]","[""Unsupervised Learning"", ""Natural language processing""]","The number of word embedding models is growing every year. Most of them learn word embeddings based on the co-occurrence information of words and their context. However, it's still an open question what is the best definition of context. We provide the first systematical investigation of different context types and representations for learning word embeddings. We conduct comprehensive experiments to evaluate their effectiveness under 4 tasks (21 datasets), which give us some insights about context selection. We hope that this paper, along with the published code, can serve as a guideline of choosing context for our community. +",/pdf/8b3fe8e618d2741d0ed9a838a49bb979f12be66f.pdf,ICLR,2017,This paper investigate different context types and representations for learning word embeddings. +SyK00v5xx,,1478290000000.0,1544200000000.0,448,A Simple but Tough-to-Beat Baseline for Sentence Embeddings,"[""arora@cs.princeton.edu"", ""yingyul@cs.princeton.edu"", ""tengyu@cs.princeton.edu""]","[""Sanjeev Arora"", ""Yingyu Liang"", ""Tengyu Ma""]","[""Natural language processing"", ""Unsupervised Learning""]"," +The success of neural network methods for computing word embeddings has motivated methods for generating semantic embeddings of longer pieces of text, such as sentences and paragraphs. Surprisingly, Wieting et al (ICLR'16) showed that such complicated methods are outperformed, especially in out-of-domain (transfer learning) settings, by simpler methods involving mild retraining of word embeddings and basic linear regression. The method of Wieting et al. requires retraining with a substantial labeled dataset such as Paraphrase Database (Ganitkevitch et al., 2013). + +The current paper goes further, showing that the following completely unsupervised sentence embedding is a formidable baseline: Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD. This weighting improves performance by about 10% to 30% in textual similarity tasks, and beats sophisticated supervised methods including RNN's and LSTM's. It even improves Wieting et al.'s embeddings. + This simple method should be used as the baseline to beat in future, especially when labeled training data is scarce or nonexistent. + +The paper also gives a theoretical explanation of the success of the above unsupervised method using a latent variable generative model for sentences, which is a simple extension of the model in Arora et al. (TACL'16) with new ""smoothing"" terms that allow for +words occurring out of context, as well as high probabilities for words like and, not in all contexts. ",/pdf/3cd3d0e6d510ec56313971e66701feb35abde02d.pdf,ICLR,2017,A simple unsupervised method for sentence embedding that can get results comparable to sophisticated models like RNN's and LSTM's +L-88RyVtXGr,vJ03OkIQy6H,1601310000000.0,1614990000000.0,1117,Learning Deeply Shared Filter Bases for Efficient ConvNets,"[""~Woochul_Kang1"", ""ssregibility@gmail.com""]","[""Woochul Kang"", ""Daeyeon Kim""]","[""Deep learning"", ""ConvNets"", ""parameter sharing"", ""model compression"", ""convolutional neural networks"", ""recursive networks""]","Recently, inspired by repetitive block structure of modern ConvNets, such as ResNets, parameter-sharing among repetitive convolution layers has been proposed to reduce the size of parameters. However, naive sharing of convolution filters poses many challenges such as overfitting and vanishing/exploding gradients, resulting in worse performance than non-shared counterpart models. Furthermore, sharing parameters often increases computational complexity due to additional operations for re-parameterization. In this work, we propose an efficient parameter-sharing structure and an effective training mechanism of deeply shared parameters. In the proposed ConvNet architecture, convolution layers are decomposed into a filter basis, that can be shared recursively, and layer-specific parts. We conjecture that a shared filter basis combined with a small amount of layer-specific parameters can retain, or further enhance, the representation power of individual layers, if a proper training method is applied. We show both theoretically and empirically that potential vanishing/exploding gradients problems can be mitigated by enforcing orthogonality to the shared filter bases. Experimental results demonstrate that our scheme effectively reduces redundancy by saving up to 63.8% of parameters while consistently outperforming non-shared counterpart networks even when a filter basis is deeply shared by up to 10 repetitive convolution layers.",/pdf/a5fd9412c06a5910e1ee44147a373ed20bdac36c.pdf,ICLR,2021,We propose an efficient recursive parameter-sharing structure and an effective training mechanism for ConvNets. +SlrqM9_lyju,GU91WRyNOqr,1601310000000.0,1616000000000.0,1900,AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly,"[""~Yuchen_Jin1"", ""~Tianyi_Zhou1"", ""liangyu@cs.washington.edu"", ""zhuyibo@bytedance.com"", ""guochuanxiong@bytedance.com"", ""~Marco_Canini1"", ""~Arvind_Krishnamurthy1""]","[""Yuchen Jin"", ""Tianyi Zhou"", ""Liangyu Zhao"", ""Yibo Zhu"", ""Chuanxiong Guo"", ""Marco Canini"", ""Arvind Krishnamurthy""]",[],"The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR that minimizes the validation loss, every $\tau$ steps. We formulate it as black-box optimization and solve it by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO's acquisition function for $\tau$ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only $\tau'\ll\tau$ steps and train an exponential model to predict the validation loss after $\tau$ steps. This mutual-training process between BO and the exponential model allows us to bound the number of training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs from diverse domains and using different optimizers. The LR schedules auto-generated by AutoLRS leads to a speedup of $1.22\times$, $1.43\times$, and $1.5\times$ when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of $1.31\times$ over state-of-the-art highly tuned LR schedules.",/pdf/2bf7cedff713d5a595539d7b724e1ab3e9d40b76.pdf,ICLR,2021, +ol_xwLR2uWD,SOlPIQQfwze,1601310000000.0,1614990000000.0,828,Reviving Autoencoder Pretraining,"[""~You_Xie1"", ""~Nils_Thuerey1""]","[""You Xie"", ""Nils Thuerey""]","[""unsupervised pretraining"", ""greedy layer-wise pretraining"", ""transfer learning"", ""orthogonality""]","The pressing need for pretraining algorithms has been diminished by numerous advances in terms of regularization, architectures, and optimizers. Despite this trend, we re-visit the classic idea of unsupervised autoencoder pretraining and propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. We establish links between SVD and pretraining and show how it can be leveraged for gaining insights about the learned structures. Most importantly, we demonstrate that our approach yields an improved performance for a wide variety of relevant learning and transfer tasks ranging from fully connected networks over ResNets to GANs. Our results demonstrate that unsupervised pretraining has not lost its practical relevance in today's deep learning environment.",/pdf/9e1a702ca6f54b4acb3a9ba3538a5ed0621a9fdc.pdf,ICLR,2021,We re-visit unsupervised autoencoder pretraining and propose a variant that relies on a full reverse pass trained in conjunction with a given training task. +3Wp8HM2CNdR,iaSSJt1dly,1601310000000.0,1614990000000.0,1203,Whitening for Self-Supervised Representation Learning,"[""~Aleksandr_Ermolov1"", ""~Aliaksandr_Siarohin1"", ""~Enver_Sangineto1"", ""~Nicu_Sebe1""]","[""Aleksandr Ermolov"", ""Aliaksandr Siarohin"", ""Enver Sangineto"", ""Nicu Sebe""]","[""self-supervised learning"", ""unsupervised learning"", ""contrastive loss"", ""triplet loss"", ""whitening""]","Most of the self-supervised representation learning methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance (""positives"") are contrasted with instances extracted from other images (""negatives""). For the learning to be effective, a lot of negatives should be compared with a positive pair, which is computationally demanding. In this paper, we propose a different direction and a new loss function for self-supervised representation learning which is based on the whitening of the latent-space features. The whitening operation has a ""scattering"" effect on the batch samples, which compensates the use of negatives, avoiding degenerate solutions where all the sample representations collapse to a single point. Our Whitening MSE (W-MSE) loss does not require special heuristics (e.g. additional networks) and it is conceptually simple. Since negatives are not needed, we can extract multiple positive pairs from the same image instance. We empirically show that W-MSE is competitive with respect to popular, more complex self-supervised methods. The source code of the method and all the experiments is included in the Supplementary Material.",/pdf/949ea77144dfa2008c1c3625f96c60c559678ac4.pdf,ICLR,2021,self-supervised loss based on whitening +9CG8RW_p3Y,zxL9s0gUyyZ,1601310000000.0,1614990000000.0,1827,Fundamental Limits and Tradeoffs in Invariant Representation Learning,"[""~Han_Zhao1"", ""~Chen_Dan1"", ""~Bryon_Aragam1"", ""~Tommi_S._Jaakkola1"", ""~Geoff_Gordon2"", ""~Pradeep_Kumar_Ravikumar1""]","[""Han Zhao"", ""Chen Dan"", ""Bryon Aragam"", ""Tommi S. Jaakkola"", ""Geoff Gordon"", ""Pradeep Kumar Ravikumar""]","[""Representation learning""]","Many machine learning applications involve learning representations that achieve two competing goals: To maximize information or accuracy with respect to a target while simultaneously maximizing invariance or independence with respect to a subset of features. Typical examples include privacy-preserving learning, domain adaptation, and algorithmic fairness, just to name a few. In fact, all of the above problems admit a common minimax game-theoretic formulation, whose equilibrium represents a fundamental tradeoff between accuracy and invariance. In this paper, we provide an information-theoretic analysis of this general and important problem under both classification and regression settings. In both cases, we analyze the inherent tradeoffs between accuracy and invariance by providing a geometric characterization of the feasible region in the information plane, where we connect the geometric properties of this feasible region to the fundamental limitations of the tradeoff problem. In the regression setting, we also derive a tight lower bound on the Lagrangian objective that quantifies the tradeoff between accuracy and invariance. Our results shed new light on this fundamental problem by providing insights on the interplay between accuracy and invariance. These results deepen our understanding of this fundamental problem and may be useful in guiding the design of adversarial representation learning algorithms. +",/pdf/eb0a4528b87b1c732cb3ccb8bbc426c8bba5e31d.pdf,ICLR,2021,We provide an information-theoretic analysis of the tradeoff problem between accuracy and invariance under both classification and regression settings. +Hy1d-ebAb,BkRwWg-Cb,1509130000000.0,1518730000000.0,556,Learning Deep Generative Models of Graphs,"[""yujiali@google.com"", ""vinyals@google.com"", ""cdyer@google.com"", ""razp@google.com"", ""peterbattaglia@google.com""]","[""Yujia Li"", ""Oriol Vinyals"", ""Chris Dyer"", ""Razvan Pascanu"", ""Peter Battaglia""]","[""Generative Model of Graphs""]","Graphs are fundamental data structures required to model many important real-world data, from knowledge graphs, physical and social interactions to molecules and proteins. In this paper, we study the problem of learning generative models of graphs from a dataset of graphs of interest. After learning, these models can be used to generate samples with similar properties as the ones in the dataset. Such models can be useful in a lot of applications, e.g. drug discovery and knowledge graph construction. The task of learning generative models of graphs, however, has its unique challenges. In particular, how to handle symmetries in graphs and ordering of its elements during the generation process are important issues. We propose a generic graph neural net based model that is capable of generating any arbitrary graph. We study its performance on a few graph generation tasks compared to baselines that exploit domain knowledge. We discuss potential issues and open problems for such generative models going forward.",/pdf/a5216c3df94438533331393c67a868d4bff54bfe.pdf,ICLR,2018,We study the graph generation problem and propose a powerful deep generative model capable of generating arbitrary graphs. +HyDAQl-AW,Sy807l-0-,1509130000000.0,1518730000000.0,572,Time Limits in Reinforcement Learning,"[""f.pardo@imperial.ac.uk"", ""a.tavakoli@imperial.ac.uk"", ""v.levdik@imperial.ac.uk"", ""p.kormushev@imperial.ac.uk""]","[""Fabio Pardo"", ""Arash Tavakoli"", ""Vitaly Levdik"", ""Petar Kormushev""]","[""reinforcement learning"", ""Markov decision processes"", ""deep learning""]","In reinforcement learning, it is common to let an agent interact with its environment for a fixed amount of time before resetting the environment and repeating the process in a series of episodes. The task that the agent has to learn can either be to maximize its performance over (i) that fixed amount of time, or (ii) an indefinite period where the time limit is only used during training. In this paper, we investigate theoretically how time limits could effectively be handled in each of the two cases. In the first one, we argue that the terminations due to time limits are in fact part of the environment, and propose to include a notion of the remaining time as part of the agent's input. In the second case, the time limits are not part of the environment and are only used to facilitate learning. We argue that such terminations should not be treated as environmental ones and propose a method, specific to value-based algorithms, that incorporates this insight by continuing to bootstrap at the end of each partial episode. To illustrate the significance of our proposals, we perform several experiments on a range of environments from simple few-state transition graphs to complex control tasks, including novel and standard benchmark domains. Our results show that the proposed methods improve the performance and stability of existing reinforcement learning algorithms.",/pdf/faf99a88038c55121ac8127968c16d0713519a76.pdf,ICLR,2018,We consider the problem of learning optimal policies in time-limited and time-unlimited domains using time-limited interactions. +Sy0GnUxCb,S1TM2LgR-,1509090000000.0,1519420000000.0,277,Emergent Complexity via Multi-Agent Competition,"[""tbansal@cs.umass.edu"", ""jakub@openai.com"", ""szymon@openai.com"", ""ilyasu@openai.com"", ""mordatch@openai.com""]","[""Trapit Bansal"", ""Jakub Pachocki"", ""Szymon Sidor"", ""Ilya Sutskever"", ""Igor Mordatch""]","[""multi-agent systems"", ""multi-agent competition"", ""self-play"", ""deep reinforcement learning""]","Reinforcement learning algorithms can train agents that solve problems in complex, interesting environments. Normally, the complexity of the trained agent is closely related to the complexity of the environment. This suggests that a highly capable agent requires a complex environment for training. In this paper, we point out that a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself. We also point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents of this level will have the right level of difficulty. +This work introduces several competitive multi-agent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple. The skills include behaviors such as running, blocking, ducking, tackling, fooling opponents, kicking, and defending using both arms and legs. A highlight of the learned behaviors can be found here: https://goo.gl/eR7fbX",/pdf/71a26e6a580fcfcccd1a1dbc408daaacb2a32327.pdf,ICLR,2018, +Byeq_xHtwS,SklKQalFvS,1569440000000.0,1577170000000.0,2408,Neural Video Encoding,"[""abelb@nvidia.com"", ""rdipietro@nvidia.com""]","[""Abel Brown"", ""Robert DiPietro""]","[""Kolmogorov complexity"", ""differentiable programming"", ""convolutional neural networks""]","Deep neural networks have had unprecedented success in computer vision, natural language processing, and speech largely due to the ability to search for suitable task algorithms via differentiable programming. In this paper, we borrow ideas from Kolmogorov complexity theory and normalizing flows to explore the possibilities of finding arbitrary algorithms that represent data. In particular, algorithms which encode sequences of video image frames. Ultimately, we demonstrate neural video encoded using convolutional neural networks to transform autoregressive noise processes and show that this method has surprising cryptographic analogs for information security.",/pdf/96c263d7fa038b57f56e9054c43d8a9b73d3b2c2.pdf,ICLR,2020,We explore applications of differentiable programming to Kolmogorov complexity in order to realize efficient programs that encode data. +SyX0IeWAW,BJ7RLgWRZ,1509130000000.0,1519360000000.0,594,META LEARNING SHARED HIERARCHIES,"[""kevinfrans2@gmail.com"", ""jonathanho@berkeley.edu"", ""c.xi@eecs.berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""joschu@openai.com""]","[""Kevin Frans"", ""Jonathan Ho"", ""Xi Chen"", ""Pieter Abbeel"", ""John Schulman""]","[""hierarchal reinforcement learning"", ""meta-learning""]","We develop a metalearning approach for learning hierarchically structured poli- cies, improving sample efficiency on unseen tasks through the use of shared primitives—policies that are executed for large numbers of timesteps. Specifi- cally, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring the strength of such hierarchies, leading to an optimization problem for quickly reaching high reward on unseen tasks. We then present an algorithm to solve this problem end-to-end through the use of any off-the-shelf reinforcement learning method, by repeatedly sampling new tasks and resetting task-specific policies. We successfully discover meaningful motor primitives for the directional movement of four-legged robots, solely by interacting with distributions of mazes. We also demonstrate the transferability of primitives to solve long-timescale sparse-reward obstacle courses, and we enable 3D humanoid robots to robustly walk and crawl with the same policy.",/pdf/f2dc7f530129271d5a2d2a2ba21a201105b6f8c1.pdf,ICLR,2018,learn hierarchal sub-policies through end-to-end training over a distribution of tasks +SkeAaJrKDS,r1lLr8ytvB,1569440000000.0,1583910000000.0,2008,Combining Q-Learning and Search with Amortized Value Estimates,"[""jhamrick@google.com"", ""vbapst@google.com"", ""alvarosg@google.com"", ""tpfaff@google.com"", ""theophane@google.com"", ""lbuesing@google.com"", ""peterbattaglia@google.com""]","[""Jessica B. Hamrick"", ""Victor Bapst"", ""Alvaro Sanchez-Gonzalez"", ""Tobias Pfaff"", ""Theophane Weber"", ""Lars Buesing"", ""Peter W. Battaglia""]","[""model-based RL"", ""Q-learning"", ""MCTS"", ""search""]","We introduce ""Search with Amortized Value Estimates"" (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and---in contrast to typical model-based search approaches---yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.",/pdf/24f332c1667c84011b2b9eaa63a26cd70cf9e86f.pdf,ICLR,2020,"We propose a model-based method called ""Search with Amortized Value Estimates"" (SAVE) which leverages both real and planned experience by combining Q-learning with Monte-Carlo Tree Search, achieving strong performance with very small search budgets." +BkeWw6VFwr,B1l5nEcPDr,1569440000000.0,1583910000000.0,585,Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing,"[""jinyuan.jia@duke.edu"", ""xiaoyu.cao@duke.edu"", ""binghui.wang@duke.edu"", ""neil.gong@duke.edu""]","[""Jinyuan Jia"", ""Xiaoyu Cao"", ""Binghui Wang"", ""Neil Zhenqiang Gong""]","[""Certified Adversarial Robustness"", ""Randomized Smoothing"", ""Adversarial Examples""]","It is well-known that classifiers are vulnerable to adversarial perturbations. To defend against adversarial perturbations, various certified robustness results have been derived. However, existing certified robustnesses are limited to top-1 predictions. In many real-world applications, top-$k$ predictions are more relevant. In this work, we aim to derive certified robustness for top-$k$ predictions. In particular, our certified robustness is based on randomized smoothing, which turns any classifier to a new classifier via adding noise to an input example. We adopt randomized smoothing because it is scalable to large-scale neural networks and applicable to any classifier. We derive a tight robustness in $\ell_2$ norm for top-$k$ predictions when using randomized smoothing with Gaussian noise. We find that generalizing the certified robustness from top-1 to top-$k$ predictions faces significant technical challenges. We also empirically evaluate our method on CIFAR10 and ImageNet. For example, our method can obtain an ImageNet classifier with a certified top-5 accuracy of 62.8\% when the $\ell_2$-norms of the adversarial perturbations are less than 0.5 (=127/255). Our code is publicly available at: \url{https://github.com/jjy1994/Certify_Topk}. ",/pdf/993579d504d60a321258536f640b344729e5aea3.pdf,ICLR,2020,We study the certified robustness for top-k predictions via randomized smoothing under Gaussian noise and derive a tight robustness bound in L_2 norm. +MG8Zde0ip6u,eMwT-iysFMTj,1601310000000.0,1614990000000.0,1323,A Siamese Neural Network for Behavioral Biometrics Authentication,"[""~Jes\u00fas_Solano1"", ""esteban.rivera@appgate.com"", ""alejandra.castelblanco@appgate.com"", ""lizzy.tengana@appgate.com"", ""christian.lopez@appgate.com"", ""martin.ochoa@appgate.com""]","[""Jes\u00fas Solano"", ""Esteban Rivera"", ""Alejandra Castelblanco"", ""Lizzy Tengana"", ""Christian Lopez"", ""Martin Ochoa""]","[""Deep Learning"", ""Few-shot Learning"", ""Behavioral Biometrics"", ""Biometric Authentication""]","The raise in popularity of personalized web and mobile applications brings about a need of robust authentication systems. Although password authentication is the most popular authentication mechanism, it has also several drawbacks. Behavioral Biometrics Authentication has emerged as a complementary risk-based authentication approach which aims at profiling users based on their behavior while interacting with computers/smartphones. In this work we propose a novel Siamese Neural Network to perform a few-shot verification of user's behavior. We develop our approach to identify behavior from either human-computer or human-smartphone interaction. For computer interaction our approach learns from mouse and keyboard dynamics, while for smartphone interaction it learns from holding patterns and touch patterns. We show that our approach has a few-shot classification accuracy of up to 99.8% and 90.8% for mobile and web interactions, respectively. We also test our approach on a database that contains over 100K different web interactions collected in the wild.",/pdf/d1a1d5c3442ef67db75b417985aee9e8f2f3bf91.pdf,ICLR,2021, +FPpZrRfz6Ss,hmOkqFd_fIY,1601310000000.0,1614990000000.0,316,To Learn Effective Features: Understanding the Task-Specific Adaptation of MAML,"[""~Zhijie_Lin1"", ""~Zhou_Zhao2"", ""~Zhu_Zhang3"", ""huaibaoxing@huawei.com"", ""nicholas.yuan@huawei.com""]","[""Zhijie Lin"", ""Zhou Zhao"", ""Zhu Zhang"", ""Huai Baoxing"", ""Jing Yuan""]","[""Meta-Learning"", ""Few-Shot Learning"", ""Meta-initialization"", ""Task-specific Adaptation""]","Meta learning, an effective way for learning unseen tasks with few samples, is an important research +area in machine learning. +Model Agnostic Meta-Learning~(MAML)~(\cite{finn2017model}) is one of the most well-known gradient-based meta learning algorithms, that learns +the meta-initialization through the inner and outer optimization loop. +The inner loop is to perform fast adaptation in several gradient update steps with the support datapoints, +while the outer loop to generalize the updated model to the query datapoints. +Recently, it has been argued that instead of rapid learning and adaptation, the learned meta-initialization through MAML +has already absorbed the high-quality features prior, where the task-specific head at training +facilitates the feature learning. +In this work, we investigate the impact of the task-specific adaptation of MAML and discuss the general formula for +other gradient-based and metric-based meta-learning approaches. +From our analysis, we further devise the Random Decision Planes~(RDP) algorithm to find a suitable linear classifier +without any gradient descent step and the Meta Contrastive Learning~(MCL) algorithm to exploit the inter-samples relationship +instead of the expensive inner-loop adaptation. +We conduct sufficient experiments on various datasets to explore our proposed algorithms.",/pdf/e49dbab3a1fac5220d3ead80d518c688e5d293cc.pdf,ICLR,2021,We further study the impact of task-specific adaptation and devise an effective training paradigm with better results but less computation costs. +ucEXZQncukK,UqGdVUl8Kak,1601310000000.0,1614990000000.0,219,Bayesian Online Meta-Learning,"[""~Pau_Ching_Yap1"", ""~Hippolyt_Ritter1"", ""~David_Barber1""]","[""Pauching Yap"", ""Hippolyt Ritter"", ""David Barber""]","[""Bayesian online learning"", ""few-shot learning"", ""meta-learning""]","Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem for large-scale supervised classification, little has been done to overcome catastrophic forgetting for few-shot classification problems. Few-shot meta-learning algorithms often require all few-shot tasks to be readily available in a batch for training. The popular gradient-based model-agnostic meta-learning algorithm (MAML) is a typical algorithm that suffers from these limitations. This work introduces a Bayesian online meta-learning framework to tackle the catastrophic forgetting and the sequential few-shot tasks problems. Our framework incorporates MAML into a Bayesian online learning algorithm with Laplace approximation or variational inference. This framework enables few-shot classification on a range of sequentially arriving datasets with a single meta-learned model and training on sequentially arriving few-shot tasks. The experimental evaluations demonstrate that our framework can effectively prevent catastrophic forgetting and is capable of online meta-learning in various few-shot classification settings.",/pdf/4bbd7ba186fdcc02823a43eeba04182768bea3df.pdf,ICLR,2021,"We introduce the BOML framework that can few-shot classify tasks originating from different distributions, and can handle few-shot online learning in sequential tasks setting." +rybDdHe0Z,H1xPdBgRb,1509080000000.0,1518730000000.0,259,Sequence Transfer Learning for Neural Decoding,"[""velango@eng.ucsd.edu"", ""anp054@eng.ucsd.edu"", ""kai.miller@stanford.edu"", ""vgilja@eng.ucsd.edu""]","[""Venkatesh Elango*"", ""Aashish N Patel*"", ""Kai J Miller"", ""Vikash Gilja""]","[""Transfer Learning"", ""Applications"", ""Neural decoding""]","A fundamental challenge in designing brain-computer interfaces (BCIs) is decoding behavior from time-varying neural oscillations. In typical applications, decoders are constructed for individual subjects and with limited data leading to restrictions on the types of models that can be utilized. Currently, the best performing decoders are typically linear models capable of utilizing rigid timing constraints with limited training data. Here we demonstrate the use of Long Short-Term Memory (LSTM) networks to take advantage of the temporal information present in sequential neural data collected from subjects implanted with electrocorticographic (ECoG) electrode arrays performing a finger flexion task. Our constructed models are capable of achieving accuracies that are comparable to existing techniques while also being robust to variation in sample data size. Moreover, we utilize the LSTM networks and an affine transformation layer to construct a novel architecture for transfer learning. We demonstrate that in scenarios where only the affine transform is learned for a new subject, it is possible to achieve results comparable to existing state-of-the-art techniques. The notable advantage is the increased stability of the model during training on novel subjects. Relaxing the constraint of only training the affine transformation, we establish our model as capable of exceeding performance of current models across all training data sizes. Overall, this work demonstrates that LSTMs are a versatile model that can accurately capture temporal patterns in neural data and can provide a foundation for transfer learning in neural decoding.",/pdf/e290626ef2812f50350a9d116003061a77b432a7.pdf,ICLR,2018, +Fn5wiAq2SR,rb-_kQpbfBD,1601310000000.0,1614990000000.0,720,Adversarial Training using Contrastive Divergence,"[""~Hongjun_Wang2"", ""~Guanbin_Li2"", ""~Liang_Lin1""]","[""Hongjun Wang"", ""Guanbin Li"", ""Liang Lin""]","[""Adversarial Training"", ""Contrastive Divergence""]","To protect the security of machine learning models against adversarial examples, adversarial training becomes the most popular and powerful strategy against various adversarial attacks by injecting adversarial examples into training data. However, it is time-consuming and requires high computation complexity to generate suitable adversarial examples for ensuring the robustness of models, which impedes the spread and application of adversarial training. In this work, we reformulate adversarial training as a combination of stationary distribution exploring, sampling, and training. Each updating of parameters of DNN is based on several transitions from the data samples as the initial states in a Hamiltonian system. Inspired by our new paradigm, we design a new generative method for adversarial training by using Contrastive Divergence (ATCD), which approaches the equilibrium distribution of adversarial examples with only few iterations by building from small modifications of the standard Contrastive Divergence (CD). Our adversarial training algorithm achieves much higher robustness than any other state-of-the-art adversarial training acceleration method on the ImageNet, CIFAR-10, and MNIST datasets and reaches a balance between performance and efficiency.",/pdf/f2522082ad50cb9a45e34b29455d1d9d542d1188.pdf,ICLR,2021,We design a new generative method for adversarial training by using Contrastive Divergence to reaches a balance of performance and efficiency. +HyXBcYg0b,H1fBqteAW,1509100000000.0,1518730000000.0,340,Residual Gated Graph ConvNets,"[""xbresson@ntu.edu.sg"", ""tlaurent@lmu.edu""]","[""Xavier Bresson"", ""Thomas Laurent""]","[""graph neural networks"", ""ConvNets"", ""RNNs"", ""pattern matching"", ""semi-supervised clustering""]","Graph-structured data such as social networks, functional brain networks, gene regulatory networks, communications networks have brought the interest in generalizing deep learning techniques to graph domains. In this paper, we are interested to design neural networks for graphs with variable length in order to solve learning problems such as vertex classification, graph classification, graph regression, and graph generative tasks. Most existing works have focused on recurrent neural networks (RNNs) to learn meaningful representations of graphs, and more recently new convolutional neural networks (ConvNets) have been introduced. In this work, we want to compare rigorously these two fundamental families of architectures to solve graph learning tasks. We review existing graph RNN and ConvNet architectures, and propose natural extension of LSTM and ConvNet to graphs with arbitrary size. Then, we design a set of analytically controlled experiments on two basic graph problems, i.e. subgraph matching and graph clustering, to test the different architectures. Numerical results show that the proposed graph ConvNets are 3-17% more accurate and 1.5-4x faster than graph RNNs. Graph ConvNets are also 36% more accurate than variational (non-learning) techniques. Finally, the most effective graph ConvNet architecture uses gated edges and residuality. Residuality plays an essential role to learn multi-layer architectures as they provide a 10% gain of performance.",/pdf/4ae1c5290208ac5496c0f2bae3944b8e4ce7f7ac.pdf,ICLR,2018,"We compare graph RNNs and graph ConvNets, and we consider the most generic class of graph ConvNets with residuality." +HklRwaEKwB,H1l8IFowPH,1569440000000.0,1583910000000.0,615,"Ridge Regression: Structure, Cross-Validation, and Sketching","[""sfliu@stanford.edu"", ""dobribanedgar@gmail.com""]","[""Sifan Liu"", ""Edgar Dobriban""]","[""ridge regression"", ""sketching"", ""random matrix theory"", ""cross-validation"", ""high-dimensional asymptotics""]","We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a unified large-data linear model. We give a precise representation of ridge regression as a covariance matrix-dependent linear combination of the true parameter and the noise. +We study the bias of $K$-fold cross-validation for choosing the regularization parameter, and propose a simple bias-correction. We analyze the accuracy of primal and dual sketching for ridge regression, showing they are surprisingly accurate. Our results are illustrated by simulations and by analyzing empirical data.",/pdf/455671549ea589bd0c09c5457c52062150923fca.pdf,ICLR,2020,"We study the structure of ridge regression in a high-dimensional asymptotic framework, and get insights about cross-validation and sketching." +r1ZdKJ-0W,HygOtk-AZ,1509120000000.0,1519410000000.0,502,Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking,"[""a.bojchevski@in.tum.de"", ""guennemann@in.tum.de""]","[""Aleksandar Bojchevski"", ""Stephan G\u00fcnnemann""]","[""node embeddings"", ""graphs"", ""unsupervised learning"", ""inductive learning"", ""uncertainty"", ""deep learning""]","Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph. ",/pdf/bce371f0e73f807ce29ec193e73ac5d49dd43cb0.pdf,ICLR,2018, We embed nodes in a graph as Gaussian distributions allowing us to capture uncertainty about their representation. +SJIMPr9eg,,1478280000000.0,1481130000000.0,256,Boosted Residual Networks,"[""a.mosca@dcs.bbk.ac.uk"", ""gmagoulas@dcs.bbk.ac.uk""]","[""Alan Mosca"", ""George D. Magoulas""]",[],"In this paper we present a new ensemble method, called Boosted Residual Networks, +which builds an ensemble of Residual Networks by growing the member +network at each round of boosting. The proposed approach combines recent developements +in Residual Networks - a method for creating very deep networks by +including a shortcut layer between different groups of layers - with the Deep Incremental +Boosting, which has been proposed as a methodology to train fast ensembles +of networks of increasing depth through the use of boosting. We demonstrate +that the synergy of Residual Networks and Deep Incremental Boosting has better +potential than simply boosting a Residual Network of fixed structure or using the +equivalent Deep Incremental Boosting without the shortcut layers.",/pdf/2615de05c9edd0e70aa0950f3f1979e73958033a.pdf,ICLR,2017, +o2N6AYOp31,zHH8yjJSNC4,1601310000000.0,1614990000000.0,2469,Augmentation-Interpolative AutoEncoders for Unsupervised Few-Shot Image Generation,"[""~Davis_Wertheimer1"", ""~Omid_Poursaeed2"", ""~Bharath_Hariharan3""]","[""Davis Wertheimer"", ""Omid Poursaeed"", ""Bharath Hariharan""]","[""Interpolation"", ""autoencoder"", ""reconstruction"", ""few-shot learning"", ""few-shot image generation"", ""generalization"", ""augmentation""]","We aim to build image generation models that generalize to new domains from few examples. To this end, we first investigate the generalization properties of classic image generators, and discover that autoencoders generalize extremely well to new domains, even when trained on highly constrained data. We leverage this insight to produce a robust, unsupervised few-shot image generation algorithm, and introduce a novel training procedure based on recovering an image from data augmentations. Our Augmentation-Interpolative AutoEncoders synthesize realistic images of novel objects from only a few reference images, and outperform both prior interpolative models and supervised few-shot image generators. Our procedure is simple and lightweight, generalizes broadly, and requires no category labels or other supervision during training.",/pdf/7d56765de4510fd060003161377a7bfc4cb62fb8.pdf,ICLR,2021,A reconstruction-based method with strong generalization capability for synthesizing images beyond the training domain +SJe9rh0cFX,HJgBavactX,1538090000000.0,1547330000000.0,1567,On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks,"[""yding5@nd.edu"", ""jliu16@nd.edu"", ""jinjun@us.ibm.com"", ""yshi4@nd.edu""]","[""Yukun Ding"", ""Jinglan Liu"", ""Jinjun Xiong"", ""Yiyu Shi""]","[""Quantized Neural Networks"", ""Universial Approximability"", ""Complexity Bounds"", ""Optimal Bit-width""]","Compression is a key step to deploy large neural networks on resource-constrained platforms. As a popular compression technique, quantization constrains the number of distinct weight values and thus reducing the number of bits required to represent and store each weight. In this paper, we study the representation power of quantized neural networks. First, we prove the universal approximability of quantized ReLU networks on a wide class of functions. Then we provide upper bounds on the number of weights and the memory size for a given approximation error bound and the bit-width of weights for function-independent and function-dependent structures. Our results reveal that, to attain an approximation error bound of $\epsilon$, the number of weights needed by a quantized network is no more than $\mathcal{O}\left(\log^5(1/\epsilon)\right)$ times that of an unquantized network. This overhead is of much lower order than the lower bound of the number of weights needed for the error bound, supporting the empirical success of various quantization techniques. To the best of our knowledge, this is the first in-depth study on the complexity bounds of quantized neural networks.",/pdf/b19c7c43db718baa927048e94b97f624fd0b35d1.pdf,ICLR,2019,This paper proves the universal approximability of quantized ReLU neural networks and puts forward the complexity bound given arbitrary error. +H1srNebAZ,rJ5rExZCW,1509130000000.0,1518730000000.0,576,Discovering the mechanics of hidden neurons,"[""simon.carbonnelle@uclouvain.be"", ""christophe.devleeschouwer@uclouvain.be""]","[""Simon Carbonnelle"", ""Christophe De Vleeschouwer""]","[""deep learning"", ""experimental analysis"", ""hidden neurons""]","Neural networks trained through stochastic gradient descent (SGD) have been around for more than 30 years, but they still escape our understanding. This paper takes an experimental approach, with a divide-and-conquer strategy in mind: we start by studying what happens in single neurons. While being the core building block of deep neural networks, the way they encode information about the inputs and how such encodings emerge is still unknown. We report experiments providing strong evidence that hidden neurons behave like binary classifiers during training and testing. During training, analysis of the gradients reveals that a neuron separates two categories of inputs, which are impressively constant across training. During testing, we show that the fuzzy, binary partition described above embeds the core information used by the network for its prediction. These observations bring to light some of the core internal mechanics of deep neural networks, and have the potential to guide the next theoretical and practical developments.",/pdf/ca36244900d5dbd718b5cba941a19001cf6bc119.pdf,ICLR,2018,We report experiments providing strong evidence that a neuron behaves like a binary classifier during training and testing +HOFxeCutxZR,gRNp4A6wRuV,1601310000000.0,1616010000000.0,315,Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation,"[""~Peiye_Zhuang2"", ""~Oluwasanmi_O_Koyejo1"", ""~Alex_Schwing1""]","[""Peiye Zhuang"", ""Oluwasanmi O Koyejo"", ""Alex Schwing""]","[""Image manipulation"", ""GANs"", ""latent space of GANs""]","Controllable semantic image editing enables a user to change entire image attributes with a few clicks, e.g., gradually making a summer scene look like it was taken in winter. Classic approaches for this task use a Generative Adversarial Net (GAN) to learn a latent space and suitable latent-space transformations. However, current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism. To address these concerns, we learn multiple attribute transformations simultaneously, integrate attribute regression into the training of transformation functions, and apply a content loss and an adversarial loss that encourages the maintenance of image identity and photo-realism. We propose quantitative evaluation strategies for measuring controllable editing performance, unlike prior work, which primarily focuses on qualitative evaluation. Our model permits better control for both single- and multiple-attribute editing while preserving image identity and realism during transformation. We provide empirical results for both natural and synthetic images, highlighting that our model achieves state-of-the-art performance for targeted image manipulation. ",/pdf/818a65b722eab947d8c57665b571d535ce9ca68a.pdf,ICLR,2021,We propose a state-of-the-art approach to semantically edit images by transferring latent vectors towards meaningful latent space directions. +mxIEptSTK6Z,pQYdJAPlFU,1601310000000.0,1614990000000.0,470,Continual learning with neural activation importance,"[""soheekim@khu.ac.kr"", ""~Seungkyu_Lee1""]","[""Sohee Kim"", ""Seungkyu Lee""]",[],"Continual learning is a concept of online learning along with multiple sequential tasks. One of the critical barriers of continual learning is that a network should learn a new task keeping the knowledge of old tasks without access to any data of the old tasks. In this paper, we propose a neuron importance based regularization method for stable continual learning. +We propose a comprehensive experimental evaluation framework on existing benchmark data sets to evaluate not just the accuracy of a certain order of continual learning performance also the robustness of the accuracy along with the changes in the order of tasks.",/pdf/df732b71449b1ec6104f583d7c272467cdd79592.pdf,ICLR,2021, +Aoq37n5bhpJ,3KMu4s3QhLW,1601310000000.0,1614990000000.0,839,Federated learning using mixture of experts,"[""~Edvin_Listo_Zec1"", ""~John_Martinsson1"", ""~Olof_Mogren1"", ""~Leon_Ren\u00e9_S\u00fctfeld1"", ""~Daniel_Gillblad1""]","[""Edvin Listo Zec"", ""John Martinsson"", ""Olof Mogren"", ""Leon Ren\u00e9 S\u00fctfeld"", ""Daniel Gillblad""]","[""federated learning"", ""mixture of experts""]","Federated learning has received attention for its efficiency and privacy benefits,in settings where data is distributed among devices. Although federated learning shows significant promise as a key approach when data cannot be shared or centralized, current incarnations show limited privacy properties and have short-comings when applied to common real-world scenarios. One such scenario is heterogeneous data among devices, where data may come from different generating distributions. In this paper, we propose a federated learning framework using a mixture of experts to balance the specialist nature of a locally trained model with the generalist knowledge of a global model in a federated learning setting. Our results show that the mixture of experts model is better suited as a personalized model for devices when data is heterogeneous, outperforming both global and lo-cal models. Furthermore, our framework gives strict privacy guarantees, which allows clients to select parts of their data that may be excluded from the federation. The evaluation shows that the proposed solution is robust to the setting where some users require a strict privacy setting and do not disclose their models to a central server at all, opting out from the federation partially or entirely. The proposed framework is general enough to include any kind of machine learning models, and can even use combinations of different kinds.",/pdf/bc5372f4c5d2c99e25950c7bc51222daa6941127.pdf,ICLR,2021,We use a mixture of expert approach to learn personalized models in a federated learning setting with heterogeneous client data +rJeKjwvclx,,1478290000000.0,1487110000000.0,399,Dynamic Coattention Networks For Question Answering,"[""cxiong@salesforce.com"", ""vzhong@salesforce.com"", ""rsocher@salesforce.com""]","[""Caiming Xiong"", ""Victor Zhong"", ""Richard Socher""]","[""Natural language processing"", ""Deep learning"", ""Applications""]","Several deep learning models have been proposed for question answering. How- ever, due to their single-pass nature, they have no way to recover from local maxima corresponding to incorrect answers. To address this problem, we introduce the Dynamic Coattention Network (DCN) for question answering. The DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then a dynamic pointer decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers. On the Stanford question answering dataset, a single DCN model improves the previous state of the art from 71.0% F1 to 75.9%, while a DCN ensemble obtains 80.4% F1.",/pdf/8db6d8f1072c3fb894057a8b06f6ba30c781540a.pdf,ICLR,2017,An end-to-end dynamic neural network model for question answering that achieves the state of the art and best leaderboard performance on the Stanford QA dataset. +rk4Qso0cKm,HJgGbfk5tQ,1538090000000.0,1556950000000.0,610,Adv-BNN: Improved Adversarial Defense through Robust Bayesian Neural Network,"[""xqliu@cs.ucla.edu"", ""yaoli@ucdavis.edu"", ""crwu@ucdavis.edu"", ""chohsieh@cs.ucla.edu""]","[""Xuanqing Liu"", ""Yao Li"", ""Chongruo Wu"", ""Cho-Jui Hsieh""]",[],"We present a new algorithm to train a robust neural network against adversarial attacks. +Our algorithm is motivated by the following two ideas. First, although recent work has demonstrated that fusing randomness can improve the robustness of neural networks (Liu 2017), we noticed that adding noise blindly to all the layers is not the optimal way to incorporate randomness. +Instead, we model randomness under the framework of Bayesian Neural Network (BNN) to formally learn the posterior distribution of models in a scalable way. Second, we formulate the mini-max problem in BNN to learn the best model distribution under adversarial attacks, leading to an adversarial-trained Bayesian neural net. Experiment results demonstrate that the proposed algorithm achieves state-of-the-art performance under strong attacks. On CIFAR-10 with VGG network, our model leads to 14% accuracy improvement compared with adversarial training (Madry 2017) and random self-ensemble (Liu, 2017) under PGD attack with 0.035 distortion, and the gap becomes even larger on a subset of ImageNet.",/pdf/2ef7dc041cc1292919d9be7aa9b8b70f3c0e6bb0.pdf,ICLR,2019,"We design an adversarial training method to Bayesian neural networks, showing a much stronger defense to white-box adversarial attacks" +Hyxtso0qtX,Hygm0zn9K7,1538090000000.0,1545360000000.0,640,Adversarial Exploration Strategy for Self-Supervised Imitation Learning,"[""williamd4112@gapp.nthu.edu.tw"", ""yesray0216@gmail.com"", ""ariel@shann.net"", ""shawn420@gapp.nthu.edu.tw"", ""cylee@cs.nthu.edu.tw""]","[""Zhang-Wei Hong"", ""Tsu-Jui Fu"", ""Tzu-Yun Shann"", ""Yi-Hsiang Chang"", ""Chun-Yi Lee""]","[""adversarial exploration"", ""self-supervised"", ""imitation learning""]","We present an adversarial exploration strategy, a simple yet effective imitation learning scheme that incentivizes exploration of an environment without any extrinsic reward or human demonstration. Our framework consists of a deep reinforcement learning (DRL) agent and an inverse dynamics model contesting with each other. The former collects training samples for the latter, and its objective is to maximize the error of the latter. The latter is trained with samples collected by the former, and generates rewards for the former when it fails to predict the actual action taken by the former. In such a competitive setting, the DRL agent learns to generate samples that the inverse dynamics model fails to predict correctly, and the inverse dynamics model learns to adapt to the challenging samples. We further propose a reward structure that ensures the DRL agent collects only moderately hard samples and not overly hard ones that prevent the inverse model from imitating effectively. We evaluate the effectiveness of our method on several OpenAI gym robotic arm and hand manipulation tasks against a number of baseline models. Experimental results show that our method is comparable to that directly trained with expert demonstrations, and superior to the other baselines even without any human priors.",/pdf/9909a2976fffd2e52765a85c8a6f92a8eac29532.pdf,ICLR,2019,A simple yet effective imitation learning scheme that incentivizes exploration of an environment without any extrinsic reward or human demonstration. +HkzOWnActX,S1liZgDndm,1538090000000.0,1545360000000.0,1186,Model-Agnostic Meta-Learning for Multimodal Task Distributions,"[""vuoristo@gmail.com"", ""shaohuas@usc.edu"", ""hexiangh@usc.edu"", ""limjj@usc.edu""]","[""Risto Vuorio"", ""Shao-Hua Sun"", ""Hexiang Hu"", ""Joseph J. Lim""]","[""Meta-learning"", ""gradient-based meta-learning"", ""model-based meta-learning""]","Gradient-based meta-learners such as MAML (Finn et al., 2017) are able to learn a meta-prior from similar tasks to adapt to novel tasks from the same distribution with few gradient updates. One important limitation of such frameworks is that they seek a common initialization shared across the entire task distribution, substantially limiting the diversity of the task distributions that they are able to learn from. In this paper, we augment MAML with the capability to identify tasks sampled from a multimodal task distribution and adapt quickly through gradient updates. Specifically, we propose a multimodal MAML algorithm that is able to modulate its meta-learned prior according to the identified task, allowing faster adaptation. We evaluate the proposed model on a diverse set of problems including regression, few-shot image classification, and reinforcement learning. The results demonstrate the effectiveness of our model in modulating the meta-learned prior in response to the characteristics of tasks sampled from a multimodal distribution.",/pdf/2d0826459e75d0a9b8f452edf91498849e79670a.pdf,ICLR,2019,"We proposed a meta-learner that generalizes across a multimodal task distribution by identifying the modes of a task distribution and modulating its meta-learned prior parameters accordingly, allowing faster adaptation through gradient updates." +S1ltg1rFDS,BklWnBsuPS,1569440000000.0,1583910000000.0,1515,Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning,"[""ali.mousavi1988@gmail.com"", ""lihongli.cs@gmail.com"", ""dennyzhou@google.com"", ""lqiang@cs.utexas.edu""]","[""Ali Mousavi"", ""Lihong Li"", ""Qiang Liu"", ""Denny Zhou""]","[""reinforcement learning"", ""off-policy estimation"", ""importance sampling"", ""propensity score""]","Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \citet{liu18breaking} proposed an approach that avoids the curse of horizon suffered by typical importance-sampling-based methods. While showing promising results, this approach is limited in practice as it requires data being collected by a known behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as solving for the fixed point of a ""backward flow"" operator and show that the fixed point solution gives the desired importance ratios of stationary distributions between the target and behavior policies. We analyze its asymptotic consistency and finite-sample +generalization. Experiments on benchmarks verify the effectiveness of our proposed approach. +",/pdf/70159d307dae110203ef877556fc5534c81b0eaa.pdf,ICLR,2020,We present a novel approach for the off-policy estimation problem in infinite-horizon RL. +BJmCKBqgl,,1478280000000.0,1481760000000.0,262,DyVEDeep: Dynamic Variable Effort Deep Neural Networks,"[""sanjaygana@gmail.com"", ""venkata0@purdue.edu"", ""ravi@cse.iitm.ac.in"", ""raghunathan@purdue.edu""]","[""Sanjay Ganapathy"", ""Swagath Venkataramani"", ""Balaraman Ravindran"", ""Anand Raghunathan""]",[],"Deep Neural Networks (DNNs) have advanced the state-of-the-art on a variety of machine learning tasks and are deployed widely in many real-world products. However, the compute and data requirements demanded by large-scale DNNs remains a significant challenge. In this work, we address this challenge in the context of DNN inference. We propose Dynamic Variable Effort Deep Neural Networks (DyVEDeep), which exploit the heterogeneity in the characteristics of inputs to DNNs to improve their compute efficiency while maintaining the same classification accuracy. DyVEDeep equips DNNs with dynamic effort knobs, which in course of processing an input, identify how critical a group of computations are to classify the input. DyVEDeep dynamically focuses its compute effort only on the critical computations, while the skipping/approximating the rest. We propose 3 effort knobs that operate at different levels of granularity viz. neuron, feature and layer levels. We build DyVEDeep versions for 5 popular image recognition benchmarks on 3 image datasets---MNIST, CIFAR and ImageNet. Across all benchmarks, DyVEDeep achieves 2.1X-2.6X reduction in number of scalar operations, which translates to 1.9X-2.3X performance improvement over a Caffe-based sequential software implementation, for negligible loss in accuracy.",https://www.dropbox.com/s/5ynhm1wy5vu2swq/DyVEDeep-ICLR2017.pdf,ICLR,2017, +CU0APx9LMaL,DgTpyGBq-_M,1601310000000.0,1616070000000.0,3528,NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition,"[""~Abhinav_Mehrotra1"", ""a.gilramos@samsung.com"", ""~Sourav_Bhattacharya1"", ""~\u0141ukasz_Dudziak1"", ""~Ravichander_Vipperla1"", ""thomas.chau@samsung.com"", ""~Mohamed_S_Abdelfattah1"", ""s.ishtiaq@samsung.com"", ""~Nicholas_Donald_Lane1""]","[""Abhinav Mehrotra"", ""Alberto Gil C. P. Ramos"", ""Sourav Bhattacharya"", ""\u0141ukasz Dudziak"", ""Ravichander Vipperla"", ""Thomas Chau"", ""Mohamed S Abdelfattah"", ""Samin Ishtiaq"", ""Nicholas Donald Lane""]","[""NAS"", ""ASR"", ""Benchmark""]","Powered by innovations in novel architecture design, noise tolerance techniques and increasing model capacity, Automatic Speech Recognition (ASR) has made giant strides in reducing word-error-rate over the past decade. ASR models are often trained with tens of thousand hours of high quality speech data to produce state-of-the-art (SOTA) results. Industry-scale ASR model training thus remains computationally heavy and time-consuming, and consequently has attracted little attention in adopting automatic techniques. On the other hand, Neural Architecture Search (NAS) has gained a lot of interest in the recent years thanks to its successes in discovering efficient architectures, often outperforming handcrafted alternatives. However, by changing the standard training process into a bi-level optimisation problem, NAS approaches often require significantly more time and computational power compared to single-model training, and at the same time increase complexity of the overall process. As a result, NAS has been predominately applied to problems which do not require as extensive training as ASR, and even then reproducibility of NAS algorithms is often problematic. Lately, a number of benchmark datasets has been introduced to address reproducibility issues by pro- viding NAS researchers with information about performance of different models obtained through exhaustive evaluation. However, these datasets focus mainly on computer vision and NLP tasks and thus suffer from limited coverage of application domains. In order to increase diversity in the existing NAS benchmarks, and at the same time provide systematic study of the effects of architectural choices for ASR, we release NAS-Bench-ASR – the first NAS benchmark for ASR models. The dataset consists of 8, 242 unique models trained on the TIMIT audio dataset for three different target epochs, and each starting from three different initializations. The dataset also includes runtime measurements of all the models on a diverse set of hardware platforms. Lastly, we show that identified good cell structures in our search space for TIMIT transfer well to a much larger LibriSpeech dataset.",/pdf/77fdff265261021a568029907ab02c0f1f6e4639.pdf,ICLR,2021,"The first NAS benchmark for ASR comprising of 8,242 unique models trained on the TIMIT audio dataset." +HyxBpoR5tm,HygoMYp9Km,1538090000000.0,1545360000000.0,802,Adversarially Robust Training through Structured Gradient Regularization,"[""kevin.roth@inf.ethz.ch"", ""aurelien.lucchi@inf.ethz.ch"", ""sebastian.nowozin@microsoft.com"", ""thomas.hofmann@inf.ethz.ch""]","[""Kevin Roth"", ""Aurelien Lucchi"", ""Sebastian Nowozin"", ""Thomas Hofmann""]","[""Adversarial Training"", ""Gradient Regularization"", ""Deep Learning""]","We propose a novel data-dependent structured gradient regularizer to increase the robustness of neural networks vis-a-vis adversarial perturbations. Our regularizer can be derived as a controlled approximation from first principles, leveraging the fundamental link between training with noise and regularization. It adds very little computational overhead during learning and is simple to implement generically in standard deep learning frameworks. Our experiments provide strong evidence that structured gradient regularization can act as an effective first line of defense against attacks based on long-range correlated signal corruptions.",/pdf/8454382d83397a557a905e87a4960eb55db75c0f.pdf,ICLR,2019,We propose a novel data-dependent structured gradient regularizer to increase the robustness of neural networks against adversarial perturbations. +BklHF6VtPB,ryxfZJpwPB,1569440000000.0,1577170000000.0,668,Modeling Winner-Take-All Competition in Sparse Binary Projections,"[""wyli@cuhk.edu.cn""]","[""Wenye Li""]","[""Sparse Representation"", ""Sparse Binary Projection"", ""Winner-Take-All""]","Inspired by the advances in biological science, the study of sparse binary projection models has attracted considerable recent research attention. The models project dense input samples into a higher-dimensional space and output sparse binary data representations after Winner-Take-All competition, subject to the constraint that the projection matrix is also sparse and binary. Following the work along this line, we developed a supervised-WTA model when training samples with both input and output representations are available, from which the optimal projection matrix can be obtained with a simple, efficient yet effective algorithm. We further extended the model and the algorithm to an unsupervised setting where only the input representation of the samples is available. In a series of empirical evaluation on similarity search tasks, the proposed models reported significantly improved results over the state-of-the-art methods in both search accuracy and running time. The successful results give us strong confidence that the work provides a highly practical tool to real world applications. +",/pdf/a3cfd4eeff773ab8befefe6418716c43a547cf06.pdf,ICLR,2020,"We developed a Winner-Take-All model that learns a sparse binary representation for input samples, with significantly improved speed and accuracy." +v5gjXpmR8J,EqHG_II0-J,1601310000000.0,1616080000000.0,2654,SSD: A Unified Framework for Self-Supervised Outlier Detection,"[""~Vikash_Sehwag1"", ""~Mung_Chiang2"", ""~Prateek_Mittal1""]","[""Vikash Sehwag"", ""Mung Chiang"", ""Prateek Mittal""]","[""Outlier detection"", ""Out-of-distribution detection in deep learning"", ""Anomaly detection with deep neural networks"", ""Self-supervised learning""]","We ask the following question: what training information is required to design an effective outlier/out-of-distribution (OOD) detector, i.e., detecting samples that lie far away from training distribution? Since unlabeled data is easily accessible for many applications, the most compelling approach is to develop detectors based on only unlabeled in-distribution data. However, we observe that most existing detectors based on unlabeled data perform poorly, often equivalent to a random prediction. In contrast, existing state-of-the-art OOD detectors achieve impressive performance but require access to fine-grained data labels for supervised training. We propose SSD, an outlier detector based on only unlabeled in-distribution data. We use self-supervised representation learning followed by a Mahalanobis distance based detection in the feature space. We demonstrate that SSD outperforms most existing detectors based on unlabeled data by a large margin. Additionally, SSD even achieves performance on par, and sometimes even better, with supervised training based detectors. Finally, we expand our detection framework with two key extensions. First, we formulate few-shot OOD detection, in which the detector has access to only one to five samples from each class of the targeted OOD dataset. Second, we extend our framework to incorporate training data labels, if available. We find that our novel detection framework based on SSD displays enhanced performance with these extensions, and achieves state-of-the-art performance. Our code is publicly available at https://github.com/inspire-group/SSD.",/pdf/89219c090f5f217510ca46c6b68a0b62df071e81.pdf,ICLR,2021,We achieve competitive performance on outlier/out-of-distribution detection using only unlabeled training data. +rI3RMgDkZqJ,qR8io72QdDH,1601310000000.0,1614990000000.0,2221,A Primal Approach to Constrained Policy Optimization: Global Optimality and Finite-Time Analysis,"[""~Tengyu_Xu1"", ""~Yingbin_Liang1"", ""~Guanghui_Lan1""]","[""Tengyu Xu"", ""Yingbin Liang"", ""Guanghui Lan""]","[""safe reinforcement learning"", ""constrained markov decision process"", ""policy optimization""]","Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.",/pdf/8d4bb39eaa31bda4664e472f72860c82c0d96c34.pdf,ICLR,2021,This paper proposes an easy-to-implement approach for safe RL problems and establishes its global optimality guarantee. +EArH-0iHhIq,mKPIKbtQ3w-,1601310000000.0,1614990000000.0,3312,ON NEURAL NETWORK GENERALIZATION VIA PROMOTING WITHIN-LAYER ACTIVATION DIVERSITY,"[""~Firas_Laakom1"", ""~Jenni_Raitoharju1"", ""~Alexandros_Iosifidis2"", ""~Moncef_Gabbouj1""]","[""Firas Laakom"", ""Jenni Raitoharju"", ""Alexandros Iosifidis"", ""Moncef Gabbouj""]","[""Deep learning""]","During the last decade, neural networks have been intensively used to tackle various problems and they have often led to state-of-the-art results. These networks are composed of multiple jointly optimized layers arranged in a hierarchical structure. At each layer, the aim is to learn to extract hidden patterns needed to solve the problem at hand and forward it to the next layers. In the standard form, a neural network is trained with gradient-based optimization, where the errors are back-propagated from the last layer back to the first one. Thus at each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically study how the within-layer activation diversity affects the generalization performance of a neural network in a supervised context and we prove that increasing the diversity of hidden activations reduces the estimation error. In addition to the theoretical guarantees, we present an empirical study confirming that the proposed approach enhances the performance of neural networks.",/pdf/ea9a1e58a8ac48bfa2b4106cee821485c8de2965.pdf,ICLR,2021,We propose an additional loss for neural network training promoting within-layer neurons' diversity and provide a theoretical analysis of its impact on the generalization error. +Byx5BTilg,,1478380000000.0,1481920000000.0,598,Exploring the Application of Deep Learning for Supervised Learning Problems,"[""jmrozanec@gmail.com"", ""giladk@berkeley.edu"", ""ricshin@berkeley.edu"", ""dawnsong@eecs.berkeley.edu""]","[""Jose Rozanec"", ""Gilad Katz"", ""Eui Chul Richard Shin"", ""Dawn Song""]","[""Deep learning"", ""Supervised Learning""]","One of the main difficulties in applying deep neural nets (DNNs) to new domains is the need to explore multiple architectures in order to discover ones that perform well. We analyze a large set of DNNs across multiple domains and derive insights regarding their effectiveness. We also analyze the characteristics of various DNNs and the general effect they may have on performance. Finally, we explore the application of meta-learning to the problem of architecture ranking. We demonstrate that by using topological features and modeling the changes in its weights, biases and activation functions layers of the initial training steps, we are able to rank architectures based on their predicted performance. We consider this work to be a first step in the important and challenging direction of exploring the space of different neural network architectures. +",/pdf/2282dfd38ddb3862d97adcea4fdb685118cbb855.pdf,ICLR,2017,We explore the multiple DNN architectures on a large set of general supervised datasets. We also propose a meta-learning approach for DNN performance prediciton and ranking +S1eik6EtPB,BJlzB6qHvH,1569440000000.0,1577170000000.0,315,Towards A Unified Min-Max Framework for Adversarial Exploration and Robustness,"[""wangjksjtu@gmail.com"", ""tzhan120@syr.edu"", ""sijia.liu@ibm.com"", ""pin-yu.chen@ibm.com"", ""coldstudy@sjtu.edu.cn"", ""makan@syr.edu"", ""lxbosky@gmail.com""]","[""Jingkang Wang"", ""Tianyun Zhang"", ""Sijia Liu"", ""Pin-Yu Chen"", ""Jiacen Xu"", ""Makan Fardad"", ""Bo Li""]","[""Ensemble attack"", ""adversarial training"", ""diversity promotion""]","The worst-case training principle that minimizes the maximal adversarial loss, also known as adversarial training (AT), has shown to be a state-of-the-art approach for enhancing adversarial robustness against norm-ball bounded input perturbations. Nonetheless, min-max optimization beyond the purpose of AT has not been rigorously explored in the research of adversarial attack and defense. In particular, given a set of risk sources (domains), minimizing the maximal loss induced from the domain set can be reformulated as a general min-max problem that is different from AT. Examples of this general formulation include attacking model ensembles, devising universal perturbation under multiple inputs or data transformations, and generalized AT over different types of attack models. We show that these problems can be solved under a unified and theoretically principled min-max optimization framework. We also show that the self-adjusted domain weights learned from our method provides a means to explain the difficulty level of attack and defense over multiple domains. Extensive experiments show that our approach leads to substantial performance improvement over the conventional averaging strategy.",/pdf/b17d35e4a2a9a4fa2b017871f52cdd45675778c3.pdf,ICLR,2020,A unified min-max optimization framework for adversarial attack and defense +MP0LhG4YiiC,NN-lsoUvnAv,1601310000000.0,1614990000000.0,1247,Analogical Reasoning for Visually Grounded Compositional Generalization,"[""~Bo_Wu6"", ""~Haoyu_Qin1"", ""~Alireza_Zareian2"", ""~Carl_Vondrick2"", ""~Shih-Fu_Chang3""]","[""Bo Wu"", ""Haoyu Qin"", ""Alireza Zareian"", ""Carl Vondrick"", ""Shih-Fu Chang""]",[],"Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of multimodal compositional generalization within the context of visually grounded language acquisition. We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNet), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as ""washing apple"" or ""cutting carrot"", it can generalize and recognize new compositions in new video frames, such as ""washing carrot"" or ""cutting apple"". To this end, ARTNet refers to relevant instances in the training data and uses their visual features and captions to establish analogies with the query image. Then it chooses a suitable verb and noun to create a new composition that describes the new image best. Extensive experiments on an instructional video dataset demonstrate that the proposed method achieves significantly better generalization capability and recognition accuracy compared to state-of-the-art transformer models.",/pdf/977466c215834d865ac53b4286576a0f4880dd1e.pdf,ICLR,2021, +BJij4yg0Z,SkcsNyeRZ,1509060000000.0,1518730000000.0,201,A Bayesian Perspective on Generalization and Stochastic Gradient Descent,"[""slsmith@google.com"", ""qvl@google.com""]","[""Samuel L. Smith and Quoc V. Le""]","[""generalization"", ""stochastic gradient descent"", ""stochastic differential equations"", ""scaling rules"", ""large batch training"", ""bayes theorem"", ""batch size""]","We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to \citet{zhang2016understanding}, who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the ``noise scale"" $g = \epsilon (\frac{N}{B} - 1) \approx \epsilon N/B$, where $\epsilon$ is the learning rate, $N$ the training set size and $B$ the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, $B_{opt} \propto \epsilon N$. We verify these predictions empirically.",/pdf/79862cef4ea997e07936532dcb53794f6bca59ff.pdf,ICLR,2018,"Generalization is strongly correlated with the Bayesian evidence, and gradient noise drives SGD towards minima whose evidence is large." +ryEGFD9gl,,1478290000000.0,1484360000000.0,410,Submodular Sum-product Networks for Scene Understanding,"[""afriesen@cs.washington.edu"", ""pedrod@cs.washington.edu""]","[""Abram L. Friesen"", ""Pedro Domingos""]","[""Computer vision"", ""Structured prediction""]","Sum-product networks (SPNs) are an expressive class of deep probabilistic models in which inference takes time linear in their size, enabling them to be learned effectively. However, for certain challenging problems, such as scene understanding, the corresponding SPN has exponential size and is thus intractable. In this work, we introduce submodular sum-product networks (SSPNs), an extension of SPNs in which sum-node weights are defined by a submodular energy function. SSPNs combine the expressivity and depth of SPNs with the ability to efficiently compute the MAP state of a combinatorial number of labelings afforded by submodular energies. SSPNs for scene understanding can be understood as representing all possible parses of an image over arbitrary region shapes with respect to an image grammar. Despite this complexity, we develop an efficient and convergent algorithm based on graph cuts for computing the (approximate) MAP state of an SSPN, greatly increasing the expressivity of the SPN model class. Empirically, we show exponential improvements in parsing time compared to traditional inference algorithms such as alpha-expansion and belief propagation, while returning comparable minima. +",/pdf/19b8870324c9064a9760d89f616c8af8b7c9651e.pdf,ICLR,2017,"A novel extension of sum-product networks that incorporates submodular Markov random fields into the sum nodes, resulting in a highly expressive class of models in which efficient inference is still possible." +SyVU6s05K7,HylI4dNFF7,1538090000000.0,1550850000000.0,807,Deep Frank-Wolfe For Neural Network Optimization,"[""lberrada@robots.ox.ac.uk"", ""az@robots.ox.ac.uk"", ""pawan@robots.ox.ac.uk""]","[""Leonard Berrada"", ""Andrew Zisserman"", ""M. Pawan Kumar""]","[""optimization"", ""conditional gradient"", ""Frank-Wolfe"", ""SVM""]","Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In addition, its adaptive variants tend to produce solutions that generalize less well on unseen data than SGD with a hand-designed schedule. We present an optimization method that offers empirically the best of both worlds: our algorithm yields good generalization performance while requiring only one hyper-parameter. Our approach is based on a composite proximal framework, which exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design. Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes an optimal step-size in closed-form at each time-step. We further show that the descent direction is given by a simple backward pass in the network, yielding the same computational cost per iteration as SGD. We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while often converging faster. The code is publicly available at https://github.com/oval-group/dfw.",/pdf/ce7e434aabd4bf9e453cdfbd301971686c81da17.pdf,ICLR,2019,We train neural networks by locally linearizing them and using a linear SVM solver (Frank-Wolfe) at each iteration. +Skgeip4FPr,HkeQm_CvDr,1569440000000.0,1577170000000.0,730,Neural networks are a priori biased towards Boolean functions with low entropy,"[""christopher.mingard@hertford.ox.ac.uk"", ""joar.skalse@hertford.ox.ac.uk"", ""guillermo.valle@dtc.ox.ac.uk"", ""david.martinez@cs.ox.ac.uk"", ""vladimir.mikulik@hertford.ox.ac.uk"", ""ard.louis@physics.ox.ac.uk""]","[""Chris Mingard"", ""Joar Skalse"", ""Guillermo Valle-P\u00e9rez"", ""David Mart\u00ednez-Rubio"", ""Vladimir Mikulik"", ""Ard A. Louis""]","[""class imbalance"", ""perceptron"", ""inductive bias"", ""simplicity bias"", ""initialization""]","Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, +for one of the simplest neural networks -- a single-layer perceptron with $n$ input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability $P(t)$ that it represents a Boolean function that classifies $t$ points in $\{0,1\}^n$ as $1$ has a remarkably simple form: $ +P(t) = 2^{-n} \,\, {\rm for} \,\, 0\leq t < 2^n$. +Since a perceptron can express far fewer Boolean functions with small or large values of $t$ (low ""entropy"") than with intermediate values of $t$ (high ""entropy"") there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed $t$, we often observe a further intrinsic bias towards functions of lower complexity. +Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect.",/pdf/dee69b9601aeeffbc32192a42da2a7a54b51af41.pdf,ICLR,2020,"We show that neural networks are biased towards functions with high class imbalance (low entropy) at initialization; we prove the exact form of the bias for the perceptron, and some properties for multi-layer networks" +14nC8HNd4Ts,WJuwoyJBjDc,1601310000000.0,1614990000000.0,252,Synthesising Realistic Calcium Traces of Neuronal Populations Using GAN,"[""~Bryan_M._Li1"", ""t.amvrosiadis@ed.ac.uk"", ""n.rochefort@ed.ac.uk"", ""~Arno_Onken1""]","[""Bryan M. Li"", ""Theoklitos Amvrosiadis"", ""Nathalie Rochefort"", ""Arno Onken""]","[""calcium imaging"", ""calcium traces"", ""generative adversarial networks"", ""spike train analysis""]","Calcium imaging has become a powerful and popular technique to monitor the activity of large populations of neurons in vivo. However, for ethical considerations and despite recent technical developments, recordings are still constrained to a limited number of trials and animals. This limits the amount of data available from individual experiments and hinders the development of analysis techniques and models for more realistic sizes of neuronal populations. The ability to artificially synthesize realistic neuronal calcium signals could greatly alleviate this problem by scaling up the number of trials. Here, we propose a Generative Adversarial Network (GAN) model to generate realistic calcium signals as seen in neuronal somata with calcium imaging. To this end, we propose CalciumGAN, a model based on the WaveGAN architecture and train it on calcium fluorescent signals with the Wasserstein distance. We test the model on artificial data with known ground-truth and show that the distribution of the generated signals closely resembles the underlying data distribution. Then, we train the model on real calcium traces recorded from the primary visual cortex of behaving mice and confirm that the deconvolved spike trains match the statistics of the recorded data. Together, these results demonstrate that our model can successfully generate realistic calcium traces, thereby providing the means to augment existing datasets of neuronal activity for enhanced data exploration and modelling.",/pdf/4ce35dba37e4f0554a9a4bda596c3418f0313281.pdf,ICLR,2021,Synthesising in vivo calcium traces from live animal using GAN +HkxZigSYwS,ryxPwJ-tDH,1569440000000.0,1577170000000.0,2499,Universal Safeguarded Learned Convex Optimization with Guaranteed Convergence,"[""heaton@math.ucla.edu"", ""chernxh@tamu.edu"", ""atlaswang@tamu.edu"", ""wotao.yin@alibaba-inc.com""]","[""Howard Heaton"", ""Xiaohan Chen"", ""Zhangyang Wang"", ""Wotao Yin""]","[""L2O"", ""learn to optimize"", ""fixed point"", ""machine learning"", ""neural network"", ""ADMM"", ""LADMM"", ""ALISTA"", ""D-LADMM""]","Many applications require quickly and repeatedly solving a certain type of optimization problem, each time with new (but similar) data. However, state of the art general-purpose optimization methods may converge too slowly for real-time use. This shortcoming is addressed by “learning to optimize” (L2O) schemes, which construct neural networks from parameterized forms of the update operations of general-purpose methods. Inferences by each network form solution estimates, and networks are trained to optimize these estimates for a particular distribution of data. This results in task-specific algorithms (e.g., LISTA, ALISTA, and D-LADMM) that can converge order(s) of magnitude faster than general-purpose counterparts. We provide the first general L2O convergence theory by wrapping all L2O schemes for convex optimization within a single framework. Existing L2O schemes form special cases, and we give a practical guide for applying our L2O framework to other problems. Using safeguarding, our theory proves, as the number of network layers increases, the distance between inferences and the solution set goes to zero, i.e., each cluster point is a solution. Our numerical examples demonstrate the efficacy of our approach for both existing and new L2O methods. ",/pdf/791725fb3196b12cb4324af2d831b7a1d7092903.pdf,ICLR,2020,"We provide the first general framework, with convergence guarantees, for applying learning to optimize schemes to any convex optimization problem." +r1x63grFvH,BygXpx-FvB,1569440000000.0,1577170000000.0,2555,Limitations for Learning from Point Clouds,"[""christianbueno@ucsb.edu"", ""alan.g.hylton@nasa.gov""]","[""Christian Bueno"", ""Alan G. Hylton""]","[""universal approximation"", ""point clouds"", ""deep learning"", ""hausdorff metric"", ""wasserstein metric""]","In this paper we prove new universal approximation theorems for deep learning on point clouds that do not assume fixed cardinality. We do this by first generalizing the classical universal approximation theorem to general compact Hausdorff spaces and then applying this to the permutation-invariant architectures presented in 'PointNet' (Qi et al) and 'Deep Sets' (Zaheer et al). Moreover, though both architectures operate on the same domain, we show that the constant functions are the only functions they can mutually uniformly approximate. In particular, DeepSets architectures cannot uniformly approximate the diameter function but can uniformly approximate the center of mass function but it is the other way around for PointNet. ",/pdf/9aa12b515c21360f533c1bf85e8028feb8ab03d4.pdf,ICLR,2020,We prove new universal approximation theorems for PointNets and DeepSets and demonstrate new limitations. +HygaikBKvS,HJl7F1yFwS,1569440000000.0,1577170000000.0,1932,Off-Policy Actor-Critic with Shared Experience Replay,"[""suschmitt@google.com"", ""mtthss@google.com"", ""simonyan@google.com""]","[""Simon Schmitt"", ""Matteo Hessel"", ""Karen Simonyan""]","[""Reinforcement Learning"", ""Off-Policy Learning"", ""Experience Replay""]","We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of very off-policy learning. We employ those insights to accelerate hyper-parameter sweeps in which all participating agents run concurrently and share their experience via a common replay module. + +To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. + +We provide extensive empirical validation of the proposed solution. We further show the benefits of this setup by demonstrating state-of-the-art data efficiency on Atari among agents trained up until 200M environment frames.",/pdf/5136fa07d34aa4201c27a59ac871c3adf632d1f8.pdf,ICLR,2020,We investigate and propose solutions for two challenges in reinforcement learning: (a) efficient actor-critic learning with experience replay (b) stability of very off-policy learning. +S1g_EsActm,SkgvAc6OKX,1538090000000.0,1545360000000.0,12,ATTENTION INCORPORATE NETWORK: A NETWORK CAN ADAPT VARIOUS DATA SIZE,"[""heliangbo@tsinghua.edu.cn"", ""sh759811581@tsinghua.edu.cn""]","[""Liangbo He"", ""Hao Sun""]","[""attention mechanism"", ""various image size""]","In traditional neural networks for image processing, the inputs of the neural networks should be the same size such as 224×224×3. But how can we train the neural net model with different input size? A common way to do is image deformation which accompany a problem of information loss (e.g. image crop or wrap). In this paper we propose a new network structure called Attention Incorporate Network(AIN). It solve the problem of different size of input images and extract the key features of the inputs by attention mechanism, pay different attention depends on the importance of the features not rely on the data size. Experimentally, AIN achieve a higher accuracy, better convergence comparing to the same size of other network structure.",/pdf/3760fb7e13f529371f30a46e9c3fa552297df62b.pdf,ICLR,2019, +r1glehC5tQ,H1xf1loYYX,1538090000000.0,1545360000000.0,1041,Distinguishability of Adversarial Examples,"[""yiqin@mines.edu"", ""ryhunt@mines.edu"", ""chuanyue@mines.edu""]","[""Yi Qin"", ""Ryan Hunt"", ""Chuan Yue""]","[""Adversarial Examples"", ""Machine Learning"", ""Neural Networks"", ""Distinguishability"", ""Defense""]","Machine learning models including traditional models and neural networks can be easily fooled by adversarial examples which are generated from the natural examples with small perturbations. This poses a critical challenge to machine learning security, and impedes the wide application of machine learning in many important domains such as computer vision and malware detection. Unfortunately, even state-of-the-art defense approaches such as adversarial training and defensive distillation still suffer from major limitations and can be circumvented. From a unique angle, we propose to investigate two important research questions in this paper: Are adversarial examples distinguishable from natural examples? Are adversarial examples generated by different methods distinguishable from each other? These two questions concern the distinguishability of adversarial examples. Answering them will potentially lead to a simple yet effective approach, termed as defensive distinction in this paper under the formulation of multi-label classification, for protecting against adversarial examples. We design and perform experiments using the MNIST dataset to investigate these two questions, and obtain highly positive results demonstrating the strong distinguishability of adversarial examples. We recommend that this unique defensive distinction approach should be seriously considered to complement other defense approaches.",/pdf/dc04eae63abcb4edcbe6a17b372b4fb03a501a31.pdf,ICLR,2019,We propose a defensive distinction protection approach and demonstrate the strong distinguishability of adversarial examples. +yrDEUYauOMd,IRt-gs87mM0,1601310000000.0,1614990000000.0,1990,Attainability and Optimality: The Equalized-Odds Fairness Revisited,"[""~Zeyu_Tang1"", ""~Kun_Zhang1""]","[""Zeyu Tang"", ""Kun Zhang""]","[""algorithmic fairness""]","Fairness of machine learning algorithms has been of increasing interest. In order to suppress or eliminate discrimination in prediction, various notions as well as approaches to impose fairness have been proposed. However, in different scenarios, whether or not the chosen notion of fairness can always be attained, even if with unlimited amount of data, is not well addressed. In this paper, focusing on the Equalized Odds notion of fairness, we consider the attainability of this criterion, and furthermore, if attainable, the optimality of the prediction performance under various settings. In particular, for classification with a deterministic prediction function of the input, we give the condition under which Equalized Odds can hold true; if randomized prediction is acceptable, we show that under mild assumptions, fair classifiers can always be derived. Moreover, we prove that compared to enforcing fairness by post-processing, one can always benefit from exploiting all available features during training and get better prediction performance while remaining fair. However, for regression tasks, Equalized Odds is not always attainable if certain conditions on the joint distribution of the features and the target variable are not met. This indicates the inherent difficulty in achieving fairness in certain cases and suggests a broader class of prediction methods might be needed for fairness.",/pdf/c3cbe2df0779d9e7dd28de19895661d08b925cba.pdf,ICLR,2021, +Pzj6fzU6wkj,ozklVYYNYVS,1601310000000.0,1615910000000.0,2324,IsarStep: a Benchmark for High-level Mathematical Reasoning,"[""~Wenda_Li1"", ""~Lei_Yu4"", ""~Yuhuai_Wu1"", ""lp15@cam.ac.uk""]","[""Wenda Li"", ""Lei Yu"", ""Yuhuai Wu"", ""Lawrence C. Paulson""]","[""mathematical reasoning"", ""dataset"", ""benchmark"", ""reasoning"", ""transformer""]","A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline. ",/pdf/c9fb7dd359102a00d8676684bd704c54961a5285.pdf,ICLR,2021,We present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. +sjuuTm4vj0,aG4sguy8zg9,1601310000000.0,1616080000000.0,1918,Using latent space regression to analyze and leverage compositionality in GANs,"[""~Lucy_Chai1"", ""~Jonas_Wulff1"", ""~Phillip_Isola1""]","[""Lucy Chai"", ""Jonas Wulff"", ""Phillip Isola""]","[""Image Synthesis"", ""Composition"", ""Generative Adversarial Networks"", ""Image Editing"", ""Interpretability""]","In recent years, Generative Adversarial Networks have become ubiquitous in both research and public perception, but how GANs convert an unstructured latent code to a high quality output is still an open question. In this work, we investigate regression into the latent space as a probe to understand the compositional properties of GANs. We find that combining the regressor and a pretrained generator provides a strong image prior, allowing us to create composite images from a collage of random image parts at inference time while maintaining global consistency. To compare compositional properties across different generators, we measure the trade-offs between reconstruction of the unrealistic input and image quality of the regenerated samples. We find that the regression approach enables more localized editing of individual image parts compared to direct editing in the latent space, and we conduct experiments to quantify this independence effect. Our method is agnostic to the semantics of edits, and does not require labels or predefined concepts during training. Beyond image composition, our method extends to a number of related applications, such as image inpainting or example-based image editing, which we demonstrate on several GANs and datasets, and +because it uses only a single forward pass, it can operate in real-time. Code is available on our project page: https://chail.github.io/latent-composition/.",/pdf/9d4e6357960a5481c3a1771887727b0277c2de37.pdf,ICLR,2021,We use a latent regressor network to investigate compositional properties of image synthesis with GANs. +Syg6fxrKDB,r1lzKXeKDB,1569440000000.0,1577170000000.0,2191,A Graph Neural Network Assisted Monte Carlo Tree Search Approach to Traveling Salesman Problem,"[""xingzhihao@sjtu.edu.cn"", ""tushikui@sjtu.edu.cn""]","[""Zhihao Xing"", ""Shikui Tu""]","[""Traveling Salesman Problem"", ""Graph Neural Network"", ""Monte Carlo Tree Search""]","We present a graph neural network assisted Monte Carlo Tree Search approach for the classical traveling salesman problem (TSP). We adopt a greedy algorithm framework to construct the optimal solution to TSP by adding the nodes successively. A graph neural network (GNN) is trained to capture the local and global graph structure and give the prior probability of selecting each vertex every step. The prior probability provides a heuristics for MCTS, and the MCTS output is an improved probability for selecting the successive vertex, as it is the feedback information by fusing the prior with the scouting procedure. Experimental results on TSP up to 100 nodes demonstrate that the proposed method obtains shorter tours than other learning-based methods.",/pdf/18affa3076e2280fe529827a055d61dfca37020b.pdf,ICLR,2020,A Graph Neural Network Assisted Monte Carlo Tree Search Approach to Traveling Salesman Problem +Hyx6Bi0qYm,rJexke5FYX,1538090000000.0,1547570000000.0,129,Adversarial Domain Adaptation for Stable Brain-Machine Interfaces,"[""a-farshchiansadegh@northwestern.edu"", ""juan.gallego@northwestern.edu"", ""joseph@josephpcohen.com"", ""yoshua.bengio@umontreal.ca"", ""lm@northwestern.edu"", ""solla@northwestern.edu""]","[""Ali Farshchian"", ""Juan A. Gallego"", ""Joseph P. Cohen"", ""Yoshua Bengio"", ""Lee E. Miller"", ""Sara A. Solla""]","[""Brain-Machine Interfaces"", ""Domain Adaptation"", ""Adversarial Networks""]","Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option +to restore voluntary movements after paralysis. These devices are based on the +ability to extract information about movement intent from neural signals recorded +using multi-electrode arrays chronically implanted in the motor cortices of the +brain. However, the inherent loss and turnover of recorded neurons requires repeated +recalibrations of the interface, which can potentially alter the day-to-day +user experience. The resulting need for continued user adaptation interferes with +the natural, subconscious use of the BMI. Here, we introduce a new computational +approach that decodes movement intent from a low-dimensional latent representation +of the neural data. We implement various domain adaptation methods +to stabilize the interface over significantly long times. This includes Canonical +Correlation Analysis used to align the latent variables across days; this method +requires prior point-to-point correspondence of the time series across domains. +Alternatively, we match the empirical probability distributions of the latent variables +across days through the minimization of their Kullback-Leibler divergence. +These two methods provide a significant and comparable improvement in the performance +of the interface. However, implementation of an Adversarial Domain +Adaptation Network trained to match the empirical probability distribution of the +residuals of the reconstructed neural signals outperforms the two methods based +on latent variables, while requiring remarkably few data points to solve the domain +adaptation problem.",/pdf/42b1af648df0cfebe9b454256f895b390736f949.pdf,ICLR,2019,We implement an adversarial domain adaptation network to stabilize a fixed Brain-Machine Interface against gradual changes in the recorded neural signals. +Hke_f0EYPH,Hkeb5CEdvr,1569440000000.0,1577170000000.0,1006,Efficient Training of Robust and Verifiable Neural Networks,"[""akhilan@mit.edu"", ""twweng@mit.edu"", ""sijia.liu@ibm.com"", ""pin-yu.chen@ibm.com"", ""dluca@mit.edu""]","[""Akhilan Boopathy"", ""Lily Weng"", ""Sijia Liu"", ""Pin-Yu Chen"", ""Luca Daniel""]",[],"Recent works have developed several methods of defending neural networks against adversarial attacks with certified guarantees. We propose that many common certified defenses can be viewed under a unified framework of regularization. This unified framework provides a technique for comparing different certified defenses with respect to robust generalization. In addition, we develop a new regularizer that is both more efficient than existing certified defenses and can be used to train networks with higher certified accuracy. Our regularizer also extends to an L0 threat model and ensemble models. Through experiments on MNIST, CIFAR-10 and GTSRB, we demonstrate improvements in training speed and certified accuracy compared to state-of-the-art certified defenses.",/pdf/8be2d30b90ff65b9cdd41dbedd032dcceabd7e12.pdf,ICLR,2020, +rkfbLilAb,B1zWIoxA-,1509110000000.0,1518730000000.0,370,Improving Search Through A3C Reinforcement Learning Based Conversational Agent,"[""milan.ag1994@gmail.com"", ""aarushi.arora043@gmail.com"", ""sshagunsodhani@gmail.com"", ""kbalaji@adobe.com""]","[""Milan Aggarwal"", ""Aarushi Arora"", ""Shagun Sodhani"", ""Balaji Krishnamurthy""]","[""Subjective search"", ""Reinforcement Learning"", ""Conversational Agent"", ""Virtual user model"", ""A3C"", ""Context aggregation""]",We develop a reinforcement learning based search assistant which can assist users through a set of actions and sequence of interactions to enable them realize their intent. Our approach caters to subjective search where the user is seeking digital assets such as images which is fundamentally different from the tasks which have objective and limited search modalities. Labeled conversational data is generally not available in such search tasks and training the agent through human interactions can be time consuming. We propose a stochastic virtual user which impersonates a real user and can be used to sample user behavior efficiently to train the agent which accelerates the bootstrapping of the agent. We develop A3C algorithm based context preserving architecture which enables the agent to provide contextual assistance to the user. We compare the A3C agent with Q-learning and evaluate its performance on average rewards and state values it obtains with the virtual user in validation episodes. Our experiments show that the agent learns to achieve higher rewards and better states.,/pdf/dd05139862d45cbb8c760614169ed92cd7630697.pdf,ICLR,2018,A Reinforcement Learning based conversational search assistant which provides contextual assistance in subjective search (like digital assets). +r1evOhEKvH,H1eO8TkUIr,1569440000000.0,1583910000000.0,45,Graph inference learning for semi-supervised classification,"[""cyx@njust.edu.cn"", ""zhen.cui@njust.edu.cn"", ""xbhong@njust.edu.cn"", ""tong.zhang@njust.edu.cn"", ""csjyang@njust.edu.cn"", ""wl2223@columbia.edu""]","[""Chunyan Xu"", ""Zhen Cui"", ""Xiaobin Hong"", ""Tong Zhang"", ""Jian Yang"", ""Wei Liu""]","[""semi-supervised classification"", ""graph inference learning""]","In this work, we address the semi-supervised classification of graph data, where the categories of those unlabeled nodes are inferred from labeled nodes as well as graph structures. Recent works often solve this problem with the advanced graph convolution in a conventional supervised manner, but the performance could be heavily affected when labeled data is scarce. Here we propose a Graph Inference Learning (GIL) framework to boost the performance of node classification by learning the inference of node labels on graph topology. To bridge the connection of two nodes, we formally define a structure relation by encapsulating node attributes, between-node paths and local topological structures together, which can make inference conveniently deduced from one node to another node. For learning the inference process, we further introduce meta-optimization on structure relations from training nodes to validation nodes, such that the learnt graph inference capability can be better self-adapted into test nodes. Comprehensive evaluations on four benchmark datasets (including Cora, Citeseer, Pubmed and NELL) demonstrate the superiority of our GIL when compared with other state-of-the-art methods in the semi-supervised node classification task.",/pdf/03d35e4c92edd9cb8ba89a592f7fe50ec7b372eb.pdf,ICLR,2020, We propose a novel graph inference learning framework by building structure relations to infer unknown node labels from those labeled nodes in an end-to-end way. +SyxMWh09KX,r1gYibA5YQ,1538090000000.0,1545360000000.0,1148,Attentive Task-Agnostic Meta-Learning for Few-Shot Text Classification,"[""xiang.jiang@dal.ca"", ""mohammad@imagia.com"", ""gabriel@imagia.com"", ""hassan.chouaib@imagia.com"", ""thomas.vincent@imagia.com"", ""andrew.jesson@imagia.com"", ""nic@imagia.com"", ""stan@cs.dal.ca""]","[""Xiang Jiang"", ""Mohammad Havaei"", ""Gabriel Chartrand"", ""Hassan Chouaib"", ""Thomas Vincent"", ""Andrew Jesson"", ""Nicolas Chapados"", ""Stan Matwin""]","[""meta-learning"", ""learning to learn"", ""few-shot learning""]","Current deep learning based text classification methods are limited by their ability to achieve fast learning and generalization when the data is scarce. We address this problem by integrating a meta-learning procedure that uses the knowledge learned across many tasks as an inductive bias towards better natural language understanding. Inspired by the Model-Agnostic Meta-Learning framework (MAML), we introduce the Attentive Task-Agnostic Meta-Learning (ATAML) algorithm for text classification. The proposed ATAML is designed to encourage task-agnostic representation learning by way of task-agnostic parameterization and facilitate task-specific adaptation via attention mechanisms. We provide evidence to show that the attention mechanism in ATAML has a synergistic effect on learning performance. Our experimental results reveal that, for few-shot text classification tasks, gradient-based meta-learning approaches ourperform popular transfer learning methods. In comparisons with models trained from random initialization, pretrained models and meta trained MAML, our proposed ATAML method generalizes better on single-label and multi-label classification tasks in miniRCV1 and miniReuters-21578 datasets.",/pdf/b2e1422643e66521ec17675ea6c013fc7d58c0c0.pdf,ICLR,2019,Meta-learning task-agnostic representations with attention. +r1l-5pEtDr,HklVDF6Dwr,1569440000000.0,1577170000000.0,697,AdaX: Adaptive Gradient Descent with Exponential Long Term Memory,"[""li3549@purdue.edu"", ""zhaoyangzhang@link.cuhk.edu.hk"", ""swanxinjiang@gmail.com"", ""pluo.lhi@gmail.com""]","[""Wenjie Li"", ""Zhaoyang Zhang"", ""Xinjiang Wang"", ""Ping Luo""]","[""Optimization Algorithm"", ""Machine Learning"", ""Deep Learning"", ""Adam""]","Adaptive optimization algorithms such as RMSProp and Adam have fast convergence and smooth learning process. Despite their successes, they are proven to have non-convergence issue even in convex optimization problems as well as weak performance compared with the first order gradient methods such as stochastic gradient descent (SGD). Several other algorithms, for example AMSGrad and AdaShift, have been proposed to alleviate these issues but only minor effect has been observed. This paper further analyzes the performance of such algorithms in a non-convex setting by extending their non-convergence issue into a simple non-convex case and show that Adam's design of update steps would possibly lead the algorithm to local minimums. To address the above problems, we propose a novel adaptive gradient descent algorithm, named AdaX, which accumulates the long-term past gradient information exponentially. We prove the convergence of AdaX in both convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with SGD. +",/pdf/6a6dc5c31d5a57b8fe48696a4d5e0256c0107447.pdf,ICLR,2020,A novel adaptive algorithm with extraordinary performance in deep learning tasks. +QoWatN-b8T,S5ydj4ubvuv,1601310000000.0,1616630000000.0,1802,"Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory","[""~Jason_Ramapuram1"", ""~Yan_Wu1"", ""~Alexandros_Kalousis1""]","[""Jason Ramapuram"", ""Yan Wu"", ""Alexandros Kalousis""]","[""memory"", ""generative model"", ""latent variable"", ""heap allocation""]","Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work, we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (≤41.58 nats/image) , binarized Omniglot (≤66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32×32.",/pdf/70c8013c1c3775393cfb1b86a14218089c000691.pdf,ICLR,2021,Differentiable block allocated latent memory model for generative modeling. +SJeq9JBFvH,S1gMMoA_wH,1569440000000.0,1583910000000.0,1887,Deep probabilistic subsampling for task-adaptive compressed sensing,"[""i.a.m.huijben@tue.nl"", ""basveeling@gmail.com"", ""r.j.g.v.sloun@tue.nl""]","[""Iris A.M. Huijben"", ""Bastiaan S. Veeling"", ""Ruud J.G. van Sloun""]",[],"The field of deep learning is commonly concerned with optimizing predictive models using large pre-acquired datasets of densely sampled datapoints or signals. In this work, we demonstrate that the deep learning paradigm can be extended to incorporate a subsampling scheme that is jointly optimized under a desired minimum sample rate. We present Deep Probabilistic Subsampling (DPS), a widely applicable framework for task-adaptive compressed sensing that enables end-to end optimization of an optimal subset of signal samples with a subsequent model that performs a required task. We demonstrate strong performance on reconstruction and classification tasks of a toy dataset, MNIST, and CIFAR10 under stringent subsampling rates in both the pixel and the spatial frequency domain. Due to the task-agnostic nature of the framework, DPS is directly applicable to all real-world domains that benefit from sample rate reduction.",/pdf/46be97c7ee83cc19aa7e13c83340c04aa4fd5c5a.pdf,ICLR,2020, +H1l3s6NtvH,SJlqzvJdvr,1569440000000.0,1577170000000.0,758,A Bayes-Optimal View on Adversarial Examples,"[""eitan.richardson@gmail.com"", ""yweiss@cs.huji.ac.il""]","[""Eitan Richardson"", ""Yair Weiss""]","[""Adversarial Examples"", ""Generative Models""]","Adversarial attacks on CNN classifiers can make an imperceptible change to an input image and alter the classification result. The source of these failures is still poorly understood, and many explanations invoke the ""unreasonably linear extrapolation"" used by CNNs along with the geometry of high dimensions. +In this paper we show that similar attacks can be used against the Bayes-Optimal classifier for certain class distributions, while for others the optimal classifier is robust to such attacks. We present analytical results showing conditions on the data distribution under which all points can be made arbitrarily close to the optimal decision boundary and show that this can happen even when the classes are easy to separate, when the ideal classifier has a smooth decision surface and when the data lies in low dimensions. We introduce new datasets of realistic images of faces and digits where the Bayes-Optimal classifier can be calculated efficiently and show that for some of these datasets the optimal classifier is robust and for others it is vulnerable to adversarial examples. In systematic experiments with many such datasets, we find that standard CNN training consistently finds a vulnerable classifier even when the optimal classifier is robust while large-margin methods often find a robust classifier with the exact same training data. Our results suggest that adversarial vulnerability is not an unavoidable consequence of machine learning in high dimensions, and may often be a result of suboptimal training methods used in current practice.",/pdf/a7a7144c73007a8b4c47ba4ab45907fb3f03507b.pdf,ICLR,2020,"We show analytically and empirically that the Bayes-optimal classifiers are, in some settings, vulnerable to adversarial examples. We then show that even when the optimal classifier is robust, trained CNNs are vulnerable." +H1gx1CNKPH,B1gr0QGuvS,1569440000000.0,1577170000000.0,880,Augmenting Transformers with KNN-Based Composite Memory,"[""angelafan@fb.com"", ""claire.gardent@loria.fr"", ""chloe.braud@loria.fr"", ""abordes@fb.com""]","[""Angela Fan"", ""Claire Gardent"", ""Chloe Braud"", ""Antoine Bordes""]","[""knn"", ""memory-augmented networks"", ""language generation"", ""dialogue""]","Various machine learning tasks can benefit from access to external information of different modalities, such as text and images. Recent work has focused on learning architectures with large memories capable of storing this knowledge. We propose augmenting Transformer neural networks with KNN-based Information Fetching (KIF) modules. Each KIF module learns a read operation to access fixed external knowledge. We apply these modules to generative dialogue modeling, a challenging task where information must be flexibly retrieved and incorporated to maintain the topic and flow of conversation. We demonstrate the effectiveness of our approach by identifying relevant knowledge from Wikipedia, images, and human-written dialogue utterances, and show that leveraging this retrieved information improves model performance, measured by automatic and human evaluation.",/pdf/2cd41c6f094fc342c666d990e3fb141ee1b12d8c.pdf,ICLR,2020,augment transformers with KNN-based search modules to read from multi-modal external information +BJlEEaEFDS,HJxpW2GPPH,1569440000000.0,1577170000000.0,482,Towards an Adversarially Robust Normalization Approach,"[""awais@khu.ac.kr"", ""fahad.shamshad@itu.edu.pk"", ""shbae@khu.ac.kr""]","[""Muhammad Awais"", ""Fahad Shamshad"", ""Sung-Ho Bae""]","[""robustness"", ""BatchNorm"", ""adversarial""]","Batch Normalization (BatchNorm) has shown to be effective for improving and accelerating the training of deep neural networks. However, recently it has been shown that it is also vulnerable to adversarial perturbations. In this work, we aim to investigate the cause of adversarial vulnerability of the BatchNorm. We hypothesize that the use of different normalization statistics during training and inference (mini-batch statistics for training and moving average of these values at inference) is the main cause of this adversarial vulnerability in the BatchNorm layer. We empirically proved this by experiments on various neural network architectures and datasets. Furthermore, we introduce Robust Normalization (RobustNorm) and experimentally show that it is not only resilient to adversarial perturbation but also inherit the benefits of BatchNorm.",/pdf/0efe8a4f4d061e6ba829ba602c56b88eb48b15b0.pdf,ICLR,2020,Investigation of how BatchNorm causes adversarial vulnerability and how to avoid it. +fpJX0O5bWKJ,yIBQuPR724,1601310000000.0,1614990000000.0,1321,Estimating Example Difficulty using Variance of Gradients,"[""~Chirag_Agarwal1"", ""~Sara_Hooker1""]","[""Chirag Agarwal"", ""Sara Hooker""]","[""interpretability"", ""human in the loop learning"", ""atypical examples""]","In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples helps inform safe deployment of models, isolates examples that require further human inspection, and provides interpretability into model behavior. In this work, we propose the Variance of Gradients (VOG) as a valuable and efficient proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VOG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VOG scores are more difficult for the model to learn and over-index on examples that require memorization.",/pdf/fe973856668852d29765087f7beacef9ef711c6d.pdf,ICLR,2021,The Variance of Gradients (VoG) metric can be used to identify atypical examples from a distribution +HJlQ96EtPr,Hyl0ycTDPH,1569440000000.0,1577170000000.0,700,FleXOR: Trainable Fractional Quantization,"[""dslee3@gmail.com"", ""mogndrewk@gmail.com"", ""quddnr145@gmail.com"", ""dragwon.jeon@gmail.com"", ""qkrqotjd91@gmail.com"", ""yji6373@naver.com"", ""gywei@g.harvard.edu""]","[""Dongsoo Lee"", ""Se Jung Kwon"", ""Byeongwook Kim"", ""Yongkweon Jeon"", ""Baeseong Park"", ""Jeongin Yun"", ""Gu-Yeon Wei""]","[""Quantization"", ""Model Compression"", ""Trainable Compression"", ""XOR"", ""Encryption""]","Parameter quantization is a popular model compression technique due to its regular form and high compression ratio. In particular, quantization based on binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables. Previous attempts, however, only allow for integer numbers of quantization bits, which ends up restricting the search space for compression ratio and accuracy. Moreover, quantization bits are usually obtained by minimizing quantization loss in a local manner that does not directly correspond to minimizing the loss function. In this paper, we propose an encryption algorithm/architecture to compress quantized weights in order to achieve fractional numbers of bits per weight and new compression configurations further optimize accuracy/compression trade-offs. Decryption is implemented using XOR gates added into the neural network model and described as $\tanh(x)$, which enable gradient calculations superior to the straight-through gradient method. We perform experiments using MNIST, CIFAR-10, and ImageNet to show that inserting XOR gates learns quantization/encrypted bit decisions through training and obtains high accuracy even for fractional sub 1-bit weights.",/pdf/6b7e44432aff63964bf47c7957697066041e4fc3.pdf,ICLR,2020,We propose an encryption algorithm/architecture to compress quantized weights in order to achieve fractional numbers of bits per weight +HbZTcIuiMAG,bDdjotcFnlG,1601310000000.0,1614990000000.0,354,Fusion 360 Gallery: A Dataset and Environment for Programmatic CAD Reconstruction,"[""~Karl_Willis1"", ""~Yewen_Pu1"", ""~Jieliang_Luo1"", ""~Hang_Chu4"", ""~Tao_Du1"", ""joseph.lambourne@autodesk.com"", ""~Armando_Solar-Lezama1"", ""~Wojciech_Matusik2""]","[""Karl Willis"", ""Yewen Pu"", ""Jieliang Luo"", ""Hang Chu"", ""Tao Du"", ""Joseph Lambourne"", ""Armando Solar-Lezama"", ""Wojciech Matusik""]","[""CAD"", ""dataset"", ""3D"", ""reconstruction"", ""environment"", ""design"", ""sequence""]","Parametric computer-aided design (CAD) is a standard paradigm used for the design of manufactured objects. CAD designers perform modeling operations, such as sketch and extrude, to form a construction sequence that makes up a final design. Despite the pervasiveness of parametric CAD and growing interest from the research community, a dataset of human designed 3D CAD construction sequences has not been available to-date. In this paper we present the Fusion 360 Gallery reconstruction dataset and environment for learning CAD reconstruction. We provide a dataset of 8,625 designs, comprising sequential sketch and extrude modeling operations, together with a complementary environment called the Fusion 360 Gym, to assist with performing CAD reconstruction. We outline a standard CAD reconstruction task, together with evaluation metrics, and present results from a novel method using neurally guided search to recover a construction sequence from a target geometry.",/pdf/3ec59d0f61896944fa3f9101bf9f3be31f7eac11.pdf,ICLR,2021,The Fusion 360 Gallery reconstruction dataset and environment for learning CAD reconstruction. +rJ8rHkWRb,ryBrH1-CW,1509120000000.0,1518730000000.0,481,A Simple Fully Connected Network for Composing Word Embeddings from Characters,"[""mike.sk.traynor@gmail.com"", ""trappenberg@gmail.com""]","[""Michael Traynor"", ""Thomas Trappenberg""]","[""natural language processing"", ""word embeddings"", ""language models"", ""neural network"", ""deep learning"", ""sparsity"", ""dropout""]","This work introduces a simple network for producing character aware word embeddings. Position agnostic and position aware character embeddings are combined to produce an embedding vector for each word. The learned word representations are shown to be very sparse and facilitate improved results on language modeling tasks, despite using markedly fewer parameters, and without the need to apply dropout. A final experiment suggests that weight sharing contributes to sparsity, increases performance, and prevents overfitting.",/pdf/bad9d4f9f13e8d0e641a83b65c7504461ffbd5a2.pdf,ICLR,2018,"A fully connected architecture is used to produce word embeddings from character representations, outperforms traditional embeddings and provides insight into sparsity and dropout." +SkgRW64twr,rklkDbK8PH,1569440000000.0,1577170000000.0,394,Deep Multi-View Learning via Task-Optimal CCA,"[""heather@pixelscientia.com"", ""roland.kwitt@gmail.com"", ""marron@unc.edu"", ""troester@unc.edu"", ""chuck_perou@med.unc.edu"", ""mn@cs.unc.edu""]","[""Heather D. Couture"", ""Roland Kwitt"", ""J.S. Marron"", ""Melissa Troester"", ""Charles M. Perou"", ""Marc Niethammer""]","[""multi-view"", ""components analysis"", ""CCA"", ""representation learning"", ""deep learning""]","Canonical Correlation Analysis (CCA) is widely used for multimodal data analysis and, more recently, for discriminative tasks such as multi-view learning; however, it makes no use of class labels. Recent CCA methods have started to address this weakness but are limited in that they do not simultaneously optimize the CCA projection for discrimination and the CCA projection itself, or they are linear only. We address these deficiencies by simultaneously optimizing a CCA-based and a task objective in an end-to-end manner. Together, these two objectives learn a non-linear CCA projection to a shared latent space that is highly correlated and discriminative. Our method shows a significant improvement over previous state-of-the-art (including deep supervised approaches) for cross-view classification (8.5% increase), regularization with a second view during training when only one view is available at test time (2.2-3.2%), and semi-supervised learning (15%) on real data.",/pdf/9f0dd7c5a2734ee4324879316d30a11273dc4659.pdf,ICLR,2020,"Learn a projection to a shared latent space that is also discriminative, improving cross-view classification, regularization with a second view during training, and multi-view prediction." +rJeBJJBYDB,B1g2j35dDH,1569440000000.0,1577170000000.0,1467,Chart Auto-Encoders for Manifold Structured Data,"[""schons@rpi.edu"", ""chenjie@us.ibm.com"", ""lair@rpi.edu""]","[""Stephan Schonsheck"", ""Jie Chen"", ""Rongjie Lai""]","[""Auto-encoder"", ""differential manifolds"", ""multi-charted latent space""]"," Auto-encoding and generative models have made tremendous successes in image and signal representation learning and generation. These models, however, generally employ the full Euclidean space or a bounded subset (such as $[0,1]^l$) as the latent space, whose trivial geometry is often too simplistic to meaningfully reflect the structure of the data. This paper aims at exploring a nontrivial geometric structure of the latent space for better data representation. Inspired by differential geometry, we propose \textbf{Chart Auto-Encoder (CAE)}, which captures the manifold structure of the data with multiple charts and transition functions among them. CAE translates the mathematical definition of manifold through parameterizing the entire data set as a collection of overlapping charts, creating local latent representations. These representations are an enhancement of the single-charted latent space commonly employed in auto-encoding models, as they reflect the intrinsic structure of the manifold. Therefore, CAE achieves a more accurate approximation of data and generates realistic new ones. We conduct experiments with synthetic and real-life data to demonstrate the effectiveness of the proposed CAE. ",/pdf/92f8580042809cd8a8a363b0b53db1c8851f3d12.pdf,ICLR,2020,Manifold-structured latent space for generative models +B1lMMx1CW,SJ1MzlJAW,1509000000000.0,1518730000000.0,114,THE EFFECTIVENESS OF A TWO-LAYER NEURAL NETWORK FOR RECOMMENDATIONS,"[""rybakovo@amazon.com"", ""vijaim@amazon.com"", ""avishkar@gmail.com"", ""slegrand@a9.com"", ""rgeorgej@amazon.com"", ""kiuk@amazon.com"", ""singsidd@amazon.com"", ""qian.you@snapchat.com"", ""enalisni@uci.edu"", ""leodirac@amazon.com"", ""rluo@pstat.ucsb.edu""]","[""Oleg Rybakov"", ""Vijai Mohan"", ""Avishkar Misra"", ""Scott LeGrand"", ""Rejith Joseph"", ""Kiuk Chung"", ""Siddharth Singh"", ""Qian You"", ""Eric Nalisnick"", ""Leo Dirac"", ""Runfei Luo""]","[""Recommender systems"", ""deep learning"", ""personalization""]","We present a personalized recommender system using neural network for recommending +products, such as eBooks, audio-books, Mobile Apps, Video and Music. +It produces recommendations based on customer’s implicit feedback history such +as purchases, listens or watches. Our key contribution is to formulate recommendation +problem as a model that encodes historical behavior to predict the future +behavior using soft data split, combining predictor and auto-encoder models. We +introduce convolutional layer for learning the importance (time decay) of the purchases +depending on their purchase date and demonstrate that the shape of the time +decay function can be well approximated by a parametrical function. We present +offline experimental results showing that neural networks with two hidden layers +can capture seasonality changes, and at the same time outperform other modeling +techniques, including our recommender in production. Most importantly, we +demonstrate that our model can be scaled to all digital categories, and we observe +significant improvements in an online A/B test. We also discuss key enhancements +to the neural network model and describe our production pipeline. Finally +we open-sourced our deep learning library which supports multi-gpu model parallel +training. This is an important feature in building neural network based recommenders +with large dimensionality of input and output data.",/pdf/e3318147d3d09b8e963df1dad356545175c1a555.pdf,ICLR,2018,Improving recommendations using time sensitive modeling with neural networks in multiple product categories on a retail website +ryxgsCVYPr,BkeNwgFuDS,1569440000000.0,1583910000000.0,1307,NeurQuRI: Neural Question Requirement Inspector for Answerability Prediction in Machine Reading Comprehension,"[""scv.back@samsung.com"", ""sai.chetan@samsung.com"", ""akhil.kedia@samsung.com"", ""haejun82.lee@samsung.com"", ""jchoo@korea.ac.kr""]","[""Seohyun Back"", ""Sai Chetan Chinthakindi"", ""Akhil Kedia"", ""Haejun Lee"", ""Jaegul Choo""]","[""Question Answering"", ""Machine Reading Comprehension"", ""Answerability Prediction"", ""Neural Checklist""]","Real-world question answering systems often retrieve potentially relevant documents to a given question through a keyword search, followed by a machine reading comprehension (MRC) step to find the exact answer from them. In this process, it is essential to properly determine whether an answer to the question exists in a given document. This task often becomes complicated when the question involves multiple different conditions or requirements which are to be met in the answer. For example, in a question ""What was the projection of sea level increases in the fourth assessment report?"", the answer should properly satisfy several conditions, such as ""increases"" (but not decreases) and ""fourth"" (but not third). To address this, we propose a neural question requirement inspection model called NeurQuRI that extracts a list of conditions from the question, each of which should be satisfied by the candidate answer generated by an MRC model. To check whether each condition is met, we propose a novel, attention-based loss function. We evaluate our approach on SQuAD 2.0 dataset by integrating the proposed module with various MRC models, demonstrating the consistent performance improvements across a wide range of state-of-the-art methods.",/pdf/0473567aa27752ffefa7b6da332a1739b5cf0c06.pdf,ICLR,2020,"We propose a neural question requirement inspection model called NeurQuRI that extracts a list of conditions from the question, each of which should be satisfied by the candidate answer generated by an MRC model." +SkeVsiAcYm,B1lwW6yqFQ,1538090000000.0,1550870000000.0,616,Generative predecessor models for sample-efficient imitation learning,"[""yannickschroecker@gatech.edu"", ""vec@google.com"", ""jscholz@google.com""]","[""Yannick Schroecker"", ""Mel Vecerik"", ""Jon Scholz""]","[""Imitation Learning"", ""Generative Models"", ""Deep Learning""]","We propose Generative Predecessor Models for Imitation Learning (GPRIL), a novel imitation learning algorithm that matches the state-action distribution to the distribution observed in expert demonstrations, using generative models to reason probabilistically about alternative histories of demonstrated states. We show that this approach allows an agent to learn robust policies using only a small number of expert demonstrations and self-supervised interactions with the environment. We derive this approach from first principles and compare it empirically to a state-of-the-art imitation learning method, showing that it outperforms or matches its performance on two simulated robot manipulation tasks and demonstrate significantly higher sample efficiency by applying the algorithm on a real robot.",/pdf/5343d14726be4ca86832e318fbcf79837d295112.pdf,ICLR,2019, +uys9OcmXNtU,LCLpN1-yZRg,1601310000000.0,1614990000000.0,569,MQTransformer: Multi-Horizon Forecasts with Context Dependent and Feedback-Aware Attention,"[""~Carson_Eisenach1"", ""~Yagna_Patel1"", ""~Dhruv_Madeka1""]","[""Carson Eisenach"", ""Yagna Patel"", ""Dhruv Madeka""]",[],"Recent advances in neural forecasting have produced major improvements in accuracy for probabilistic demand prediction. In this work, we propose novel improvements to the current state of the art by incorporating changes inspired by recent advances in Transformer architectures for Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own history based on the context for which it is producing a forecast. We also present a novel positional encoding that allows the neural network to learn context-dependent seasonality functions as well as arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster (Wen et al., 2017) models display excess variability by failing to leverage previous errors in the forecast to improve accuracy. We propose a novel decoder-self attention scheme for forecasting that produces significant improvements in the excess variation of the forecast.",/pdf/ff9f7f7ade66f34dd9095d8dc1ddd04a602a2107.pdf,ICLR,2021, +Oos98K9Lv-k,iHrhjrfU9Yu,1601310000000.0,1616120000000.0,1005,Neural Topic Model via Optimal Transport,"[""~He_Zhao1"", ""~Dinh_Phung2"", ""~Viet_Huynh1"", ""~Trung_Le2"", ""~Wray_Buntine1""]","[""He Zhao"", ""Dinh Phung"", ""Viet Huynh"", ""Trung Le"", ""Wray Buntine""]","[""topic modelling"", ""optimal transport"", ""document analysis""]","Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.",/pdf/7be7e3b207a273ccbe61f42c2358cc4fb090748f.pdf,ICLR,2021,"This paper presents a neural topic model via optimal transport, which can discover more coherent and diverse topics and derive better document representations for both regular and short texts." +Hyg_X2C5FX,rkgTbxR9Ym,1538090000000.0,1550880000000.0,1371,GAN Dissection: Visualizing and Understanding Generative Adversarial Networks,"[""davidbau@csail.mit.edu"", ""junyanz@csail.mit.edu"", ""hendrik.strobelt@ibm.com"", ""bzhou@csail.mit.edu"", ""jbt@csail.mit.edu"", ""billf@csail.mit.edu"", ""torralba@csail.mit.edu""]","[""David Bau"", ""Jun-Yan Zhu"", ""Hendrik Strobelt"", ""Bolei Zhou"", ""Joshua B. Tenenbaum"", ""William T. Freeman"", ""Antonio Torralba""]","[""GANs"", ""representation"", ""interpretability"", ""causality""]","Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, visualization and understanding of GANs is largely missing. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. + +In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts with a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. Finally, we examine the contextual relationship between these units and their surrounding by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in the scene. We provide open source interpretation tools to help peer researchers and practitioners better understand their GAN models.",/pdf/850f5beb9fbbedefecd6a9a8753e693cc9a0aa37.pdf,ICLR,2019,"GAN representations are examined in detail, and sets of representation units are found that control the generation of semantic concepts in the output." +LXMSvPmsm0g,AapT48LlvMf,1601310000000.0,1614210000000.0,2110,Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning,"[""~Tianlong_Chen1"", ""~Zhenyu_Zhang4"", ""~Sijia_Liu1"", ""~Shiyu_Chang2"", ""~Zhangyang_Wang1""]","[""Tianlong Chen"", ""Zhenyu Zhang"", ""Sijia Liu"", ""Shiyu Chang"", ""Zhangyang Wang""]","[""lottery tickets"", ""winning tickets"", ""lifelong learning""]","The lottery ticket hypothesis states that a highly sparsified sub-network can be trained in isolation, given the appropriate weight initialization. This paper extends that hypothesis from one-shot task learning, and demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, which we call lifelong tickets. We show that the resulting lifelong ticket can further be leveraged to improve the performance of learning over continual tasks. However, it is highly non-trivial to conduct network pruning in the lifelong setting. Two critical roadblocks arise: i) As many tasks now arrive sequentially, finding tickets in a greedy weight pruning fashion will inevitably suffer from the intrinsic bias, that the earlier emerging tasks impact more; ii) As lifelong learning is consistently challenged by catastrophic forgetting, the compact network capacity of tickets might amplify the risk of forgetting. In view of those, we introduce two pruning options, e.g., top-down and bottom-up, for finding lifelong tickets. Compared to the top-down pruning that extends vanilla (iterative) pruning over sequential tasks, we show that the bottom-up one, which can dynamically shrink and (re-)expand model capacity, effectively avoids the undesirable excessive pruning in the early stage. We additionally introduce lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data. Unifying those ingredients, we demonstrate the existence of very competitive lifelong tickets, e.g., achieving 3-8% of the dense model size with even higher accuracy, compared to strong class-incremental learning baselines on CIFAR-10/CIFAR-100/Tiny-ImageNet datasets. Codes available at https://github.com/VITA-Group/Lifelong-Learning-LTH.",/pdf/b8a21c081c59f4808478ea25d46d2743c4d51298.pdf,ICLR,2021,"Proposed novel bottom-up lifelong pruning effectively identify the winning tickets, which significantly improve the performance of learning over continual tasks" +yEnaS6yOkxy,WZbhU0aQlsdL,1601310000000.0,1614990000000.0,3415,Class Balancing GAN with a Classifier in the Loop,"[""~Harsh_Rangwani1"", ""~Konda_Reddy_Mopuri3"", ""~Venkatesh_Babu_Radhakrishnan2""]","[""Harsh Rangwani"", ""Konda Reddy Mopuri"", ""Venkatesh Babu Radhakrishnan""]","[""Long-tailed Learning"", ""GAN"", ""Universal Adversarial Perturbations""]","Generative Adversarial Networks (GANs) have swiftly evolved to imitate increasingly complex image distributions. However, majority of the developments focus on performance of GANs on balanced datasets. We find that the existing GANs and their training regimes which work well on balanced datasets fail to be effective in case of imbalanced (i.e. long-tailed) datasets. In this work we introduce a novel and theoretically motivated Class Balancing regularizer for training GANs. Our regularizer makes use of the knowledge from a pre-trained classifier to ensure balanced learning of all the classes in the dataset. This is achieved via modelling the effective class frequency based on the exponential forgetting observed in neural networks and encouraging the GAN to focus on underrepresented classes. We demonstrate the utility of our contribution in two diverse scenarios: (i) Learning representations for long-tailed distributions, where we achieve better performance than existing approaches, and (ii) Generation of Universal Adversarial Perturbations (UAPs) in the data-free scenario for the large scale datasets, where we bridge the gap between data-driven and data-free approaches for crafting UAPs.",/pdf/51e8ab249e5ffb2285eab09de7658b4cc370eeab.pdf,ICLR,2021,A regularizer for balancing class distributions of GAN samples with application to long-tailed learning and data-free adversarial attacks. +BycCx8qex,,1478280000000.0,1478280000000.0,287,DRAGNN: A Transition-Based Framework for Dynamically Connected Neural Networks,"[""lingpenk@cs.cmu.edu"", ""chrisalberti@google.com"", ""andor@google.com"", ""bogatyy@google.com"", ""djweiss@google.com""]","[""Lingpeng Kong"", ""Chris Alberti"", ""Daniel Andor"", ""Ivan Bogatyy"", ""David Weiss""]","[""Natural language processing"", ""Deep learning"", ""Multi-modal learning"", ""Structured prediction""]","In this work, we present a compact, modular framework for constructing new recurrent neural architectures. Our basic module is a new generic unit, the Transition Based Recurrent Unit (TBRU). In addition to hidden layer activations, TBRUs have discrete state dynamics that allow network connections to be built dynamically as a function of intermediate activations. By connecting multiple TBRUs, we can extend and combine commonly used architectures such as sequence-to-sequence, attention mechanisms, and recursive tree-structured models. A TBRU can also serve as both an {\em encoder} for downstream tasks and as a {\em decoder} for its own task simultaneously, resulting in more accurate multi-task learning. We call our approach Dynamic Recurrent Acyclic Graphical Neural Networks, or DRAGNN. We show that DRAGNN is significantly more accurate and efficient than seq2seq with attention for syntactic dependency parsing and yields more accurate multi-task learning for extractive summarization tasks. +",/pdf/c50d4ca22fbc45f74f428cb4772daea52cab438a.pdf,ICLR,2017,Modular framework for dynamically unrolled neural architectures improves structured prediction tasks +Rhsu5qD36cL,1QtKg-zQP1o,1601310000000.0,1612590000000.0,95,Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy,"[""~Akinori_F_Ebihara1"", ""miyagawataik@nec.com"", ""k-sakurai-bq@nec.com"", ""h-imaoka_cb@nec.com""]","[""Akinori F Ebihara"", ""Taiki Miyagawa"", ""Kazuyuki Sakurai"", ""Hitoshi Imaoka""]","[""Sequential probability ratio test"", ""Early classification"", ""Density ratio estimation""]","Classifying sequential data as early and as accurately as possible is a challenging yet critical problem, especially when a sampling cost is high. One algorithm that achieves this goal is the sequential probability ratio test (SPRT), which is known as Bayes-optimal: it can keep the expected number of data samples as small as possible, given the desired error upper-bound. However, the original SPRT makes two critical assumptions that limit its application in real-world scenarios: (i) samples are independently and identically distributed, and (ii) the likelihood of the data being derived from each class can be calculated precisely. Here, we propose the SPRT-TANDEM, a deep neural network-based SPRT algorithm that overcomes the above two obstacles. The SPRT-TANDEM sequentially estimates the log-likelihood ratio of two alternative hypotheses by leveraging a novel Loss function for Log-Likelihood Ratio estimation (LLLR) while allowing correlations up to $N (\in \mathbb{N})$ preceding samples. In tests on one original and two public video databases, Nosaic MNIST, UCF101, and SiW, the SPRT-TANDEM achieves statistically significantly better classification accuracy than other baseline classifiers, with a smaller number of data samples. The code and Nosaic MNIST are publicly available at https://github.com/TaikiMiyagawa/SPRT-TANDEM.",/pdf/61baa81a79a2975a98aad96ab59d3ca65685492b.pdf,ICLR,2021,"With a novel sequential density estimation algorithm, we relax critical assumptions of the classical Sequential Probability Ratio Test to be applicable in various real-world scenarios." +Bkgq9ANKvB,HJxXF6d_Dr,1569440000000.0,1577170000000.0,1293,Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates,"[""yangliu@ucsc.edu"", ""guohongyi@sjtu.edu.cn""]","[""Yang Liu"", ""Hongyi Guo""]","[""learning with noisy labels"", ""empirical risk minimization"", ""peer loss""]","Learning with noisy labels is a common problem in supervised learning. Existing approaches require practitioners to specify noise rates, i.e., a set of parameters controlling the severity of label noises in the problem. In this work, we introduce a technique to learn from noisy labels that does not require a priori specification of the noise rates. In particular, we introduce a new family of loss functions that we name as peer loss functions. Our approach then uses a standard empirical risk minimization (ERM) framework with peer loss functions. Peer loss functions associate each training sample with a certain form of ""peer"" samples, which evaluate a classifier' predictions jointly. We show that, under mild conditions, performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near optimal classifier as if performing ERM over the clean training data, which we do not have access to. To our best knowledge, this is the first result on ""learning with noisy labels without knowing noise rates"" with theoretical guarantees. We pair our results with an extensive set of experiments, where we compare with state-of-the-art techniques of learning with noisy labels. Our results show that peer loss functions based method consistently outperforms the baseline benchmarks. Peer loss provides a way to simplify model development when facing potentially noisy training labels, and can be promoted as a robust candidate loss function in such situations. ",/pdf/9e474fc24931bc633f06bfdf3d0f0a94b052f187.pdf,ICLR,2020,"This paper introduces peer loss, a family of loss functions that enables training a classifier over noisy labels, but without using explicit knowledge of the noise rates of labels." +rylNJlStwB,rJldstyYPH,1569440000000.0,1577170000000.0,2057,Learning to Infer User Interface Attributes from Images,"[""pschlatt@ethz.ch"", ""pavol.bielik@inf.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Philippe Schlattner"", ""Pavol Bielik"", ""Martin Vechev""]",[],"We present a new approach that helps developers automate the process of user interface implementation. Concretely, given an input image created by a designer (e.g, using a vector graphics editor), we learn to infer its implementation which when rendered (e.g., on the Android platform), looks visually the same as the input image. To achieve this, we take a black box rendering engine and a set of attributes it supports (e.g., colors, border radius, shadow or text properties), use it to generate a suitable synthetic training dataset, and then train specialized neural models to predict each of the attribute values. To improve pixel-level accuracy, we also use imitation learning to train a neural policy that refines the predicted attribute values by learning to compute the similarity of the original and rendered images in their attribute space, rather than based on the difference of pixel values. +",/pdf/425aef8bfc0583d8333a66947e1b7493ecba7a89.pdf,ICLR,2020, +HkezfhA5Y7,rkezZeC5YX,1538090000000.0,1545360000000.0,1242,A Rate-Distortion Theory of Adversarial Examples,"[""gallowaa@uoguelph.ca"", ""agolubeva@perimeterinstitute.ca"", ""gwtaylor@uoguelph.ca""]","[""Angus Galloway"", ""Anna Golubeva"", ""Graham W. Taylor""]","[""adversarial examples"", ""information bottleneck"", ""robustness""]","The generalization ability of deep neural networks (DNNs) is intertwined with model complexity, robustness, and capacity. Through establishing an equivalence between a DNN and a noisy communication channel, we characterize generalization and fault tolerance for unbounded adversarial attacks in terms of information-theoretic quantities. Invoking rate-distortion theory, we suggest that excess capacity is a significant cause of vulnerability to adversarial examples.",/pdf/7d24f22c80386a4d9d433942573f6ba0f49a36a7.pdf,ICLR,2019,We argue that excess capacity is a significant cause of susceptibility to adversarial examples. +B1lwSsC5KX,SklUzKUKKm,1538090000000.0,1545360000000.0,93,Déjà Vu: An Empirical Evaluation of the Memorization Properties of Convnets,"[""asablayrolles@fb.com"", ""matthijs@fb.com"", ""cordelia.schmid@inria.fr"", ""rvj@fb.com""]","[""Alexandre Sablayrolles"", ""Matthijs Douze"", ""Cordelia Schmid"", ""Herv\u00e9 J\u00e9gou""]","[""membership inference"", ""memorization"", ""attack"", ""privacy""]","Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate over- fitting. This paper considers the related question of “membership inference”, where the goal is to determine if an image was used during training. We con- sider membership tests over either ensembles of samples or individual samples. +First, we show how to detect if a dataset was used to train a model, and in particular whether some validation images were used at train time. Then, we introduce a new approach to infer membership when a few of the top layers are not available or have been fine-tuned, and show that lower layers still carry information about the training samples. To support our findings, we conduct large-scale experiments on Imagenet and subsets of YFCC-100M with modern architectures such as VGG and Resnet. +",/pdf/fb6fb600a6d51143c997585750ac1802ce2cd8aa.pdf,ICLR,2019,We analyze the memorization properties by a convnet of the training set and propose several use-cases where we can extract some information about the training set. +Sy6iJDqlx,,1478290000000.0,1488650000000.0,349,"Attend, Adapt and Transfer: Attentive Deep Architecture for Adaptive Transfer from multiple sources in the same domain","[""rjana@umich.edu"", ""aravindsrinivas@gmail.com"", ""miteshk@cse.iitm.ac.in"", ""prasanna.p@cs.mcgill.ca"", ""ravi@cse.iitm.ac.in""]","[""Janarthanan Rajendran"", ""Aravind Lakshminarayanan"", ""Mitesh M. Khapra"", ""Prasanna P"", ""Balaraman Ravindran""]","[""Deep learning"", ""Reinforcement Learning"", ""Transfer Learning""]","Transferring knowledge from prior source tasks in solving a new target task can be useful in several learning applications. The application of transfer poses two serious challenges which have not been adequately addressed. First, the agent should be able to avoid negative transfer, which happens when the transfer hampers or slows down the learning instead of helping it. Second, the agent should be able to selectively transfer, which is the ability to select and transfer from different and multiple source tasks for different parts of the state space of the target task. We propose A2T (Attend Adapt and Transfer), an attentive deep architecture which adapts and transfers from these source tasks. Our model is generic enough to effect transfer of either policies or value functions. Empirical evaluations on different learning algorithms show that A2T is an effective architecture for transfer by being able to avoid negative transfer while transferring selectively from multiple source tasks in the same domain.",/pdf/5ab63afda67c68cd39a6dbe0fb9402dfe5f451fd.pdf,ICLR,2017,We propose a general architecture for transfer that can avoid negative transfer and transfer selectively from multiple source tasks in the same domain. +SkF2D7g0b,Sy_nwmlR-,1509070000000.0,1518730000000.0,230,Exploring the Space of Black-box Attacks on Deep Neural Networks,"[""abhagoji@princeton.edu"", ""_w@eecs.berkeley.edu"", ""lxbosky@gmail.com"", ""dawnsong@gmail.com""]","[""Arjun Nitin Bhagoji"", ""Warren He"", ""Bo Li"", ""Dawn Song""]","[""adversarial machine learning"", ""black-box attacks""]","Existing black-box attacks on deep neural networks (DNNs) so far have largely focused on transferability, where an adversarial instance generated for a locally trained model can “transfer” to attack other learning models. In this paper, we propose novel Gradient Estimation black-box attacks for adversaries with query access to the target model’s class probabilities, which do not rely on transferability. We also propose strategies to decouple the number of queries required to generate each adversarial sample from the dimensionality of the input. An iterative variant of our attack achieves close to 100% adversarial success rates for both targeted and untargeted attacks on DNNs. We carry out extensive experiments for a thorough comparative evaluation of black-box attacks and show that the proposed Gradient Estimation attacks outperform all transferability based black-box attacks we tested on both MNIST and CIFAR-10 datasets, achieving adversarial success rates similar to well known, state-of-the-art white-box attacks. We also apply the Gradient Estimation attacks successfully against a real-world content moderation classifier hosted by Clarifai. Furthermore, we evaluate black-box attacks against state-of-the-art defenses. We show that the Gradient Estimation attacks are very effective even against these defenses.",/pdf/5f3238f70f31480b9668afbd0ece2e7adfe28480.pdf,ICLR,2018,Query-based black-box attacks on deep neural networks with adversarial success rates matching white-box attacks +4q8qGBf4Zxb,VWKATgOcu0S,1601310000000.0,1614990000000.0,2274,Network Architecture Search for Domain Adaptation,"[""~Yichen_Li2"", ""~Xingchao_Peng1""]","[""Yichen Li"", ""Xingchao Peng""]",[],"Deep networks have been used to learn transferable representations for domain adaptation. Existing deep domain adaptation methods systematically employ popular hand-crafted networks designed specifically for image-classification tasks, leading to sub-optimal domain adaptation performance. In this paper, we present Neural Architecture Search for Domain Adaptation (NASDA), a principle framework that leverages differentiable neural architecture search to derive the optimal network architecture for domain adaptation task. NASDA is designed with two novel training strategies: neural architecture search with multi-kernel Maximum Mean Discrepancy to derive the optimal architecture, and adversarial training between a feature generator and a batch of classifiers to consolidate the feature generator. We demonstrate experimentally that NASDA leads to state-of-the-art performance on several domain adaptation benchmarks.",/pdf/cfe84ef619e7ce866bad304ed87e0990d51705cb.pdf,ICLR,2021, +HJe3TsR5K7,ryxkdPtcKm,1538090000000.0,1545360000000.0,840,Learning Joint Wasserstein Auto-Encoders for Joint Distribution Matching,"[""secaojiezhang@mail.scut.edu.cn"", ""guoyongcs@gmail.com"", ""selangyuanmo@mail.scut.edu.cn"", ""peilinzhao@hotmail.com"", ""jzhuang@uta.edu"", ""mingkuitan@scut.edu.cn""]","[""Jiezhang Cao"", ""Yong Guo"", ""Langyuan Mo"", ""Peilin Zhao"", ""Junzhou Huang"", ""Mingkui Tan""]","[""joint distribution matching"", ""image-to-image translation"", ""video-to-video synthesis"", ""Wasserstein distance""]","We study the joint distribution matching problem which aims at learning bidirectional mappings to match the joint distribution of two domains. This problem occurs in unsupervised image-to-image translation and video-to-video synthesis tasks, which, however, has two critical challenges: (i) it is difficult to exploit sufficient information from the joint distribution; (ii) how to theoretically and experimentally evaluate the generalization performance remains an open question. To address the above challenges, we propose a new optimization problem and design a novel Joint Wasserstein Auto-Encoders (JWAE) to minimize the Wasserstein distance of the joint distributions in two domains. We theoretically prove that the generalization ability of the proposed method can be guaranteed by minimizing the Wasserstein distance of joint distributions. To verify the generalization ability, we apply our method to unsupervised video-to-video synthesis by performing video frame interpolation and producing visually smooth videos in two domains, simultaneously. Both qualitative and quantitative comparisons demonstrate the superiority of our method over several state-of-the-arts.",/pdf/8ac61dce09026e98e13b601e5b43721595ddf6da.pdf,ICLR,2019,"We propose a novel Joint Wasserstein Auto-Encoders (JWAE) for Joint Distribution Matching problem, and apply it to image-to-image translation and video-to-video synthesis tasks." +HyeEIyBtvr,B1l8rIa_DB,1569440000000.0,1577170000000.0,1724,BETANAS: Balanced Training and selective drop for Neural Architecture Search,"[""fangmuyuan@huawei.com"", ""wangqiang168@huawei.com"", ""zhangjian157@huawei.com"", ""zorro.zhongzhao@huawei.com""]","[""Muyuan Fang"", ""Qiang Wang"", ""Jian Zhang"", ""Zhao Zhong""]","[""neural architecture search"", ""weight sharing"", ""auto machine learning"", ""deep learning"", ""CNN""]","Automatic neural architecture search techniques are becoming increasingly important in machine learning area recently. Especially, weight sharing methods have shown remarkable potentials on searching good network architectures with few computational resources. However, existing weight sharing methods mainly suffer limitations on searching strategies: these methods either uniformly train all network paths to convergence which introduces conflicts between branches and wastes a large amount of computation on unpromising candidates, or selectively train branches with different frequency which leads to unfair evaluation and comparison among paths. To address these issues, we propose a novel neural architecture search method with balanced training strategy to ensure fair comparisons and a selective drop mechanism to reduces conflicts among candidate paths. The experimental results show that our proposed method can achieve a leading performance of 79.0% on ImageNet under mobile settings, which outperforms other state-of-the-art methods in both accuracy and efficiency.",/pdf/8e2d0f358dbd2f0b23e6aa7317ceb893757f5284.pdf,ICLR,2020,A novel method to search for neural architectures via weight sharing. +rkx3-04FwB,ryxDLvE_DH,1569440000000.0,1577170000000.0,980,MONET: Debiasing Graph Embeddings via the Metadata-Orthogonal Training Unit,"[""johnpalowitch@gmail.com"", ""bperozzi@acm.org""]","[""John Palowitch"", ""Bryan Perozzi""]","[""Graph Embeddings"", ""Representation Learning""]","Are Graph Neural Networks (GNNs) fair? In many real world graphs, the formation of edges is related to certain node attributes (e.g. gender, community, reputation). In this case, any GNN using these edges will be biased by this information, as it is encoded in the structure of the adjacency matrix itself. In this paper, we show that when metadata is correlated with the formation of node neighborhoods, unsupervised node embedding dimensions learn this metadata. This bias implies an inability to control for important covariates in real-world applications, such as recommendation systems. + +To solve these issues, we introduce the Metadata-Orthogonal Node Embedding Training (MONET) unit, a general model for debiasing embeddings of nodes in a graph. MONET achieves this by ensuring that the node embeddings are trained on a hyperplane orthogonal to that of the node metadata. This effectively organizes unstructured embedding dimensions into an interpretable topology-only, metadata-only division with no linear interactions. We illustrate the effectiveness of MONET though our experiments on a variety of real world graphs, which shows that our method can learn and remove the effect of arbitrary covariates in tasks such as preventing the leakage of political party affiliation in a blog network, and thwarting the gaming of embedding-based recommendation systems.",/pdf/ca4f13a67f969d7f1835d82c72c48c403525bedc.pdf,ICLR,2020,Introduces a novel graph neural network method for debiasing graph embeddings from metadata and embedding the metadata effect. +S1xFl64tDr,HJg87EmUwH,1569440000000.0,1583910000000.0,346,Interpretable Complex-Valued Neural Networks for Privacy Protection,"[""xiangliyao08@sjtu.edu.cn"", ""1603023-zh@sjtu.edu.cn"", ""11612807@mail.sustc.edu.cn"", ""zhangyf_sjtu@sjtu.edu.cn"", ""ariesrj@sjtu.edu.cn"", ""zqs1022@sjtu.edu.cn""]","[""Liyao Xiang"", ""Hao Zhang"", ""Haotian Ma"", ""Yifan Zhang"", ""Jie Ren"", ""Quanshi Zhang""]","[""Deep Learning"", ""Privacy Protection"", ""Complex-Valued Neural Networks""]","Previous studies have found that an adversary attacker can often infer unintended input information from intermediate-layer features. We study the possibility of preventing such adversarial inference, yet without too much accuracy degradation. We propose a generic method to revise the neural network to boost the challenge of inferring input attributes from features, while maintaining highly accurate outputs. In particular, the method transforms real-valued features into complex-valued ones, in which the input is hidden in a randomized phase of the transformed features. The knowledge of the phase acts like a key, with which any party can easily recover the output from the processing result, but without which the party can neither recover the output nor distinguish the original input. Preliminary experiments on various datasets and network structures have shown that our method significantly diminishes the adversary's ability in inferring about the input while largely preserves the resulting accuracy.",/pdf/de30952f44a977af1588f9a66f607e702d9bffcb.pdf,ICLR,2020, +p65lWYKpqKz,NtmFJu79tO,1601310000000.0,1614990000000.0,2883,Physics-aware Spatiotemporal Modules with Auxiliary Tasks for Meta-Learning,"[""~Sungyong_Seo1"", ""~Chuizheng_Meng1"", ""~Sirisha_Rambhatla1"", ""~Yan_Liu1""]","[""Sungyong Seo"", ""Chuizheng Meng"", ""Sirisha Rambhatla"", ""Yan Liu""]","[""physics-aware learning"", ""spatiotemporal graph signals"", ""few shot learning""]","Modeling the dynamics of real-world physical systems is critical for spatiotemporal prediction tasks, but challenging when data is limited. The scarcity of real-world data and the difficulty in reproducing the data distribution hinder directly applying meta-learning techniques. Although the knowledge of governing partial differential equations (PDE) of the data can be helpful for the fast adaptation to few observations, it is mostly infeasible to exactly find the equation for observations in real-world physical systems. In this work, we propose a framework, physics-aware meta-learning with auxiliary tasks whose spatial modules incorporate PDE-independent knowledge and temporal modules utilize the generalized features from the spatial modules to be adapted to the limited data, respectively. The framework is inspired by a local conservation law expressed mathematically as a continuity equation and does not require the exact form of governing equation to model the spatiotemporal observations. The proposed method mitigates the need for a large number of real-world tasks for meta-learning by leveraging spatial information in simulated data to meta-initialize the spatial modules. We apply the proposed framework to both synthetic and real-world spatiotemporal prediction tasks and demonstrate its superior performance with limited observations.",/pdf/9d2f396ba0976048ae07a7cfa6bf37e1abad9cab.pdf,ICLR,2021,We propose physics-aware modules designed for meta-learning to tackle the few sample challenges in spatiotemporal physical observations in the real-world. +BJlVhsA5KX,BylINMa5Ym,1538090000000.0,1545360000000.0,701,Sequenced-Replacement Sampling for Deep Learning,"[""chiuman100@gmail.com"", ""pdhvip@gmail.com"", ""wei.yang2@huawei.com"", ""yichang@acm.org""]","[""Chiu Man Ho"", ""Dae Hoon Park"", ""Wei Yang"", ""Yi Chang""]","[""deep neural networks"", ""stochastic gradient descent"", ""sequenced-replacement sampling""]","We propose sequenced-replacement sampling (SRS) for training deep neural networks. The basic idea is to assign a fixed sequence index to each sample in the dataset. Once a mini-batch is randomly drawn in each training iteration, we refill the original dataset by successively adding samples according to their sequence index. Thus we carry out replacement sampling but in a batched and sequenced way. In a sense, SRS could be viewed as a way of performing ""mini-batch augmentation"". It is particularly useful for a task where we have a relatively small images-per-class such as CIFAR-100. Together with a longer period of initial large learning rate, it significantly improves the classification accuracy in CIFAR-100 over the current state-of-the-art results. Our experiments indicate that training deeper networks with SRS is less prone to over-fitting. In the best case, we achieve an error rate as low as 10.10%.",/pdf/d5e0c948a797ea18533ba50e7c7f8e167e9b71f1.pdf,ICLR,2019,"Proposed a novel way (without adding new parameters) of training deep neural network in order to improve generalization, especially for the case where we have relatively small images-per-class." +B1eZweHFwr,HkeHWjlYvH,1569440000000.0,1577170000000.0,2349,Statistical Verification of General Perturbations by Gaussian Smoothing,"[""marcfisc@student.ethz.ch"", ""mbaader@inf.ethz.ch"", ""martin.vechev@inf.ethz.ch""]","[""Marc Fischer"", ""Maximilian Baader"", ""Martin Vechev""]","[""adversarial robustness"", ""certified network"", ""randomised smoothing"", ""geometric perturbations""]","We present a novel statistical certification method that generalizes prior work based on smoothing to handle richer perturbations. Concretely, our method produces a provable classifier which can establish statistical robustness against geometric perturbations (e.g., rotations, translations) as well as volume changes and pitch shifts on audio data. The generalization is non-trivial and requires careful handling of operations such as interpolation. Our method is agnostic to the choice of classifier and scales to modern architectures such as ResNet-50 on ImageNet.",/pdf/2e40c75db521acb822a80777100b08127c5b4b8a.pdf,ICLR,2020,"We present a statistical certification method to certify robustness for rotations, translations and other transformations." +HkedQp4tPr,HkgZXh1wPB,1569440000000.0,1577170000000.0,454,Parallel Scheduled Sampling,"[""duckworthd@google.com"", ""aneelakantan@google.com"", ""bgoodrich@google.com"", ""lukaszkaiser@google.com"", ""bengio@google.com""]","[""Daniel Duckworth"", ""Arvind Neelakantan"", ""Ben Goodrich"", ""Lukasz Kaiser"", ""Samy Bengio""]","[""deep learning"", ""generative models"", ""teacher forcing"", ""scheduled sampling""]","Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit(pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling (Bengio et al., 2015) aimsto mitigate this discrepancy between train and test time by randomly replacing some discrete units in the history with the model’s prediction. While teacher-forced training works well with ML accelerators as the computation can be parallelized across time, Scheduled Sampling involves undesirable sequential processing. In this paper, we introduce a simple technique to parallelize Scheduled Sampling across time. Experimentally, we find the proposed technique leads to equivalent or better performance on image generation, summarization, dialog generation, and translation compared to teacher-forced training. n dialog response generation task,Parallel Scheduled Sampling achieves 1.6 BLEU score (11.5%) improvement over teacher-forcing while in image generation it achieves 20% and 13.8% improvement in Frechet Inception Distance (FID) and Inception Score (IS) respectively. Further, we discuss the effects of different hyper-parameters associated with Scheduled Sampling on the model performance.",/pdf/a4ecc028a218deeb1f8998e2dd6864505b8c75cd.pdf,ICLR,2020,We describe a simple technique to parallelize Scheduled Sampling across time which gives better sample quality and train almost as fast as teacher-forcing. +S1gBz2C9tX,r1esu-Cqtm,1538090000000.0,1545360000000.0,1263,Importance Resampling for Off-policy Policy Evaluation,"[""mkschleg@ualberta.ca"", ""wchung@ualberta.ca"", ""daniel.graves@huawei.com"", ""whitem@ualberta.ca""]","[""Matthew Schlegel"", ""Wesley Chung"", ""Daniel Graves"", ""Martha White""]","[""Reinforcement Learning"", ""Off-policy policy evaluation"", ""importance resampling"", ""importance sampling""]","Importance sampling is a common approach to off-policy learning in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the parameters for the value function. Weighted importance sampling (WIS) has been explored to reduce variance for off-policy policy evaluation, but only for linear value function approximation. In this work, we explore a resampling strategy to reduce variance, rather than a reweighting strategy. We propose Importance Resampling (IR) for off-policy learning, that resamples experience from the replay buffer and applies a standard on-policy update. The approach avoids using importance sampling ratios directly in the update, instead correcting the distribution over transitions before the update. We characterize the bias and consistency of the our estimator, particularly compared to WIS. We then demonstrate in several toy domains that IR has improved sample efficiency and parameter sensitivity, as compared to several baseline WIS estimators and to IS. We conclude with a demonstration showing IR improves over IS for learning a value function from images in a racing car simulator.",/pdf/09ca8d72f9333afe9e6f9c094720ee3722b00756.pdf,ICLR,2019,A resampling approach for off-policy policy evaluation in reinforcement learning. +R0a0kFI3dJx,RDktu65qOr8,1601310000000.0,1616030000000.0,2265,Adaptive Extra-Gradient Methods for Min-Max Optimization and Games,"[""~Kimon_Antonakopoulos1"", ""~Veronica_Belmega1"", ""~Panayotis_Mertikopoulos1""]","[""Kimon Antonakopoulos"", ""Veronica Belmega"", ""Panayotis Mertikopoulos""]","[""min-max optimization"", ""games"", ""mirror-prox"", ""adaptive methods"", ""regime agnostic methods""]","We present a new family of min-max optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. +Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not, without requiring any prior tuning by the optimizer. +As a result, the algorithm simultaneously achieves order-optimal convergence rates, \ie it converges to an $\varepsilon$-optimal solution within $\mathcal{O}(1/\varepsilon)$ iterations in smooth problems, and within $\mathcal{O}(1/\varepsilon^2)$ iterations in non-smooth ones. Importantly, these guarantees do not require any of the standard boundedness or Lipschitz continuity conditions that are typically assumed in the literature; in particular, they apply even to problems with singularities (such as resource allocation problems and the like). This adaptation is achieved through the use of a geometric apparatus based on Finsler metrics and a suitably chosen mirror-prox template that allows us to derive sharp convergence rates for the methods at hand.",/pdf/b85ffd0f421c8180b9a511a825ac3f10fc824b9b.pdf,ICLR,2021,We develop an adaptive mirror-prox method for min-max problems and games that achieves order-optimal rates in both smooth and non-smooth problems. +#NAME?,77U4g9JQuoq,1601310000000.0,1616060000000.0,595,A Discriminative Gaussian Mixture Model with Sparsity,"[""~Hideaki_Hayashi1"", ""~Seiichi_Uchida1""]","[""Hideaki Hayashi"", ""Seiichi Uchida""]","[""classification"", ""sparse Bayesian learning"", ""Gaussian mixture model""]","In probabilistic classification, a discriminative model based on the softmax function has a potential limitation in that it assumes unimodality for each class in the feature space. The mixture model can address this issue, although it leads to an increase in the number of parameters. We propose a sparse classifier based on a discriminative GMM, referred to as a sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained via sparse Bayesian learning. Using this sparse learning framework, we can simultaneously remove redundant Gaussian components and reduce the number of parameters used in the remaining components during learning; this learning method reduces the model complexity, thereby improving the generalization capability. Furthermore, the SDGM can be embedded into neural networks (NNs), such as convolutional NNs, and can be trained in an end-to-end manner. Experimental results demonstrated that the proposed method outperformed the existing softmax-based discriminative models.",/pdf/6fecf8af857ca0e108abee0d2dd9710cf7c3ad37.pdf,ICLR,2021,"A sparse classifier based on a discriminative Gaussian mixture model, which can also be embedded into a neural network." +rJxYMCEFDr,HyeSvJr_PS,1569440000000.0,1577170000000.0,1009,Leveraging Adversarial Examples to Obtain Robust Second-Order Representations,"[""mohit.p@gatech.edu"", ""gukyeong.kwon@gatech.edu"", ""cantemel@gatech.edu"", ""alregib@gatech.edu""]","[""Mohit Prabhushankar"", ""Gukyeong Kwon"", ""Dogancan Temel"", ""Ghassan AlRegib""]","[""Second-order representation"", ""adversarial examples"", ""robustness"", ""gradients""]","Deep neural networks represent data as projections on trained weights in a high dimensional manifold. This is a first-order based absolute representation that is widely used due to its interpretable nature and simple mathematical functionality. However, in the application of visual recognition, first-order representations trained on pristine images have shown a vulnerability to distortions. Visual distortions including imaging acquisition errors and challenging environmental conditions like blur, exposure, snow and frost cause incorrect classification in first-order neural nets. To eliminate vulnerabilities under such distortions, we propose representing data points by their relative positioning in a high dimensional manifold instead of their absolute positions. Such a positioning scheme is based on a data point’s second-order property. We obtain a data point’s second-order representation by creating adversarial examples to all possible decision boundaries and tracking the movement of corresponding boundaries. We compare our representation against first-order methods and show that there is an increase of more than 14% under severe distortions for ResNet-18. We test the generalizability of the proposed representation on larger networks and on 19 complex and real-world distortions from CIFAR-10-C. Furthermore, we show how our proposed representation can be used as a plug-in approach on top of any network. We also provide methodologies to scale our proposed representation to larger datasets.",/pdf/21c1e33c42e4120dc2aa6adad421d9e283da6f7b.pdf,ICLR,2020,We introduce a robust plug-in representation based on relative positioning derived from targeted multi-class adversarial image generation. +H1gdAC4KDB,H1x57K9_PH,1569440000000.0,1577170000000.0,1436,Adversarially Robust Generalization Just Requires More Unlabeled Data,"[""zhairuntian@pku.edu.cn"", ""caitianle1998@pku.edu.cn"", ""dihe@microsoft.com"", ""cdan@cs.cmu.edu"", ""brooklet60@hust.edu.cn"", ""jeh17@cornell.edu"", ""wanglw@cis.pku.edu.cn""]","[""Runtian Zhai"", ""Tianle Cai"", ""Di He"", ""Chen Dan"", ""Kun He"", ""John E. Hopcroft"", ""Liwei Wang""]","[""Adversarial Robustness"", ""Semi-supervised Learning""]","Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model with better adversarially robust generalization. The key insight of our results is based on a risk decomposition theorem, in which the expected robust risk is separated into two parts: the stability part which measures the prediction stability in the presence of perturbations, and the accuracy part which evaluates the standard classification accuracy. As the stability part does not depend on any label information, we can optimize this part using unlabeled data. We further prove that for a specific Gaussian mixture problem, adversarially robust generalization can be almost as easy as the standard generalization in supervised learning if a sufficiently large amount of unlabeled data is provided. Inspired by the theoretical findings, we further show that a practical adversarial training algorithm that leverages unlabeled data can improve adversarial robust generalization on MNIST and Cifar-10.",/pdf/10d9a4892e8e0efbfee5c97af3e72aee018f2705.pdf,ICLR,2020, +SJx4O34YvS,Syxe_KzX8r,1569440000000.0,1577170000000.0,37,Semantics Preserving Adversarial Attacks,"[""ousmane@elementai.com"", ""elnaz.barshan@elementai.com"", ""babanezhad@gmail.com""]","[""Ousmane Amadou Dia"", ""Elnaz Barshan"", ""Reza Babanezhad""]","[""black-box adversarial attacks"", ""stein variational inference"", ""adversarial images and tex""]","While progress has been made in crafting visually imperceptible adversarial examples, constructing semantically meaningful ones remains a challenge. In this paper, we propose a framework to generate semantics preserving adversarial examples. First, we present a manifold learning method to capture the semantics of the inputs. The motivating principle is to learn the low-dimensional geometric summaries of the inputs via statistical inference. Then, we perturb the elements of the learned manifold using the Gram-Schmidt process to induce the perturbed elements to remain in the manifold. To produce adversarial examples, we propose an efficient algorithm whereby we leverage the semantics of the inputs as a source of knowledge upon which we impose adversarial constraints. We apply our approach on toy data, images and text, and show its effectiveness in producing semantics preserving adversarial examples which evade existing defenses against adversarial attacks.",/pdf/c0945141d9f57d1009d0976377bb423caf31a387.pdf,ICLR,2020,Generating semantically meaningful adversarial examples beyond simple norm balls in an efficient and effective way using generative models. +S191YzbRZ,HkrytM-AW,1509140000000.0,1518730000000.0,877,Prototype Matching Networks for Large-Scale Multi-label Genomic Sequence Classification,"[""jjl5sw@virginia.edu"", ""as5cu@virginia.edu"", ""rs3zz@virginia.edu"", ""yq2h@virginia.edu""]","[""Jack Lanchantin"", ""Arshdeep Sekhon"", ""Ritambhara Singh"", ""Yanjun Qi""]","[""bioinformatics"", ""multi-label classification"", ""matching networks"", ""prototypes"", ""memory networks"", ""attention""]","One of the fundamental tasks in understanding genomics is the problem of predicting Transcription Factor Binding Sites (TFBSs). With more than hundreds of Transcription Factors (TFs) as labels, genomic-sequence based TFBS prediction is a challenging multi-label classification task. There are two major biological mechanisms for TF binding: (1) sequence-specific binding patterns on genomes known as “motifs” and (2) interactions among TFs known as co-binding effects. In this paper, we propose a novel deep architecture, the Prototype Matching Network (PMN) to mimic the TF binding mechanisms. Our PMN model automatically extracts prototypes (“motif”-like features) for each TF through a novel prototype-matching loss. Borrowing ideas from few-shot matching models, we use the notion of support set of prototypes and an LSTM to learn how TFs interact and bind to genomic sequences. On a reference TFBS dataset with 2.1 million genomic sequences, PMN significantly outperforms baselines and validates our design choices empirically. To our knowledge, this is the first deep learning architecture that introduces prototype learning and considers TF-TF interactions for large scale TFBS prediction. Not only is the proposed architecture accurate, but it also models the underlying biology.",/pdf/8db6330c303015eb2a9d1abf59d77afe7964bed6.pdf,ICLR,2018,We combine the matching network framework for few shot learning into a large scale multi-label model for genomic sequence classification. +WAISmwsqDsb,PpU5Mhlj73H,1601310000000.0,1615920000000.0,509,DINO: A Conditional Energy-Based GAN for Domain Translation,"[""~Konstantinos_Vougioukas1"", ""~Stavros_Petridis1"", ""~Maja_Pantic1""]","[""Konstantinos Vougioukas"", ""Stavros Petridis"", ""Maja Pantic""]","[""Generative Modelling"", ""Domain Translation"", ""Conditional GANs"", ""Energy-Based GANs""]","Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics since the conditional input can often be ignored by the discriminator. We propose an alternative method for conditioning and present a new framework, where two networks are simultaneously trained, in a supervised manner, to perform domain translation in opposite directions. Our method is not only better at capturing the shared information between two domains but is more generic and can be applied to a broader range of problems. The proposed framework performs well even in challenging cross-modal translations, such as video-driven speech reconstruction, for which other systems struggle to maintain correspondence.",/pdf/1770fc1a0716d2fde0cefb49d59d540311331789.pdf,ICLR,2021,A framework for domain translation which uses a novel mechanism for conditioning energy-based GANs. +BJxkOlSYDH,HyeSUhlYPr,1569440000000.0,1583910000000.0,2382,Provable Filter Pruning for Efficient Neural Networks,"[""lucasl@mit.edu"", ""baykal@mit.edu"", ""hlang08@gmail.com"", ""dannyf.post@gmail.com"", ""rus@csail.mit.edu""]","[""Lucas Liebenwein"", ""Cenk Baykal"", ""Harry Lang"", ""Dan Feldman"", ""Daniela Rus""]","[""theory"", ""compression"", ""filter pruning"", ""neural networks""]","We present a provable, sampling-based approach for generating compact Convolutional Neural Networks (CNNs) by identifying and removing redundant filters from an over-parameterized network. Our algorithm uses a small batch of input data points to assign a saliency score to each filter and constructs an importance sampling distribution where filters that highly affect the output are sampled with correspondingly high probability. +In contrast to existing filter pruning approaches, our method is simultaneously data-informed, exhibits provable guarantees on the size and performance of the pruned network, and is widely applicable to varying network architectures and data sets. Our analytical bounds bridge the notions of compressibility and importance of network structures, which gives rise to a fully-automated procedure for identifying and preserving filters in layers that are essential to the network's performance. Our experimental evaluations on popular architectures and data sets show that our algorithm consistently generates sparser and more efficient models than those constructed by existing filter pruning approaches. ",/pdf/7ac4cb1b260593c92731885f950f67112b1f21b8.pdf,ICLR,2020,A sampling-based filter pruning approach for convolutional neural networks exhibiting provable guarantees on the size and performance of the pruned network. +euDnVs0Ynts,gO4z9mykBTi,1601310000000.0,1616060000000.0,3006,Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time,"[""~Yu_Cheng2"", ""~Honghao_Lin1""]","[""Yu Cheng"", ""Honghao Lin""]","[""Bayesian networks"", ""robust statistics"", ""learning theory""]","We study the problem of learning Bayesian networks where an $\epsilon$-fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearly-linear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower by at least a factor of $(d/\epsilon)$, where $d$ is the number of variables in the Bayesian network and $\epsilon$ is the fraction of corrupted samples. + +Our algorithm and analysis are considerably simpler than those in previous work. We achieve this by establishing a direct connection between robust learning of Bayesian networks and robust mean estimation. As a subroutine in our algorithm, we develop a robust mean estimation algorithm whose runtime is nearly-linear in the number of nonzeros in the input samples, which may be of independent interest.",/pdf/01c090bb63e775869f6bc2d003ebf3cd5e79df67.pdf,ICLR,2021,We give the first nearly-linear time algorithm for the robust learning of fixed-structure Bayesian networks. +K5a_QFEUzA1,RH-U-vSnT5y,1601310000000.0,1614990000000.0,1747,Cross-model Back-translated Distillation for Unsupervised Machine Translation,"[""~Phi_Xuan_Nguyen1"", ""~Shafiq_Joty1"", ""wuk@i2r.a-star.edu.sg"", ""~AiTi_Aw1""]","[""Phi Xuan Nguyen"", ""Shafiq Joty"", ""Kui Wu"", ""AiTi Aw""]","[""unsupervised machine translation"", ""NMT"", ""machine translation""]","Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, these diversification processes may have reached their limit. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, it boosts the performance of the standard UMT methods by 1.5-2.0 BLEU. In particular, in WMT'14 English-French, WMT'16 German-English and English-Romanian, CBD outperforms cross-lingual masked language model (XLM) by 2.3, 2.2 and 1.6 BLEU, respectively. It also yields 1.5-3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not.",/pdf/7188c1b2b5af74e92001927d2f7750e0e22447b5.pdf,ICLR,2021,The paper introduces a method to improve unsupervised machine translation using two unsupervised agents to produce diverse data and conduct knowledge distillation. +AM0PBmqmojH,p3aRIfqisd,1601310000000.0,1614990000000.0,3231,"Warpspeed Computation of Optimal Transport, Graph Distances, and Embedding Alignment","[""~Johannes_Klicpera1"", ""marten.lienen@in.tum.de"", ""~Stephan_G\u00fcnnemann1""]","[""Johannes Klicpera"", ""Marten Lienen"", ""Stephan G\u00fcnnemann""]","[""Optimal transport"", ""sinkhorn distance"", ""locality sensitive hashing"", ""nystr\u00f6m method"", ""graph neural networks"", ""embedding alignment""]","Optimal transport (OT) is a cornerstone of many machine learning tasks. The current best practice for computing OT is via entropy regularization and Sinkhorn iterations. This algorithm runs in quadratic time and requires calculating the full pairwise cost matrix, which is prohibitively expensive for large sets of objects. To alleviate this limitation we propose to instead use a sparse approximation of the cost matrix based on locality sensitive hashing (LSH). Moreover, we fuse this sparse approximation with the Nyström method, resulting in the locally corrected Nyström method (LCN). These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even in complex, high-dimensional spaces. We thoroughly demonstrate these advantages via a theoretical analysis and by evaluating multiple approximations both directly and as a component of two real-world models. Using approximate Sinkhorn for unsupervised word embedding alignment enables us to train the model full-batch in a fraction of the time while improving upon the original on average by 3.1 percentage points without any model changes. For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn and outcompetes previous models by 48%. LCN-Sinkhorn enables GTN to achieve this while still scaling log-linearly in the number of nodes.",/pdf/789d60b05160bb7cfb43897e75a7a1fe26aefa04.pdf,ICLR,2021,"We propose the locally corrected Nyström (LCN) method for kernels, develop two fast approximations of entropy-regularized optimal transport (sparse Sinkhorn and LCN-Sinkhorn) and evaluate them for embedding alignment and graph distance regression." +0pxiMpCyBtr,bXNYXBtFBj,1601310000000.0,1611610000000.0,1044,Monotonic Kronecker-Factored Lattice,"[""~William_Taylor_Bakst1"", ""~Nobuyuki_Morioka1"", ""~Erez_Louidor1""]","[""William Taylor Bakst"", ""Nobuyuki Morioka"", ""Erez Louidor""]","[""Theory"", ""Regularization"", ""Algorithms"", ""Classification"", ""Regression"", ""Matrix and Tensor Factorization"", ""Fairness"", ""Evaluation"", ""Efficiency"", ""Machine Learning""]","It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice ($\mathrm{KFL}$), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and storage costs scale linearly in the number of input features, which is a significant improvement over existing methods that grow exponentially. We also show that we can still properly enforce monotonicity and other shape constraints. The $\mathrm{KFL}$ function class consists of products of piecewise-linear functions, and the size of the function class can be further increased through ensembling. We prove that the function class of an ensemble of $M$ base $\mathrm{KFL}$ models strictly increases as $M$ increases up to a certain threshold. Beyond this threshold, every multilinear interpolated lattice function can be expressed. Our experimental results demonstrate that $\mathrm{KFL}$ trains faster with fewer parameters while still achieving accuracy and evaluation speeds comparable to or better than the baseline methods and preserving monotonicity guarantees on the learned model.",/pdf/86fd2e9cf39fac3f9c7f37e71683f60eb7e3e575.pdf,ICLR,2021,"We show how to effectively and efficiently learn flexible and interpretable monotonic functions using Kronecker-Factored Lattice, an efficient reparameterization of flexible monotonic lattice regression via Kronecker product." +_MxHo0GHsH6,sfgWQw5Lg2W,1601310000000.0,1614990000000.0,1489,Once Quantized for All: Progressively Searching for Quantized Compact Models,"[""~Mingzhu_Shen1"", ""~Feng_Liang3"", ""~Chuming_Li1"", ""~Chen_Lin2"", ""~Ming_Sun4"", ""~Junjie_Yan4"", ""~Wanli_Ouyang1""]","[""Mingzhu Shen"", ""Feng Liang"", ""Chuming Li"", ""Chen Lin"", ""Ming Sun"", ""Junjie Yan"", ""Wanli Ouyang""]","[""quantized neural networks"", ""network architecture search"", ""compact models""]","Automatic search of Quantized Neural Networks (QNN) has attracted a lot of attention. However, the existing quantization-aware Neural Architecture Search (NAS) approaches inherit a two-stage search-retrain schema, which is not only time-consuming but also adversely affected by the unreliable ranking of architectures during the search. To avoid the undesirable effect of the search-retrain schema, we present Once Quantized for All (OQA), a novel framework that searches for quantized compact models and deploys their quantized weights at the same time without additional post-process. While supporting a huge architecture search space, our OQA can produce a series of quantized compact models under ultra-low bit-widths(e.g. 4/3/2 bit). A progressive bit inheritance procedure is introduced to support ultra-low bit-width. Our searched model family, OQANets, achieves a new state-of-the-art (SOTA) on quantized compact models compared with various quantization methods and bit-widths. In particular, OQA2bit-L achieves 64.0\% ImageNet Top-1 accuracy, outperforming its 2 bit counterpart EfficientNet-B0@QKD by a large margin of 14\% using 30\% less computation cost. ",/pdf/fe550d0a8a5bf9b246a3b578132364a9703b4edf.pdf,ICLR,2021, +BylVcTNtDS,B1gzt9pDPB,1569440000000.0,1583910000000.0,702,A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning,"[""srezaei@ucdavis.edu"", ""xinliu@ucdavis.edu""]","[""Shahbaz Rezaei"", ""Xin Liu""]","[""Machine learning security"", ""Transfer learning"", ""deep learning security"", ""Softmax Vulnerability"", ""Transfer learning Security""]","Due to insufficient training data and the high computational cost to train a deep neural network from scratch, transfer learning has been extensively used in many deep-neural-network-based applications. A commonly used transfer learning approach involves taking a part of a pre-trained model, adding a few layers at the end, and re-training the new layers with a small dataset. This approach, while efficient and widely used, imposes a security vulnerability because the pre-trained model used in transfer learning is usually publicly available, including to potential attackers. In this paper, we show that without any additional knowledge other than the pre-trained model, an attacker can launch an effective and efficient brute force attack that can craft instances of input to trigger each target class with high confidence. We assume that the attacker has no access to any target-specific information, including samples from target classes, re-trained model, and probabilities assigned by Softmax to each class, and thus making the attack target-agnostic. These assumptions render all previous attack models inapplicable, to the best of our knowledge. To evaluate the proposed attack, we perform a set of experiments on face recognition and speech recognition tasks and show the effectiveness of the attack. Our work reveals a fundamental security weakness of the Softmax layer when used in transfer learning settings.",/pdf/114cb3e7e93573002951610e4809837571a3939f.pdf,ICLR,2020, +7I12hXRi8F,Pn8tcQA1l0x,1601310000000.0,1616030000000.0,1394,ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning,"[""~Hengrui_Cai1"", ""~Rui_Song2"", ""wlu4@ncsu.edu""]","[""Hengrui Cai"", ""Rui Song"", ""Wenbin Lu""]","[""Causal network"", ""Constrained optimization"", ""COVID-19"", ""Individual mediation effects"", ""Structure learning""]","In the era of causal revolution, identifying the causal effect of an exposure on the outcome of interest is an important problem in many areas, such as epidemics, medicine, genetics, and economics. Under a general causal graph, the exposure may have a direct effect on the outcome and also an indirect effect regulated by a set of mediators. An analysis of causal effects that interprets the causal mechanism contributed through mediators is hence challenging but on demand. To the best of our knowledge, there are no feasible algorithms that give an exact decomposition of the indirect effect on the level of individual mediators, due to common interaction among mediators in the complex graph. In this paper, we establish a new statistical framework to comprehensively characterize causal effects with multiple mediators, namely, ANalysis Of Causal Effects (ANOCE), with a newly introduced definition of the mediator effect, under the linear structure equation model. We further propose a constrained causal structure learning method by incorporating a novel identification constraint that specifies the temporal causal relationship of variables. The proposed algorithm is applied to investigate the causal effects of 2020 Hubei lockdowns on reducing the spread of the coronavirus in Chinese major cities out of Hubei. ",/pdf/8142413a92e7df5fb79598dee863640346d53f5b.pdf,ICLR,2021,"Analysis of causal effects on the level of individual mediators via constrained structural learning, with application to the COVID-19 Spread in China." +r1d-lFmO-cM,a6jDpCdQ5s7,1601310000000.0,1614990000000.0,2583,Pointwise Binary Classification with Pairwise Confidence Comparisons,"[""~Lei_Feng1"", ""ssl2018@email.swu.edu.cn"", ""~Nan_Lu1"", ""~Bo_Han1"", ""~Miao_Xu1"", ""~Gang_Niu1"", ""~Bo_An2"", ""~Masashi_Sugiyama1""]","[""Lei Feng"", ""Senlin Shu"", ""Nan Lu"", ""Bo Han"", ""Miao Xu"", ""Gang Niu"", ""Bo An"", ""Masashi Sugiyama""]","[""Binary classification"", ""pairwise comparisons"", ""unbiased risk estimator""]","Ordinary (pointwise) binary classification aims to learn a binary classifier from pointwise labeled data. However, such pointwise labels may not be directly accessible due to privacy, confidentiality, or security considerations. In this case, can we still learn an accurate binary classifier? This paper proposes a novel setting, namely pairwise comparison (Pcomp) classification, where we are given only pairs of unlabeled data that we know one is more likely to be positive than the other, instead of pointwise labeled data. Compared with pointwise labels, pairwise comparisons are easier to collect, and Pcomp classification is useful for subjective classification tasks. To solve this problem, we present a mathematical formulation for the generation process of pairwise comparison data, based on which we exploit an unbiased risk estimator (URE) to train a binary classifier by empirical risk minimization and establish an estimation error bound. We first prove that a URE can be derived and improve it using correction functions. Then, we start from the noisy-label learning perspective to introduce a progressive URE and improve it by imposing consistency regularization. Finally, experiments validate the effectiveness of our proposed solutions for Pcomp classification.",/pdf/a03325082d6b6d7c946f2b7caecb9a811e4a1df8.pdf,ICLR,2021,We can successfully learn a binary classifier from only pairwise comparison data. +ByCPHrgCW,SJAwSSl0W,1509080000000.0,1518730000000.0,255,Deep Learning Inferences with Hybrid Homomorphic Encryption,"[""anthonymeehan@anthonymeehan.com"", ""ryan.ko@waikato.ac.nz"", ""geoff@waikato.ac.nz""]","[""Anthony Meehan"", ""Ryan K L Ko"", ""Geoff Holmes""]","[""deep learning"", ""homomorphic encryption"", ""hybrid homomorphic encryption"", ""privacy preserving"", ""representation learning"", ""neural networks""]","When deep learning is applied to sensitive data sets, many privacy-related implementation issues arise. These issues are especially evident in the healthcare, finance, law and government industries. Homomorphic encryption could allow a server to make inferences on inputs encrypted by a client, but to our best knowledge, there has been no complete implementation of common deep learning operations, for arbitrary model depths, using homomorphic encryption. This paper demonstrates a novel approach, efficiently implementing many deep learning functions with bootstrapped homomorphic encryption. As part of our implementation, we demonstrate Single and Multi-Layer Neural Networks, for the Wisconsin Breast Cancer dataset, as well as a Convolutional Neural Network for MNIST. Our results give promising directions for privacy-preserving representation learning, and the return of data control to users. + +",/pdf/43dae54cad481416104f335c5fb25cd47ca4a89d.pdf,ICLR,2018,"We made a feature-rich system for deep learning with encrypted inputs, producing encrypted outputs, preserving privacy." +BJeguTEKDB,rygXI5oPwB,1569440000000.0,1577170000000.0,620,INSTANCE CROSS ENTROPY FOR DEEP METRIC LEARNING,"[""xwang39@qub.ac.uk"", ""elyor@anyvision.co"", ""y.hua@qub.ac.uk"", ""n.robertson@qub.ac.uk""]","[""Xinshao Wang"", ""Elyor Kodirov"", ""Yang Hua"", ""Neil M. Robertson""]","[""Deep Metric Learning"", ""Instance Cross Entropy"", ""Sample Mining/Weighting"", ""Image Retrieval""]","Loss functions play a crucial role in deep metric learning thus a variety of them have been proposed. Some supervise the learning process by pairwise or tripletwise similarity constraints while others take the advantage of structured similarity information among multiple data points. In this work, we approach deep metric learning from a novel perspective. We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. ICE has three main appealing properties. Firstly, similar to categorical cross entropy (CCE), ICE has clear probabilistic interpretation and exploits structured semantic similarity information for learning supervision. Secondly, ICE is scalable to infinite training data as it learns on mini-batches iteratively and is independent of the training set size. Thirdly, motivated by our relative weight analysis, seamless sample reweighting is incorporated. It rescales samples’ gradients to control the differentiation degree over training examples instead of truncating them by sample mining. In addition to its simplicity and intuitiveness, extensive experiments on three real-world benchmarks demonstrate the superiority of ICE.",/pdf/40457401cfa8b248e2c81a59fda28f0daf87343d.pdf,ICLR,2020,We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. +yN18f9V1Onp,NS-JjF_9c0M,1601310000000.0,1614990000000.0,2462,Adaptive Learning Rates for Multi-Agent Reinforcement Learning,"[""~Jiechuan_Jiang1"", ""~Zongqing_Lu2""]","[""Jiechuan Jiang"", ""Zongqing Lu""]",[],"In multi-agent reinforcement learning (MARL), the learning rates of actors and critic are mostly hand-tuned and fixed. This not only requires heavy tuning but more importantly limits the learning. With adaptive learning rates according to gradient patterns, some optimizers have been proposed for general optimizations, which however do not take into consideration the characteristics of MARL. In this paper, we propose AdaMa to bring adaptive learning rates to cooperative MARL. AdaMa evaluates the contribution of actors' updates to the improvement of Q-value and adaptively updates the learning rates of actors to the direction of maximally improving the Q-value. AdaMa could also dynamically balance the learning rates between the critic and actors according to their varying effects on the learning. Moreover, AdaMa can incorporate the second-order approximation to capture the contribution of pairwise actors' updates and thus more accurately updates the learning rates of actors. Empirically, we show that AdaMa could accelerate the learning and improve the performance in a variety of multi-agent scenarios, and the visualizations of learning rates during training clearly explain how and why AdaMa works.",/pdf/99ec715da1bcb90a88454b995eb8b88aa40767b3.pdf,ICLR,2021, +rJeqeCEtvH,BkeI0tXuDr,1569440000000.0,1583910000000.0,939,Semi-Supervised Generative Modeling for Controllable Speech Synthesis,"[""raza.habib@cs.ucl.ac.uk"", ""soroosh@google.com"", ""mattshannon@google.com"", ""ebattenberg@google.com"", ""rjryan@google.com"", ""daisy@google.com"", ""davidkao@google.com"", ""tombagby@google.com""]","[""Raza Habib"", ""Soroosh Mariooryad"", ""Matt Shannon"", ""Eric Battenberg"", ""RJ Skerry-Ryan"", ""Daisy Stanton"", ""David Kao"", ""Tom Bagby""]","[""TTS"", ""Speech Synthesis"", ""Semi-supervised Models"", ""VAE"", ""disentanglement""]","We present a novel generative model that combines state-of-the-art neural text- to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn’t been possible with purely unsupervised methods. We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 1% (30 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline. We will release audio samples at https://google.github.io/tacotron/publications/semisupervised_generative_modeling_for_controllable_speech_synthesis/.",/pdf/a116d6d9c6bc6d9d5854d7f6af9e9191ca298d7f.pdf,ICLR,2020, +HyxG3p4twS,BygZBayuPS,1569440000000.0,1583910000000.0,772,Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations,"[""pkorus@nyu.edu"", ""memon@nyu.edu""]","[""Pawel Korus"", ""Nasir Memon""]","[""image forensics"", ""photo manipulation detection"", ""learned compression"", ""lossy compression"", ""image compression"", ""entropy estimation""]","Detection of photo manipulation relies on subtle statistical traces, notoriously removed by aggressive lossy compression employed online. We demonstrate that end-to-end modeling of complex photo dissemination channels allows for codec optimization with explicit provenance objectives. We design a lightweight trainable lossy image codec, that delivers competitive rate-distortion performance, on par with best hand-engineered alternatives, but has lower computational footprint on modern GPU-enabled platforms. Our results show that significant improvements in manipulation detection accuracy are possible at fractional costs in bandwidth/storage. Our codec improved the accuracy from 37% to 86% even at very low bit-rates, well below the practicality of JPEG (QF 20). ",/pdf/32e1296b257a0494d8ad34f00f7f657df8ab9455.pdf,ICLR,2020,We learn an efficient lossy image codec that can be optimized to facilitate reliable photo manipulation detection at fractional cost in payload/quality and even at low bitrates. +S1lslCEYPB,BkxNfc7dvH,1569440000000.0,1577170000000.0,940,Improved Mutual Information Estimation,"[""mroueh@us.ibm.com"", ""igor.melnyk@ibm.com"", ""pdognin@us.ibm.com"", ""rossja@us.ibm.com"", ""tom.sercu@gmail.com""]","[""Youssef Mroueh*"", ""Igor Melnyk*"", ""Pierre Dognin*"", ""Jerret Ross*"", ""Tom Sercu*""]","[""mutual information"", ""variational bound"", ""kernel methods"", ""Neural estimators"", ""mutual information maximization"", ""self-supervised learning""]","We propose a new variational lower bound on the KL divergence and show that the Mutual Information (MI) can be estimated by maximizing this bound using a witness function on a hypothesis function class and an auxiliary scalar variable. If the function class is in a Reproducing Kernel Hilbert Space (RKHS), this leads to a jointly convex problem. We analyze the bound by deriving its dual formulation and show its connection to a likelihood ratio estimation problem. We show that the auxiliary variable introduced in our variational form plays the role of a Lagrange multiplier that enforces a normalization constraint on the likelihood ratio. By extending the function space to neural networks, we propose an efficient neural MI estimator, and validate its performance on synthetic examples, showing advantage over the existing baselines. We then demonstrate the strength of our estimator in large-scale self-supervised representation learning through MI maximization.",/pdf/ba795fa0154b3b9759161fbbb7af45dcfb6aa749.pdf,ICLR,2020,we propose a new variational bound for estimating mutual information and show the strength of our estimator in large-scale self-supervised representation learning through MI maximization. +WZnVnlFBKFj,XuIxTKaAcxA,1601310000000.0,1614990000000.0,2472,Federated Learning With Quantized Global Model Updates,"[""~Mohammad_Mohammadi_Amiri1"", ""d.gunduz@imperial.ac.uk"", ""~Sanjeev_Kulkarni1"", ""~H._Vincent_Poor1""]","[""Mohammad Mohammadi Amiri"", ""Deniz Gunduz"", ""Sanjeev Kulkarni"", ""H. Vincent Poor""]","[""Federated learning"", ""lossy broadcasting""]","We study federated learning (FL), which enables mobile devices to utilize their local datasets to collaboratively train a global model with the help of a central server, while keeping data localized. At each iteration, the server broadcasts the current global model to the devices for local training, and aggregates the local model updates from the devices to update the global model. Previous work on the communication efficiency of FL has mainly focused on the aggregation of model updates from the devices, assuming perfect broadcasting of the global model. In this paper, we instead consider broadcasting a compressed version of the global model. This is to further reduce the communication cost of FL, which can be particularly limited when the global model is to be transmitted over a wireless medium. We introduce a lossy FL (LFL) algorithm, in which both the global model and the local model updates are quantized before being transmitted. We analyze the convergence behavior of the proposed LFL algorithm assuming the availability of accurate local model updates at the server. Numerical experiments show that the proposed LFL scheme, which quantizes the global model update (with respect to the global model estimate at the devices) rather than the global model itself, significantly outperforms other existing schemes studying quantization of the global model at the PS-to-device direction. Also, the performance loss of the proposed scheme is marginal compared to the fully lossless approach, where the PS and the devices transmit their messages entirely without any quantization.",/pdf/90aa6a2ba00aaa9e52a6416d6784fda394168940.pdf,ICLR,2021,We study federated learning where the global model update with significantly less variability/variance compared to the global model itself is quantized and broadcast by the parameter server to the devices for local training. +rkgOlCVYvB,SylyCuX_Pr,1569440000000.0,1583910000000.0,935,Pure and Spurious Critical Points: a Geometric Study of Linear Networks,"[""matthew.trager@cims.nyu.edu"", ""kathlen.korn@gmail.com"", ""bruna@cims.nyu.edu""]","[""Matthew Trager"", ""Kathl\u00e9n Kohn"", ""Joan Bruna""]","[""Loss landscape"", ""linear networks"", ""algebraic geometry""]","The critical locus of the loss function of a neural network is determined by the geometry of the functional space and by the parameterization of this space by the network's weights. We introduce a natural distinction between pure critical points, which only depend on the functional space, and spurious critical points, which arise from the parameterization. We apply this perspective to revisit and extend the literature on the loss function of linear neural networks. For this type of network, the functional space is either the set of all linear maps from input to output space, or a determinantal variety, i.e., a set of linear maps with bounded rank. We use geometric properties of determinantal varieties to derive new results on the landscape of linear networks with different loss functions and different parameterizations. Our analysis clearly illustrates that the absence of ""bad"" local minima in the loss landscape of linear networks is due to two distinct phenomena that apply in different settings: it is true for arbitrary smooth convex losses in the case of architectures that can express all linear maps (""filling architectures"") but it holds only for the quadratic loss when the functional space is a determinantal variety (""non-filling architectures""). Without any assumption on the architecture, smooth convex losses may lead to landscapes with many bad minima.",/pdf/f203c690f327bdbb7ad7f08538a96a545db208e5.pdf,ICLR,2020, +MbM_gvIB3Y4,8gA_pBtio3m,1601310000000.0,1614990000000.0,446,Which Mutual-Information Representation Learning Objectives are Sufficient for Control?,"[""~Kate_Rakelly1"", ""~Abhishek_Gupta1"", ""~Carlos_Florensa1"", ""~Sergey_Levine1""]","[""Kate Rakelly"", ""Abhishek Gupta"", ""Carlos Florensa"", ""Sergey Levine""]","[""representation learning"", ""reinforcement learning"", ""information theory""]","Mutual information maximization provides an appealing formalism for learning representations of data. In the context of reinforcement learning, such representations can accelerate learning by discarding irrelevant and redundant information, while retaining the information necessary for control. Much of the prior work on these methods has addressed the practical difficulties of estimating mutual information from samples of high-dimensional observations, while comparatively less is understood about \emph{which} mutual information objectives are sufficient for RL from a theoretical perspective. In this paper we identify conditions under which representations that maximize specific mutual-information objectives are theoretically sufficient for learning and representing the optimal policy. Somewhat surprisingly, we find that several popular objectives can yield insufficient representations given mild and common assumptions on the structure of the MDP. We corroborate our theoretical results with deep RL experiments on a simulated game environment with visual observations.",/pdf/649f2e2e61154201d47262476a777ccadb967c02.pdf,ICLR,2021,We examine whether popular MI-based representation learning objectives for RL yield state representations sufficient for learning and representing optimal control policies +EoVmlONgI9e,tMtp4ANoL8H,1601310000000.0,1614990000000.0,3011,The Emergence of Individuality in Multi-Agent Reinforcement Learning,"[""~Jiechuan_Jiang1"", ""~Zongqing_Lu2""]","[""Jiechuan Jiang"", ""Zongqing Lu""]",[],"Individuality is essential in human society, which induces the division of labor and thus improves the efficiency and productivity. Similarly, it should also be a key to multi-agent cooperation. Inspired by that individuality is of being an individual separate from others, we propose a simple yet efficient method for the emergence of individuality (EOI) in multi-agent reinforcement learning (MARL). EOI learns a probabilistic classifier that predicts a probability distribution over agents given their observation and gives each agent an intrinsic reward of being correctly predicted by the classifier. The intrinsic reward encourages the agents to visit their own familiar observations, and learning the classifier by such observations makes the intrinsic reward signals stronger and in turn makes the agents more identifiable. To further enhance the intrinsic reward and promote the emergence of individuality, two regularizers are proposed to increase the discriminability of the classifier. We implement EOI on top of popular MARL algorithms. Empirically, we show that EOI outperforms existing methods in a variety of multi-agent cooperative scenarios.",/pdf/41daabb792b66e7f5890b043219257882e5dd716.pdf,ICLR,2021, +ryf7ioRqFX,SylUa_35tQ,1538090000000.0,1547580000000.0,609,h-detach: Modifying the LSTM Gradient Towards Better Optimization,"[""bhargavkanuparthi25@gmail.com"", ""devansharpit@gmail.com"", ""giancarlo.kerg@gmail.com"", ""rosemary.nan.ke@gmail.com"", ""ioannis@iro.umontreal.ca"", ""yoshua.umontreal@gmail.com""]","[""Bhargav Kanuparthi"", ""Devansh Arpit"", ""Giancarlo Kerg"", ""Nan Rosemary Ke"", ""Ioannis Mitliagkas"", ""Yoshua Bengio""]","[""LSTM"", ""Optimization"", ""Long term dependencies"", ""Back-propagation through time""]","Recurrent neural networks are known for their notorious exploding and vanishing gradient problem (EVGP). This problem becomes more evident in tasks where the information needed to correctly solve them exist over long time scales, because EVGP prevents important gradient components from being back-propagated adequately over a large number of steps. We introduce a simple stochastic algorithm (\textit{h}-detach) that is specific to LSTM optimization and targeted towards addressing this problem. Specifically, we show that when the LSTM weights are large, the gradient components through the linear path (cell state) in the LSTM computational graph get suppressed. Based on the hypothesis that these components carry information about long term dependencies (which we show empirically), their suppression can prevent LSTMs from capturing them. Our algorithm\footnote{Our code is available at https://github.com/bhargav104/h-detach.} prevents gradients flowing through this path from getting suppressed, thus allowing the LSTM to capture such dependencies better. We show significant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modification of LSTM gradient on various benchmark datasets.",/pdf/12ad6196127f084a8e473d83361f0697b9673d4b.pdf,ICLR,2019,A simple algorithm to improve optimization and handling of long term dependencies in LSTM +rJfW5oA5KQ,H1xvmhCFtm,1538090000000.0,1550710000000.0,509,Approximability of Discriminators Implies Diversity in GANs,"[""yub@stanford.edu"", ""tengyuma@stanford.edu"", ""risteski@mit.edu""]","[""Yu Bai"", ""Tengyu Ma"", ""Andrej Risteski""]","[""Theory"", ""Generative adversarial networks"", ""Mode collapse"", ""Generalization""]","While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent works have shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al. (2017a) suggests a dilemma about GANs’ statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. +By contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible and injective neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence or the Wasserstein distance, indicating that the lack of diversity in GANs may be caused by the sub-optimality in optimization instead of statistical inefficiency.",/pdf/5c3e41124e34d561123b5b406123f00e8408343f.pdf,ICLR,2019,"GANs can in principle learn distributions sample-efficiently, if the discriminator class is compact and has strong distinguishing power against the particular generator class." +M9hdyCNlWaf,4S4liBBuwe,1601310000000.0,1614990000000.0,3271,Sparse Uncertainty Representation in Deep Learning with Inducing Weights,"[""~Hippolyt_Ritter1"", ""~Martin_Kukla1"", ""~Cheng_Zhang1"", ""~Yingzhen_Li1""]","[""Hippolyt Ritter"", ""Martin Kukla"", ""Cheng Zhang"", ""Yingzhen Li""]","[""Bayesian neural networks"", ""uncertainty estimation"", ""memory efficiency""]","Bayesian neural networks and deep ensembles represent two modern paradigms of uncertainty quantification in deep learning. Yet these approaches struggle to scale mainly due to memory inefficiency issues, since they require parameter storage several times higher than their deterministic counterparts. To address this, we augment the weight matrix of each layer with a small number of inducing weights, thereby projecting the uncertainty quantification into such low dimensional spaces. We further extend Matheron's conditional Gaussian sampling rule to enable fast weight sampling, whichenable our inference method to maintain reasonable run-time as compared with ensembles. Importantly, our approach achieves competitive performance to the state-of-the-art in prediction and uncertainty estimation tasks with fully connected neural networks and ResNets, while reducing the parameter size to $\leq 47.9\%$ of that of a single neural network. ",/pdf/a3dd664869d916612de797754361eb503a27d4db.pdf,ICLR,2021,"We introduce a parameter-efficient uncertainty quantification framework for deep neural net, results show competitive performances, but the model size is reduced significantly to < half of a single network." +Nc3TJqbcl3,veu_evhUVo8,1601310000000.0,1616000000000.0,1844,Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers,"[""~Cristina_Pinneri1"", ""shambhuraj.sawant@tuebingen.mpg.de"", ""sebastian.blaes@tuebingen.mpg.de"", ""~Georg_Martius1""]","[""Cristina Pinneri"", ""Shambhuraj Sawant"", ""Sebastian Blaes"", ""Georg Martius""]","[""reinforcement learning"", ""zero-order optimization"", ""policy learning"", ""model-based learning"", ""robotics"", ""model predictive control""]","Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the cross-entropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is essential for fast execution in real robotic systems. Our method builds upon standard approaches, like guidance cost and dataset aggregation, and introduces a novel adaptive factor which prevents the optimizer from collapsing to the learner's behavior at the beginning of the training. The extracted policies reach unprecedented performance on challenging tasks as making a humanoid stand up and opening a door without reward shaping",/pdf/1e36a9b55c2b184bab1395be47101e4beb882f41.pdf,ICLR,2021,We propose an adaptively guided imitation learning method that is able to extract strong policies for hard robotic tasks from zero-order trajectory optimizers. +Kz42iQirPJI,e_RNnd2W3kY,1601310000000.0,1614990000000.0,2306,Towards Learning to Remember in Meta Learning of Sequential Domains,"[""~Zhenyi_Wang1"", ""~Tiehang_Duan1"", ""~Donglin_Zhan1"", ""~Changyou_Chen1""]","[""Zhenyi Wang"", ""Tiehang Duan"", ""Donglin Zhan"", ""Changyou Chen""]","[""Meta learning"", ""Continual Learning"", ""Sequential Domain Learning""]","Meta-learning has made rapid progress in past years, with recent extensions made to avoid catastrophic forgetting in the learning process, namely continual meta learning. It is desirable to generalize the meta learner’s ability to continuously learn in sequential domains, which is largely unexplored to-date. We found through extensive empirical verification that significant improvement +is needed for current continual learning techniques to be applied in the sequential domain meta learning setting. To tackle the problem, we adapt existing dynamic learning rate adaptation techniques to meta learn both model parameters and learning rates. Adaptation on parameters ensures good generalization performance, while adaptation on learning rates is made to avoid +catastrophic forgetting of past domains. Extensive experiments on a sequence of commonly used real-domain data demonstrate the effectiveness of our proposed method, outperforming current strong baselines in continual learning. Our code is made publicly available online (anonymous)",/pdf/0fbddd5480d4c77de368254d4e4c423aa5ac56ae.pdf,ICLR,2021,"First work to investigate learning to remember in meta learning of *sequential domains*, achieving state of the art compared with existing continual learning techniques." +193sEnKY1ij,RzrN6QbRAjF,1601310000000.0,1616060000000.0,1873,No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks,"[""~Shyamgopal_Karthik1"", ""~Ameya_Prabhu1"", ""~Puneet_K._Dokania1"", ""~Vineet_Gandhi2""]","[""Shyamgopal Karthik"", ""Ameya Prabhu"", ""Puneet K. Dokania"", ""Vineet Gandhi""]","[""Hierarchy-Aware Classification"", ""Conditional Risk Minimization"", ""Post-Hoc Correction""]","There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-$k$ predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates.",/pdf/1a3cb262b1a7bf8a9f26a4734b5d77ea7f544937.pdf,ICLR,2021,Conditional risk framework exploiting the label hierarchy outperforms state of the art and makes a strong baseline for future explorations. +HJxiMAVtPH,rJg-0ZSdwS,1569440000000.0,1577170000000.0,1014,Multi-scale Attributed Node Embedding,"[""benedek.rozemberczki@gmail.com"", ""carl.allen@ed.ac.uk"", ""rsarkar@inf.ed.ac.uk""]","[""Benedek Rozemberczki"", ""Carl Allen"", ""Rik Sarkar""]","[""network embedding"", ""graph embedding"", ""node embedding"", ""network science"", ""graph representation learning""]","We present network embedding algorithms that capture information about a node from the local distribution over node attributes around it, as observed over random walks following an approach similar to Skip-gram. Observations from neighborhoods of different sizes are either pooled (AE) or encoded distinctly in a multi-scale approach (MUSAE). Capturing attribute-neighborhood relationships over multiple scales is useful for a diverse range of applications, including latent feature identification across disconnected networks with similar attributes. We prove theoretically that matrices of node-feature pointwise mutual information are implicitly factorized by the embeddings. Experiments show that our algorithms are robust, computationally efficient and outperform comparable models on social, web and citation network datasets.",/pdf/125239ee14eac85f8e0cff16d108f9e76d91b735.pdf,ICLR,2020,We develop efficient multi-scale approximate attributed network embedding procedures with provable properties. +H1lJJnR5Ym,S1x-r2T5KX,1538090000000.0,1545360000000.0,950,Exploration by random network distillation,"[""yburda@openai.com"", ""h.l.edwards@sms.ed.ac.uk"", ""a.storkey@ed.ac.uk"", ""oleg@openai.com""]","[""Yuri Burda"", ""Harrison Edwards"", ""Amos Storkey"", ""Oleg Klimov""]","[""reinforcement learning"", ""exploration"", ""curiosity""]","We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access the underlying state of the game, and occasionally completes the first level. This suggests that relatively simple methods that scale well can be sufficient to tackle challenging exploration problems.",/pdf/b8342c1b9d42eacdcc9b94d8e49f990fad834709.pdf,ICLR,2019,A simple exploration bonus is introduced and achieves state of the art performance in 3 hard exploration Atari games. +lNrtNGkr-vw,HPXtGdYxNNp,1601310000000.0,1614990000000.0,2858,Linear Representation Meta-Reinforcement Learning for Instant Adaptation,"[""~Matt_Peng1"", ""~Banghua_Zhu1"", ""~Jiantao_Jiao1""]","[""Matt Peng"", ""Banghua Zhu"", ""Jiantao Jiao""]","[""meta reinforcement learning"", ""out-of-distribution"", ""reinforcement learning""]","This paper introduces Fast Linearized Adaptive Policy (FLAP), a new meta-reinforcement learning (meta-RL) method that is able to extrapolate well to out-of-distribution tasks without the need to reuse data from training, and adapt almost instantaneously with the need of only a few samples during testing. FLAP builds upon the idea of learning a shared linear representation of the policy so that when adapting to a new task, it suffices to predict a set of linear weights. A separate adapter network is trained simultaneously with the policy such that during adaptation, we can directly use the adapter network to predict these linear weights instead of updating a meta-policy via gradient descent such as in prior Meta-RL algorithms like MAML to obtain the new policy. The application of the separate feed-forward network not only speeds up the adaptation run-time significantly, but also generalizes extremely well to very different tasks that prior Meta-RL methods fail to generalize to. Experiments on standard continuous-control meta-RL benchmarks show FLAP presenting significantly stronger performance on out-of-distribution tasks with up to double the average return and up to 8X faster adaptation run-time speeds when compared to prior methods.",/pdf/df99a19b6493d498583022684bdef264e98344c2.pdf,ICLR,2021,Our paper proposes a meta-reinforcement learning algorithm that generalizes well to highly extrapolated test tasks with an adaptation process that showcases a significantly reduced run-time. +rJl0r3R9KX,HkeaLA39Ym,1538090000000.0,1550890000000.0,1592,Regularized Learning for Domain Adaptation under Label Shifts,"[""kazizzad@uci.edu"", ""anqiliu@caltech.edu"", ""fan.yang@stat.math.ethz.ch"", ""anima@caltech.edu""]","[""Kamyar Azizzadenesheli"", ""Anqi Liu"", ""Fanny Yang"", ""Animashree Anandkumar""]","[""Deep Learning"", ""Domain Adaptation"", ""Label Shift"", ""Importance Weights"", ""Generalization""]","We propose Regularized Learning under Label shifts (RLLS), a principled and a practical domain-adaptation algorithm to correct for shifts in the label distribution between a source and a target domain. We first estimate importance weights using labeled source data and unlabeled target data, and then train a classifier on the weighted source samples. We derive a generalization bound for the classifier on the target domain which is independent of the (ambient) data dimensions, and instead only depends on the complexity of the function class. To the best of our knowledge, this is the first generalization bound for the label-shift problem where the labels in the target domain are not available. Based on this bound, we propose a regularized estimator for the small-sample regime which accounts for the uncertainty in the estimated weights. Experiments on the CIFAR-10 and MNIST datasets show that RLLS improves classification accuracy, especially in the low sample and large-shift regimes, compared to previous methods.",/pdf/c208d9ebf9a9e827c96b8052567fe6c5878f4b3e.pdf,ICLR,2019,A practical and provably guaranteed approach for training efficiently classifiers in the presence of label shifts between Source and Target data sets +R2ZlTVPx0Gk,nNhOlHRKKrX,1601310000000.0,1614800000000.0,205,DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation,"[""~Alexandre_Rame1"", ""~Matthieu_Cord1""]","[""Alexandre Rame"", ""Matthieu Cord""]","[""Deep Learning"", ""Deep Ensembles"", ""Information Theory"", ""Information Bottleneck"", ""Adversarial Learning""]","Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members’ performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation.",/pdf/d1c240af3d07df28ebfbd88feefb18b1e55be44e.pdf,ICLR,2021,"Driven by arguments from information theory, we introduce a new learning strategy for deep ensembles that increases diversity among members: we adversarially prevent features from being conditionally redundant, i.e., predictable from each other." +B1l2bp4YwS,ryec85_UvH,1569440000000.0,1583910000000.0,390,What graph neural networks cannot learn: depth vs width,"[""andreas.loukas@epfl.ch""]","[""Andreas Loukas""]","[""graph neural networks"", ""capacity"", ""impossibility results"", ""lower bounds"", ""expressive power""]","This paper studies the expressive power of graph neural networks falling within the message-passing framework (GNNmp). Two results are presented. First, GNNmp are shown to be Turing universal under sufficient conditions on their depth, width, node attributes, and layer expressiveness. Second, it is discovered that GNNmp can lose a significant portion of their power when their depth and width is restricted. The proposed impossibility statements stem from a new technique that enables the repurposing of seminal results from distributed computing and leads to lower bounds for an array of decision, optimization, and estimation problems involving graphs. Strikingly, several of these problems are deemed impossible unless the product of a GNNmp's depth and width exceeds a polynomial of the graph size; this dependence remains significant even for tasks that appear simple or when considering approximation.",/pdf/1deea7b0fa3c20f142b229f06af2471086407471.pdf,ICLR,2020,Several graph problems are impossible unless the product of a graph neural network's depth and width exceeds a polynomial of the graph size. +HJeIU0VYwB,SkxTebwuwr,1569440000000.0,1577170000000.0,1140,ADA+: A GENERIC FRAMEWORK WITH MORE ADAPTIVE EXPLICIT ADJUSTMENT FOR LEARNING RATE,"[""oasis.random.time@gmail.com"", ""xiangsheng.huang@ia.ac.cn"", ""2015019051@mail.buct.edu.cn""]","[""Yue Zhao"", ""Xiangsheng Huang"", ""Ludan Kou""]","[""Optimization"", ""Adaptive Methods"", ""Convergence"", ""Convolutional Neural Network""]","Although adaptive algorithms have achieved significant success in training deep neural networks with faster training speed, they tend to have poor generalization performance compared to SGD with Momentum(SGDM). One of the state-of-the-art algorithms, PADAM, is proposed to close the generalization gap of adaptive methods while lacking an internal explanation. This work pro- poses a general framework, in which we use an explicit function Φ(·) as an adjustment to the actual step size, and present a more adaptive specific form AdaPlus(Ada+). Based on this framework, we analyze various behaviors brought by different types of Φ(·), such as a constant function in SGDM, a linear function in Adam, a concave function in Padam and a concave function with offset term in AdaPlus. Empirically, we conduct experiments on classic benchmarks both in CNN and RNN architectures and achieve better performance(even than SGDM). +",/pdf/275ad1a553639cbc285ab8fa660944b8a10b0715.pdf,ICLR,2020,"This work proposes a novel generic framework, in which we explicitly analyze different behaviors brought by various types of Φ(·), and based on the framework we propose a more adaptive optimization algorithm." +Bkeuz20cYm,BklK1fqcY7,1538090000000.0,1545360000000.0,1280,Double Neural Counterfactual Regret Minimization,"[""ken.lh@antfin.com"", ""hkl163251@antfin.com"", ""zhibang.zg@antfin.com"", ""lvshan.jt@antfin.com"", ""yuan.qi@antfin.com"", ""lsong@cc.gatech.edu""]","[""Hui Li"", ""Kailiang Hu"", ""Zhibang Ge"", ""Tao Jiang"", ""Yuan Qi"", ""Le Song""]","[""Counterfactual Regret Minimization"", ""Imperfect Information game""]","Counterfactual regret minimization (CRF) is a fundamental and effective technique for solving imperfect information games. However, the original CRF algorithm only works for discrete state and action spaces, and the resulting strategy is maintained as a tabular representation. Such tabular representation limits the method from being directly applied to large games and continuing to improve from a poor strategy profile. In this paper, we propose a double neural representation for the Imperfect Information Games, where one neural network represents the cumulative regret, and the other represents the average strategy. Furthermore, we adopt the counterfactual regret minimization algorithm to optimize this double neural representation. To make neural learning efficient, we also developed several novel techniques including a robust sampling method, mini-batch Monte Carlo counterfactual regret minimization (MCCFR) and Monte Carlo counterfactual regret minimization plus (MCCFR+) which may be of independent interests. Experimentally, we demonstrate that the proposed double neural algorithm converges significantly better than the reinforcement learning counterpart. ",/pdf/67caca7efc9ecaa8d75848c9f3036d4460bf5624.pdf,ICLR,2019,We proposed a double neural CFR which can match the performance of tabular based CFR and opens up the possibility for a purely neural approach to directly solve large imperfect information game. +Ux5zdAir9-U,HNXMnKYDaYc,1601310000000.0,1614990000000.0,2236,GraphLog: A Benchmark for Measuring Logical Generalization in Graph Neural Networks,"[""~Koustuv_Sinha1"", ""~Shagun_Sodhani1"", ""~Joelle_Pineau1"", ""~William_L._Hamilton1""]","[""Koustuv Sinha"", ""Shagun Sodhani"", ""Joelle Pineau"", ""William L. Hamilton""]","[""graph neural networks"", ""dataset"", ""benchmark"", ""logic""]","Relational inductive biases have a key role in building learning agents that can generalize and reason in a compositional manner. While relational learning algorithms such as graph neural networks (GNNs) show promise, we do not understand their effectiveness to adapt to new tasks. In this work, we study the logical generalization capabilities of GNNs by designing a benchmark suite grounded in first-order logic. Our benchmark suite, GraphLog, requires that learning algorithms perform rule induction in different synthetic logics, represented as knowledge graphs. +GraphLog consists of relation prediction tasks on 57 distinct procedurally generated logical worlds. We use GraphLog to evaluate GNNs in three different setups: single-task supervised learning, multi-task (with pretraining), and continual learning. Unlike previous benchmarks, GraphLog enables us to precisely control the logical relationship between the different worlds by controlling the underlying first-order logic rules. We find that models' ability to generalize and adapt strongly correlates to the availability of diverse sets of logical rules during multi-task training. We also find the severe catastrophic forgetting effect in continual learning scenarios, and GraphLog provides a precise mechanism to control the distribution shift. Overall, our results highlight new challenges for the design of GNN models, opening up an exciting area of research in generalization using graph-structured data.",/pdf/8e08a66555cca2c5187d83292e6e394e24798dc4.pdf,ICLR,2021, +HJF3iD9xe,,1478290000000.0,1484690000000.0,426,Deep Learning with Sets and Point Clouds,"[""mravanba@cs.cmu.edu"", ""bapoczos@cs.cmu.edu"", ""jeff.schneider@cs.cmu.edu""]","[""Siamak Ravanbakhsh"", ""Jeff Schneider"", ""Barnabas Poczos""]","[""Deep learning"", ""Structured prediction"", ""Computer vision"", ""Supervised Learning"", ""Semi-Supervised Learning""]","We introduce a simple permutation equivariant layer for deep learning with set structure. This type of layer, obtained by parameter-sharing, has a simple implementation and linear-time complexity in the size of each set. We use deep permutation-invariant networks to perform point-could classification and MNIST digit summation, where in both cases the output is invariant to permutations of the input. In a semi-supervised setting, where the goal is make predictions for each instance within a set, we demonstrate the usefulness of this type of layer in set-outlier detection as well as semi-supervised learning with clustering side-information.",/pdf/62d4b9419407ca08fcbb011eaf4daca8fbb7bda6.pdf,ICLR,2017,Parameter-sharing for permutation-equivariance and invariance with applications to point-cloud classification. +BJeRykBKDH,BJeByxiuvS,1569440000000.0,1577170000000.0,1487,Empowering Graph Representation Learning with Paired Training and Graph Co-Attention,"[""deacandr@mila.quebec"", ""huang.yu-hsiang@courrier.uqam.ca"", ""petar.velickovic@cst.cam.ac.uk"", ""pl219@cam.ac.uk"", ""jian.tang@hec.ca""]","[""Andreea Deac"", ""Yu-Hsiang Huang"", ""Petar Velickovic"", ""Pietro Lio"", ""Jian Tang""]","[""graph neural networks"", ""graph co-attention"", ""paired graphs"", ""molecular properties"", ""drug-drug interaction""]","Through many recent advances in graph representation learning, performance achieved on tasks involving graph-structured data has substantially increased in recent years---mostly on tasks involving node-level predictions. The setup of prediction tasks over entire graphs (such as property prediction for a molecule, or side-effect prediction for a drug), however, proves to be more challenging, as the algorithm must combine evidence about several structurally relevant patches of the graph into a single prediction. +Most prior work attempts to predict these graph-level properties while considering only one graph at a time---not allowing the learner to directly leverage structural similarities and motifs across graphs. Here we propose a setup in which a graph neural network receives pairs of graphs at once, and extend it with a co-attentional layer that allows node representations to easily exchange structural information across them. We first show that such a setup provides natural benefits on a pairwise graph classification task (drug-drug interaction prediction), and then expand to a more generic graph regression setup: enhancing predictions over QM9, a standard molecular prediction benchmark. Our setup is flexible, powerful and makes no assumptions about the underlying dataset properties, beyond anticipating the existence of multiple training graphs.",/pdf/3121a2282e0036dcec026041303b77761c59e75f.pdf,ICLR,2020,We use graph co-attention in a paired graph training system for graph classification and regression. +r1gVqsA9tQ,B1lyHutqt7,1538090000000.0,1545360000000.0,528,ChainGAN: A sequential approach to GANs,"[""safwan.hossain@mail.utoronto.ca"", ""kiarash.jamali@mail.utoronto.ca"", ""ychnlgy.li@utoronto.ca"", ""frank@spoclab.com""]","[""Safwan Hossain"", ""Kiarash Jamali"", ""Yuchen Li"", ""Frank Rudzicz""]","[""Machine Learning"", ""Sequential Models"", ""GANs""]","We propose a new architecture and training methodology for generative adversarial networks. Current approaches attempt to learn the transformation from a noise sample to a generated data sample in one shot. Our proposed generator architecture, called ChainGAN, uses a two-step process. It first attempts to transform a noise vector into a crude sample, similar to a traditional generator. Next, a chain of networks, called editors, attempt to sequentially enhance this sample. We train each of these units independently, instead of with end-to-end backpropagation on the entire chain. Our model is robust, efficient, and flexible as we can apply it to various network architectures. We provide rationale for our choices and experimentally evaluate our model, achieving competitive results on several datasets.",/pdf/4e3ae95216a1bf315e3be4c48796e5cd69abc7f2.pdf,ICLR,2019,Multistep generation process for GANs +YwpZmcAehZ,BuZ97X62jl,1601310000000.0,1615230000000.0,181,Revisiting Dynamic Convolution via Matrix Decomposition,"[""~Yunsheng_Li1"", ""~Yinpeng_Chen1"", ""~Xiyang_Dai2"", ""mengcliu@microsoft.com"", ""~Dongdong_Chen1"", ""yu.ye@microsoft.com"", ""~Lu_Yuan1"", ""~Zicheng_Liu1"", ""~Mei_Chen2"", ""~Nuno_Vasconcelos1""]","[""Yunsheng Li"", ""Yinpeng Chen"", ""Xiyang Dai"", ""mengchen liu"", ""Dongdong Chen"", ""Ye Yu"", ""Lu Yuan"", ""Zicheng Liu"", ""Mei Chen"", ""Nuno Vasconcelos""]","[""supervised representation learning"", ""efficient network"", ""dynamic network"", ""matrix decomposition""]","Recent research in dynamic convolution shows substantial performance boost for efficient CNNs, due to the adaptive aggregation of K static convolution kernels. It has two limitations: (a) it increases the number of convolutional weights by K-times, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging. In this paper, we revisit it from a new perspective of matrix decomposition and reveal the key issue is that dynamic convolution applies dynamic attention over channel groups after projecting into a higher dimensional latent space. To address this issue, we propose dynamic channel fusion to replace dynamic attention over channel groups. Dynamic channel fusion not only enables significant dimension reduction of the latent space, but also mitigates the joint optimization difficulty. As a result, our method is easier to train and requires significantly fewer parameters without sacrificing accuracy. Source code is at https://github.com/liyunsheng13/dcd.",/pdf/e60d43801b1591f24c4abdca3a995b42eecd1fdf.pdf,ICLR,2021,Efficient network with dynamic matrix decomposition +jXe91kq3jAq,i1CGWRnL7ie,1601310000000.0,1616070000000.0,1425,Latent Skill Planning for Exploration and Transfer,"[""~Kevin_Xie1"", ""~Homanga_Bharadhwaj1"", ""~Danijar_Hafner1"", ""~Animesh_Garg1"", ""~Florian_Shkurti1""]","[""Kevin Xie"", ""Homanga Bharadhwaj"", ""Danijar Hafner"", ""Animesh Garg"", ""Florian Shkurti""]","[""Model-Based Reinforcement Learning"", ""World Models"", ""Skill Discovery"", ""Mutual Information"", ""Planning"", ""Model Predictive Control"", ""Partial Amortization""]","To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/latent-skill-planning/",/pdf/fe4b11b44759390e66cce050dfae61286ab48d51.pdf,ICLR,2021,Partially amortized planning through hierarchy helps learn skills for complex control tasks +BJK3Xasel,,1478380000000.0,1488410000000.0,595,Nonparametric Neural Networks,"[""george.philipp@email.de"", ""jgc@cs.cmu.edu""]","[""George Philipp"", ""Jaime G. Carbonell""]","[""Deep learning"", ""Supervised Learning""]","Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce {\it nonparametric neural networks}, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an $\ell_p$ penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an $\ell_2$ penalty. We employ a novel optimization algorithm, which we term ``Adaptive Radial-Angular Gradient Descent'' or {\it AdaRad}, and obtain promising results.",/pdf/1dce8a44a0276dce16fe0daf3247a2948925ad87.pdf,ICLR,2017,We automatically set the size of an MLP by adding and removing units during training as appropriate. +Hkee1JBKwB,rkg4wiquvS,1569440000000.0,1577170000000.0,1454,Convolutional Tensor-Train LSTM for Long-Term Video Prediction,"[""jiahaosu@terpmail.umd.edu"", ""wonmin.byeon@gmail.com"", ""furongh@cs.umd.edu"", ""jkautz@nvidia.com"", ""animakumar@gmail.com""]","[""Jiahao Su"", ""Wonmin Byeon"", ""Furong Huang"", ""Jan Kautz"", ""Animashree Anandkumar""]","[""Tensor decomposition"", ""Video prediction""]","Long-term video prediction is highly challenging since it entails simultaneously capturing spatial and temporal information across a long range of image frames.Standard recurrent models are ineffective since they are prone to error propagation and cannot effectively capture higher-order correlations. A potential solution is to extend to higher-order spatio-temporal recurrent models. However, such a model requires a large number of parameters and operations, making it intractable to learn in practice and is prone to overfitting. In this work, we propose convolutional tensor-train LSTM (Conv-TT-LSTM), which learns higher-orderConvolutional LSTM (ConvLSTM) efficiently using convolutional tensor-train decomposition (CTTD). Our proposed model naturally incorporates higher-order spatio-temporal information at a small cost of memory and computation by using efficient low-rank tensor representations. We evaluate our model on Moving-MNIST and KTH datasets and show improvements over standard ConvLSTM and better/comparable results to other ConvLSTM-based approaches, but with much fewer parameters.",/pdf/921c7a245bef64ed33d4cc827be4db59c547fbac.pdf,ICLR,2020,"we propose convolutional tensor-train LSTM, which learns higher-order Convolutional LSTM efficiently using convolutional tensor-train decomposition. " +SyxCqGbRZ,B1oTqGbCW,1509140000000.0,1518730000000.0,911,Learning to Treat Sepsis with Multi-Output Gaussian Process Deep Recurrent Q-Networks,"[""jfutoma14@gmail.com"", ""anthony.lin@duke.edu"", ""mark.sendak@duke.edu"", ""armando.bedoya@duke.edu"", ""meredith.edwards@duke.edu"", ""cara.obrien@duke.edu"", ""kheller@gmail.com""]","[""Joseph Futoma"", ""Anthony Lin"", ""Mark Sendak"", ""Armando Bedoya"", ""Meredith Clement"", ""Cara O'Brien"", ""Katherine Heller""]","[""Healthcare"", ""Gaussian Process"", ""Deep Reinforcement Learning""]","Sepsis is a life-threatening complication from infection and a leading cause of mortality in hospitals. While early detection of sepsis improves patient outcomes, there is little consensus on exact treatment guidelines, and treating septic patients remains an open problem. In this work we present a new deep reinforcement learning method that we use to learn optimal personalized treatment policies for septic patients. We model patient continuous-valued physiological time series using multi-output Gaussian processes, a probabilistic model that easily handles missing values and irregularly spaced observation times while maintaining estimates of uncertainty. The Gaussian process is directly tied to a deep recurrent Q-network that learns clinically interpretable treatment policies, and both models are learned together end-to-end. We evaluate our approach on a heterogeneous dataset of septic spanning 15 months from our university health system, and find that our learned policy could reduce patient mortality by as much as 8.2\% from an overall baseline mortality rate of 13.3\%. Our algorithm could be used to make treatment recommendations to physicians as part of a decision support tool, and the framework readily applies to other reinforcement learning problems that rely on sparsely sampled and frequently missing multivariate time series data. +",/pdf/b38ae962f024e0b70a031a91f61e56b9b3b37a00.pdf,ICLR,2018,"We combine Multi-output Gaussian processes with deep recurrent Q-networks to learn optimal treatments for sepsis and show improved performance over standard deep reinforcement learning methods," +SklVEnR5K7,S1xhXLpcKm,1538090000000.0,1545360000000.0,1440,Making Convolutional Networks Shift-Invariant Again,"[""rich.zhang@eecs.berkeley.edu""]","[""Richard Zhang""]","[""convolutional networks"", ""signal processing"", ""shift"", ""translation"", ""invariance"", ""equivariance""]","Modern convolutional networks are not shift-invariant, despite their convolutional nature: small shifts in the input can cause drastic changes in the internal feature maps and output. In this paper, we isolate the cause -- the downsampling operation in convolutional and pooling layers -- and apply the appropriate signal processing fix -- low-pass filtering before downsampling. This simple architectural modification boosts the shift-equivariance of the internal representations and consequently, shift-invariance of the output. Importantly, this is achieved while maintaining downstream classification performance. In addition, incorporating the inductive bias of shift-invariance largely removes the need for shift-based data augmentation. Lastly, we observe that the modification induces spatially-smoother learned convolutional kernels. Our results suggest that this classical signal processing technique has a place in modern deep networks.",/pdf/050a9c62da62d220aadcccd8525187e42dfe78ff.pdf,ICLR,2019,"Modern networks are not shift-invariant, due to naive downsampling; we apply a signal processing tool -- anti-aliasing low-pass filtering before downsampling -- to improve shift-invariance" +m1CD7tPubNy,61G6JF60NuK,1601310000000.0,1616040000000.0,64,Mind the Pad -- CNNs Can Develop Blind Spots,"[""~Bilal_Alsallakh1"", ""narine@fb.com"", ""vivekm@fb.com"", ""junyuan@nyu.edu"", ""orionr@fb.com""]","[""Bilal Alsallakh"", ""Narine Kokhlikyan"", ""Vivek Miglani"", ""Jun Yuan"", ""Orion Reblitz-Richardson""]","[""CNN"", ""convolution"", ""spatial bias"", ""blind spots"", ""foveation"", ""padding"", ""exposition"", ""debugging"", ""visualization""]","We show how feature maps in convolutional networks are susceptible to spatial bias. Due to a combination of architectural choices, the activation at certain locations is systematically elevated or weakened. The major source of this bias is the padding mechanism. Depending on several aspects of convolution arithmetic, this mechanism can apply the padding unevenly, leading to asymmetries in the learned weights. We demonstrate how such bias can be detrimental to certain tasks such as small object detection: the activation is suppressed if the stimulus lies in the impacted area, leading to blind spots and misdetection. We explore alternative padding methods and propose solutions for analyzing and mitigating spatial bias. +",/pdf/70ef163f0737eab414d51c5c352b8292272c77d4.pdf,ICLR,2021,"The padding mechanism in CNNs can induce harmful spatial bias in the learned weights and in the feature maps, which can be mitigated with careful architectural choices." +rkglZyHtvH,HkxpM_jdDS,1569440000000.0,1577170000000.0,1531,Refining the variational posterior through iterative optimization,"[""mh740@cam.ac.uk"", ""jsnoek@google.com"", ""trandustin@google.com"", ""jg801@cam.ac.uk"", ""jmh233@cam.ac.uk""]","[""Marton Havasi"", ""Jasper Snoek"", ""Dustin Tran"", ""Jonathan Gordon"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""uncertainty estimation"", ""variational inference"", ""auxiliary variables"", ""Bayesian neural networks""]","Variational inference (VI) is a popular approach for approximate Bayesian inference that is particularly promising for highly parameterized models such as deep neural networks. A key challenge of variational inference is to approximate the posterior over model parameters with a distribution that is simpler and tractable yet sufficiently expressive. In this work, we propose a method for training highly flexible variational distributions by starting with a coarse approximation and iteratively refining it. Each refinement step makes cheap, local adjustments and only requires optimization of simple variational families. We demonstrate theoretically that our method always improves a bound on the approximation (the Evidence Lower BOund) and observe this empirically across a variety of benchmark tasks. In experiments, our method consistently outperforms recent variational inference methods for deep learning in terms of log-likelihood and the ELBO. We see that the gains are further amplified on larger scale models, significantly outperforming standard VI and deep ensembles on residual networks on CIFAR10.",/pdf/f06213d96b7b8bbadaa5ab52f169597ae8808974.pdf,ICLR,2020,The paper proposes an algorithm to increase the flexibility of the variational posterior in Bayesian neural networks through iterative optimization. +B1g79grKPr,r1xOoRltwS,1569440000000.0,1577170000000.0,2465,Goal-Conditioned Video Prediction,"[""oleh@seas.upenn.edu"", ""pertsch@usc.edu"", ""febert@berkeley.edu"", ""dineshjayaraman@berkeley.edu"", ""cbfinn@cs.stanford.edu"", ""svlevine@eecs.berkeley.edu""]","[""Oleh Rybkin"", ""Karl Pertsch"", ""Frederik Ebert"", ""Dinesh Jayaraman"", ""Chelsea Finn"", ""Sergey Levine""]","[""predictive models"", ""video prediction"", ""latent variable models""]","Many processes can be concisely represented as a sequence of events leading from a starting state to an end state. Given raw ingredients, and a finished cake, an experienced chef can surmise the recipe. Building upon this intuition, we propose a new class of visual generative models: goal-conditioned predictors (GCP). Prior work on video generation largely focuses on prediction models that only observe frames from the beginning of the video. GCP instead treats videos as start-goal transformations, making video generation easier by conditioning on the more informative context provided by the first and final frames. Not only do existing forward prediction approaches synthesize better and longer videos when modified to become goal-conditioned, but GCP models can also utilize structures that are not linear in time, to accomplish hierarchical prediction. To this end, we study both auto-regressive GCP models and novel tree-structured GCP models that generate frames recursively, splitting the video iteratively into finer and finer segments delineated by subgoals. In experiments across simulated and real datasets, our GCP methods generate high-quality sequences over long horizons. Tree-structured GCPs are also substantially easier to parallelize than auto-regressive GCPs, making training and inference very efficient, and allowing the model to train on sequences that are thousands of frames in length.Finally, we demonstrate the utility of GCP approaches for imitation learning in the setting without access to expert actions. Videos are on the supplementary website: https://sites.google.com/view/video-gcp",/pdf/bdca1a749a11f6cf7f6931bb8fbe74dcaac5e30c.pdf,ICLR,2020,We propose a new class of visual generative models: goal-conditioned predictors. We show experimentally that conditioning on the goal allows to reduce uncertainty and produce predictions over much longer horizons. +BkxRRkSKwr,HJeA4u1tvS,1569440000000.0,1583910000000.0,2043,Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models,"[""xisenjin@usc.edu"", ""zywei@fudan.edu.cn"", ""junyidu@usc.edu"", ""xyxue@fudan.edu.cn"", ""xiangren@usc.edu""]","[""Xisen Jin"", ""Zhongyu Wei"", ""Junyi Du"", ""Xiangyang Xue"", ""Xiang Ren""]","[""natural language processing"", ""interpretability""]","The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase compositions. To explain how the model handles semantic compositions, we study hierarchical explanation of neural network predictions. We identify non-additivity and context independent importance attributions within hierarchies as two desirable properties for highlighting word and phrase compositions. We show some prior efforts on hierarchical explanations, e.g. contextual decomposition, do not satisfy the desired properties mathematically, leading to inconsistent explanation quality in different models. In this paper, we start by proposing a formal and general way to quantify the importance of each word and phrase. Following the formulation, we propose Sampling and Contextual Decomposition (SCD) algorithm and Sampling and Occlusion (SOC) algorithm. Human and metrics evaluation on both LSTM models and BERT Transformer models on multiple datasets show that our algorithms outperform prior hierarchical explanation algorithms. Our algorithms help to visualize semantic composition captured by models, extract classification rules and improve human trust of models.",/pdf/cf05abd1e4d4f8871bbcd84fcf80052e5a33a093.pdf,ICLR,2020,We propose measurement of phrase importance and algorithms for hierarchical explanation of neural sequence model predictions +S1giro05t7,Hyl5Pf6Ht7,1538090000000.0,1545360000000.0,115,Reducing Overconfident Errors outside the Known Distribution,"[""zli115@illinois.edu"", ""dhoiem@illinois.edu""]","[""Zhizhong Li"", ""Derek Hoiem""]","[""Machine learning safety"", ""confidence"", ""overconfidence"", ""unknown domain"", ""novel distribution"", ""generalization"", ""distillation"", ""ensemble"", ""underrepresentation""]","Intuitively, unfamiliarity should lead to lack of confidence. In reality, current algorithms often make highly confident yet wrong predictions when faced with unexpected test samples from an unknown distribution different from training. Unlike domain adaptation methods, we cannot gather an ""unexpected dataset"" prior to test, and unlike novelty detection methods, a best-effort original task prediction is still expected. We compare a number of methods from related fields such as calibration and epistemic uncertainty modeling, as well as two proposed methods that reduce overconfident errors of samples from an unknown novel distribution without drastically increasing evaluation time: (1) G-distillation, training an ensemble of classifiers and then distill into a single model using both labeled and unlabeled examples, or (2) NCR, reducing prediction confidence based on its novelty detection score. Experimentally, we investigate the overconfidence problem and evaluate our solution by creating ""familiar"" and ""novel"" test splits, where ""familiar"" are identically distributed with training and ""novel"" are not. We discover that calibrating using temperature scaling on familiar data is the best single-model method for improving novel confidence, followed by our proposed methods. In addition, some methods' NLL performance are roughly equivalent to a regularly trained model with certain degree of smoothing. Calibrating can also reduce confident errors, for example, in gender recognition by 95% on demographic groups different from the training data.",/pdf/b5b978858840a87007c978ae15a41303bb959dcc.pdf,ICLR,2019,"Deep networks are more likely to be confidently wrong when testing on unexpected data. We propose an experimental methodology to study the problem, and two methods to reduce confident errors on unknown input distributions." +7aogOj_VYO0,CE1Rck_DhF,1601310000000.0,1614260000000.0,1722,Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning,"[""~Da_Yu1"", ""~Huishuai_Zhang3"", ""~Wei_Chen1"", ""~Tie-Yan_Liu1""]","[""Da Yu"", ""Huishuai Zhang"", ""Wei Chen"", ""Tie-Yan Liu""]","[""privacy preserving machine learning"", ""differentially private deep learning"", ""gradient redundancy""]","The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm \emph{Gradient Embedding Perturbation (GEP)} towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with low computational cost and modest privacy guarantee for deep models. Especially, with privacy bound $\epsilon=8$, we achieve $74.9\%$ test accuracy on CIFAR10 and $95.1\%$ test accuracy on SVHN, significantly improving over existing results.",/pdf/d70f5682ba3c5d03cd82bb7d287b19e150c1f9a5.pdf,ICLR,2021,A new algorithm for differentially private learning that advances state-of-the-art performance on several benchmark datasets. Code: https://github.com/dayu11/Gradient-Embedding-Perturbation +gZ9hCDWe6ke,BznQpRqf6yB,1601310000000.0,1616040000000.0,1041,Deformable DETR: Deformable Transformers for End-to-End Object Detection,"[""~Xizhou_Zhu1"", ""~Weijie_Su2"", ""luotto@sensetime.com"", ""binli@ustc.edu.cn"", ""~Xiaogang_Wang2"", ""~Jifeng_Dai1""]","[""Xizhou Zhu"", ""Weijie Su"", ""Lewei Lu"", ""Bin Li"", ""Xiaogang Wang"", ""Jifeng Dai""]","[""Efficient Attention Mechanism"", ""Deformation Modeling"", ""Multi-scale Representation"", ""End-to-End Object Detection""]","DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10$\times$ less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.",/pdf/758d4b5c0d63033d526ff8744d872a03543bb674.pdf,ICLR,2021,Deformable DETR is an efficient and fast-converging end-to-end object detector. It mitigates the high complexity and slow convergence issues of DETR via a novel sampling-based efficient attention mechanism. +H1lBj2VFPS,ByeF2ekWwr,1569440000000.0,1583910000000.0,150,Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware,"[""zhaoxiandong@ict.ac.cn"", ""wangying2009@ict.ac.cn"", ""caixuyi18s@ict.ac.cn"", ""liucheng@ict.ac.cn"", ""zlei@ict.ac.cn""]","[""Xiandong Zhao"", ""Ying Wang"", ""Xuyi Cai"", ""Cheng Liu"", ""Lei Zhang""]","[""quantization"", ""integer-arithmetic-only DNN accelerator"", ""acceleration""]","With the proliferation of specialized neural network processors that operate on low-precision integers, the performance of Deep Neural Network inference becomes increasingly dependent on the result of quantization. Despite plenty of prior work on the quantization of weights or activations for neural networks, there is still a wide gap between the software quantizers and the low-precision accelerator implementation, which degrades either the efficiency of networks or that of the hardware for the lack of software and hardware coordination at design-phase. In this paper, we propose a learned linear symmetric quantizer for integer neural network processors, which not only quantizes neural parameters and activations to low-bit integer but also accelerates hardware inference by using batch normalization fusion and low-precision accumulators (e.g., 16-bit) and multipliers (e.g., 4-bit). We use a unified way to quantize weights and activations, and the results outperform many previous approaches for various networks such as AlexNet, ResNet, and lightweight models like MobileNet while keeping friendly to the accelerator architecture. Additional, we also apply the method to object detection models and witness high performance and accuracy in YOLO-v2. Finally, we deploy the quantized models on our specialized integer-arithmetic-only DNN accelerator to show the effectiveness of the proposed quantizer. We show that even with linear symmetric quantization, the results can be better than asymmetric or non-linear methods in 4-bit networks. In evaluation, the proposed quantizer induces less than 0.4\% accuracy drop in ResNet18, ResNet34, and AlexNet when quantizing the whole network as required by the integer processors.",/pdf/e33b3876da00219dcd4120fa7a53b8654123c0aa.pdf,ICLR,2020,We introduce an efficient quantization process that allows for performance acceleration on specialized integer-only neural network accelerator. +SklEhlHtPr,rJeDHg-Fvr,1569440000000.0,1577170000000.0,2534,DeepPCM: Predicting Protein-Ligand Binding using Unsupervised Learned Representations,"[""paul.kim@bayer.com"", ""robin.winter@bayer.com"", ""djork-arne.clevert@bayer.com""]","[""Paul Kim"", ""Robin Winter"", ""Djork-Arn\u00e9 Clevert""]","[""Unsupervised Representation Learning"", ""Computational biology"", ""computational chemistry"", ""protein-ligand binding""]","In-silico protein-ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to make an accurate model of the protein-ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous work in PCM modeling relies on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings which outperform complex, human-engineered representations. We apply this reasoning to propose a novel proteochemometric modeling methodology which, for the first time, uses embeddings generated via unsupervised representation learning for both the protein and ligand descriptors. We evaluate performance on various splits of a benchmark dataset, including a challenging split that tests the model’s ability to generalize to proteins for which bioactivity data is greatly limited, and we find that our method consistently outperforms state-of-the-art methods.",/pdf/e326b281ebf9a5abec38bf3efcf3093e917f9681.pdf,ICLR,2020,We report a new methodological framework which uses unsupervised-learned representations of proteins and compounds to significantly outperform methods based on handcrafted features for the impactful protein-ligand binding task. +S1Dh8Tg0-,ryIn8pg0W,1509120000000.0,1521540000000.0,417,Fix your classifier: the marginal value of training the last weight layer,"[""elad.hoffer@gmail.com"", ""itayhubara@gmail.com"", ""daniel.soudry@gmail.com""]","[""Elad Hoffer"", ""Itay Hubara"", ""Daniel Soudry""]",[],"Neural networks are commonly used as models for classification for a wide variety of tasks. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more resources. + +In this work we argue that this classifier can be fixed, up to a global scale constant, with little or no loss of accuracy for most tasks, allowing memory and computational benefits. Moreover, we show that by initializing the classifier with a Hadamard matrix we can speed up inference as well. We discuss the implications for current understanding of neural network models. +",/pdf/4a0ef598a79f22a5bab9cc41627496e4f99fcb2b.pdf,ICLR,2018,You can fix the classifier in neural networks without losing accuracy +FKotzp6PZJw,nwWUM6izcDa,1601310000000.0,1614990000000.0,1410,On the Estimation Bias in Double Q-Learning,"[""~Zhizhou_Ren1"", ""~Guangxiang_Zhu1"", ""~Beining_Han1"", ""~Jianglun_Chen2"", ""~Chongjie_Zhang1""]","[""Zhizhou Ren"", ""Guangxiang Zhu"", ""Beining Han"", ""Jianglun Chen"", ""Chongjie Zhang""]","[""Reinforcement learning"", ""Q-learning"", ""Estimation bias""]","Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operator. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and still suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operation. To address the concerns of converging to non-optimal stationary solutions, we propose a simple and effective approach as a partial fix for underestimation bias in double Q-learning. This approach leverages real returns to bound the target value. We extensively evaluate the proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.",/pdf/6fb0a4fc6e24a731856dc7727580d5e9cd0ec8fb.pdf,ICLR,2021,"We prove that double Q-learning may have multiple non-optimal fixed points, and propose a simple approach to address this issue." +ryxnY3NYPS,SygenxDpIB,1569440000000.0,1583910000000.0,93,Diverse Trajectory Forecasting with Determinantal Point Processes,"[""yyuan2@cs.cmu.edu"", ""kkitani@cs.cmu.edu""]","[""Ye Yuan"", ""Kris M. Kitani""]","[""Diverse Inference"", ""Generative Models"", ""Trajectory Forecasting""]","The ability to forecast a set of likely yet diverse possible future behaviors of an agent (e.g., future trajectories of a pedestrian) is essential for safety-critical perception systems (e.g., autonomous vehicles). In particular, a set of possible future behaviors generated by the system must be diverse to account for all possible outcomes in order to take necessary safety precautions. It is not sufficient to maintain a set of the most likely future outcomes because the set may only contain perturbations of a dominating single outcome (major mode). While generative models such as variational autoencoders (VAEs) have been shown to be a powerful tool for learning a distribution over future trajectories, randomly drawn samples from the learned implicit likelihood model may not be diverse -- the likelihood model is derived from the training data distribution and the samples will concentrate around the major mode of the data. In this work, we propose to learn a diversity sampling function (DSF) that generates a diverse yet likely set of future trajectories. The DSF maps forecasting context features to a set of latent codes which can be decoded by a generative model (e.g., VAE) into a set of diverse trajectory samples. Concretely, the process of identifying the diverse set of samples is posed as DSF parameter estimation. To learn the parameters of the DSF, the diversity of the trajectory samples is evaluated by a diversity loss based on a determinantal point process (DPP). Gradient descent is performed over the DSF parameters, which in turn moves the latent codes of the sample set to find an optimal set of diverse yet likely trajectories. Our method is a novel application of DPPs to optimize a set of items (forecasted trajectories) in continuous space. We demonstrate the diversity of the trajectories produced by our approach on both low-dimensional 2D trajectory data and high-dimensional human motion data.",/pdf/dbe4a1eab1eabba68d5e92b6ae29819c3f7f7999.pdf,ICLR,2020,We learn a diversity sampling function with DPPs to obtain a diverse set of samples from a generative model. +8q_ca26L1fz,#NAME?,1601310000000.0,1614990000000.0,1885,Revisiting Graph Neural Networks for Link Prediction,"[""~Muhan_Zhang1"", ""~Pan_Li2"", ""yxia@fb.com"", ""wangkai@fb.com"", ""longjin@fb.com""]","[""Muhan Zhang"", ""Pan Li"", ""Yinglong Xia"", ""Kai Wang"", ""Long Jin""]","[""Graph Neural Networks"", ""Link Prediction""]","Graph neural networks (GNNs) have achieved great success in recent years. Three most common applications include node classification, link prediction, and graph classification. While there is rich literature on node classification and graph classification, GNNs for link prediction is relatively less studied and less understood. Two representative classes of methods exist: GAE and SEAL. GAE (Graph Autoencoder) first uses a GNN to learn node embeddings for all nodes, and then aggregates the embeddings of the source and target nodes as their link representation. SEAL extracts a subgraph around the source and target nodes, labels the nodes in the subgraph, and then uses a GNN to learn a link representation from the labeled subgraph. In this paper, we thoroughly discuss the differences between these two classes of methods, and conclude that simply aggregating \textit{node} embeddings does not lead to effective \textit{link} representations, while learning from \textit{properly labeled subgraphs} around links provides highly expressive and generalizable link representations. Experiments on the recent large-scale OGB link prediction datasets show that SEAL has up to 195\% performance gains over GAE methods, achieving new state-of-the-art results on 3 out of 4 datasets.",/pdf/f9da78c757e3bcaef9b9f11b79047e256c5dd587.pdf,ICLR,2021,"We showed that simply aggregating pairwise node embeddings learned by a GNN does not lead to effective link representations, and proposed a labeling trick to address this issue." +H1I3M7Z0b,SkAsMXZRW,1509140000000.0,1518730000000.0,1147,WSNet: Learning Compact and Efficient Networks with Weight Sampling,"[""xiaojie.jin@u.nus.edu"", ""superyyzg@gmail.com"", ""ning.xu@snap.com"", ""jiachao.yang@snap.com"", ""elefjia@nus.edu.sg"", ""yanshuicheng@360.com""]","[""Xiaojie Jin"", ""Yingzhen Yang"", ""Ning Xu"", ""Jianchao Yang"", ""Jiashi Feng"", ""Shuicheng Yan""]","[""Deep learning"", ""model compression""]"," We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks. Existing approaches conventionally learn full model parameters independently and then compress them via \emph{ad hoc} processing such as model pruning or filter factorization. Alternatively, WSNet proposes learning model parameters by sampling from a compact set of learnable parameters, which naturally enforces {parameter sharing} throughout the learning process. We demonstrate that such a novel weight sampling approach (and induced WSNet) promotes both weights and computation sharing favorably. By employing this method, we can more efficiently learn much smaller networks with competitive performance compared to baseline networks with equal numbers of convolution filters. Specifically, we consider learning compact and efficient 1D convolutional neural networks for audio classification. Extensive experiments on multiple audio classification datasets verify the effectiveness of WSNet. Combined with weight quantization, the resulted models are up to \textbf{180$\times$} smaller and theoretically up to \textbf{16$\times$} faster than the well-established baselines, without noticeable performance drop.",/pdf/bec3a5d58827cadae79c5f9723e7e514ce07a5f4.pdf,ICLR,2018,We present a novel network architecture for learning compact and efficient deep neural networks +K5YasWXZT3O,hnz2qluCe-,1601310000000.0,1619140000000.0,647,Tilted Empirical Risk Minimization,"[""~Tian_Li1"", ""~Ahmad_Beirami1"", ""~Maziar_Sanjabi1"", ""~Virginia_Smith1""]","[""Tian Li"", ""Ahmad Beirami"", ""Maziar Sanjabi"", ""Virginia Smith""]","[""exponential tilting"", ""models of learning and generalization"", ""label noise robustness"", ""fairness""]","Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.",/pdf/292aa65cc32390e3e8557ebe28fa70380977f416.pdf,ICLR,2021,"We show that tilted empirical risk minimization (TERM) can be used for enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance, all in a unified framework." +H1W1UN9gg,,1478280000000.0,1488580000000.0,215,Deep Information Propagation,"[""schsam@google.com"", ""gilmer@google.com"", ""sganguli@stanford.edu"", ""jaschasd@google.com""]","[""Samuel S. Schoenholz"", ""Justin Gilmer"", ""Surya Ganguli"", ""Jascha Sohl-Dickstein""]","[""Theory"", ""Deep learning""]","We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.",/pdf/49b3f245e912ffb5845d079da65751b3e4fe4ad1.pdf,ICLR,2017,We predict whether randomly initialized neural networks can be trained by studying whether or not information can travel through them. +B1GAUs0cKQ,SkgocDLcK7,1538090000000.0,1550490000000.0,222,Variance Networks: When Expectation Does Not Meet Your Expectations,"[""k.necludov@gmail.com"", ""dmolch111@gmail.com"", ""ars.ashuha@gmail.com"", ""vetrovd@yandex.ru""]","[""Kirill Neklyudov"", ""Dmitry Molchanov"", ""Arsenii Ashukha"", ""Dmitry Vetrov""]","[""deep learning"", ""variational inference"", ""variational dropout""]","Ordinary stochastic neural networks mostly rely on the expected values of their weights to make predictions, whereas the induced noise is mostly used to capture the uncertainty, prevent overfitting and slightly boost the performance through test-time averaging. In this paper, we introduce variance layers, a different kind of stochastic layers. Each weight of a variance layer follows a zero-mean distribution and is only parameterized by its variance. It means that each object is represented by a zero-mean distribution in the space of the activations. We show that such layers can learn surprisingly well, can serve as an efficient exploration tool in reinforcement learning tasks and provide a decent defense against adversarial attacks. We also show that a number of conventional Bayesian neural networks naturally converge to such zero-mean posteriors. We observe that in these cases such zero-mean parameterization leads to a much better training objective than more flexible conventional parameterizations where the mean is being learned.",/pdf/3befe31b06f3e6adab32966922f1df56500e8c08.pdf,ICLR,2019,"It is possible to learn a zero-centered Gaussian distribution over the weights of a neural network by learning only variances, and it works surprisingly well." +S1e_ssC5F7,rklcE4nctQ,1538090000000.0,1545360000000.0,634,Hyper-Regularization: An Adaptive Choice for the Learning Rate in Gradient Descent,"[""smsxgz@pku.edu.cn"", ""jin.hao@pku.edu.cn"", ""lindachao@pku.edu.cn"", ""zhzhang@math.pku.edu.cn""]","[""Guangzeng Xie"", ""Hao Jin"", ""Dachao Lin"", ""Zhihua Zhang""]","[""Adaptive learning rate"", ""novel framework""]","We present a novel approach for adaptively selecting the learning rate in gradient descent methods. Specifically, we impose a regularization term on the learning rate via a generalized distance, and cast the joint updating process of the parameter and the learning rate into a maxmin problem. Some existing schemes such as AdaGrad (diagonal version) and WNGrad can be rederived from our approach. Based on our approach, the updating rules for the learning rate do not rely on the smoothness constant of optimization problems and are robust to the initial learning rate. We theoretically analyze our approach in full batch and online learning settings, which achieves comparable performances with other first-order gradient-based algorithms in terms of accuracy as well as convergence rate.",/pdf/345c0e2dd15a7399bd29975ac041e0a301e82231.pdf,ICLR,2019, +HJe_yR4Fwr,S1l-GqMuvB,1569440000000.0,1583910000000.0,898,Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin,"[""colinwei@stanford.edu"", ""tengyuma@cs.stanford.edu""]","[""Colin Wei"", ""Tengyu Ma""]","[""deep learning theory"", ""generalization bounds"", ""adversarially robust generalization"", ""data-dependent generalization bounds""]","For linear classifiers, the relationship between (normalized) output margin and generalization is captured in a clear and simple bound – a large output margin implies good generalization. Unfortunately, for deep models, this relationship is less clear: existing analyses of the output margin give complicated bounds which sometimes depend exponentially on depth. In this work, we propose to instead analyze a new notion of margin, which we call the “all-layer margin.” Our analysis reveals that the all-layer margin has a clear and direct relationship with generalization for deep models. This enables the following concrete applications of the all-layer margin: 1) by analyzing the all-layer margin, we obtain tighter generalization bounds for neural nets which depend on Jacobian and hidden layer norms and remove the exponential dependency on depth 2) our neural net results easily translate to the adversarially robust setting, giving the first direct analysis of robust test error for deep networks, and 3) we present a theoretically inspired training algorithm for increasing the all-layer margin. Our algorithm improves both clean and adversarially robust test performance over strong baselines in practice.",/pdf/68a22f3da586dac673478e301763f5516fdf3f10.pdf,ICLR,2020,"We propose a new notion of margin that has a direct relationship with neural net generalization, and obtain improved generalization bounds for neural nets and robust classification by analyzing this margin." +H1VjBebR-,r1XoHxb0b,1509130000000.0,1519320000000.0,585,The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings,"[""tomer22g@gmail.com"", ""liorwolf@gmail.com"", ""sagiebenaim@gmail.com""]","[""Tomer Galanti"", ""Lior Wolf"", ""Sagie Benaim""]","[""Unsupervised learning"", ""cross-domain mapping"", ""Kolmogorov complexity"", ""Occam's razor""]","We discuss the feasibility of the following learning problem: given unmatched samples from two domains and nothing else, learn a mapping between the two, which preserves semantics. Due to the lack of paired samples and without any definition of the semantic information, the problem might seem ill-posed. Specifically, in typical cases, it seems possible to build infinitely many alternative mappings from every target mapping. This apparent ambiguity stands in sharp contrast to the recent empirical success in solving this problem. + +We identify the abstract notion of aligning two domains in a semantic way with concrete terms of minimal relative complexity. A theoretical framework for measuring the complexity of compositions of functions is developed in order to show that it is reasonable to expect the minimal complexity mapping to be unique. The measured complexity used is directly related to the depth of the neural networks being learned and a semantically aligned mapping could then be captured simply by learning using architectures that are not much bigger than the minimal architecture. + +Various predictions are made based on the hypothesis that semantic alignment can be captured by the minimal mapping. These are verified extensively. In addition, a new mapping algorithm is proposed and shown to lead to better mapping results.",/pdf/28602396bbae8435a659b18b701d49a96f937048.pdf,ICLR,2018,"Our hypothesis is that given two domains, the lowest complexity mapping that has a low discrepancy approximates the target mapping." +B1liIlBKvS,HJepO9gtPS,1569440000000.0,1577170000000.0,2336,Selfish Emergent Communication,"[""michael.noukhovitch@umontreal.ca"", ""tlacroix@uci.edu"", ""aaron.courville@gmail.com""]","[""Michael Noukhovitch"", ""Travis LaCroix"", ""Aaron Courville""]","[""multi agent reinforcement learning"", ""emergent communication"", ""game theory""]","Current literature in machine learning holds that unaligned, self-interested agents do not learn to use an emergent communication channel. We introduce a new sender-receiver game to study emergent communication for this spectrum of partially-competitive scenarios and put special care into evaluation. We find that communication can indeed emerge in partially-competitive scenarios, and we discover three things that are tied to improving it. First, that selfish communication is proportional to cooperation, and it naturally occurs for situations that are more cooperative than competitive. Second, that stability and performance are improved by using LOLA (Foerster et al, 2018), especially in more competitive scenarios. And third, that discrete protocols lend themselves better to learning cooperative communication than continuous ones. ",/pdf/e26ed372e5390f3a8f211b87befe8dda63dfa06d.pdf,ICLR,2020,"We manage to emerge communication with selfish agents, contrary to the current view in ML" +YUGG2tFuPM,48EKFPfBsq_,1601310000000.0,1616040000000.0,2210,Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks,"[""~Alexander_Levine2"", ""~Soheil_Feizi2""]","[""Alexander Levine"", ""Soheil Feizi""]","[""bagging"", ""ensemble"", ""robustness"", ""certificate"", ""poisoning"", ""smoothing""]","Adversarial poisoning attacks distort training data in order to corrupt the test-time behavior of a classifier. A provable defense provides a certificate for each test sample, which is a lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample's classification. +We propose two novel provable defenses against poisoning attacks: (i) Deep Partition Aggregation (DPA), a certified defense against a general poisoning threat model, defined as the insertion or deletion of a bounded number of samples to the training set --- by implication, this threat model also includes arbitrary distortions to a bounded number of images and/or labels; and (ii) Semi-Supervised DPA (SS-DPA), a certified defense against label-flipping poisoning attacks. DPA is an ensemble method where base models are trained on partitions of the training set determined by a hash function. DPA is related to both subset aggregation, a well-studied ensemble method in classical machine learning, as well as to randomized smoothing, a popular provable defense against evasion (inference) attacks. Our defense against label-flipping poison attacks, SS-DPA, uses a semi-supervised learning algorithm as its base classifier model: each base classifier is trained using the entire unlabeled training set in addition to the labels for a partition. SS-DPA significantly outperforms the existing certified defense for label-flipping attacks (Rosenfeld et al., 2020) on both MNIST and CIFAR-10: provably tolerating, for at least half of test images, over 600 label flips (vs. < 200 label flips) on MNIST and over 300 label flips (vs. 175 label flips) on CIFAR-10. Against general poisoning attacks where no prior certified defenses exists, DPA can certify $\geq$ 50% of test images against over 500 poison image insertions on MNIST, and nine insertions on CIFAR-10. These results establish new state-of-the-art provable defenses against general and label-flipping poison attacks. Code is available at https://github.com/alevine0/DPA",/pdf/d7ef1d2223ddeebc34ef648c6db9ba088ba871bb.pdf,ICLR,2021,We propose novel certified defenses against label-flipping and general adversarial poisoning attacks. +ByePEC4KDS,HygvD4LdPH,1569440000000.0,1577170000000.0,1078,Situating Sentence Embedders with Nearest Neighbor Overlap,"[""lucylin@cs.washington.edu"", ""nasmith@cs.washington.edu""]","[""Lucy H. Lin"", ""Noah A. Smith""]","[""sentence embeddings"", ""nearest neighbors"", ""semantic similarity""]","As distributed approaches to natural language semantics have developed and diversified, embedders for linguistic units larger than words (e.g., sentences) have come to play an increasingly important role. To date, such embedders have been evaluated using benchmark tasks (e.g., GLUE) and linguistic probes. We propose a comparative approach, nearest neighbor overlap (N2O), that quantifies similarity between embedders in a task-agnostic manner. N2O requires only a collection of examples and is simple to understand: two embedders are more similar if, for the same set of inputs, there is greater overlap between the inputs' nearest neighbors. We use N2O to compare 21 sentence embedders and show the effects of different design choices and architectures.",/pdf/74799ce8c0253ae859d56cb64466b7a70d599771.pdf,ICLR,2020,"We propose nearest neighbor overlap, a procedure which quantifies similarity between embedders in a task-agnostic manner, and use it to compare 21 sentence embedders." +SyxrxR4KPS,S1lpMw7dPB,1569440000000.0,1583910000000.0,928,Deep neuroethology of a virtual rodent,"[""jsmerel@google.com"", ""diegoaldarondo@g.harvard.edu"", ""jesse_d_marshall@fas.harvard.edu"", ""tassa@google.com"", ""gregwayne@google.com"", ""olveczky@fas.harvard.edu""]","[""Josh Merel"", ""Diego Aldarondo"", ""Jesse Marshall"", ""Yuval Tassa"", ""Greg Wayne"", ""Bence Olveczky""]","[""computational neuroscience"", ""motor control"", ""deep RL""]","Parallel developments in neuroscience and deep learning have led to mutually productive exchanges, pushing our understanding of real and artificial neural networks in sensory and cognitive systems. However, this interaction between fields is less developed in the study of motor control. In this work, we develop a virtual rodent as a platform for the grounded study of motor activity in artificial models of embodied control. We then use this platform to study motor activity across contexts by training a model to solve four complex tasks. Using methods familiar to neuroscientists, we describe the behavioral representations and algorithms employed by different layers of the network using a neuroethological approach to characterize motor activity relative to the rodent's behavior and goals. We find that the model uses two classes of representations which respectively encode the task-specific behavioral strategies and task-invariant behavioral kinematics. These representations are reflected in the sequential activity and population dynamics of neural subpopulations. Overall, the virtual rodent facilitates grounded collaborations between deep reinforcement learning and motor neuroscience.",/pdf/31957327a576d1964b20f9e1881af36e76d59d6e.pdf,ICLR,2020,"We built a physical simulation of a rodent, trained it to solve a set of tasks, and analyzed the resulting networks." +ryRh0bb0Z,Hkp3AWbRZ,1509130000000.0,1528370000000.0,745,Multi-View Data Generation Without View Supervision,"[""mickael.chen@lip6.fr"", ""ludovic.denoyer@lip6.fr"", ""thierry.artieres@lif.univ-mrs.fr""]","[""Mickael Chen"", ""Ludovic Denoyer"", ""Thierry Arti\u00e8res""]","[""multi-view"", ""adversarial learning"", ""generative model""]","The development of high-dimensional generative models has recently gained a great surge of interest with the introduction of variational auto-encoders and generative adversarial neural networks. Different variants have been proposed where the underlying latent space is structured, for example, based on attributes describing the data to generate. We focus on a particular problem where one aims at generating samples corresponding to a number of objects under various views. We assume that the distribution of the data is driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object. Therefore, we propose a generative model and a conditional variant built on such a disentangled latent space. This approach allows us to generate realistic samples corresponding to various objects in a high variety of views. Unlike many multi-view approaches, our model doesn't need any supervision on the views but only on the content. Compared to other conditional generation approaches that are mostly based on binary or categorical attributes, we make no such assumption about the factors of variations. Our model can be used on problems with a huge, potentially infinite, number of categories. We experiment it on four images datasets on which we demonstrate the effectiveness of the model and its ability to generalize. ",/pdf/a9118b028383e33ddd307f714ca1ecb1ec094415.pdf,ICLR,2018,"We describe a novel multi-view generative model that can generate multiple views of the same object, or multiple objects in the same view with no need of label on views." +SkGNrnC9FQ,SylB1y05K7,1538090000000.0,1545360000000.0,1535,Manifold Alignment via Feature Correspondence,"[""jay.stanley@yale.edu"", ""guy.wolf@yale.edu"", ""smita.krishnaswamy@yale.edu""]","[""Jay S. Stanley III"", ""Guy Wolf"", ""Smita Krishnaswamy""]","[""graph signal processing"", ""graph alignment"", ""manifold alignment"", ""spectral graph wavelet transform"", ""diffusion geometry"", ""harmonic analysis""]","We propose a novel framework for combining datasets via alignment of their associated intrinsic dimensions. Our approach assumes that the two datasets are sampled from a common latent space, i.e., they measure equivalent systems. Thus, we expect there to exist a natural (albeit unknown) alignment of the data manifolds associated with the intrinsic geometry of these datasets, which are perturbed by measurement artifacts in the sampling process. Importantly, we do not assume any individual correspondence (partial or complete) between data points. Instead, we rely on our assumption that a subset of data features have correspondence across datasets. We leverage this assumption to estimate relations between intrinsic manifold dimensions, which are given by diffusion map coordinates over each of the datasets. We compute a correlation matrix between diffusion coordinates of the datasets by considering graph (or manifold) Fourier coefficients of corresponding data features. We then orthogonalize this correlation matrix to form an isometric transformation between the diffusion maps of the datasets. Finally, we apply this transformation to the diffusion coordinates and construct a unified diffusion geometry of the datasets together. We show that this approach successfully corrects misalignment artifacts, and allows for integrated data.",/pdf/6ef1462d5fc32677006cbffe389e1874946c266d.pdf,ICLR,2019,We propose a method for aligning the latent features learned from different datasets using harmonic correlations. +Skl4LTEtDS,SklnBwuwPS,1569440000000.0,1577170000000.0,556,Growing Action Spaces,"[""gregory.farquhar@cs.ox.ac.uk"", ""lgustafson@fb.com"", ""zlin@fb.com"", ""shimon.whiteson@cs.ox.ac.uk"", ""usunier@fb.com"", ""gab@fb.com""]","[""Gregory Farquhar"", ""Laura Gustafson"", ""Zeming Lin"", ""Shimon Whiteson"", ""Nicolas Usunier"", ""Gabriel Synnaeve""]","[""reinforcement learning"", ""curriculum learning"", ""multi-agent reinforcement learning""]","In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces.",/pdf/287b5fdd453652e7a70515a37bad5d464443628a.pdf,ICLR,2020,Progressively growing the available action space is a great curriculum for learning agents +ByxZdj09tX,BygnC_L-F7,1538090000000.0,1545360000000.0,327,"FROM DEEP LEARNING TO DEEP DEDUCING: AUTOMATICALLY TRACKING DOWN NASH EQUILIBRIUM THROUGH AUTONOMOUS NEURAL AGENT, A POSSIBLE MISSING STEP TOWARD GENERAL A.I.","[""brownwang0426@gmail.com""]","[""Brown Wang""]","[""Reinforcement Learning"", ""Deep Feed-forward Neural Network"", ""Recurrent Neural Network"", ""Game Theory"", ""Control Theory"", ""Nash Equilibrium"", ""Optimization""]","Contrary to most reinforcement learning studies, which emphasize on training a deep neural network to approximate its output layer to certain strategies, this paper proposes a reversed method for reinforcement learning. We call this “Deep Deducing”. In short, after adequately training a deep neural network according to a strategy-environment-to-payoff table, then we initialize randomized strategy +input and propagate the error between the actual output and the desired output back to the initially-randomized strategy input in the “input layer” of the trained deep neural network gradually to perform a task similar to “human deduction”. And we view the final strategy input in the “input layer” as the fittest strategy for a neural network when confronting the observed environment input from the world outside.",/pdf/bae28453950d6cc5d571551031e04ea73f2b4442.pdf,ICLR,2019,FROM DEEP LEARNING TO DEEP DEDUCING +rALA0Xo6yNJ,6Xdo80FBEWY,1601310000000.0,1616520000000.0,1331,Learning to Reach Goals via Iterated Supervised Learning,"[""~Dibya_Ghosh1"", ""~Abhishek_Gupta1"", ""~Ashwin_Reddy1"", ""~Justin_Fu1"", ""~Coline_Manon_Devin1"", ""~Benjamin_Eysenbach1"", ""~Sergey_Levine1""]","[""Dibya Ghosh"", ""Abhishek Gupta"", ""Ashwin Reddy"", ""Justin Fu"", ""Coline Manon Devin"", ""Benjamin Eysenbach"", ""Sergey Levine""]","[""goal reaching"", ""reinforcement learning"", ""behavior cloning"", ""goal-conditioned RL""]","Current reinforcement learning (RL) algorithms can be brittle and difficult to use, especially when learning goal-reaching behaviors from sparse rewards. Although supervised imitation learning provides a simple and stable alternative, it requires access to demonstrations from a human supervisor. In this paper, we study RL algorithms that use imitation learning to acquire goal reaching policies from scratch, without the need for expert demonstrations or a value function. In lieu of demonstrations, we leverage the property that any trajectory is a successful demonstration for reaching the final state in that same trajectory. We propose a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. Each iteration, the agent collects new trajectories using the latest policy, and maximizes the likelihood of the actions along these trajectories under the goal that was actually reached, so as to improve the policy. We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks. ",/pdf/34b6d953408a7aaff3549569738b80162e8e6dbc.pdf,ICLR,2021,"We present GCSL, a simple RL method that uses supervised learning to learn goal-reaching policies." +Hkx7_1rKwS,HyxyUyAuwr,1569440000000.0,1583910000000.0,1797,On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach,"[""yuanhao-16@mails.tsinghua.edu.cn"", ""gdzhang@cs.toronto.edu"", ""jba@cs.toronto.edu""]","[""Yuanhao Wang*"", ""Guodong Zhang*"", ""Jimmy Ba""]","[""minimax optimization"", ""smooth differentiable games"", ""local convergence"", ""generative adversarial networks"", ""optimization""]","Many tasks in modern machine learning can be formulated as finding equilibria in sequential games. In particular, two-player zero-sum sequential games, also known as minimax optimization, have received growing interest. It is tempting to apply gradient descent to solve minimax optimization given its popularity and success in supervised learning. However, it has been noted that naive application of gradient descent fails to find some local minimax and can converge to non-local-minimax points. In this paper, we propose Follow-the-Ridge (FR), a novel algorithm that provably converges to and only converges to local minimax. We show theoretically that the algorithm addresses the notorious rotational behaviour of gradient dynamics, and is compatible with preconditioning and positive momentum. Empirically, FR solves toy minimax problems and improves the convergence of GAN training compared to the recent minimax optimization algorithms. ",/pdf/6a3d8fe978f41b939fae6fb8dd86e6bd7c2ecf57.pdf,ICLR,2020, +EKb4Z0aSNf,rglNG24F0C,1601310000000.0,1614990000000.0,1642,CLOPS: Continual Learning of Physiological Signals,"[""~Dani_Kiyasseh1"", ""tingting.zhu@eng.ox.ac.uk"", ""~David_A._Clifton1""]","[""Dani Kiyasseh"", ""Tingting Zhu"", ""David A. Clifton""]","[""Continual learning"", ""physiological signals"", ""healthcare""]","Deep learning algorithms are known to experience destructive interference when instances violate the assumption of being independent and identically distributed (i.i.d). This violation, however, is ubiquitous in clinical settings where data are streamed temporally and from a multitude of physiological sensors. To overcome this obstacle, we propose CLOPS, a replay-based continual learning strategy. In three continual learning scenarios based on three publically-available datasets, we show that CLOPS can outperform the state-of-the-art methods, GEM and MIR. Moreover, we propose end-to-end trainable parameters, which we term task-instance parameters, that can be used to quantify task difficulty and similarity. This quantification yields insights into both network interpretability and clinical applications, where task difficulty is poorly quantified.",/pdf/abbad498238f1d6a353d71e52a4bf1703fc29065.pdf,ICLR,2021, +rJxX8T4Kvr,r1ltx4uwwH,1569440000000.0,1629880000000.0,553,Learning Efficient Parameter Server Synchronization Policies for Distributed SGD,"[""red.zr@alibaba-inc.com"", ""yangsheng@hit.edu.cn"", ""andreaswernerrober@alibaba-inc.com"", ""zhengping.qzp@alibaba-inc.com"", ""jingren.zhou@alibaba-inc.com""]","[""Rong Zhu"", ""Sheng Yang"", ""Andreas Pfadler"", ""Zhengping Qian"", ""Jingren Zhou""]","[""Distributed SGD"", ""Paramter-Server"", ""Synchronization Policy"", ""Reinforcement Learning""]","We apply a reinforcement learning (RL) based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of machine learning models with Stochastic Gradient Descent (SGD). Utilizing a formal synchronization policy description in the PS-setting, we are able to derive a suitable and compact description of states and actions, allowing us to efficiently use the standard off-the-shelf deep Q-learning algorithm. As a result, we are able to learn synchronization policies which generalize to different cluster environments, different training datasets and small model variations and (most importantly) lead to considerable decreases in training time when compared to standard policies such as bulk synchronous parallel (BSP), asynchronous parallel (ASP), or stale synchronous parallel (SSP). To support our claims we present extensive numerical results obtained from experiments performed in simulated cluster environments. In our experiments training time is reduced by 44 on average and learned policies generalize to multiple unseen circumstances.",/pdf/df5227746e38e8b3290b9c3bed548d248060f48c.pdf,ICLR,2020,We apply a reinforcement learning based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of SGD. +hx1IXFHAw7R,1qlPe3QKepu,1601310000000.0,1616060000000.0,2317,Provable Rich Observation Reinforcement Learning with Combinatorial Latent States,"[""~Dipendra_Misra1"", ""~Qinghua_Liu1"", ""~Chi_Jin1"", ""jcl@microsoft.com""]","[""Dipendra Misra"", ""Qinghua Liu"", ""Chi Jin"", ""John Langford""]","[""Reinforcement learning theory"", ""Rich observation"", ""Noise-contrastive learning"", ""State abstraction"", ""Factored MDP""]","We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a ""latent"" factored state. These observations are ""rich enough"" to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors. We create a learning algorithm FactoRL (Fact-o-Rel) for this setting, which uses noise-contrastive learning to identify latent structures in emission processes and discover a factorized state space. We derive polynomial sample complexity guarantees for FactoRL which polynomially depend upon the number factors, and very weakly depend on the size of the observation space. We also provide a guarantee of polynomial time complexity when given access to an efficient planning algorithm.",/pdf/6a01a542edf09482d75550c673ddcb462727111a.pdf,ICLR,2021,We introduce a problem setup and a provable reinforcement learning algorithm for rich-observation problems with latent combinatorially large state space. +r1LXit5ee,,1478300000000.0,1488400000000.0,519,Episodic Exploration for Deep Deterministic Policies for StarCraft Micromanagement,"[""usunier@fb.com"", ""gab@fb.com"", ""zlin@fb.com"", ""soumith@fb.com""]","[""Nicolas Usunier"", ""Gabriel Synnaeve"", ""Zeming Lin"", ""Soumith Chintala""]","[""Deep learning"", ""Reinforcement Learning"", ""Games""]","We consider scenarios from the real-time strategy game StarCraft as benchmarks for reinforcement learning algorithms. We focus on micromanagement, that is, the short-term, low-level control of team members during a battle. We propose several scenarios that are challenging for reinforcement learning algorithms because the state- action space is very large, and there is no obvious feature representation for the value functions. We describe our approach to tackle the micromanagement scenarios with deep neural network controllers from raw state features given by the game engine. We also present a heuristic reinforcement learning algorithm which combines direct exploration in the policy space and backpropagation. This algorithm collects traces for learning using deterministic policies, which appears much more efficient than, e.g., ε-greedy exploration. Experiments show that this algorithm allows to successfully learn non-trivial strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE struggle.",/pdf/a436dd86fa64bf2053b3323a072669d6b505e675.pdf,ICLR,2017,"We propose a new reinforcement learning algorithm based on zero order optimization, that we evaluate on StarCraft micromanagement scenarios." +rkgFXR4KPr,SygYworODS,1569440000000.0,1577170000000.0,1046,A Simple Recurrent Unit with Reduced Tensor Product Representations,"[""shuaitang93@ucsd.edu"", ""paul.smolensky@gmail.com"", ""desa@ucsd.edu""]","[""Shuai Tang"", ""Paul Smolensky"", ""Virginia R. de Sa""]","[""RNNs"", ""TPRs""]","Widely used recurrent units, including Long-short Term Memory (LSTM) and Gated Recurrent Unit (GRU), perform well on natural language tasks, but their ability to learn structured representations is still questionable. Exploiting reduced Tensor Product Representations (TPRs) --- distributed representations of symbolic structure in which vector-embedded symbols are bound to vector-embedded structural positions --- we propose the TPRU, a simple recurrent unit that, at each time step, explicitly executes structural-role binding and unbinding operations to incorporate structural information into learning. The gradient analysis of our proposed TPRU is conducted to support our model design, and its performance on multiple datasets shows the effectiveness of it. Furthermore, observations on linguistically grounded study demonstrate the interpretability of our TPRU.",/pdf/293e9e4eea0f8016f36d647d52237fbed21214e5.pdf,ICLR,2020, +H1eLVxrKwS,BkxJjUlFwB,1569440000000.0,1577170000000.0,2249,Removing input features via a generative model to explain their attributions to classifier's decisions,"[""chiragagarwall12@gmail.com"", ""dans@uic.edu"", ""anh.ng8@gmail.com""]","[""Chirag Agarwal"", ""Dan Schonfeld"", ""Anh Nguyen""]","[""attribution maps"", ""generative models"", ""inpainting"", ""counterfactual"", ""explanations"", ""interpretability"", ""explainability""]","Interpretability methods often measure the contribution of an input feature to an image classifier's decisions by heuristically removing it via e.g. blurring, adding noise, or graying out, which often produce unrealistic, out-of-samples. Instead, we propose to integrate a generative inpainter into three representative attribution map methods as a mechanism for removing input features. Compared to the original counterparts, our methods (1) generate more plausible counterfactual samples under the true data generating process; (2) are more robust to hyperparameter settings; and (3) localize objects more accurately. Our findings were consistent across both ImageNet and Places365 datasets and two different pairs of classifiers and inpainters.",/pdf/495fce6eba96e79aff572966e92899c4ef9af75c.pdf,ICLR,2020, +HkeyZhC9F7,rJx_Yj15KQ,1538090000000.0,1545360000000.0,1133,Learning Heuristics for Automated Reasoning through Reinforcement Learning,"[""gilled@berkeley.edu"", ""markus.norman.rabe@gmail.com"", ""eal@berkeley.edu"", ""sseshia@eecs.berkeley.edu""]","[""Gil Lederman"", ""Markus N. Rabe"", ""Edward A. Lee"", ""Sanjit A. Seshia""]","[""reinforcement learning"", ""deep learning"", ""logics"", ""formal methods"", ""automated reasoning"", ""backtracking search"", ""satisfiability"", ""quantified Boolean formulas""]","We demonstrate how to learn efficient heuristics for automated reasoning algorithms through deep reinforcement learning. We focus on backtracking search algorithms for quantified Boolean logics, which already can solve formulas of impressive size - up to 100s of thousands of variables. The main challenge is to find a representation of these formulas that lends itself to making predictions in a scalable way. For challenging problems, the heuristic learned through our approach reduces execution time by a factor of 10 compared to the existing handwritten heuristics.",/pdf/0cd39db88fadedf14c8a841fd359f62a53297a39.pdf,ICLR,2019,RL finds better heuristics for automated reasoning algorithms. +8cpHIfgY4Dj,rRumf43bxv9,1601310000000.0,1616840000000.0,405,FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization,"[""~Lanqing_Li1"", ""yangrui19@mails.tsinghua.edu.cn"", ""~Dijun_Luo1""]","[""Lanqing Li"", ""Rui Yang"", ""Dijun Luo""]","[""offline/batch reinforcement learning"", ""meta-reinforcement learning"", ""multi-task reinforcement learning"", ""distance metric learning"", ""contrastive learning""]","We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.",/pdf/44984a3c82f19e6fc4db9819ab9140e0cc3ca7e0.pdf,ICLR,2021,"A novel model-free, end-to-end fully-offline meta-RL algorithm designed to maximize practicality, performance and sample/computational efficiency." +r6I3EvB9eDO,tqoS7a89r-q,1601310000000.0,1614990000000.0,79,Necessary and Sufficient Conditions for Compositional Representations,"[""~Yuanpeng_Li2""]","[""Yuanpeng Li""]","[""Compositionality""]","Humans leverage compositionality for flexible and efficient learning, but current machine learning algorithms lack such ability. Despite many efforts in specific cases, there is still absence of theories and tools to study it systematically. In this paper, we leverage group theory to mathematically prove necessary and sufficient conditions for two fundamental questions of compositional representations. (1) What are the properties for a set of components to be expressed compositionally. (2) What are the properties for mappings between compositional and entangled representations. We provide examples to better understand the conditions and how to apply them. E.g., we use the theory to give a new explanation of why attention mechanism helps compositionality. We hope this work will help to advance understanding of compositionality and improvement of artificial intelligence towards human level. +",/pdf/a563f504df9d308b7c2fd18dd84786ddec6a88ca.pdf,ICLR,2021, +r1uOhfb0W,Sky7hfZRZ,1509140000000.0,1518730000000.0,963,Learning Sparse Structured Ensembles with SG-MCMC and Network Pruning,"[""zhangyic17@mails.tsinghua.edu.cn"", ""ozj@tsinghua.edu.cn""]","[""Yichi Zhang"", ""Zhijian Ou""]","[""ensemble learning"", ""SG-MCMC"", ""group sparse prior"", ""network pruning""]","An ensemble of neural networks is known to be more robust and accurate than an individual network, however usually with linearly-increased cost in both training and testing. +In this work, we propose a two-stage method to learn Sparse Structured Ensembles (SSEs) for neural networks. +In the first stage, we run SG-MCMC with group sparse priors to draw an ensemble of samples from the posterior distribution of network parameters. In the second stage, we apply weight-pruning to each sampled network and then perform retraining over the remained connections. +In this way of learning SSEs with SG-MCMC and pruning, we not only achieve high prediction accuracy since SG-MCMC enhances exploration of the model-parameter space, but also reduce memory and computation cost significantly in both training and testing of NN ensembles. +This is thoroughly evaluated in the experiments of learning SSE ensembles of both FNNs and LSTMs. +For example, in LSTM based language modeling (LM), we obtain 21\% relative reduction in LM perplexity by learning a SSE of 4 large LSTM models, which has only 30\% of model parameters and 70\% of computations in total, as compared to the baseline large LSTM LM. +To the best of our knowledge, this work represents the first methodology and empirical study of integrating SG-MCMC, group sparse prior and network pruning together for learning NN ensembles.",/pdf/ae5cb0a74b0fdd0c7daf8c0513cd259d79cb9fbd.pdf,ICLR,2018,"Propose a novel method by integrating SG-MCMC sampling, group sparse prior and network pruning to learn Sparse Structured Ensemble (SSE) with improved performance and significantly reduced cost than traditional methods. " +rkxciiC9tm,ryg-tz6wtX,1538090000000.0,1550120000000.0,645,NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning,"[""xiesirui@sensetime.com"", ""huangjunning@sensetime.com"", ""leilanxin@sensetime.com"", ""liuchunxiao@sensetime.com"", ""mazheng@sensetime.com"", ""wayne.zhang@sensetime.com"", ""linliang@ieee.org""]","[""Sirui Xie"", ""Junning Huang"", ""Lanxin Lei"", ""Chunxiao Liu"", ""Zheng Ma"", ""Wei Zhang"", ""Liang Lin""]","[""Reinforcement learning"", ""exploration""]","Reinforcement learning agents need exploratory behaviors to escape from local optima. These behaviors may include both immediate dithering perturbation and temporally consistent exploration. To achieve these, a stochastic policy model that is inherently consistent through a period of time is in desire, especially for tasks with either sparse rewards or long term information. In this work, we introduce a novel on-policy temporally consistent exploration strategy - Neural Adaptive Dropout Policy Exploration (NADPEx) - for deep reinforcement learning agents. Modeled as a global random variable for conditional distribution, dropout is incorporated to reinforcement learning policies, equipping them with inherent temporal consistency, even when the reward signals are sparse. Two factors, gradients' alignment with the objective and KL constraint in policy space, are discussed to guarantee NADPEx policy's stable improvement. Our experiments demonstrate that NADPEx solves tasks with sparse reward while naive exploration and parameter noise fail. It yields as well or even faster convergence in the standard mujoco benchmark for continuous control. ",/pdf/96c6dc86a4d7b858db86595f4a4de5790b9d92e7.pdf,ICLR,2019, +UcoXdfrORC,QASrYOF0U8I,1601310000000.0,1615950000000.0,1330,Model-Based Visual Planning with Self-Supervised Functional Distances,"[""~Stephen_Tian1"", ""~Suraj_Nair1"", ""~Frederik_Ebert1"", ""~Sudeep_Dasari2"", ""~Benjamin_Eysenbach1"", ""~Chelsea_Finn1"", ""~Sergey_Levine1""]","[""Stephen Tian"", ""Suraj Nair"", ""Frederik Ebert"", ""Sudeep Dasari"", ""Benjamin Eysenbach"", ""Chelsea Finn"", ""Sergey Levine""]","[""planning"", ""model learning"", ""distance learning"", ""reinforcement learning"", ""robotics""]","A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods.",/pdf/e7d9842e2ee26ac0242d6efcf7a863c669541594.pdf,ICLR,2021,"We combine model-based planning with dynamical distance learning to solve visual goal-reaching tasks, using random, unlabeled, experience." +SyYYPdg0-,HkdYvOg0-,1509100000000.0,1518730000000.0,311,Counterfactual Image Networks,"[""denizokt@mit.edu"", ""vondrick@google.com"", ""torralba@mit.edu""]","[""Deniz Oktay"", ""Carl Vondrick"", ""Antonio Torralba""]","[""computer vision"", ""image segmentation"", ""generative models"", ""adversarial networks"", ""unsupervised learning""]","We capitalize on the natural compositional structure of images in order to learn object segmentation with weakly labeled images. The intuition behind our approach is that removing objects from images will yield natural images, however removing random patches will yield unnatural images. We leverage this signal to develop a generative model that decomposes an image into layers, and when all layers are combined, it reconstructs the input image. However, when a layer is removed, the model learns to produce a different image that still looks natural to an adversary, which is possible by removing objects. Experiments and visualizations suggest that this model automatically learns object segmentation on images labeled only by scene better than baselines.",/pdf/db4ebde3ae5b93c15c0f39da74857a14279b2b06.pdf,ICLR,2018,Weakly-supervised image segmentation using compositional structure of images and generative models. +Cb54AMqHQFP,6Rjx_AiuQz3,1601310000000.0,1615850000000.0,69,Network Pruning That Matters: A Case Study on Retraining Variants,"[""~Duong_Hoang_Le2"", ""~Binh-Son_Hua1""]","[""Duong Hoang Le"", ""Binh-Son Hua""]","[""Network Pruning""]","Network pruning is an effective method to reduce the computational expense of over-parameterized neural networks for deployment on low-resource systems. Recent state-of-the-art techniques for retraining pruned networks such as weight rewinding and learning rate rewinding have been shown to outperform the traditional fine-tuning technique in recovering the lost accuracy (Renda et al., 2020), but so far it is unclear what accounts for such performance. In this work, we conduct extensive experiments to verify and analyze the uncanny effectiveness of learning rate rewinding. We find that the reason behind the success of learning rate rewinding is the usage of a large learning rate. Similar phenomenon can be observed in other learning rate schedules that involve large learning rates, e.g., the 1-cycle learning rate schedule (Smith et al., 2019). By leveraging the right learning rate schedule in retraining, we demonstrate a counter-intuitive phenomenon in that randomly pruned networks could even achieve better performance than methodically pruned networks (fine-tuned with the conventional approach). Our results emphasize the cruciality of the learning rate schedule in pruned network retraining - a detail often overlooked by practitioners during the implementation of network pruning. ",/pdf/cd800d481759bd2472d081c050f8be1d94f91760.pdf,ICLR,2021,We study the effective of different retraining mechanisms while doing pruning +rkezdaEtvH,SJllrhsDvS,1569440000000.0,1577170000000.0,624,Hyperbolic Discounting and Learning Over Multiple Horizons,"[""liam.fedus@gmail.com"", ""carlesgelada@hotmail.com"", ""yoshua.bengio@mila.quebec"", ""bellemare@google.com"", ""hugolarochelle@google.com""]","[""William Fedus"", ""Carles Gelada"", ""Yoshua Bengio"", ""Marc G. Bellemare"", ""Hugo Larochelle""]","[""Deep learning"", ""reinforcement learning"", ""discounting"", ""hyperbolic discounting"", ""auxiliary tasks""]","Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. Here we extend earlier work of Kurth-Nelson and Redish and propose an efficient deep reinforcement learning agent that acts via hyperbolic discounting and other non-exponential discount mechanisms. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over state-of-the-art methods.",/pdf/30949b5bd4239375b53f85a4754bfe71b57bea13.pdf,ICLR,2020,A deep RL agent that learns hyperbolic (and other non-exponential) Q-values and a new multi-horizon auxiliary task. +LuyryrCs6Ez,Dl5dc4epnl,1601310000000.0,1614990000000.0,404,CURI: A Benchmark for Productive Concept Learning Under Uncertainty,"[""~Shanmukha_Ramakrishna_Vedantam1"", ""~Arthur_Szlam1"", ""~Maximilian_Nickel1"", ""~Ari_S._Morcos1"", ""~Brenden_M._Lake1""]","[""Shanmukha Ramakrishna Vedantam"", ""Arthur Szlam"", ""Maximilian Nickel"", ""Ari S. Morcos"", ""Brenden M. Lake""]","[""compositional learning"", ""meta-learning"", ""systematicity"", ""reasoning""]","Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts (“a scene with objects that have the same color”) and ad-hoc categories defined through goals (“objects that could fall on one’s head”). In contrast, standard classification benchmarks: 1) consider only a fixed set of category labels, 2) do not evaluate compositional concept learning and 3) do not explicitly capture a notion of reasoning under uncertainty. We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) to bridge this gap. CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning boolean operations, variable binding, etc. Importantly, it also defines a model-independent “compositionality gap” to evaluate difficulty of generalizing out-of-distribution along each of these axes. Extensive evaluations across a range of modeling choices spanning different modalities (image, schemas, and sounds), splits, privileged auxiliary concept information, and choices of negatives reveal substantial scope for modeling advances on the proposed task. All code and datasets will be available online.",/pdf/07a8bb21964ebf4914b375eda496ed7fb226050d.pdf,ICLR,2021,A novel benchmark that tests compositional reasoning about concepts under uncertainty +SkZxCk-0Z,rJlg0JZAW,1509130000000.0,1519390000000.0,531,Can Neural Networks Understand Logical Entailment?,"[""richardevans@google.com"", ""saxton@google.com"", ""davidamos@google.com"", ""pushmeet@google.com"", ""etg@google.com""]","[""Richard Evans"", ""David Saxton"", ""David Amos"", ""Pushmeet Kohli"", ""Edward Grefenstette""]","[""structure"", ""neural networks"", ""logic"", ""dataset""]","We introduce a new dataset of logical entailments for the purpose of measuring models' ability to capture and exploit the structure of logical expressions against an entailment prediction task. We use this task to compare a series of architectures which are ubiquitous in the sequence-processing literature, in addition to a new model class---PossibleWorldNets---which computes entailment as a ``convolution over possible worlds''. Results show that convolutional networks present the wrong inductive bias for this class of problems relative to LSTM RNNs, tree-structured neural networks outperform LSTM RNNs due to their enhanced ability to exploit the syntax of logic, and PossibleWorldNets outperform all benchmarks.",/pdf/2d561803fec6819ea8b9fe1e19f154297d97dd95.pdf,ICLR,2018,We introduce a new dataset of logical entailments for the purpose of measuring models' ability to capture and exploit the structure of logical expressions against an entailment prediction task. +rJgPFgHFwr,BylcW0xYDB,1569440000000.0,1577170000000.0,2439,Laconic Image Classification: Human vs. Machine Performance,"[""jaco_1031@hotmail.com"", ""aidhog@gmail.com"", ""jorge.perez.rojas@gmail.com""]","[""Javier Carrasco"", ""Aidan Hogan"", ""Jorge P\u00e9rez""]","[""minimal images"", ""entropy"", ""human vs. machine performance""]","We propose laconic classification as a novel way to understand and compare the performance of diverse image classifiers. The goal in this setting is to minimise the amount of information (aka. entropy) required in individual test images to maintain correct classification. Given a classifier and a test image, we compute an approximate minimal-entropy positive image for which the classifier provides a correct classification, becoming incorrect upon any further reduction. The notion of entropy offers a unifying metric that allows to combine and compare the effects of various types of reductions (e.g., crop, colour reduction, resolution reduction) on classification performance, in turn generalising similar methods explored in previous works. Proposing two complementary frameworks for computing the minimal-entropy positive images of both human and machine classifiers, in experiments over the ILSVRC test-set, we find that machine classifiers are more sensitive entropy-wise to reduced resolution (versus cropping or reduced colour for machines, as well as reduced resolution for humans), supporting recent results suggesting a texture bias in the ILSVRC-trained models used. We also find, in the evaluated setting, that humans classify the minimal-entropy positive images of machine models with higher precision than machines classify those of humans.",/pdf/9df6c5ed4c6fd3a172d7a34ff503862e94d95ce1.pdf,ICLR,2020,A framework for minimal entropy image classification and a comparison between machines and humans +c5klJN-Bpq1,u3viAIOZRCn,1601310000000.0,1614990000000.0,1709,Generalizing Tree Models for Improving Prediction Accuracy,"[""~Jaemin_Yoo1"", ""sael@ajou.ac.kr""]","[""Jaemin Yoo"", ""Lee Sael""]",[],"Can we generalize and improve the representation power of tree models? Tree models are often favored over deep neural networks due to their interpretable structures in problems where the interpretability is required, such as in the classification of feature-based data where each feature is meaningful. However, most tree models have low accuracies and easily overfit to training data. In this work, we propose Decision Transformer Network (DTN), our highly accurate and interpretable tree model based on our generalized framework of tree models, decision transformers. Decision transformers allow us to describe tree models in the context of deep learning. Our DTN is proposed based on improving the generalizable components of the decision transformer, which increases the representation power of tree models while preserving the inherent interpretability of the tree structure. Our extensive experiments on 121 feature-based datasets show that DTN outperforms the state-of-the-art tree models and even deep neural networks.",/pdf/951fb28bd0446328fedb42bd572c9c4b0e05f9a1.pdf,ICLR,2021, +ryh9pmcee,,1478270000000.0,1488840000000.0,200,Energy-based Generative Adversarial Networks,"[""jakezhao@cs.nyu.edu"", ""mathieu@cs.nyu.edu"", ""yann@cs.nyu.edu""]","[""Junbo Zhao"", ""Michael Mathieu"", ""Yann LeCun""]","[""Deep learning"", ""Unsupervised Learning"", ""Semi-Supervised Learning""]","We introduce the ""Energy-based Generative Adversarial Network"" model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.",/pdf/b42ae537853626ee99448b99bd9ba4486bbe2481.pdf,ICLR,2017,"We introduce the ""Energy-based Generative Adversarial Network"" (EBGAN) model." +SJx0PAEFDS,HklfAqw_Pr,1569440000000.0,1577170000000.0,1193,Underwhelming Generalization Improvements From Controlling Feature Attribution,"[""joseph@viviano.ca"", ""becks.simpson@imagia.com"", ""francis.dutil@imagia.com"", ""yoshua.bengio@mila.quebec"", ""joseph@josephpcohen.com""]","[""Joseph D Viviano"", ""Becks Simpson"", ""Francis Dutil"", ""Yoshua Bengio"", ""Joseph Paul Cohen""]","[""interpretability"", ""medical"", ""generalization"", ""saliency""]","Overfitting is a common issue in machine learning, which can arise when the model learns to predict class membership using convenient but spuriously-correlated image features instead of the true image features that denote a class. These are typically visualized using saliency maps. In some object classification tasks such as for medical images, one may have some images with masks, indicating a region of interest, i.e., which part of the image contains the most relevant information for the classification. We describe a simple method for taking advantage of such auxiliary labels, by training networks to ignore the distracting features which may be extracted outside of the region of interest, on the training images for which such masks are available. This mask information is only used during training and has an impact on generalization accuracy in a dataset-dependent way. We observe an underwhelming relationship between controlling saliency maps and improving generalization performance.",/pdf/5bbab21af0d591c439a435e5649193e8bf595936.pdf,ICLR,2020,"There is hope that one can diagnose and fix overfitting in classifiers by studying and guiding their saliency maps, but we developed multiple methods to do this well and only see a minor positive effect on generalization." +KUDUoRsEphu,GHM-6siuDwh,1601310000000.0,1614690000000.0,1754,"Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize","[""~Nils_Wandel2"", ""~Michael_Weinmann1"", ""~Reinhard_Klein1""]","[""Nils Wandel"", ""Michael Weinmann"", ""Reinhard Klein""]","[""Unsupervised Learning"", ""Fluid Dynamics"", ""U-Net""]","Fast and stable fluid simulations are an essential prerequisite for applications ranging from computer-generated imagery to computer-aided design in research and development. However, solving the partial differential equations of incompressible fluids is a challenging task and traditional numerical approximation schemes come at high computational costs. Recent deep learning based approaches promise vast speed-ups but do not generalize to new fluid domains, require fluid simulation data for training, or rely on complex pipelines that outsource major parts of the fluid simulation to traditional methods. + +In this work, we propose a novel physics-constrained training approach that generalizes to new fluid domains, requires no fluid simulation data, and allows convolutional neural networks to map a fluid state from time-point t to a subsequent state at time t+dt in a single forward pass. This simplifies the pipeline to train and evaluate neural fluid models. After training, the framework yields models that are capable of fast fluid simulations and can handle various fluid phenomena including the Magnus effect and Kármán vortex streets. We present an interactive real-time demo to show the speed and generalization capabilities of our trained models. Moreover, the trained neural networks are efficient differentiable fluid solvers as they offer a differentiable update step to advance the fluid simulation in time. We exploit this fact in a proof-of-concept optimal control experiment. Our models significantly outperform a recent differentiable fluid solver in terms of computational speed and accuracy.",/pdf/a304e46c25faf8b1991a7580669d779b3c3e2cd6.pdf,ICLR,2021,"We present an unsupervised training framework for incompressible fluid dynamics that allows neural networks to perform fast, accurate, differentiable fluid simulations and generalize to new domain geometries." +pbXQtKXwLS,Sm2XPa5r2P5,1601310000000.0,1614990000000.0,836,Guiding Neural Network Initialization via Marginal Likelihood Maximization,"[""~Anthony_Tai1"", ""~Chunfeng_Huang1""]","[""Anthony Tai"", ""Chunfeng Huang""]","[""Neural networks"", ""Gaussian processes"", ""model initialization"", ""marginal likelihood""]","We propose a simple, data-driven approach to help guide hyperparameter selection for neural network initialization. We leverage the relationship between neural network and Gaussian process models having corresponding activation and covariance functions to infer the hyperparameter values desirable for model initialization. Our experiment shows that marginal likelihood maximization provides recommendations that yield near-optimal prediction performance on MNIST classification task under experiment constraints. Furthermore, our empirical results indicate consistency in the proposed technique, suggesting that computation cost for the procedure could be significantly reduced with smaller training sets. ",/pdf/798a9f1f3775b13b85fe92b00c6d62ec5aa39d19.pdf,ICLR,2021,We propose using Gaussian process marginal likelihood maximization to recommend hyperparameter values for initialization of the corresponding neural network. +rJvJXZb0W,rkU1Qb-AZ,1509130000000.0,1519360000000.0,661,An efficient framework for learning sentence representations,"[""llajan@umich.edu"", ""honglak@eecs.umich.edu""]","[""Lajanugen Logeswaran"", ""Honglak Lee""]","[""sentence"", ""embeddings"", ""unsupervised"", ""representations"", ""learning"", ""efficient""]","In this work we propose a simple and efficient framework for learning sentence representations from unlabelled data. Drawing inspiration from the distributional hypothesis and recent work on learning sentence representations, we reformulate the problem of predicting the context in which a sentence appears as a classification problem. Given a sentence and the context in which it appears, a classifier distinguishes context sentences from other contrastive sentences based on their vector representations. This allows us to efficiently learn different types of encoding functions, and we show that the model learns high-quality sentence representations. We demonstrate that our sentence representations outperform state-of-the-art unsupervised and supervised representation learning methods on several downstream NLP tasks that involve understanding sentence semantics while achieving an order of magnitude speedup in training time.",/pdf/f81f89708e8a59c721d319ae71d188bd04ce8bcd.pdf,ICLR,2018,A framework for learning high-quality sentence representations efficiently. +RSn0s-T-qoy,MWajYsYF0Vq,1601310000000.0,1614990000000.0,3074,Multi-View Disentangled Representation,"[""~Zongbo_Han1"", ""~Changqing_Zhang1"", ""~Huazhu_Fu4"", ""~Qinghua_Hu1"", ""~Joey_Tianyi_Zhou1""]","[""Zongbo Han"", ""Changqing Zhang"", ""Huazhu Fu"", ""Qinghua Hu"", ""Joey Tianyi Zhou""]",[],"Learning effective representations for data with multiple views is crucial in machine learning and pattern recognition. Recently great efforts have focused on learning unified or latent representations to integrate information from different views for specific tasks. These approaches generally assume simple or implicit relationships between different views and as a result are not able to flexibly and explicitly depict the correlations among these views. To address this, we firstly propose the definition and conditions for multi-view disentanglement providing general instructions for disentangling representations between different views. Furthermore, a novel objective function is derived to explicitly disentangle the multi-view data into a shared part across different views and a (private) exclusive part within each view. Experiments on a variety of multi-modal datasets demonstrate that our objective can effectively disentangle information from different views while satisfying the disentangling conditions.",/pdf/408a3026f62529c7dbcff93c4f2ce77a5c4744d4.pdf,ICLR,2021, +HJxDugSFDB,Hklix6xtDr,1569440000000.0,1577170000000.0,2402,Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model,"[""alexlee_gk@cs.berkeley.edu"", ""nagaban2@berkeley.edu"", ""pabbeel@cs.berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Alex X. Lee"", ""Anusha Nagabandi"", ""Pieter Abbeel"", ""Sergey Levine""]",[],"Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these kinds of observation spaces present a number of challenges in practice, since the policy must now solve two problems: a representation learning problem, and a task learning problem. In this paper, we aim to explicitly learn representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC learns a compact latent representation space using a stochastic sequential latent variable model, and then learns a critic model within this latent space. By learning a critic within a compact state space, SLAC can learn much more efficiently than standard RL methods. The proposed model improves performance substantially over alternative representations as well, such as variational autoencoders. In fact, our experimental evaluation demonstrates that the sample efficiency of our resulting method is comparable to that of model-based RL methods that directly use a similar type of model for control. Furthermore, our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks. Our code and videos of our results are available at our website.",/pdf/48c6e1132313e6a0a58b5d6a63dc04b7985c18b0.pdf,ICLR,2020, +SkRsFSRpb,S1piKBR6Z,1508950000000.0,1518730000000.0,91,GeoSeq2Seq: Information Geometric Sequence-to-Sequence Networks,"[""alessandro.bay@cortexica.com"", ""biswasengupta@yahoo.com""]","[""Alessandro Bay"", ""Biswa Sengupta""]",[],"The Fisher information metric is an important foundation of information geometry, wherein it allows us to approximate the local geometry of a probability distribution. Recurrent neural networks such as the Sequence-to-Sequence (Seq2Seq) networks that have lately been used to yield state-of-the-art performance on speech translation or image captioning have so far ignored the geometry of the latent embedding, that they iteratively learn. We propose the information geometric Seq2Seq (GeoSeq2Seq) network which abridges the gap between deep recurrent neural networks and information geometry. Specifically, the latent embedding offered by a recurrent network is encoded as a Fisher kernel of a parametric Gaussian Mixture Model, a formalism common in computer vision. We utilise such a network to predict the shortest routes between two nodes of a graph by learning the adjacency matrix using the GeoSeq2Seq formalism; our results show that for such a problem the probabilistic representation of the latent embedding supersedes the non-probabilistic embedding by 10-15\%.",/pdf/7f663e8970562c400c33f84c08ee3ad1359f7826.pdf,ICLR,2018, +BkoCeqgR-,Sy9Ae9xRb,1509100000000.0,1518730000000.0,345,On the Construction and Evaluation of Color Invariant Networks,"[""konrad.groh@de.bosch.com""]","[""Konrad Groh""]","[""deep learning"", ""invariance"", ""data set"", ""evaluation""]","This is an empirical paper which constructs color invariant networks and evaluates their performances on a realistic data set. The paper studies the simplest possible case of color invariance: invariance under pixel-wise permutation of the color channels. Thus the network is aware not of the specific color object, but its colorfulness. The data set introduced in the paper consists of images showing crashed cars from which ten classes were extracted. An additional annotation was done which labeled whether the car shown was red or non-red. The networks were evaluated by their performance on the classification task. With the color annotation we altered the color ratios in the training data and analyzed the generalization capabilities of the networks on the unaltered test data. We further split the test data in red and non-red cars and did a similar evaluation. It is shown in the paper that an pixel-wise ordering of the rgb-values of the images performs better or at least similarly for small deviations from the true color ratios. The limits of these networks are also discussed.",/pdf/63ed0e265747fc86ee64fe0cb928930bf0d1cd0b.pdf,ICLR,2018,We construct and evaluate color invariant neural nets on a novel realistic data set +cO1IH43yUF,COLrwv6zCZ_,1601310000000.0,1615480000000.0,1284,Revisiting Few-sample BERT Fine-tuning,"[""~Tianyi_Zhang2"", ""~Felix_Wu1"", ""~Arzoo_Katiyar1"", ""~Kilian_Q_Weinberger1"", ""~Yoav_Artzi1""]","[""Tianyi Zhang"", ""Felix Wu"", ""Arzoo Katiyar"", ""Kilian Q Weinberger"", ""Yoav Artzi""]","[""Fine-tuning"", ""Optimization"", ""BERT""]","This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process. ",/pdf/b9891ff9bbfe3c50cf752711eb45ea789cd534aa.pdf,ICLR,2021, +r1aPbsFle,,1478240000000.0,1489260000000.0,121,Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling,"[""inanh@stanford.edu"", ""khosravi@stanford.edu"", ""rsocher@salesforce.com""]","[""Hakan Inan"", ""Khashayar Khosravi"", ""Richard Socher""]","[""Natural language processing"", ""Deep learning""]","Recurrent neural networks have been very successful at predicting sequences of words in tasks such as language modeling. However, all such models are based on the conventional classification framework, where the model is trained against one-hot targets, and each word is represented both as an input and as an output in isolation. This causes inefficiencies in learning both in terms of utilizing all of the information and in terms of the number of parameters needed to train. We introduce a novel theoretical framework that facilitates better learning in language modeling, and show that our framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables. Our framework leads to state of the art performance on the Penn Treebank with a variety of network models.",/pdf/e230e2228ecf9bd54251ab8e25b453c110f7028e.pdf,ICLR,2017, +OLOr1K5zbDu,zAomNFRUf4C,1601310000000.0,1614990000000.0,2589,"Triple-Search: Differentiable Joint-Search of Networks, Precision, and Accelerators","[""~Yonggan_Fu1"", ""~Yongan_Zhang1"", ""~Haoran_You1"", ""~Yingyan_Lin1""]","[""Yonggan Fu"", ""Yongan Zhang"", ""Haoran You"", ""Yingyan Lin""]","[""neural architecture search"", ""network hardware co-design""]","The record-breaking performance and prohibitive complexity of deep neural networks (DNNs) have ignited a substantial need for customized DNN accelerators which have the potential to boost DNN acceleration efficiency by orders-of-magnitude. While it has been recognized that maximizing DNNs' acceleration efficiency requires a joint design/search for three different yet highly coupled aspects, including the networks, adopted precision, and their accelerators, the challenges associated with such a joint search have not yet been fully discussed and addressed. First, to jointly search for a network and its precision via differentiable search, there exists a dilemma of whether to explode the memory consumption or achieve sub-optimal designs. Second, a generic and differentiable joint search of the networks and their accelerators is non-trivial due to (1) the discrete nature of the accelerator space and (2) the difficulty of obtaining operation-wise hardware cost penalties because some accelerator parameters are determined by the whole network. To this end, we propose a Triple-Search (TRIPS) framework to address the aforementioned challenges towards jointly searching for the network structure, precision, and accelerator in a differentiable manner, to efficiently and effectively explore the huge joint search space. Our TRIPS addresses the first challenge above via a heterogeneous sampling strategy to achieve unbiased search with constant memory consumption, and tackles the latter one using a novel co-search pipeline that integrates a generic differentiable accelerator search engine. Extensive experiments and ablation studies validate that both TRIPS generated networks and accelerators consistently outperform state-of-the-art (SOTA) designs (including co-search/exploration techniques, hardware-aware NAS methods, and DNN accelerators), in terms of search time, task accuracy, and accelerator efficiency. All codes will be released upon acceptance.",/pdf/c9df07505e92200c2596abb6beac79c1ccae9ef3.pdf,ICLR,2021,"We propose the Triple-Search framework to jointly search network structure, precision and hardware architecture in a differentiable manner." +ryH_bShhW,rJNObHn2W,1507770000000.0,1518730000000.0,8,DOUBLY STOCHASTIC ADVERSARIAL AUTOENCODER,"[""mazarafrooz@cylance.com""]","[""Mahdi Azarafrooz""]","[""Generative adversarial Networks"", ""Deep Generative models"", ""Kernel Methods""]","Any autoencoder network can be turned into a generative model by imposing an arbitrary prior distribution on its hidden code vector. Variational Autoencoder uses a KL divergence penalty to impose the prior, whereas Adversarial Autoencoder uses generative adversarial networks. A straightforward modification of Adversarial Autoencoder can be achieved by replacing the adversarial network with maximum mean discrepancy (MMD) network. This replacement leads to a new set of probabilistic autoencoder which is also discussed in our paper. + +However, an essential challenge remains in both of these probabilistic autoencoders, namely that the only source of randomness at the output of encoder, is the training data itself. Lack of enough stochasticity can make the optimization problem non-trivial. As a result, they can lead to degenerate solutions where the generator collapses into sampling only a few modes. + +Our proposal is to replace the adversary of the adversarial autoencoder by a space of {\it stochastic} functions. This replacement introduces a a new source of randomness which can be considered as a continuous control for encouraging {\it explorations}. This prevents the adversary from fitting too closely to the generator and therefore leads to more diverse set of generated samples. Consequently, the decoder serves as a better generative network which unlike MMD nets scales linearly with the amount of data. We provide mathematical and empirical evidence on how this replacement outperforms the pre-existing architectures. ",/pdf/13ec102384aceb61d2b5723b10040d38ce2d4952.pdf,ICLR,2018, +B1eXygBFPH,rklZjF1tvr,1569440000000.0,1577170000000.0,2056,Attacking Graph Convolutional Networks via Rewiring,"[""mayao4@msu.edu"", ""szw494@psu.edu"", ""derrtyle@msu.edu"", ""wuli@us.ibm.com"", ""tangjili@msu.edu""]","[""Yao Ma"", ""Suhang Wang"", ""Tyler Derr"", ""Lingfei Wu"", ""Jiliang Tang""]","[""Graph Neural Networks"", ""Rewiring"", ""Adversarial Attacks""]","Graph Neural Networks (GNNs) have boosted the performance of many graph related tasks such as node classification and graph classification. Recent researches show that graph neural networks are vulnerable to adversarial attacks, which deliberately add carefully created unnoticeable perturbation to the graph structure. The perturbation is usually created by adding/deleting a few edges, which might be noticeable even when the number of edges modified is small. In this paper, we propose a graph rewiring operation which affects the graph in a less noticeable way compared to adding/deleting edges. We then use reinforcement learning to learn the attack strategy based on the proposed rewiring operation. Experiments on real world graphs demonstrate the effectiveness of the proposed framework. To understand the proposed framework, we further analyze how its generated perturbation to the graph structure affects the output of the target model.",/pdf/2406427c3b09e14318f6a5dc2aeff3c47f26ffc2.pdf,ICLR,2020,Using rewiring operation to conduct adversarial attacks on graph structured data. +HJxwAo09KQ,H1ey6H0KY7,1538090000000.0,1545360000000.0,900,Learned optimizers that outperform on wall-clock and validation loss,"[""lmetz@google.com"", ""nirum@google.com"", ""jeremynixon@google.com"", ""cdfreeman@google.com"", ""jaschasd@google.com""]","[""Luke Metz"", ""Niru Maheswaranathan"", ""Jeremy Nixon"", ""Daniel Freeman"", ""Jascha Sohl-dickstein""]","[""Learned Optimizers"", ""Meta-Learning""]","Deep learning has shown that learned functions can dramatically outperform hand-designed functions on perceptual tasks. Analogously, this suggests that learned update functions may similarly outperform current hand-designed optimizers, especially for specific tasks. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wall-clock speedups over hand-designed optimizers, and thus are rarely used in practice. Typically, learned optimizers are trained by truncated backpropagation through an unrolled optimization process. The resulting gradients are either strongly biased (for short truncations) or have exploding norm (for long truncations). In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance. This allows us to train neural networks to perform optimization faster than well tuned first-order methods. Moreover, by training the optimizer against validation loss, as opposed to training loss, we are able to use it to train models which generalize better than those trained by first order methods. We demonstrate these results on problems where our learned optimizer trains convolutional networks in a fifth of the wall-clock time compared to tuned first-order methods, and with an improvement",/pdf/2c94269a1648cac4ba5ccf7f1c3b5c505a09e1be.pdf,ICLR,2019,"We analyze problems when training learned optimizers, address those problems via variational optimization using two complementary gradient estimators, and train optimizers that are 5x faster in wall-clock time than baseline optimizers (e.g. Adam)." +Byg_vREtvB,rJeNO_PdDr,1569440000000.0,1577170000000.0,1180,Generalized Bayesian Posterior Expectation Distillation for Deep Neural Networks,"[""mvadera@cs.umass.edu"", ""marlin@cs.umass.edu""]","[""Meet P. Vadera"", ""Benjamin M. Marlin""]","[""Bayesian Neural Networks"", ""Distillation""]","In this paper, we present a general framework for distilling expectations with respect to the Bayesian posterior distribution of a deep neural network, significantly extending prior work on a method known as ``Bayesian Dark Knowledge."" Our generalized framework applies to the case of classification models and takes as input the architecture of a ``teacher"" network, a general posterior expectation of interest, and the architecture of a ``student"" network. The distillation method performs an online compression of the selected posterior expectation using iteratively generated Monte Carlo samples from the parameter posterior of the teacher model. We further consider the problem of optimizing the student model architecture with respect to an accuracy-speed-storage trade-off. We present experimental results investigating multiple data sets, distillation targets, teacher model architectures, and approaches to searching for student model architectures. We establish the key result that distilling into a student model with an architecture that matches the teacher, as is done in Bayesian Dark Knowledge, can lead to sub-optimal performance. Lastly, we show that student architecture search methods can identify student models with significantly improved performance. ",/pdf/5db49d69762684595f766d570924fc6c30047939.pdf,ICLR,2020,A general framework for distilling Bayesian posterior expectations for deep neural networks. +SJeLIgBKPS,HklQm9ltwB,1569440000000.0,1583910000000.0,2324,Gradient Descent Maximizes the Margin of Homogeneous Neural Networks,"[""vfleaking@gmail.com"", ""lijian83@mail.tsinghua.edu.cn""]","[""Kaifeng Lyu"", ""Jian Li""]","[""margin"", ""homogeneous"", ""gradient descent""]","In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.",/pdf/1961d0a01f9b41951a88793dcc0e818a80d108fe.pdf,ICLR,2020,We study the implicit bias of gradient descent and prove under a minimal set of assumptions that the parameter direction of homogeneous models converges to KKT points of a natural margin maximization problem. +r1DPFCyA-,HJ8wYRyRb,1509050000000.0,1518730000000.0,193,Discriminative k-shot learning using probabilistic models,"[""msb55@cam.ac.uk"", ""mrojascarulla@gmail.com"", ""kuba.swiatkowski@gmail.com"", ""bs@tuebingen.mpg.de"", ""ret26@cam.ac.uk""]","[""Matthias Bauer"", ""Mateo Rojas-Carulla"", ""Jakub Bart\u0142omiej \u015awi\u0105tkowski"", ""Bernhard Sch\u00f6lkopf"", ""Richard E. Turner""]","[""discriminative k-shot learning"", ""probabilistic inference""]","This paper introduces a probabilistic framework for k-shot image classification. The goal is to generalise from an initial large-scale classification task to a separate task comprising new classes and small numbers of examples. The new approach not only leverages the feature-based representation learned by a neural network from the initial task (representational transfer), but also information about the classes (concept transfer). The concept information is encapsulated in a probabilistic model for the final layer weights of the neural network which acts as a prior for probabilistic k-shot learning. We show that even a simple probabilistic model achieves state-of-the-art on a standard k-shot learning dataset by a large margin. Moreover, it is able to accurately model uncertainty, leading to well calibrated classifiers, and is easily extensible and flexible, unlike many recent approaches to k-shot learning.",/pdf/c98e564bd99e7a5e1ca2dec57f97dfd703ee986f.pdf,ICLR,2018,This paper introduces a probabilistic framework for k-shot image classification that achieves state-of-the-art results +B1gcblSKwB,SJetkZgYwB,1569440000000.0,1577170000000.0,2147,Meta-Learning with Network Pruning for Overfitting Reduction,"[""hongduan_tian@nuist.edu.cn"", ""kfliubo@gmail.com"", ""xtyuan1980@gmail.com"", ""qsliu@nuist.edu.cn""]","[""Hongduan Tian"", ""Bo Liu"", ""Xiao-Tong Yuan"", ""Qingshan Liu""]","[""Meta-Learning"", ""Few-shot Learning"", ""Network Pruning"", ""Generalization Analysis""]","Meta-Learning has achieved great success in few-shot learning. However, the existing meta-learning models have been evidenced to overfit on meta-training tasks when using deeper and wider convolutional neural networks. This means that we cannot improve the meta-generalization performance by merely deepening or widening the networks. To remedy such a deficiency of meta-overfitting, we propose in this paper a sparsity constrained meta-learning approach to learn from meta-training tasks a subnetwork from which first-order optimization methods can quickly converge towards the optimal network in meta-testing tasks. Our theoretical analysis shows the benefit of sparsity for improving the generalization gap of the learned meta-initialization network. We have implemented our approach on top of the widely applied Reptile algorithm assembled with varying network pruning routines including Dense-Sparse-Dense (DSD) and Iterative Hard Thresholding (IHT). Extensive experimental results on benchmark datasets with different over-parameterized deep networks demonstrate that our method can not only effectively ease meta-overfitting but also in many cases improve the meta-generalization performance when applied to few-shot classification tasks.",/pdf/f19d0a7d01d6ace60179c4091baaa36ea977e52a.pdf,ICLR,2020, +pavee2r1N01,wZ4LXY_njHO,1601310000000.0,1614990000000.0,2735,Provable Robustness by Geometric Regularization of ReLU Networks,"[""~Chester_Holtz1"", ""~Changhao_Shi1"", ""~Gal_Mishne1""]","[""Chester Holtz"", ""Changhao Shi"", ""Gal Mishne""]","[""deep learning"", ""adversarial attack"", ""robust certification""]","Recent work has demonstrated that neural networks are vulnerable to small, adversarial perturbations of their input. In this paper, we propose an efficient regularization scheme inspired by convex geometry and barrier methods to improve the robustness of feedforward ReLU networks. Since such networks are piecewise linear, they partition the input space into polyhedral regions (polytopes). Our regularizer is designed to minimize the distance between training samples and the \textit{analytical centers} of their respective polytopes so as to push points away from the boundaries. Our regularizer \textit{provably} improves a lower bound on the necessary adversarial perturbation required to switch an example's label. The addition of a second regularizer that encourages linear decision boundaries improves robustness while avoiding over-regularization of the classifier. We demonstrate the robustness of our approach with respect to $\ell_\infty$ and $\ell_2$ adversarial perturbations on multiple datasets. Our method is competitive with state-of-the-art algorithms for learning robust networks. Moreover, applying our algorithm in conjunction with adversarial training boosts the robustness of classifiers even further. +",/pdf/fb893aadc8b77eca051f246cb19df47143e13fdb.pdf,ICLR,2021,We propose a novel geometric regularization term which provably improves the robustness of neural networks. +9y4qOAIfA9r,exHMluvle1,1601310000000.0,1614990000000.0,3548,Does injecting linguistic structure into language models lead to better alignment with brain recordings?,"[""~Mostafa_Abdou2"", ""ana@di.ku.dk"", ""~Mariya_K_Toneva1"", ""~Daniel_Hershcovich1"", ""~Anders_S\u00f8gaard1""]","[""Mostafa Abdou"", ""Ana Valeria Gonz\u00e1lez"", ""Mariya K Toneva"", ""Daniel Hershcovich"", ""Anders S\u00f8gaard""]","[""neurolinguistics"", ""natural language processing"", ""computational neuroscience""]","Neuroscientists evaluate deep neural networks for natural language processing as possible candidate models for how language is processed in the brain. These models are often trained without explicit linguistic supervision, but have been shown to learn some linguistic structure in the absence of such supervision (Manning et. al, 2020), potentially questioning the relevance of symbolic linguistic theories in modeling such cognitive processes (Warstadt & Bowman, 2020). We evaluate across two fMRI datasets whether language models align better with brain recordings, if their attention is biased by annotations from syntactic or semantic formalisms. Using structure from dependency or minimal recursion semantic annotations, we find alignments improve significantly for one of the datasets. For another dataset, we see more mixed results. We present an extensive analysis of these results. Our proposed approach enables the evaluation of more targeted hypotheses about the composition of meaning in the brain, expanding the range of possible scientific inferences a neuroscientist could make, and opens up new opportunities for cross-pollination between computational neuroscience and linguistics. + +",/pdf/67135e0f56815707b5d27d32dd3732242f79141e.pdf,ICLR,2021, +HJgODj05KX,Hkg72m_5KX,1538090000000.0,1545360000000.0,279,A preconditioned accelerated stochastic gradient descent algorithm,"[""alexandru.onose@asml.com"", ""iman.mossavat@asml.com"", ""henk-jan.smilde@asml.com""]","[""Alexandru Onose"", ""Seyed Iman Mossavat"", ""Henk-Jan H. Smilde""]","[""stochastic optimization"", ""neural network"", ""preconditioned accelerated stochastic gradient descent""]","We propose a preconditioned accelerated stochastic gradient method suitable for large scale optimization. We derive sufficient convergence conditions for the minimization of convex functions using a generic class of diagonal preconditioners and provide a formal convergence proof based on a framework originally used for on-line learning. Inspired by recent popular adaptive per-feature algorithms, we propose a specific preconditioner based on the second moment of the gradient. The sufficient convergence conditions motivate a critical adaptation of the per-feature updates in order to ensure convergence. We show empirical results for the minimization of convex and non-convex cost functions, in the context of neural network training. The method compares favorably with respect to current, first order, stochastic optimization methods.",/pdf/d88c2736e0868ec8f2874192f75bfac34872df1e.pdf,ICLR,2019,"We propose a preconditioned accelerated gradient method that combines Nesterov’s accelerated gradient descent with a class of diagonal preconditioners, in a stochastic setting." +9wHe4F-lpp,v9HDxptXNB,1601310000000.0,1614990000000.0,865,FTBNN: Rethinking Non-linearity for 1-bit CNNs and Going Beyond,"[""~Zhuo_Su2"", ""~Linpu_Fang1"", ""~Deke_Guo1"", ""~Dewen_Hu2"", ""~Matti_Pietik\u00e4inen2"", ""~Li_Liu9""]","[""Zhuo Su"", ""Linpu Fang"", ""Deke Guo"", ""Dewen Hu"", ""Matti Pietik\u00e4inen"", ""Li Liu""]","[""Binary neural networks"", ""network quantization"", ""network compression""]","Binary neural networks (BNNs), where both weights and activations are binarized into 1 bit, have been widely studied in recent years due to its great benefit of highly accelerated computation and substantially reduced memory footprint that appeal to the development of resource constrained devices. In contrast to previous methods tending to reduce the quantization error for training BNN structures, we argue that the binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. In this paper, we re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance on the large-scale ImageNet dataset in terms of accuracy and training efficiency. To go further, we find that the proposed BNN model still has much potential to be compressed by making a better use of the efficient binary operations, without losing accuracy. In addition, the limited capacity of the BNN model can also be increased with the help of group execution. Based on these insights, we are able to improve the baseline with an additional 4$\sim$5% top-1 accuracy gain even with less computational cost. Our code and all trained models will be made public.",/pdf/4c1b54e4d6eed5b47d0a15a8fbb7a3867a245b59.pdf,ICLR,2021,"We proposed a highly efficient and effective binary neural network by introduing proper non-linearities, and further enhanced that with binary purification and group execution." +Xb8xvrtB8Ce,odr4UCmVkVr,1601310000000.0,1616590000000.0,160,Bag of Tricks for Adversarial Training,"[""~Tianyu_Pang1"", ""~Xiao_Yang4"", ""~Yinpeng_Dong2"", ""~Hang_Su3"", ""~Jun_Zhu2""]","[""Tianyu Pang"", ""Xiao Yang"", ""Yinpeng Dong"", ""Hang Su"", ""Jun Zhu""]","[""Adversarial Training"", ""Robustness"", ""Adversarial Examples""]","Adversarial training (AT) is one of the most effective strategies for promoting model robustness. However, recent benchmarks show that most of the proposed improvements on AT are less effective than simply early stopping the training procedure. This counter-intuitive fact motivates us to investigate the implementation details of tens of AT methods. Surprisingly, we find that the basic settings (e.g., weight decay, training schedule, etc.) used in these methods are highly inconsistent. In this work, we provide comprehensive evaluations on CIFAR-10, focusing on the effects of mostly overlooked training tricks and hyperparameters for adversarially trained models. Our empirical observations suggest that adversarial robustness is much more sensitive to some basic training settings than we thought. For example, a slightly different value of weight decay can reduce the model robust accuracy by more than $7\%$, which is probable to override the potential promotion induced by the proposed methods. We conclude a baseline training setting and re-implement previous defenses to achieve new state-of-the-art results. These facts also appeal to more concerns on the overlooked confounders when benchmarking defenses.",/pdf/cd34c578852869a39b386b91bdb51a8e91255111.pdf,ICLR,2021,Empirical evaluation of basic training tricks used in adversarial training +_O9YLet0wvN,AQ-mGBa4g3E,1601310000000.0,1614990000000.0,1524,Closing the Generalization Gap in One-Shot Object Detection,"[""~Claudio_Michaelis1"", ""~Matthias_Bethge1"", ""~Alexander_S_Ecker1""]","[""Claudio Michaelis"", ""Matthias Bethge"", ""Alexander S Ecker""]","[""One-Shot Learning"", ""Few-Shot Learning"", ""Object Detection"", ""One-Shot Object Detection"", ""Generalization""]","Despite substantial progress in object detection and few-shot learning, detecting objects based on a single example - one-shot object detection - remains a challenge. A central problem is the generalization gap: Object categories used during training are detected much more reliably than novel ones. We here show that this generalization gap can be nearly closed by increasing the number of object categories used during training. Doing so allows us to beat the state-of-the-art on COCO by 5.4 %AP50 (from 22.0 to 27.5) and improve generalization from seen to unseen classes from 45% to 89%. We verify that the effect is caused by the number of categories and not the amount of data and that it holds for different models, backbones and datasets. This result suggests that the key to strong few-shot detection models may not lie in sophisticated metric learning approaches, but instead simply in scaling the number of categories. We hope that our findings will help to better understand the challenges of few-shot learning and encourage future data annotation efforts to focus on wider datasets with a broader set of categories rather than gathering more samples per category.",/pdf/61e090e70fb052b86e7d25afd2126a4042cb94a1.pdf,ICLR,2021,The generalization gap in one-shot object detection can be closed using datasets with sufficient categories. +mLcmdlEUxy-,2ldwwa8JaML,1601310000000.0,1616020000000.0,2010,Recurrent Independent Mechanisms,"[""~Anirudh_Goyal1"", ""~Alex_Lamb1"", ""~Jordan_Hoffmann1"", ""~Shagun_Sodhani1"", ""~Sergey_Levine1"", ""~Yoshua_Bengio1"", ""~Bernhard_Sch\u00f6lkopf1""]","[""Anirudh Goyal"", ""Alex Lamb"", ""Jordan Hoffmann"", ""Shagun Sodhani"", ""Sergey Levine"", ""Yoshua Bengio"", ""Bernhard Sch\u00f6lkopf""]","[""modular representations"", ""better generalization"", ""learning mechanisms""]","We explore the hypothesis that learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes that only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and compete with each other so they are updated only at time steps where they are most relevant. We show that this leads to specialization amongst the RIMs, which in turn allows for remarkably improved generalization on tasks where some factors of variation differ systematically between training and evaluation. +",/pdf/a58297a8b9fc47b94fee061cf119f991c6bb2a87.pdf,ICLR,2021,"Learning recurrent mechanisms which operate independently, and sparingly interact can lead to better generalization to out of distribution samples." +7qmQNB6Wn_B,3L6LM2q86hn,1601310000000.0,1614990000000.0,1185,Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration,"[""~Seungyul_Han1"", ""~Youngchul_Sung1""]","[""Seungyul Han"", ""Youngchul Sung""]","[""Reinforcement Learning"", ""Entropy Regularization"", ""Exploration""]","Policy entropy regularization is commonly used for better exploration in deep reinforcement learning (RL). However, policy entropy regularization is sample-inefficient in off-policy learning since it does not take the distribution of previous samples stored in the replay buffer into account. In order to take advantage of the previous sample distribution from the replay buffer for sample-efficient exploration, we propose sample-aware entropy regularization which maximizes the entropy of weighted sum of the policy action distribution and the sample action distribution from the replay buffer. We formulate the problem of sample-aware entropy regularized policy iteration, prove its convergence, and provide a practical algorithm named diversity actor-critic (DAC) which is a generalization of soft actor-critic (SAC). Numerical results show that DAC significantly outperforms SAC baselines and other state-of-the-art RL algorithms.",/pdf/a245aeface9ac9d4ee00e349c3c95fce09a3a5e6.pdf,ICLR,2021,The paper introduces sample-aware entropy regularization for sample-efficient exploration and the corresponding diversity actor-critic algorithm that generalizes SAC. +CLYe1Yke1r,0-44lpbNTGV,1601310000000.0,1614990000000.0,3684,Box-To-Box Transformation for Modeling Joint Hierarchies,"[""~Shib_Sankar_Dasgupta2"", ""~Xiang_Li2"", ""~Michael_Boratko1"", ""~Dongxu_Zhang1"", ""~Andrew_McCallum1""]","[""Shib Sankar Dasgupta"", ""Xiang Li"", ""Michael Boratko"", ""Dongxu Zhang"", ""Andrew McCallum""]","[""Box embeddings"", ""Representation Learning"", ""Joint Hierarchy"", ""transitive relations"", ""knowledge graph embedding"", ""relational learning.""]","Learning representations of entities and relations in knowledge graphs is an active area of research, with much emphasis placed on choosing the appropriate geometry to capture tree-like structures. Box embeddings (Vilnis et al., 2018; Li et al., 2019; Dasgupta et al., 2020), which represent concepts as n-dimensional hyperrectangles, are capable of embedding trees by training on a subset of the transitive closure. In Patel et al. (2020), the authors demonstrate that only the transitive reduction is required, and further extend box embeddings to capture joint hierarchies by augmenting the graph with new nodes. While it is possible to represent joint hierarchies with this method, the parameters for each hierarchy are decoupled, making generalization between hierarchies infeasible. In this work, we introduce a learned box-to-box transformation which respects the geometric structure of the box embeddings. We demonstrate that this not only improves the capability of modeling cross-hierarchy compositional edges but is also capable of generalizing from a subset of the transitive reduction.",/pdf/92b7e0af66c8fa274e5a5b5f90927659df57ab51.pdf,ICLR,2021,Learning transformation on box embedding space to generalize over multiple hierarchies +SJeF_h4FwB,SJxE720ILB,1569440000000.0,1577170000000.0,49,Label Cleaning with Likelihood Ratio Test,"[""zheng.songzhu@stonybrook.edu"", ""pxiangwu@gmail.com"", ""ag77in@gmail.com"", ""mayank.isi@gmail.com"", ""dnm@cs.rutgers.edu"", ""chao.chen.1@stonybrook.edu""]","[""Songzhu Zheng"", ""Pengxiang Wu"", ""Aman Goswami"", ""Mayank Goswami"", ""Dimitris Metaxas"", ""Chao Chen""]","[""Deep Learning""]","To collect large scale annotated data, it is inevitable to introduce label noise, i.e., incorrect class labels. A major challenge is to develop robust deep learning models that achieve high test performance despite training set label noise. We introduce a novel approach that directly cleans labels in order to train a high quality model. Our method leverages statistical principles to correct data labels and has a theoretical guarantee of the correctness. In particular, we use a likelihood ratio test(LRT) to flip the labels of training data. We prove that our LRT label correction algorithm is guaranteed to flip the label so it is consistent with the true Bayesian optimal decision rule with high probability. We incorporate our label correction algorithm into the training of deep neural networks and train models that achieve superior testing performance on multiple public datasets.",/pdf/47d9b11c13e7fb8122bd70be4364e5ab81d38449.pdf,ICLR,2020,Use likelihood ratio test to perform label correction +r1lFIiR9tQ,B1xPo34qtQ,1538090000000.0,1545360000000.0,195,Training generative latent models by variational f-divergence minimization,"[""mingtian.zhang.17@ucl.ac.uk"", ""thomas.bird@cs.ucl.ac.uk"", ""raza.habib@cs.ucl.ac.uk"", ""t.xu12@lse.ac.uk"", ""david.barber@ucl.ac.uk""]","[""Mingtian Zhang"", ""Thomas Bird"", ""Raza Habib"", ""Tianlin Xu"", ""David Barber""]","[""variational inference"", ""generative model"", ""f divergence""]","Probabilistic models are often trained by maximum likelihood, which corresponds to minimizing a specific form of f-divergence between the model and data distribution. We derive an upper bound that holds for all f-divergences, showing the intuitive result that the divergence between two joint distributions is at least as great as the divergence between their corresponding marginals. Additionally, the f-divergence is not formally defined when two distributions have different supports. We thus propose a noisy version of f-divergence which is well defined in such situations. We demonstrate how the bound and the new version of f-divergence can be readily used to train complex probabilistic generative models of data and that the fitted model can depend significantly on the particular divergence used.",/pdf/9cc42d11aa2483056869d06180949c7feab06c0d.pdf,ICLR,2019,Training generative models using an upper bound of the f divergence. +SylkYeHtwr,H1lhvplKDB,1569440000000.0,1583910000000.0,2419,SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models,"[""luoyc15@mails.tsinghua.edu.cn"", ""abeatson@cs.princeton.edu"", ""mnorouzi@google.com"", ""dcszj@mail.tsinghua.edu.cn"", ""duvenaud@cs.toronto.edu"", ""rpa@princeton.edu"", ""rtqichen@cs.toronto.edu""]","[""Yucen Luo"", ""Alex Beatson"", ""Mohammad Norouzi"", ""Jun Zhu"", ""David Duvenaud"", ""Ryan P. Adams"", ""Ricky T. Q. Chen""]",[],"Standard variational lower bounds used to train latent variable models produce biased estimates of most quantities of interest. We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models based on randomized truncation of infinite series. If parameterized by an encoder-decoder architecture, the parameters of the encoder can be optimized to minimize its variance of this estimator. We show that models trained using our estimator give better test-set likelihoods than a standard importance-sampling based approach for the same average computational cost. This estimator also allows use of latent variable models for tasks where unbiased estimators, rather than marginal likelihood lower bounds, are preferred, such as minimizing reverse KL divergences and estimating score functions.",/pdf/c9d5463e8efbe7421dd2b2c72ef7c7b504cb1f96.pdf,ICLR,2020,"We create an unbiased estimator for the log probability of latent variable models, extending such models to a larger scope of applications." +SklR6aEtwH,BJxl-EWuvr,1569440000000.0,1577170000000.0,837,Neural Architecture Search by Learning Action Space for Monte Carlo Tree Search,"[""linnan_wang@brown.edu"", ""s9xie@fb.com"", ""yuandong@fb.com"", ""tengli@fb.com""]","[""Linnan Wang"", ""Saining Xie"", ""Teng Li"", ""Rodrigo Fonseca"", ""Yuandong Tian""]","[""MCTS"", ""Neural Architecture Search"", ""Search""]","Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy). As a result, using manually designed action space to perform NAS often leads to sample-inefficient explorations of architectures and thus can be sub-optimal. In order to improve sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS) that learns actions to recursively partition the search space into good or bad regions that contain networks with concentrated performance metrics, i.e., low variance. During the search phase, as different architecture search action sequences lead to regions of different performance, the search efficiency can be significantly improved by biasing towards the good regions. On the largest NAS dataset NASBench-101, our experimental results demonstrated that LaNAS is 22x, 14.6x, 12.4x, 6.8x, 16.5x more sample-efficient than Random Search, Regularized Evolution, Monte Carlo Tree Search, Neural Architecture Optimization, and Bayesian Optimization, respectively. When applied to the open domain, LaNAS achieves 98.0% accuracy on CIFAR-10 and 75.0% top1 accuracy on ImageNet in only 803 samples, outperforming SOTA AmoebaNet with 33x fewer samples.",/pdf/113f1d5b21c911c5208e1b6a4e6533c1a84c593d.pdf,ICLR,2020,A new model that learns latent actions for MCTS to the application of neural architecture search +S1eBzhRqK7,ByxjA-0ctm,1538090000000.0,1545360000000.0,1261,Evolutionary-Neural Hybrid Agents for Architecture Search,"[""kmaziarz@google.com"", ""akhorlin@google.com"", ""underflow@google.com"", ""agesmundo@google.com""]","[""Krzysztof Maziarz"", ""Andrey Khorlin"", ""Quentin de Laroussilhe"", ""Andrea Gesmundo""]","[""Evolutionary"", ""Architecture Search"", ""NAS""]","Neural Architecture Search has recently shown potential to automate the design of Neural Networks. The use of Neural Network agents trained with Reinforcement Learning can offer the possibility to learn complex patterns, as well as the ability to explore a vast and compositional search space. On the other hand, evolutionary algorithms offer the greediness and sample efficiency needed for such an application, as each sample requires a considerable amount of resources. We propose a class of Evolutionary-Neural hybrid agents (Evo-NAS), that retain the best qualities of the two approaches. We show that the Evo-NAS agent can outperform both Neural and Evolutionary agents, both on a synthetic task, and on architecture search for a suite of text classification datasets.",/pdf/9974bcace4c723c6b05b6bda8c6fbdcfc0a0ab61.pdf,ICLR,2019,"We propose a class of Evolutionary-Neural hybrid agents, that retain the best qualities of the two approaches." +HJgLLyrYwB,ByeSgvTdPS,1569440000000.0,1583910000000.0,1731,State-only Imitation with Transition Dynamics Mismatch,"[""gangwan2@illinois.edu"", ""jianpeng@illinois.edu""]","[""Tanmay Gangwani"", ""Jian Peng""]","[""Imitation learning"", ""Reinforcement Learning"", ""Inverse Reinforcement Learning""]","Imitation Learning (IL) is a popular paradigm for training agents to achieve complicated goals by leveraging expert behavior, rather than dealing with the hardships of designing a correct reward function. With the environment modeled as a Markov Decision Process (MDP), most of the existing IL algorithms are contingent on the availability of expert demonstrations in the same MDP as the one in which a new imitator policy is to be learned. This is uncharacteristic of many real-life scenarios where discrepancies between the expert and the imitator MDPs are common, especially in the transition dynamics function. Furthermore, obtaining expert actions may be costly or infeasible, making the recent trend towards state-only IL (where expert demonstrations constitute only states or observations) ever so promising. Building on recent adversarial imitation approaches that are motivated by the idea of divergence minimization, we present a new state-only IL algorithm in this paper. It divides the overall optimization objective into two subproblems by introducing an indirection step and solves the subproblems iteratively. We show that our algorithm is particularly effective when there is a transition dynamics mismatch between the expert and imitator MDPs, while the baseline IL methods suffer from performance degradation. To analyze this, we construct several interesting MDPs by modifying the configuration parameters for the MuJoCo locomotion tasks from OpenAI Gym.",/pdf/d0b1a39c252097c699c922036af72b5546249ea5.pdf,ICLR,2020,Algorithm for imitation with state-only expert demonstrations; builds on adversarial-IRL; experiments with transition dynamics mismatch b/w expert and imitator +rJQDjk-0b,SJGDiyWC-,1509120000000.0,1519330000000.0,514,Unbiased Online Recurrent Optimization,"[""corentin.tallec@polytechnique.edu"", ""yann@yann-ollivier.org""]","[""Corentin Tallec"", ""Yann Ollivier""]","[""RNN""]","The novel \emph{Unbiased Online Recurrent Optimization} (UORO) algorithm allows for online learning of general recurrent computational graphs such as recurrent network models. It works in a streaming fashion and avoids backtracking through past activations and inputs. UORO is computationally as costly as \emph{Truncated Backpropagation Through Time} (truncated BPTT), a widespread algorithm for online learning of recurrent networks \cite{jaeger2002tutorial}. UORO is a modification of \emph{NoBackTrack} \cite{DBLP:journals/corr/OllivierC15} that bypasses the need for model sparsity and makes implementation easy in current deep learning frameworks, even for complex models. Like NoBackTrack, UORO provides unbiased gradient estimates; unbiasedness is the core hypothesis in stochastic gradient descent theory, without which convergence to a local optimum is not guaranteed. On the contrary, truncated BPTT does not provide this property, leading to possible divergence. On synthetic tasks where truncated BPTT is shown to diverge, UORO converges. For instance, when a parameter has a positive short-term but negative long-term influence, truncated BPTT diverges unless the truncation span is very significantly longer than the intrinsic temporal range of the interactions, while UORO performs well thanks to the unbiasedness of its gradients. +",/pdf/f62e3b51de771fad3829cd4234c48da4a1764b04.pdf,ICLR,2018,"Introduces an online, unbiased and easily implementable gradient estimate for recurrent models." +arNvQ7QRyVb,dleMZibaQj,1601310000000.0,1614990000000.0,2330,Sharing Less is More: Lifelong Learning in Deep Networks with Selective Layer Transfer,"[""~Seungwon_Lee2"", ""~Sima_Behpour1"", ""~ERIC_EATON1""]","[""Seungwon Lee"", ""Sima Behpour"", ""ERIC EATON""]","[""lifelong learning"", ""continual learning"", ""architecture search""]","Effective lifelong learning across diverse tasks requires diverse knowledge, yet transferring irrelevant knowledge may lead to interference and catastrophic forgetting. In deep networks, transferring the appropriate granularity of knowledge is as important as the transfer mechanism, and must be driven by the relationships among tasks. We first show that the lifelong learning performance of several current deep learning architectures can be significantly improved by transfer at the appropriate layers. We then develop an expectation-maximization (EM) method to automatically select the appropriate transfer configuration and optimize the task network weights. This EM-based selective transfer is highly effective, as demonstrated on three algorithms in several lifelong object classification scenarios.",/pdf/eaa1a6ccd33370c32ef4b8c4f8937254cf77d98f.pdf,ICLR,2021,"Starting from the observation that performance of a lifelong learning architecture is significantly improved by transferring at appropriate layers, EM-based algorithm for selective transfer between tasks is proposed and evaluated in this paper." +HJvvRoe0W,SJUD0ogAb,1509110000000.0,1519310000000.0,377,An image representation based convolutional network for DNA classification,"[""yinbojian93@gmail.com"", ""m.balvert@cwi.nl"", ""d.zambrano@cwi.nl"", ""a.schoenhuth@cwi.nl"", ""s.m.bohte@cwi.nl""]","[""Bojian Yin"", ""Marleen Balvert"", ""Davide Zambrano"", ""Alexander Schoenhuth"", ""Sander Bohte""]","[""DNA sequences"", ""Hilbert curves"", ""Convolutional neural networks"", ""chromatin structure""]","The folding structure of the DNA molecule combined with helper molecules, also referred to as the chromatin, is highly relevant for the functional properties of DNA. The chromatin structure is largely determined by the underlying primary DNA sequence, though the interaction is not yet fully understood. In this paper we develop a convolutional neural network that takes an image-representation of primary DNA sequence as its input, and predicts key determinants of chromatin structure. The method is developed such that it is capable of detecting interactions between distal elements in the DNA sequence, which are known to be highly relevant. Our experiments show that the method outperforms several existing methods both in terms of prediction accuracy and training time.",/pdf/c7aa95612b12a9155bc1b99ef53d0302d4d5cce4.pdf,ICLR,2018,A method to transform DNA sequences into 2D images using space-filling Hilbert Curves to enhance the strengths of CNNs +Hki-ZlbA-,Bym-Zxb0Z,1509130000000.0,1518730000000.0,554,Ground-Truth Adversarial Examples,"[""nicholas@carlini.com"", ""katz911@gmail.com"", ""barrett@cs.stanford.edu"", ""dill@cs.stanford.edu""]","[""Nicholas Carlini"", ""Guy Katz"", ""Clark Barrett"", ""David L. Dill""]","[""adversarial examples"", ""neural networks"", ""formal verification"", ""ground truths""]","The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for training networks that are robust to such examples; and each time stronger attacks have been devised, demonstrating the shortcomings of existing defenses. This highlights a key difficulty in designing an effective defense: the inability to assess a network's robustness against future attacks. We propose to address this difficulty through formal verification techniques. We construct ground truths: adversarial examples with a provably-minimal distance from a given input point. We demonstrate how ground truths can serve to assess the effectiveness of attack techniques, by comparing the adversarial examples produced by those attacks to the ground truths; and also of defense techniques, by computing the distance to the ground truths before and after the defense is applied, and measuring the improvement. We use this technique to assess recently suggested attack and defense techniques. +",/pdf/19527fda347a2eeeba62f04101f97200d2be43dc.pdf,ICLR,2018,We use formal verification to assess the effectiveness of techniques for finding adversarial examples or for defending against adversarial examples. +ByeL1R4FvS,Hkxnb_MdDH,1569440000000.0,1577170000000.0,894,Unsupervised Data Augmentation for Consistency Training,"[""qizhex@cs.cmu.edu"", ""dzihang@cs.cmu.edu"", ""hovy@cs.cmu.edu"", ""thangluong@google.com"", ""qvl@google.com""]","[""Qizhe Xie"", ""Zihang Dai"", ""Eduard Hovy"", ""Minh-Thang Luong"", ""Quoc V. Le""]","[""Semi-supervised learning"", ""computer vision"", ""natural language processing""]","Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 2.7% with only 4,000 examples, nearly matching the performance of models trained on 50,000 labeled examples. Our method also combines well with transfer learning, e.g., when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used.",/pdf/d8d552f970cb61b05f5c1d2d5f07ba24cba89a7d.pdf,ICLR,2020,A semi-supervised learning method that enforces a model's prediction to be robust to advanced data augmentations. +rJXTf9Bxg,,1477970000000.0,1477970000000.0,27,Conditional Image Synthesis With Auxiliary Classifier GANs,"[""augustusodena@google.com"", ""colah@google.com"", ""shlens@google.com""]","[""Augustus Odena"", ""Christopher Olah"", ""Jonathon Shlens""]","[""Deep learning""]","Synthesizing high resolution photorealistic images has been a long-standing challenge in machine learning. In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128x128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128x128 samples are more than twice as discriminable as artificially resized 32x32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data.",https://arxiv.org/pdf/1610.09585v1.pdf,ICLR,2017,We introduce a special GAN architecture that results in high quality 128x128 ImageNet samples; we introduce 2 new quantitative metrics of sample quality. +HylvleBtPB,ByeHIp1tDr,1569440000000.0,1577170000000.0,2103,Language-independent Cross-lingual Contextual Representations,"[""xzhang19@mails.tsinghua.edu.cn"", ""wangsong16@mails.tsinghua.edu.cn"", ""dou@cs.uoregon.edu"", ""xeliu@mail.tsinghua.edu.cn"", ""thien@cs.uoregon.edu"", ""wuji_ee@mail.tsinghua.edu.cn""]","[""Xiao Zhang"", ""Song Wang"", ""Dejing Dou"", ""Xien Liu"", ""Thien Huu Nguyen"", ""Ji Wu""]","[""contextual representation"", ""cross-lingual"", ""transfer learning""]","Contextual representation models like BERT have achieved state-of-the-art performance on a diverse range of NLP tasks. We propose a cross-lingual contextual representation model that generates language-independent contextual representations. This helps to enable zero-shot cross-lingual transfer of a wide range of NLP models, on top of contextual representation models like BERT. We provide a formulation of language-independent cross-lingual contextual representation based on mono-lingual representations. Our formulation takes three steps to align sequences of vectors: transform, extract, and reorder. We present a detailed discussion about the process of learning cross-lingual contextual representations, also about the performance in cross-lingual transfer learning and its implications.",/pdf/8c874dee490dfee0c5d4bfa73f9db2ba215db6d8.pdf,ICLR,2020,A language-independent contextual text representation for zero-shot cross-lingual transfer learning. +0OlrLvrsHwQ,LEbVP_N6ane,1601310000000.0,1615210000000.0,1178,Learning Parametrised Graph Shift Operators,"[""~George_Dasoulas1"", ""~Johannes_F._Lutzeyer1"", ""~Michalis_Vazirgiannis1""]","[""George Dasoulas"", ""Johannes F. Lutzeyer"", ""Michalis Vazirgiannis""]","[""graph neural networks"", ""graph shift operators"", ""graph classification"", ""node classification"", ""graph representation learning""]","In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network (GNN) frameworks. The PGSO is suggested as a replacement of the standard GSOs that are used in state-of-the-art GNN architectures and the optimisation of the PGSO parameters is seamlessly included in the model training. It is proved that the PGSO has real eigenvalues and a set of real eigenvectors independent of the parameter values and spectral bounds on the PGSO are derived. PGSO parameters are shown to adapt to the sparsity of the graph structure in a study on stochastic blockmodel networks, where they are found to automatically replicate the GSO regularisation found in the literature. On several real-world datasets the accuracy of state-of-the-art GNN architectures is improved by the inclusion of the PGSO in both node- and graph-classification tasks. ",/pdf/e6905a755d143d191885c3b5fef08eb1af58ebb1.pdf,ICLR,2021,"We propose a parametrised graph shift operator (PGSO) to encode graph structure, providing a unified view of the most common GSOs, and improve GNN performance by incorporating the PGSO into the model training in an end-to-end manner." +rJl6M2C5Y7,BJeEYCa5KQ,1538090000000.0,1545360000000.0,1307,Online Hyperparameter Adaptation via Amortized Proximal Optimization,"[""pvicol@cs.toronto.edu"", ""zhc15@mails.tsinghua.edu.cn"", ""rgrosse@cs.toronto.edu""]","[""Paul Vicol"", ""Jeffery Z. HaoChen"", ""Roger Grosse""]","[""hyperparameters"", ""optimization"", ""learning rate adaptation""]","Effective performance of neural networks depends critically on effective tuning of optimization hyperparameters, especially learning rates (and schedules thereof). We present Amortized Proximal Optimization (APO), which takes the perspective that each optimization step should approximately minimize a proximal objective (similar to the ones used to motivate natural gradient and trust region policy optimization). Optimization hyperparameters are adapted to best minimize the proximal objective after one weight update. We show that an idealized version of APO (where an oracle minimizes the proximal objective exactly) achieves global convergence to stationary point and locally second-order convergence to global optimum for neural networks. APO incurs minimal computational overhead. We experiment with using APO to adapt a variety of optimization hyperparameters online during training, including (possibly layer-specific) learning rates, damping coefficients, and gradient variance exponents. For a variety of network architectures and optimization algorithms (including SGD, RMSprop, and K-FAC), we show that with minimal tuning, APO performs competitively with carefully tuned optimizers.",/pdf/f36d302acbed9d211fe7668d338b8098ffd670c7.pdf,ICLR,2019,"We introduce amortized proximal optimization (APO), a method to adapt a variety of optimization hyperparameters online during training, including learning rates, damping coefficients, and gradient variance exponents." +S1xtAjR5tX,ByxXipK5KQ,1538090000000.0,1551890000000.0,913,Improving Sequence-to-Sequence Learning via Optimal Transport,"[""liqun.chen@duke.edu"", ""yizhe.zhang@microsoft.com"", ""rz68@duke.edu"", ""chenyang.tao@duke.edu"", ""zhe.gan@microsoft.com"", ""hczhang1@gmail.com"", ""bai.li@duke.edu"", ""dinghan.shen@duke.edu"", ""cchangyou@gmail.com"", ""lcarin@duke.edu""]","[""Liqun Chen"", ""Yizhe Zhang"", ""Ruiyi Zhang"", ""Chenyang Tao"", ""Zhe Gan"", ""Haichao Zhang"", ""Bai Li"", ""Dinghan Shen"", ""Changyou Chen"", ""Lawrence Carin""]","[""NLP"", ""optimal transport"", ""sequence to sequence"", ""natural language processing""]","Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE). However, standard MLE training considers a word-level objective, predicting the next word given the previous ground-truth partial sentence. This procedure focuses on modeling local syntactic patterns, and may fail to capture long-range semantic structure. We present a novel solution to alleviate these issues. Our approach imposes global sequence-level guidance via new supervision based on optimal transport, enabling the overall characterization and preservation of semantic features. We further show that this method can be understood as a Wasserstein gradient flow trying to match our model to the ground truth sequence distribution. Extensive experiments are conducted to validate the utility of the proposed approach, showing consistent improvements over a wide variety of NLP tasks, including machine translation, abstractive text summarization, and image captioning.",/pdf/860ff616aa1db79afa58aee7b00352ef17bee264.pdf,ICLR,2019, +ryx6WgStPB,BylxV-xKwS,1569440000000.0,1583910000000.0,2153,Hypermodels for Exploration,"[""vikranthd@google.com"", ""lxlu@google.com"", ""mibrahimi@google.com"", ""iosband@google.com"", ""zhengwen@google.com"", ""benvanroy@google.com""]","[""Vikranth Dwaracherla"", ""Xiuyuan Lu"", ""Morteza Ibrahimi"", ""Ian Osband"", ""Zheng Wen"", ""Benjamin Van Roy""]","[""exploration"", ""hypermodel"", ""reinforcement learning""]","We study the use of hypermodels to represent epistemic uncertainty and guide exploration. +This generalizes and extends the use of ensembles to approximate Thompson sampling. The computational cost of training an ensemble grows with its size, and as such, prior work has typically been limited to ensembles with tens of elements. We show that alternative hypermodels can enjoy dramatic efficiency gains, enabling behavior that would otherwise require hundreds or thousands of elements, and even succeed in situations where ensemble methods fail to learn regardless of size. +This allows more accurate approximation of Thompson sampling as well as use of more sophisticated exploration schemes. In particular, we consider an approximate form of information-directed sampling and demonstrate performance gains relative to Thompson sampling. As alternatives to ensembles, we consider linear and neural network hypermodels, also known as hypernetworks. +We prove that, with neural network base models, a linear hypermodel can represent essentially any distribution over functions, and as such, hypernetworks do not extend what can be represented.",/pdf/9e6d021fdb54902bcc899e0974dfa15f3b18a1a2.pdf,ICLR,2020,Hypermodels can encode posterior distributions similar to large ensembles at much smaller computational cost. This can facilitate significant improvements in exploration. +Bk6qQGWRb,HJd97G-Ab,1509140000000.0,1518730000000.0,810,Efficient Exploration through Bayesian Deep Q-Networks,"[""kazizzad@uci.edu"", ""ebrun@cs.stanford.edu"", ""animakumar@gmail.com""]","[""Kamyar Azizzadenesheli"", ""Emma Brunskill"", ""Animashree Anandkumar""]","[""Deep RL"", ""Thompson Sampling"", ""Posterior update""]","We propose Bayesian Deep Q-Network (BDQN), a practical Thompson sampling based Reinforcement Learning (RL) Algorithm. Thompson sampling allows for targeted exploration in high dimensions through posterior sampling but is usually computationally expensive. We address this limitation by introducing uncertainty only at the output layer of the network through a Bayesian Linear Regression (BLR) model, which can be trained with fast closed-form updates and its samples can be drawn efficiently through the Gaussian distribution. We apply our method to a wide range of Atari Arcade Learning Environments. Since BDQN carries out more efficient exploration, it is able to reach higher rewards substantially faster than a key baseline, DDQN.",/pdf/b8252cd7f7fce3fe5d9f9f1c54a91e8ff44cf650.pdf,ICLR,2018,Using Bayesian regression to estimate the posterior over Q-functions and deploy Thompson Sampling as a targeted exploration strategy with efficient trade-off the exploration and exploitation +SklnVAEFDB,r1lDeDIuwr,1569440000000.0,1577170000000.0,1090,BERT-AL: BERT for Arbitrarily Long Document Understanding,"[""903276268@pku.edu.cn"", ""zhuoyu.wei@microsoft.com"", ""yushi@microsoft.com"", ""yining.chen@microsoft.com""]","[""Ruixuan Zhang"", ""Zhuoyu Wei"", ""Yu Shi"", ""Yining Chen""]",[],"Pretrained language models attract lots of attentions, and they take advantage of the two-stages training process: pretraining on huge corpus and finetuning on specific tasks. Thereinto, BERT (Devlin et al., 2019) is a Transformer (Vaswani et al., 2017) based model and has been the state-of-the-art for many kinds of Nature Language Processing (NLP) tasks. However, BERT cannot take text longer than the maximum length as input since the maximum length is predefined during pretraining. When we apply BERT to long text tasks, e.g., document-level text summarization: 1) Truncating inputs by the maximum sequence length will decrease performance, since the model cannot capture long dependency and global information ranging the whole document. 2) Extending the maximum length requires re-pretraining which will cost a mass of time and computing resources. What's even worse is that the computational complexity will increase quadratically with the length, which will result in an unacceptable training time. To resolve these problems, we propose to apply Transformer to only model local dependency and recurrently capture long dependency by inserting multi-channel LSTM into each layer of BERT. The proposed model is named as BERT-AL (BERT for Arbitrarily Long Document Understanding) and it can accept arbitrarily long input without re-pretraining from scratch. We demonstrate BERT-AL's effectiveness on text summarization by conducting experiments on the CNN/Daily Mail dataset. Furthermore, our method can be adapted to other Transformer based models, e.g., XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), for various NLP tasks with long text.",/pdf/92fa664b2e445ced111dd2f25a5807318c13a351.pdf,ICLR,2020, +Hygvln09K7,SygkcCs9FQ,1538090000000.0,1545360000000.0,1085,Meta Learning with Fast/Slow Learners,"[""chengzhuoyuan07@gmail.com""]","[""zhuoyuan@fb.com""]","[""computer vision"", ""meta learning""]","Meta-learning has recently achieved success in many optimization problems. In general, a meta learner g(.) could be learned for a base model f(.) on a variety of tasks, such that it can be more efficient on a new task. In this paper, we make some key modifications to enhance the performance of meta-learning models. (1) we leverage different meta-strategies for different modules to optimize them separately: we use conservative “slow learners” on low-level basic feature representation layers and “fast learners” on high-level task-specific layers; (2) Furthermore, we provide theoretical analysis on why the proposed approach works, based on a case study on a two-layer MLP. We evaluate our model on synthetic MLP regression, as well as low-shot learning tasks on Omniglot and ImageNet benchmarks. We demonstrate that our approach is able to achieve state-of-the-art performance.",/pdf/50464de80c7928c82c356173530565720a58249a.pdf,ICLR,2019,We applied multiple meta-strategy to improve meta-learning performance on base CNNs. +4IwieFS44l,F0xWfhiBxK,1601310000000.0,1614680000000.0,1961,Fooling a Complete Neural Network Verifier,"[""zomborid@inf.u-szeged.hu"", ""banhelyi@inf.u-szeged.hu"", ""csendes@inf.u-szeged.hu"", ""imegyeri@inf.u-szeged.hu"", ""~M\u00e1rk_Jelasity1""]","[""D\u00e1niel Zombori"", ""Bal\u00e1zs B\u00e1nhelyi"", ""Tibor Csendes"", ""Istv\u00e1n Megyeri"", ""M\u00e1rk Jelasity""]","[""adversarial examples"", ""complete verifiers"", ""numerical errors""]","The efficient and accurate characterization of the robustness of neural networks to input perturbation is an important open problem. Many approaches exist including heuristic and exact (or complete) methods. Complete methods are expensive but their mathematical formulation guarantees that they provide exact robustness metrics. However, this guarantee is valid only if we assume that the verified network applies arbitrary-precision arithmetic and the verifier is reliable. In practice, however, both the networks and the verifiers apply limited-precision floating point arithmetic. In this paper, we show that numerical roundoff errors can be exploited to craft adversarial networks, in which the actual robustness and the robustness computed by a state-of-the-art complete verifier radically differ. We also show that such adversarial networks can be used to insert a backdoor into any network in such a way that the backdoor is completely missed by the verifier. The attack is easy to detect in its naive form but, as we show, the adversarial network can be transformed to make its detection less trivial. We offer a simple defense against our particular attack based on adding a very small perturbation to the network weights. However, our conjecture is that other numerical attacks are possible, and exact verification has to take into account all the details of the computation executed by the verified networks, which makes the problem significantly harder. + +",/pdf/77a36c9a49f48b9ddb12530a2f5dd127064dfae1.pdf,ICLR,2021,We propose an attack (along with a defense) to fool complete verification based on exploiting numerical errors. +BJxnIxSKDr,H1xVK9gtwr,1569440000000.0,1577170000000.0,2338,Mint: Matrix-Interleaving for Multi-Task Learning,"[""tianheyu@cs.stanford.edu"", ""szk@stanford.edu"", ""eric.anthony.mitchell95@gmail.com"", ""abhigupta@berkeley.edu"", ""hausmankarol@gmail.com"", ""svlevine@eecs.berkeley.edu"", ""cbfinn@cs.stanford.edu""]","[""Tianhe Yu"", ""Saurabh Kumar"", ""Eric Mitchell"", ""Abhishek Gupta"", ""Karol Hausman"", ""Sergey Levine"", ""Chelsea Finn""]","[""multi-task learning""]","Deep learning enables training of large and flexible function approximators from scratch at the cost of large amounts of data. Applications of neural networks often consider learning in the context of a single task. However, in many scenarios what we hope to learn is not just a single task, but a model that can be used to solve multiple different tasks. Such multi-task learning settings have the potential to improve data efficiency and generalization by sharing data and representations across tasks. However, in some challenging multi-task learning settings, particularly in reinforcement learning, it is very difficult to learn a single model that can solve all the tasks while realizing data efficiency and performance benefits. Learning each of the tasks independently from scratch can actually perform better in such settings, but it does not benefit from the representation sharing that multi-task learning can potentially provide. In this work, we develop an approach that endows a single model with the ability to represent both extremes: joint training and independent training. To this end, we introduce matrix-interleaving (Mint), a modification to standard neural network models that projects the activations for each task into a different learned subspace, represented by a per-task and per-layer matrix. By learning these matrices jointly with the other model parameters, the optimizer itself can decide how much to share representations between tasks. On three challenging multi-task supervised learning and reinforcement learning problems with varying degrees of shared task structure, we find that this model consistently matches or outperforms joint training and independent training, combining the best elements of both.",/pdf/9952c44c316b179940819cf5e805b9a3a954a0d9.pdf,ICLR,2020,"We propose an approach that endows a single model with the ability to represent both extremes: joint training and independent training, which leads to effective multi-task learning." +LkFG3lB13U5,xmNQ3C17pzG,1601310000000.0,1616020000000.0,2569,Adaptive Federated Optimization,"[""~Sashank_J._Reddi1"", ""~Zachary_Charles1"", ""~Manzil_Zaheer1"", ""zachgarrett@google.com"", ""krush@google.com"", ""~Jakub_Kone\u010dn\u00fd1"", ""~Sanjiv_Kumar1"", ""~Hugh_Brendan_McMahan1""]","[""Sashank J. Reddi"", ""Zachary Charles"", ""Manzil Zaheer"", ""Zachary Garrett"", ""Keith Rush"", ""Jakub Kone\u010dn\u00fd"", ""Sanjiv Kumar"", ""Hugh Brendan McMahan""]","[""Federated learning"", ""optimization"", ""adaptive optimization"", ""distributed optimization""]","Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.",/pdf/d3f38daf93af27b20819fe19a4c3ca3f2635d9b1.pdf,ICLR,2021,"We propose adaptive federated optimization techniques, and highlight their improved performance over popular methods such as FedAvg." +r1g87C4KwB,B1xQTYHOPB,1569440000000.0,1588090000000.0,1039,The Break-Even Point on Optimization Trajectories of Deep Neural Networks,"[""staszek.jastrzebski@gmail.com"", ""msz93@o2.pl"", ""stanislav.fort@gmail.com"", ""devansharpit@gmail.com"", ""jcktbr@gmail.com"", ""kyunghyun.cho@nyu.edu"", ""k.j.geras@nyu.edu""]","[""Stanislaw Jastrzebski"", ""Maciej Szymczak"", ""Stanislav Fort"", ""Devansh Arpit"", ""Jacek Tabor"", ""Kyunghyun Cho*"", ""Krzysztof Geras*""]","[""generalization"", ""sgd"", ""learning rate"", ""batch size"", ""hessian"", ""curvature"", ""trajectory"", ""optimization""]","The early phase of training of deep neural networks is critical for their final performance. In this work, we study how the hyperparameters of stochastic gradient descent (SGD) used in the early phase of training affect the rest of the optimization trajectory. We argue for the existence of the ""``break-even"" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD. In particular, we demonstrate on multiple classification tasks that using a large learning rate in the initial phase of training reduces the variance of the gradient, and improves the conditioning of the covariance of gradients. These effects are beneficial from the optimization perspective and become visible after the break-even point. Complementing prior work, we also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers. In short, our work shows that key properties of the loss surface are strongly influenced by SGD in the early phase of training. We argue that studying the impact of the identified effects on generalization is a promising future direction.",/pdf/614bb41ca81c9195f40d67ce8f058d3a88ee87e2.pdf,ICLR,2020,"In the early phase of training of deep neural networks there exists a ""break-even point"" which determines properties of the entire optimization trajectory." +Hk9Xc_lR-,HyFm9Oe0Z,1509100000000.0,1519420000000.0,315,On the Discrimination-Generalization Tradeoff in GANs,"[""penzhan@microsoft.com"", ""qiang.liu@dartmouth.edu"", ""dennyzhou@google.com"", ""tax313@lehigh.edu"", ""xiaohe@microsoft.com""]","[""Pengchuan Zhang"", ""Qiang Liu"", ""Dengyong Zhou"", ""Tao Xu"", ""Xiaodong He""]","[""generative adversarial network"", ""discrimination"", ""generalization""]","Generative adversarial training can be generally understood as minimizing certain moment matching loss defined by a set of discriminator functions, typically neural networks. The discriminator set should be large enough to be able to uniquely identify the true distribution (discriminative), and also be small enough to go beyond memorizing samples (generalizable). In this paper, we show that a discriminator set is guaranteed to be discriminative whenever its linear span is dense in the set of bounded continuous functions. This is a very mild condition satisfied even by neural networks with a single neuron. Further, we develop generalization bounds between the learned distribution and true distribution under different evaluation metrics. When evaluated with neural distance, our bounds show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set. When evaluated with KL divergence, our bound provides an explanation on the counter-intuitive behaviors of testing likelihood in GAN training. Our analysis sheds lights on understanding the practical performance of GANs.",/pdf/486413de3bde7298335338e62841bf1cedf202ff.pdf,ICLR,2018,This paper studies the discrimination and generalization properties of GANs when the discriminator set is a restricted function class like neural networks. +ascdLuNQY4J,N36_8eyQvlA,1601310000000.0,1614990000000.0,2698,Searching for Convolutions and a More Ambitious NAS,"[""~Nicholas_Carl_Roberts1"", ""~Mikhail_Khodak1"", ""~Tri_Dao1"", ""~Liam_Li1"", ""~Nina_Balcan1"", ""~Christopher_Re1"", ""~Ameet_Talwalkar1""]","[""Nicholas Carl Roberts"", ""Mikhail Khodak"", ""Tri Dao"", ""Liam Li"", ""Nina Balcan"", ""Christopher Re"", ""Ameet Talwalkar""]","[""neural architecture search"", ""automated machine learning"", ""convolutional neural networks""]","An important goal of neural architecture search (NAS) is to automate-away the design of neural networks on new tasks in under-explored domains, thus helping to democratize machine learning. However, current NAS research largely focuses on search spaces consisting of existing operations---such as different types of convolution---that are already known to work well on well-studied problems---often in computer vision. Our work is motivated by the following question: can we enable users to build their own search spaces and discover the right neural operations given data from their specific domain? We make progress towards this broader vision for NAS by introducing a space of operations generalizing the convolution that enables search over a large family of parameterizable linear-time matrix-vector functions. Our flexible construction allows users to design their own search spaces adapted to the nature and shape of their data, to warm-start search methods using convolutions when they are known to perform well, or to discover new operations from scratch when they do not. We evaluate our approach on several novel search spaces over vision and text data, on all of which simple NAS search algorithms can find operations that perform better than baseline layers.",/pdf/23a94f8f507bc0222c9136777c4fd292f3202a6b.pdf,ICLR,2021,A general-purpose search space for neural architecture search that enables discovering operations that beat convolutions on image data. +SkgSXUKxx,,1478220000000.0,1484180000000.0,100,An Analysis of Feature Regularization for Low-shot Learning,"[""chenzhuoyuan@baidu.com"", ""liuxiao12@baidu.com"", ""wei.xu@baidu.com"", ""han.zhao@cs.cmu.edu""]","[""Zhuoyuan Chen"", ""Han Zhao"", ""Xiao Liu"", ""Wei Xu""]","[""Deep learning"", ""Computer vision""]","Low-shot visual learning, the ability to recognize novel object categories from very few, or even one example, is a hallmark of human visual intelligence. Though successful on many tasks, deep learning approaches tends to be notoriously data-hungry. Recently, feature penalty regularization has been proved effective on capturing new concepts. In this work, we provide both empirical evidence and theoretical analysis on how and why these methods work. We also propose a better design of cost function with improved performance. Close scrutiny reveals the centering effect of feature representation, as well as the intrinsic connection with batch normalization. Extensive experiments on synthetic datasets, the one-shot learning benchmark “Omniglot”, and large-scale ImageNet validate our analysis.",/pdf/d888edb052d293dcbc4b1983aeace80c350a1558.pdf,ICLR,2017,An analysis of adding regularization for low-shot learning +BkxackSKvH,Sygl6oAuPr,1569440000000.0,1577170000000.0,1895,Learning Entailment-Based Sentence Embeddings from Natural Language Inference,"[""rkarimi@idiap.ch"", ""florian.mai@idiap.ch"", ""james.henderson@idiap.ch""]","[""Rabeeh Karimi Mahabadi*"", ""Florian Mai*"", ""James Henderson""]","[""sentence embeddings"", ""textual entailment"", ""natural language inference"", ""interpretability""]","Large datasets on natural language inference are a potentially valuable resource for inducing semantic representations of natural language sentences. But in many such models the embeddings computed by the sentence encoder goes through an MLP-based interaction layer before predicting its label, and thus some of the information about textual entailment is encoded in the interpretation of sentence embeddings given by this parameterised MLP. +In this work we propose a simple interaction layer based on predefined entailment and contradiction scores applied directly to the sentence embeddings. This parameter-free interaction model achieves results on natural language inference competitive with MLP-based models, demonstrating that the trained sentence embeddings directly represent the information needed for textual entailment, and the inductive bias of this model leads to better generalisation to other related datasets.",/pdf/452f307847a4c61c5045518c6587b03468d0af5b.pdf,ICLR,2020,We propose a natural language inference model whose interaction layer imposes a direct interpretation of the induced sentence embeddings in terms of entailment and contradiction. +Bym0cU1CZ,BkQCqLJR-,1509020000000.0,1518730000000.0,127,Towards Interpretable Chit-chat: Open Domain Dialogue Generation with Dialogue Acts,"[""wuwei@microsoft.com"", ""can.xu@microsoft.com"", ""wumark@126.com"", ""lizj@buaa.edu.cn""]","[""Wei Wu"", ""Can Xu"", ""Yu Wu"", ""Zhoujun Li""]","[""dialogue generation"", ""dialogue acts"", ""open domain conversation"", ""supervised learning"", ""reinforcement learning""]","Conventional methods model open domain dialogue generation as a black box through end-to-end learning from large scale conversation data. In this work, we make the first step to open the black box by introducing dialogue acts into open domain dialogue generation. The dialogue acts are generally designed and reveal how people engage in social chat. Inspired by analysis on real data, we propose jointly modeling dialogue act selection and response generation, and perform learning with human-human conversations tagged with a dialogue act classifier and a reinforcement approach to further optimizing the model for long-term conversation. With the dialogue acts, we not only achieve significant improvement over state-of-the-art methods on response quality for given contexts and long-term conversation in both machine-machine simulation and human-machine conversation, but also are capable of explaining why such achievements can be made.",/pdf/672df9ae22ffca8bac44e03b31f7fded8e27628f.pdf,ICLR,2018,open domain dialogue generation with dialogue acts +HJlk-eHFwH,S1lbDAJKDS,1569440000000.0,1577170000000.0,2120,AdaGAN: Adaptive GAN for Many-to-Many Non-Parallel Voice Conversion,"[""maitreya_patel@daiict.ac.in"", ""purohit_mirali@daiict.ac.in"", ""mihirparmar@asu.edu"", ""nirmesh88_shah@daiict.ac.in"", ""hemant_patil@daiict.ac.in""]","[""Maitreya Patel"", ""Mirali Purohit"", ""Mihir Parmar"", ""Nirmesh J. Shah"", ""Hemant A. Patil""]","[""Voice Conversion"", ""Deep Learning"", ""Non parallel"", ""GAN"", ""AdaGAN"", ""AdaIN""]","Voice Conversion (VC) is a task of converting perceived speaker identity from a source speaker to a particular target speaker. Earlier approaches in the literature primarily find a mapping between the given source-target speaker-pairs. Developing mapping techniques for many-to-many VC using non-parallel data, including zero-shot learning remains less explored areas in VC. Most of the many-to-many VC architectures require training data from all the target speakers for whom we want to convert the voices. In this paper, we propose a novel style transfer architecture, which can also be extended to generate voices even for target speakers whose data were not used in the training (i.e., case of zero-shot learning). In particular, propose Adaptive Generative Adversarial Network (AdaGAN), new architectural training procedure help in learning normalized speaker-independent latent representation, which will be used to generate speech with different speaking styles in the context of VC. We compare our results with the state-of-the-art StarGAN-VC architecture. In particular, the AdaGAN achieves 31.73%, and 10.37% relative improvement compared to the StarGAN in MOS tests for speech quality and speaker similarity, respectively. The key strength of the proposed architectures is that it yields these results with less computational complexity. AdaGAN is 88.6% less complex than StarGAN-VC in terms of FLoating Operation Per Second (FLOPS), and 85.46% less complex in terms of trainable parameters. ",/pdf/b3900224fc95bb09431a9b70f1845bdeb8de5094.pdf,ICLR,2020,Novel adaptive instance normalization based GAN framework for non parallel many-to-many and zero-shot VC. +B1x9ITVYDr,BJe6QFtvvr,1569440000000.0,1577170000000.0,569,"Compressive Recovery Defense: A Defense Framework for $\ell_0, \ell_2$ and $\ell_\infty$ norm attacks.","[""jasjeet.dhaliwal@sjsu.edu"", ""kyle.hambrook@sjsu.edu""]","[""Jasjeet Dhaliwal"", ""Kyle Hambrook""]","[""adversarial input"", ""adversarial machine learning"", ""neural networks"", ""compressive sensing.""]","We provide recovery guarantees for compressible signals that have been corrupted with noise and extend the framework introduced in \cite{bafna2018thwarting} to defend neural networks against $\ell_0$, $\ell_2$, and $\ell_{\infty}$-norm attacks. In the case of $\ell_0$-norm noise, we provide recovery guarantees for Iterative Hard Thresholding (IHT) and Basis Pursuit (BP). For $\ell_2$-norm bounded noise, we provide recovery guarantees for BP, and for the case of $\ell_\infty$-norm bounded noise, we provide recovery guarantees for Dantzig Selector (DS). These guarantees theoretically bolster the defense framework introduced in \cite{bafna2018thwarting} for defending neural networks against adversarial inputs. Finally, we experimentally demonstrate the effectiveness of this defense framework against an array of $\ell_0$, $\ell_2$ and $\ell_\infty$-norm attacks. ",/pdf/4afdb3539972a76c50b7e3195242e6e87bcfa928.pdf,ICLR,2020, +SJxTroR9F7,BkgY9KhtYm,1538090000000.0,1546370000000.0,130,Supervised Policy Update for Deep Reinforcement Learning,"[""quan.hovuong@gmail.com"", ""yiming.zhang@nyu.edu"", ""keithwross@nyu.edu""]","[""Quan Vuong"", ""Yiming Zhang"", ""Keith W. Ross""]","[""Deep Reinforcement Learning""]","We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem. We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. The SPU implementation is much simpler than TRPO. In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.",/pdf/39290ac8296a92b670e01d99790fb7d45a3a265e.pdf,ICLR,2019,"first posing and solving the sample efficiency optimization problem in the non-parameterized policy space, and then solving a supervised regression problem to find a parameterized policy that is near the optimal non-parameterized policy." +rkeZIJBYvr,SJeqGBa_wr,1569440000000.0,1583910000000.0,1719,Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks,"[""haebeom.lee@kaist.ac.kr"", ""hayeon926@kaist.ac.kr"", ""donghyun.na@kaist.ac.kr"", ""shkim@aitrics.com"", ""mike_seop@aitrics.com"", ""eunhoy@kaist.ac.kr"", ""sjhwang82@kaist.ac.kr""]","[""Hae Beom Lee"", ""Hayeon Lee"", ""Donghyun Na"", ""Saehoon Kim"", ""Minseop Park"", ""Eunho Yang"", ""Sung Ju Hwang""]","[""meta-learning"", ""few-shot learning"", ""Bayesian neural network"", ""variational inference"", ""learning to learn"", ""imbalanced and out-of-distribution tasks for few-shot learning""]","While tasks could come with varying the number of instances and classes in realistic settings, the existing meta-learning approaches for few-shot classification assume that number of instances per task and class is fixed. Due to such restriction, they learn to equally utilize the meta-knowledge across all the tasks, even when the number of instances per task and class largely varies. Moreover, they do not consider distributional difference in unseen tasks, on which the meta-knowledge may have less usefulness depending on the task relatedness. To overcome these limitations, we propose a novel meta-learning model that adaptively balances the effect of the meta-learning and task-specific learning within each task. Through the learning of the balancing variables, we can decide whether to obtain a solution by relying on the meta-knowledge or task-specific learning. We formulate this objective into a Bayesian inference framework and tackle it using variational inference. We validate our Bayesian Task-Adaptive Meta-Learning (Bayesian TAML) on two realistic task- and class-imbalanced datasets, on which it significantly outperforms existing meta-learning approaches. Further ablation study confirms the effectiveness of each balancing component and the Bayesian learning framework. ",/pdf/64280d376d87b0ac0bb11499e1f5ecf2b3822784.pdf,ICLR,2020,"A novel meta-learning model that adaptively balances the effect of the meta-learning and task-specific learning, and also class-specific learning within each task." +SkfMWhAqYQ,HkgToZ05Ym,1538090000000.0,1549550000000.0,1150,Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet,"[""wieland.brendel@bethgelab.org"", ""matthias.bethge@uni-tuebingen.de""]","[""Wieland Brendel"", ""Matthias Bethge""]","[""interpretability"", ""representation learning"", ""bag of features"", ""deep learning"", ""object recognition""]","Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 32 x 32 px features and Alexnet performance for 16 x16 px features). The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies.",/pdf/6080072819789a32a4ac79377f6705667d053541.pdf,ICLR,2019,"Aggregating class evidence from many small image patches suffices to solve ImageNet, yields more interpretable models and can explain aspects of the decision-making of popular DNNs." +SJl47yBYPS,BygRBI3uwH,1569440000000.0,1577170000000.0,1614,Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning,"[""cw1681@nyu.edu"", ""yanqiu.wu@nyu.edu"", ""quan.hovuong@gmail.com"", ""keithwross@nyu.edu""]","[""Che Wang"", ""Yanqiu Wu"", ""Quan Vuong"", ""Keith Ross""]","[""Deep Reinforcement Learning"", ""Sample Efficiency"", ""Off-Policy Algorithms""]","The field of Deep Reinforcement Learning (DRL) has recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In this paper, we seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that the entropy term in Soft Actor Critic (SAC) principally addresses the bounded nature of the action spaces. With this insight, we propose a simple normalization scheme which allows a streamlined algorithm without entropy maximization match the performance of SAC. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. We also propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. We further show that the streamlined algorithm with the simple non-uniform sampling scheme outperforms SAC and achieves state-of-the-art performance on challenging continuous control tasks.",/pdf/e5f23c639da3d2aa14eb8540800034dcf07f230c.pdf,ICLR,2020,We propose a new DRL off-policy algorithm achieving state-of-the-art performance. +3InxcRQsYLf,CX1fieLryq,1601310000000.0,1614990000000.0,3200,VideoGen: Generative Modeling of Videos using VQ-VAE and Transformers,"[""yunzhi@berkeley.edu"", ""~Wilson_Yan1"", ""~Pieter_Abbeel2"", ""~Aravind_Srinivas1""]","[""Yunzhi Zhang"", ""Wilson Yan"", ""Pieter Abbeel"", ""Aravind Srinivas""]","[""video generation"", ""vqvae"", ""transformers"", ""gpt""]","We present VideoGen: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGen uses VQ-VAE that learns learns downsampled discrete latent representations of a video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation, ease of training and a light compute requirement, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate coherent action-conditioned samples based on experiences gathered from the VizDoom simulator. We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models without requiring industry scale compute resources. Samples are available at https://sites.google.com/view/videogen",/pdf/1024deb2b0763c2d1186197d65b8d8ef5b3fce56.pdf,ICLR,2021,Video generation model with latent space autoregressive transformer +HJgEe1SKPr,BJeqemoOvr,1569440000000.0,1577170000000.0,1501,GAN-based Gaussian Mixture Model Responsibility Learning,"[""wanming.huang@student.uts.edu.au"", ""shuai.jiang-1@student.uts.edu.au"", ""xuan.liang@student.uts.edu.au"", ""ianopper@outlook.com"", ""yida.xu@uts.edu.au""]","[""Wanming Huang"", ""Shuai Jiang"", ""Xuan Liang"", ""Ian Oppermann"", ""Richard Yi Da Xu""]","[""Generative Adversarial Networks""]","Mixture Model (MM) is a probabilistic framework which allows us to define a dataset containing K different modes. When each of the modes is associated with a Gaussian distribution, we refer it as Gaussian MM, or GMM. Given a data point x, GMM may assume the existence of a random index k ∈ {1, . . . , K } identifying which Gaussian the particular data is associated with. In a traditional GMM paradigm, it is straightforward to compute in closed-form, the conditional like- lihood p(x|k, θ), as well as responsibility probability p(k|x, θ) which describes the distribution index corresponds to the data. Computing the responsibility allows us to retrieve many important statistics of the overall dataset, including the weights of each of the modes. Modern large datasets often contain multiple unlabelled modes, such as paintings dataset containing several styles; fashion images containing several unlabelled categories. In its raw representation, the Euclidean distances between the data do not allow them to form mixtures naturally, nor it’s feasible to compute responsibility distribution, making GMM unable to apply. To this paper, we utilize the Generative Adversarial Network (GAN) framework to achieve an alternative plausible method to compute these probabilities at the data’s latent space z instead of x. Instead of defining p(x|k, θ) explicitly, we devised a modified GAN to allow us to define the distribution using p(z|k, θ), where z is the corresponding latent representation of x, as well as p(k|x, θ) through an additional classification network which is trained with the GAN in an “end-to-end” fashion. These techniques allow us to discover interesting properties of an unsupervised dataset, including dataset segments as well as generating new “out-distribution” data by smooth linear interpolation across any combinations of the modes in a completely unsupervised manner.",/pdf/23edf307dc7ed72cc5c6a937faed3ef40cd7e782.pdf,ICLR,2020, +XSLF1XFq5h,zr2ZL1sUni,1601310000000.0,1616010000000.0,1848,Getting a CLUE: A Method for Explaining Uncertainty Estimates,"[""~Javier_Antoran1"", ""~Umang_Bhatt1"", ""~Tameem_Adel1"", ""~Adrian_Weller1"", ""~Jos\u00e9_Miguel_Hern\u00e1ndez-Lobato1""]","[""Javier Antoran"", ""Umang Bhatt"", ""Tameem Adel"", ""Adrian Weller"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""interpretability"", ""uncertainty"", ""explainability""]","Both uncertainty estimation and interpretability are important factors for trustworthy machine learning systems. However, there is little work at the intersection of these two areas. We address this gap by proposing a novel method for interpreting uncertainty estimates from differentiable probabilistic models, like Bayesian Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty Explanations (CLUE), indicates how to change an input, while keeping it on the data manifold, such that a BNN becomes more confident about the input's prediction. We validate CLUE through 1) a novel framework for evaluating counterfactual explanations of uncertainty, 2) a series of ablation experiments, and 3) a user study. Our experiments show that CLUE outperforms baselines and enables practitioners to better understand which input patterns are responsible for predictive uncertainty.",/pdf/bb1896f36e6eb8c78e3080ebea185ff4537fc95b.pdf,ICLR,2021,We introduce a method to help explain uncertainties of any differentiable probabilistic model by perturbing input features. +BkGakb9lx,,1478260000000.0,1484230000000.0,167,RenderGAN: Generating Realistic Labeled Data,"[""leon.sixt@fu-berlin.de"", ""benjamin.wild@fu-berlin.de"", ""tim.landgraf@fu-berlin.de""]","[""Leon Sixt"", ""Benjamin Wild"", ""Tim Landgraf""]","[""Unsupervised Learning"", ""Computer vision"", ""Deep learning"", ""Applications""]","Deep Convolutional Neuronal Networks (DCNNs) are showing remarkable performance on many computer vision tasks. Due to their large parameter space, they require many labeled samples when trained in a supervised setting. The costs of annotating data manually can render the use of DCNNs infeasible. We present a novel framework called RenderGAN that can generate large amounts of realistic, labeled images by combining a 3D model and the Generative Adversarial Network framework. In our approach, image augmentations (e.g. lighting, background, and detail) are learned from unlabeled data such that the generated images are strikingly realistic while preserving the labels known from the 3D model. We apply the RenderGAN framework to generate images of barcode-like markers that are attached to honeybees. Training a DCNN on data generated by the RenderGAN yields considerably better performance than training it on various baselines. ",/pdf/1eec1561c58f263ff8ff3fb0fac669559d09b991.pdf,ICLR,2017,"We embed a 3D model in the GAN framework to generate realistic, labeled data." +Byg1v1HKDB,rkgnQ9TuPr,1569440000000.0,1583910000000.0,1752,Abductive Commonsense Reasoning,"[""chandrab@allenai.org"", ""ronanlb@allenai.org"", ""chaitanyam@allenai.org"", ""keisukes@allenai.org"", ""arih@allenai.org"", ""hrashkin@uw.edu"", ""dougd@allenai.org"", ""scottyih@fb.com"", ""yejinc@allenai.org""]","[""Chandra Bhagavatula"", ""Ronan Le Bras"", ""Chaitanya Malaviya"", ""Keisuke Sakaguchi"", ""Ari Holtzman"", ""Hannah Rashkin"", ""Doug Downey"", ""Wen-tau Yih"", ""Yejin Choi""]","[""Abductive Reasoning"", ""Commonsense Reasoning"", ""Natural Language Inference"", ""Natural Language Generation""]","Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation. While abduction has long been considered to be at the core of how people interpret and read between the lines in natural language (Hobbs et al., 1988), there has been relatively little research in support of abductive natural language inference and generation. We present the first study that investigates the viability of language-based abductive reasoning. We introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations. Based on this dataset, we conceptualize two new tasks – (i) Abductive NLI: a multiple-choice question answering task for choosing the more likely explanation, and (ii) Abductive NLG: a conditional generation task for explaining given observations in natural language. On Abductive NLI, the best model achieves 68.9% accuracy, well below human performance of 91.4%. On Abductive NLG, the current best language generators struggle even more, as they lack reasoning capabilities that are trivial for humans. Our analysis leads to new insights into the types of reasoning that deep pre-trained language models fail to perform—despite their strong performance on the related but more narrowly defined task of entailment NLI—pointing to interesting avenues for future research.",/pdf/48fb3337a97a1f5f932d55dcebbbadf13ff197e8.pdf,ICLR,2020, +r1RQdCg0W,r1TQ_Al0-,1509120000000.0,1518730000000.0,455,"MACH: Embarrassingly parallel $K$-class classification in $O(d\log{K})$ memory and $O(K\log{K} + d\log{K})$ time, instead of $O(Kd)$","[""qh5@rice.edu"", ""anshumali@rice.edu"", ""yiqiu.wang@rice.edu""]","[""Qixuan Huang"", ""Anshumali Shrivastava"", ""Yiqiu Wang""]","[""Extreme Classification"", ""Large-scale learning"", ""hashing"", ""GPU"", ""High Performance Computing""]","We present Merged-Averaged Classifiers via Hashing (MACH) for $K$-classification with large $K$. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(d\log{K})$ memory while only requiring $O(K\log{K} + d\log{K})$ operation for inference. MACH is the first generic $K$-classification algorithm, with provably theoretical guarantees, which requires $O(\log{K})$ memory without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few independent classification task with very small (constant) number of classes. We provide theoretical quantification of accuracy-memory tradeoff by showing the first connection between extreme classification and heavy hitters. With MACH we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU (12GB), with the classification accuracy of 19.28\%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (320 GB model size) and achieves 9\% accuracy. In contrast, MACH can achieve 9\% accuracy with 480x reduction in the model size (of mere 0.6GB). With MACH, we also demonstrate complete training of fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU.",/pdf/fdd401137e72fbd1a9d4f9e24386ba8f35cac598.pdf,ICLR,2018,"How to Training 100,000 classes on a single GPU" +B1D6ty-A-,HkP6Y1-Rb,1509120000000.0,1518730000000.0,506,Training Autoencoders by Alternating Minimization,"[""cs14btech11020@iith.ac.in"", ""cs14resch11001@iith.ac.in"", ""cs13b1028@iith.ac.in"", ""vineethnb@iith.ac.in"", ""purushot@cse.iitk.ac.in""]","[""Sneha Kudugunta"", ""Adepu Shankar"", ""Surya Chavali"", ""Vineeth Balasubramanian"", ""Purushottam Kar""]","[""Deep Learning"", ""Autoencoders"", ""Alternating Optimization""]","We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimization problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising when compared to those trained using traditional backpropagation techniques, both in terms of training speed, as well as feature extraction and reconstruction performance.",/pdf/2ed171a95dd0d97455d55134b9e994037a7603fb.pdf,ICLR,2018,We utilize the alternating minimization principle to provide an effective novel technique to train deep autoencoders. +ADwLLmSda3,fKKlN0NTel,1601310000000.0,1614990000000.0,2206,Neural Nonnegative CP Decomposition for Hierarchical Tensor Analysis,"[""jvendrow@math.ucla.edu"", ""~Jamie_Haddock1"", ""~Deanna_Needell2""]","[""Joshua Vendrow"", ""Jamie Haddock"", ""Deanna Needell""]","[""nonnegative tensor decompositions"", ""topic modeling"", ""hierarchical model"", ""CP decomposition"", ""neural network"", ""backpropagation""]","There is a significant demand for topic modeling on large-scale data with complex multi-modal structure in applications such as multi-layer network analysis, temporal document classification, and video data analysis; frequently this multi-modal data has latent hierarchical structure. We propose a new hierarchical nonnegative CANDECOMP/PARAFAC (CP) decomposition (hierarchical NCPD) model and a training method, Neural NCPD, for performing hierarchical topic modeling on multi-modal tensor data. Neural NCPD utilizes a neural network architecture and backpropagation to mitigate error propagation through hierarchical NCPD. ",/pdf/5ff44b48db470aff2d5a68a8ad106eee6fb065a6.pdf,ICLR,2021,"We propose a new hierarchical nonnegative CANDECOMP/PARAFAC (CP) decomposition (hierarchical NCPD) model and a training method, Neural NCPD, for performing hierarchical topic modeling on multi-modal tensor data." +Hkes0iR9KX,HkgK-kCtF7,1538090000000.0,1545360000000.0,926,DEEP GEOMETRICAL GRAPH CLASSIFICATION,"[""rahmani.sut@gmail.com"", ""pingli98@gmail.com""]","[""Mostafa Rahmani"", ""Ping Li""]","[""Graph classification"", ""Deep Learning"", ""Graph pooling"", ""Embedding""]","Most of the existing Graph Neural Networks (GNNs) are the mere extension of the Convolutional Neural Networks (CNNs) to graphs. Generally, they consist of several steps of message passing between the nodes followed by a global indiscriminate feature pooling function. In many data-sets, however, the nodes are unlabeled or their labels provide no information about the similarity between the nodes and the locations of the nodes in the graph. Accordingly, message passing may not propagate helpful information throughout the graph. We show that this conventional approach can fail to learn to perform even simple graph classification tasks. We alleviate this serious shortcoming of the GNNs by making them a two step method. In the first of the proposed approach, a graph embedding algorithm is utilized to obtain a continuous feature vector for each node of the graph. The embedding algorithm represents the graph as a point-cloud in the embedding space. In the second step, the GNN is applied to the point-cloud representation of the graph provided by the embedding method. The GNN learns to perform the given task by inferring the topological structure of the graph encoded in the spatial distribution of the embedded vectors. In addition, we extend the proposed approach to the graph clustering problem and a new architecture for graph clustering is proposed. Moreover, the spatial representation of the graph is utilized to design a graph pooling algorithm. We turn the problem of graph down-sampling into a column sampling problem, i.e., the sampling algorithm selects a subset of the nodes whose feature vectors preserve the spatial distribution of all the feature vectors. We apply the proposed approach to several popular benchmark data-sets and it is shown that the proposed geometrical approach strongly improves the state-of-the-art result for several data-sets. For instance, for the PTC data-set, we improve the state-of-the-art result for more than 22 %.",/pdf/0c47c1e0d879fd2d3c1ebb04c518168659a7e371.pdf,ICLR,2019,The graph analysis problem is transformed into a point cloud analysis problem. +rkg8FJBYDS,S1gi7I0_vB,1569440000000.0,1577170000000.0,1842,Variational Diffusion Autoencoders with Random Walk Sampling,"[""henryli@eng.ucsd.edu"", ""ofir.lindenbaum@yale.edu"", ""xiuyuan.cheng@duke.edu"", ""acloninger@ucsd.edu""]","[""Henry Li"", ""Ofir Lindenbaum"", ""Xiuyuan Cheng"", ""Alexander Cloninger""]","[""generative models"", ""variational inference"", ""manifold learning"", ""diffusion maps""]","Variational inference (VI) methods and especially variational autoencoders (VAEs) specify scalable generative models that enjoy an intuitive connection to manifold learning --- with many default priors the posterior/likelihood pair $q(z|x)$/$p(x|z)$ can be viewed as an approximate homeomorphism (and its inverse) between the data manifold and a latent Euclidean space. However, these approximations are well-documented to become degenerate in training. Unless the subjective prior is carefully chosen, the topologies of the prior and data distributions often will not match. +Conversely, diffusion maps (DM) automatically \textit{infer} the data topology and enjoy a rigorous connection to manifold learning, but do not scale easily or provide the inverse homeomorphism. +In this paper, we propose \textbf{a)} a principled measure for recognizing the mismatch between data and latent distributions and \textbf{b)} a method that combines the advantages of variational inference and diffusion maps to learn a homeomorphic generative model. The measure, the \textit{locally bi-Lipschitz property}, is a sufficient condition for a homeomorphism and easy to compute and interpret. The method, the \textit{variational diffusion autoencoder} (VDAE), is a novel generative algorithm that first infers the topology of the data distribution, then models a diffusion random walk over the data. To achieve efficient computation in VDAEs, we use stochastic versions of both variational inference and manifold learning optimization. We prove approximation theoretic results for the dimension dependence of VDAEs, and that locally isotropic sampling in the latent space results in a random walk over the reconstructed manifold. +Finally, we demonstrate the utility of our method on various real and synthetic datasets, and show that it exhibits performance superior to other generative models.",/pdf/212f06df21e428a6b04d85dc0f8dfa01b841ef9e.pdf,ICLR,2020,We combine variational inference and manifold learning (specifically VAEs and diffusion maps) to build a generative model based on a diffusion random walk on a data manifold; we generate samples by drawing from the walk's stationary distribution. +rJecSyHtDS,r1gNbm6OwB,1569440000000.0,1577170000000.0,1702,Learning to Recognize the Unseen Visual Predicates,"[""zhudefa@iie.ac.cn"", ""liusi@buaa.edu.cn"", ""jiangwentao@buaa.edu.cn"", ""liguanbin@mail.sysu.edu.cn"", ""wutianyi01@baidu.com"", ""guoguodong01@baidu.com""]","[""Defa Zhu"", ""Si Liu"", ""Wentao Jiang"", ""Guanbin Li"", ""Tianyi Wu"", ""Guodong Guo""]","[""Visual Relationship Detection"", ""Scene Graph Generation"", ""Knowledge"", ""Zero-shot Learning""]","Visual relationship recognition models are limited in the ability to generalize from finite seen predicates to unseen ones. We propose a new problem setting named predicate zero-shot learning (PZSL): learning to recognize the predicates without training data. It is unlike the previous zero-shot learning problem on visual relationship recognition which learns to recognize the unseen relationship triplets () but requires all components (subject, predicate, and object) to be seen in the training set. For the PZSL problem, however, the models are expected to recognize the diverse even unseen predicates, which is meaningful for many downstream high-level tasks, like visual question answering, to handle complex scenes and open questions. The PZSL is a very challenging task since the predicates are very abstract and follow an extreme long-tail distribution. To address the PZSL problem, we present a model that performs compatibility learning leveraging the linguistic priors from the corpus and knowledge base. An unbalanced sampled-softmax is further developed to tackle the extreme long-tail distribution of predicates. Finally, the experiments are conducted to analyze the problem and verify the effectiveness of our methods. The dataset and source code will be released for further study. ",/pdf/609650a9248225a75d9b6e0ad8763f252a304fcb.pdf,ICLR,2020,We propose and address a new problem named predicate zero-shot learning in visual relationship recognition. +4HGL3H9eL9U,uaLNal06O5l,1601310000000.0,1614990000000.0,3499,AT-GAN: An Adversarial Generative Model for Non-constrained Adversarial Examples,"[""~Xiaosen_Wang1"", ""~Kun_He1"", ""~Chuanbiao_Song2"", ""~Liwei_Wang1"", ""~John_E._Hopcroft1""]","[""Xiaosen Wang"", ""Kun He"", ""Chuanbiao Song"", ""Liwei Wang"", ""John E. Hopcroft""]","[""adversarial examples"", ""adversarial attack"", ""generation-based attack"", ""adversarial generative model"", ""non-constrained adversarial examples""]","With the rapid development of adversarial machine learning, numerous adversarial attack methods have been proposed. Typical attacks are based on a search in the neighborhood of input image to generate a perturbed adversarial example. Since 2017, generative models are adopted for adversarial attacks, and most of them focus on generating adversarial perturbations from input noise or input image. Thus the output is restricted by input for these works. A recent work targets unrestricted adversarial example using generative model but their method is based on a search in the neighborhood of input noise, so actually their output is still constrained by input. In this work, we propose AT-GAN (Adversarial Transfer on Generative Adversarial Net) to train an adversarial generative model that can directly produce adversarial examples. Different from previous works, we aim to learn the distribution of adversarial examples so as to generate semantically meaningful adversaries. AT-GAN achieves this goal by first learning a generative model for real data, followed by transfer learning to obtain the desired generative model. Once trained and transferred, AT-GAN could generate adversarial examples directly and quickly for any input noise, denoted as non-constrained adversarial examples. Extensive experiments and visualizations show that AT-GAN can efficiently generate diverse adversarial examples that are realistic to human perception, and yields higher attack success rates against adversarially trained models. + + ",/pdf/d82ee47f908c32573b75d681aa950a0b82aab0b0.pdf,ICLR,2021,"We propose to train an adversarial generative model called AT-GAN that aims to learn the distribution of adversarial examples, and can directly produce adversarial examples once trained." +rJsiFTYex,,1478250000000.0,1481920000000.0,149,A Way out of the Odyssey: Analyzing and Combining Recent Insights for LSTMs,"[""slongpre@cs.stanford.edu"", ""sabeekp@cs.stanford.edu"", ""cxiong@salesforce.com"", ""rsocher@salesforce.com""]","[""Shayne Longpre"", ""Sabeek Pradhan"", ""Caiming Xiong"", ""Richard Socher""]","[""Natural language processing"", ""Deep learning"", ""Supervised Learning""]","LSTMs have become a basic building block for many deep NLP models. In recent years, many improvements and variations have been proposed for deep sequence models in general, and LSTMs in particular. We propose and analyze a series of architectural modifications for LSTM networks resulting in improved performance for text classification datasets. We observe compounding improvements on traditional LSTMs using Monte Carlo test-time model averaging, deep vector averaging (DVA), and residual connections, along with four other suggested modifications. Our analysis provides a simple, reliable, and high quality baseline model.",/pdf/4cb2b5080ff414bc0ab0b750be4943baaad466ee.pdf,ICLR,2017,"Relatively simple augmentations to the LSTM, such as Monte Carlo test time averaging, deep vector averaging, and residual connections, can yield massive accuracy improvements on text classification datasets." +26WnoE4hjS,sDUnop-dmk,1601310000000.0,1614990000000.0,2131,Measuring and mitigating interference in reinforcement learning,"[""~Vincent_Liu3"", ""~Adam_M_White1"", ""~Hengshuai_Yao2"", ""~Martha_White1""]","[""Vincent Liu"", ""Adam M White"", ""Hengshuai Yao"", ""Martha White""]","[""Reinforcement Learning"", ""Representation Learning""]","Catastrophic interference is common in many network-based learning systems, and many proposals exist for mitigating it. But, before we overcome interference we must understand it better. In this work, we first provide a definition and novel measure of interference for value-based control methods such as Fitted Q Iteration and DQN. We systematically evaluate our measure of interference, showing that it correlates with forgetting, across a variety of network architectures. Our new interference measure allows us to ask novel scientific questions about commonly used deep learning architectures and develop new learning algorithms. In particular we show that updates on the last layer result in significantly higher interference than updates internal to the network. Lastly, we introduce a novel online-aware representation learning algorithm to minimize interference, and we empirically demonstrate that it improves stability and has lower interference.",/pdf/d1a3396c295958962a1140f29ce508997b2d77b1.pdf,ICLR,2021, +HyMRaoAqKX,HJeStoT5KQ,1538090000000.0,1545360000000.0,852,Implicit Autoencoders,"[""a.makhzani@gmail.com""]","[""Alireza Makhzani""]","[""Unsupervised Learning"", ""Generative Models"", ""Variational Inference"", ""Generative Adversarial Networks.""]","In this paper, we describe the ""implicit autoencoder"" (IAE), a generative autoencoder in which both the generative path and the recognition path are parametrized by implicit distributions. We use two generative adversarial networks to define the reconstruction and the regularization cost functions of the implicit autoencoder, and derive the learning rules based on maximum-likelihood learning. Using implicit distributions allows us to learn more expressive posterior and conditional likelihood distributions for the autoencoder. Learning an expressive conditional likelihood distribution enables the latent code to only capture the abstract and high-level information of the data, while the remaining information is captured by the implicit conditional likelihood distribution. For example, we show that implicit autoencoders can disentangle the global and local information, and perform deterministic or stochastic reconstructions of the images. We further show that implicit autoencoders can disentangle discrete underlying factors of variation from the continuous factors in an unsupervised fashion, and perform clustering and semi-supervised learning.",/pdf/23f4d501c32b71d7080422a2fcf058c18f572689.pdf,ICLR,2019,"We propose a generative autoencoder that can learn expressive posterior and conditional likelihood distributions using implicit distributions, and train the model using a new formulation of the ELBO." +BvrKnFq_454,88O_CGLOVQQ,1601310000000.0,1614990000000.0,2079,Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties,"[""~Brett_Daley1"", ""~Christopher_Amato1""]","[""Brett Daley"", ""Christopher Amato""]","[""deep learning"", ""gradient descent"", ""optimization""]","Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of the optimization problem known to cause Adam to diverge. We also establish a regret bound in the general stochastic nonconvex setting that suggests Expectigrad is less susceptible to gradient variance than existing methods are. Testing Expectigrad on several high-dimensional machine learning tasks, we find it often performs favorably to state-of-the-art methods with little hyperparameter tuning.",/pdf/3bef92915fc63896867f8526256addcb64c2258e.pdf,ICLR,2021,Expectigrad is an adaptive method that converges on a wider class of optimization problems and performs well in practice. +Byl1W1rtvH,S1llfvj_PH,1569440000000.0,1577170000000.0,1529,Recurrent Hierarchical Topic-Guided Neural Language Models,"[""gdd_xidian@126.com"", ""bchen@mail.xidian.edu.cn"", ""ruiyinglu_xidian@163.com"", ""mingyuan.zhou@mccombs.utexas.edu""]","[""Dandan Guo"", ""Bo Chen"", ""Ruiying Lu"", ""Mingyuan Zhou""]","[""Bayesian deep learning"", ""recurrent gamma belief net"", ""larger-context language model"", ""variational inference"", ""sentence generation"", ""paragraph generation""]","To simultaneously capture syntax and semantics from a text corpus, we propose a new larger-context language model that extracts recurrent hierarchical semantic structure via a dynamic deep topic model to guide natural language generation. Moving beyond a conventional language model that ignores long-range word dependencies and sentence order, the proposed model captures not only intra-sentence word dependencies, but also temporal transitions between sentences and inter-sentence topic dependences. For inference, we develop a hybrid of stochastic-gradient MCMC and recurrent autoencoding variational Bayes. Experimental results on a variety of real-world text corpora demonstrate that the proposed model not only outperforms state-of-the-art larger-context language models, but also learns interpretable recurrent multilayer topics and generates diverse sentences and paragraphs that are syntactically correct and semantically coherent.",/pdf/93442ce17f366643c5015e0ff2ded8fafc3cbc22.pdf,ICLR,2020,"We introduce a novel larger-context language model to simultaneously captures syntax and semantics, making it capable of generating highly interpretable sentences and paragraphs" +HkcdHtqlx,,1478300000000.0,1482380000000.0,487,Gated-Attention Readers for Text Comprehension,"[""bdhingra@cs.cmu.edu"", ""hanxiaol@cs.cmu.edu"", ""zhiliny@cs.cmu.edu"", ""wcohen@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu""]","[""Bhuwan Dhingra"", ""Hanxiao Liu"", ""Zhilin Yang"", ""William W. Cohen"", ""Ruslan Salakhutdinov""]","[""Natural language processing"", ""Deep learning"", ""Supervised Learning""]","In this paper we study the problem of answering cloze-style questions over documents. Our model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtains state-of-the-art results on three benchmarks for this task--the CNN \& Daily Mail news stories and the Who Did What dataset. The effectiveness of multiplicative interaction is demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention. ",/pdf/0c5f85d9571de0ff009f42643c4e76af418e3869.pdf,ICLR,2017, +SygONjRqKm,HygAwRc7FX,1538090000000.0,1545360000000.0,16,Amortized Context Vector Inference for Sequence-to-Sequence Networks,"[""sotirios.chatzis@cut.ac.cy"", ""k.v.tolias@edu.cut.ac.cy"", ""aristotelis.charalampous@edu.cut.ac.cy""]","[""Sotirios Chatzis"", ""Kyriacos Tolias"", ""Aristotelis Charalampous""]","[""neural attention"", ""sequence-to-sequence"", ""variational inference""]","Neural attention (NA) has become a key component of sequence-to-sequence models that yield state-of-the-art performance in as hard tasks as abstractive document summarization (ADS), machine translation (MT), and video captioning (VC). NA mechanisms perform inference of context vectors; these constitute weighted sums of deterministic input sequence encodings, adaptively sourced over long temporal horizons. Inspired from recent work in the field of amortized variational inference (AVI), in this work we consider treating the context vectors generated by soft-attention (SA) models as latent variables, with approximate finite mixture model posteriors inferred via AVI. We posit that this formulation may yield stronger generalization capacity, in line with the outcomes of existing applications of AVI to deep networks. To illustrate our method, we implement it and experimentally evaluate it considering challenging ADS, VC, and MT benchmarks. This way, we exhibit its improved effectiveness over state-of-the-art alternatives.",/pdf/80cf8de2a68386b75d0313a0b458b9e40bd84a1d.pdf,ICLR,2019,A generalisation of context representation in neural attention under the variational inference rationale. +5fJ0qcwBNr0,SBGg8L185y0,1601310000000.0,1614990000000.0,3243,A Gradient-based Kernel Approach for Efficient Network Architecture Search,"[""~Jingjing_Xu1"", ""~Liang_Zhao8"", ""junyang.ljy@alibaba-inc.com"", ""~Xu_Sun1"", ""~Hongxia_Yang2""]","[""Jingjing Xu"", ""Liang Zhao"", ""Junyang Lin"", ""Xu Sun"", ""Hongxia Yang""]","[""NAS""]","It is widely accepted that vanishing and exploding gradient values are the main reason behind the difficulty of deep network training. +In this work, we take a further step to understand the optimization of deep networks and find that both gradient correlations and gradient values have strong impacts on model training. +Inspired by our new finding, we explore a simple yet effective network architecture search (NAS) approach that leverages gradient correlation and gradient values to find well-performing architectures. To be specific, we first formulate these two terms into a unified gradient-based kernel and then select architectures with the largest kernels at initialization as the final networks. +The new approach replaces the expensive ``train-then-test'' evaluation paradigm with a new lightweight function according to the gradient-based kernel at initialization. +Experiments show that our approach achieves competitive results with orders of magnitude faster than ``train-then-test'' paradigms on image classification tasks. Furthermore, the extremely low search cost enables its wide applications. It also obtains performance improvements on two text classification tasks.",/pdf/fb485373e4361341feb4bf589781b241cc3adfb7.pdf,ICLR,2021, +rJeXCo0cYX,S1gFtMvqt7,1538090000000.0,1575050000000.0,877,BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning,"[""maximechevalierb@gmail.com"", ""dimabgv@gmail.com"", ""salemlahlou9@gmail.com"", ""lcswillems@gmail.com"", ""chitwaniit@gmail.com"", ""thien@cs.uoregon.edu"", ""yoshua.bengio@umontreal.ca""]","[""Maxime Chevalier-Boisvert"", ""Dzmitry Bahdanau"", ""Salem Lahlou"", ""Lucas Willems"", ""Chitwan Saharia"", ""Thien Huu Nguyen"", ""Yoshua Bengio""]","[""language"", ""learning"", ""efficiency"", ""imitation learning"", ""reinforcement learning""]","Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons. Though, given the lack of sample efficiency in current learning methods, reaching this goal may require substantial research efforts. We introduce the BabyAI research platform, with the goal of supporting investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. Each level gradually leads the agent towards acquiring a combinatorially rich synthetic language, which is a proper subset of English. The platform also provides a hand-crafted bot agent, which simulates a human teacher. We report estimated amount of supervision required for training neural reinforcement and behavioral-cloning agents on some BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample-efficient in the context of learning a language with compositional properties.",/pdf/9adbab84c4e547679f96a091273b3ff518fa6972.pdf,ICLR,2019,We present the BabyAI platform for studying data efficiency of language learning with a human in the loop +BkeqATVYwr,rkxIVCWdwS,1569440000000.0,1577170000000.0,865,GRAPH NEIGHBORHOOD ATTENTIVE POOLING,"[""zekarias@kth.se"", ""sarunasg@kth.se""]","[""Zekarias Tilahun Kefato"", ""Sarunas Girdzijauskas""]","[""Network Representation Learning"", ""Attentive Pooling Networks"", ""Context-sensitive Embedding"", ""Mutual Attention"", ""Link Prediction"", ""Node Clustering""]","Network representation learning (NRL) is a powerful technique for learning low-dimensional vector representation of high-dimensional and sparse graphs. Most studies explore the structure and meta data associated with the graph using random walks and employ a unsupervised or semi-supervised learning schemes. Learning in these methods is context-free, because only a single representation per node is learned. Recently studies have argued on the sufficiency of a single representation and proposed a context-sensitive approach that proved to be highly effective in applications such as link prediction and ranking. +However, most of these methods rely on additional textual features that require RNNs or CNNs to capture high-level features or rely on a community detection algorithm to identifying multiple contexts of a node. +In this study, without requiring additional features nor a community detection algorithm, we propose a novel context-sensitive algorithm called GAP that learns to attend on different part of a node’s neighborhood using attentive pooling networks. We show the efficacy of GAP using three real-world datasets on link prediction and node clustering tasks and compare it against 10 popular and state-of-the-art (SOTA) baselines. GAP consistently outperforms them and achieves up to ≈9% and ≈20% gain over the best performing methods on link prediction and clustering tasks, respectively.",/pdf/99b5bb527c624fdddd1407b47ebf76448b7a96a1.pdf,ICLR,2020, +HJxdAoCcYX,r1lwtgyqtQ,1538090000000.0,1545360000000.0,907,Characterizing Malicious Edges targeting on Graph Neural Networks,"[""xuxiaojun1005@gmail.com"", ""yue9yu@gmail.com"", ""lxbosky@gmail.com"", ""lsong@cc.gatech.edu"", ""windsonliu@tencent.com"", ""cgunter@illinois.edu""]","[""Xiaojun Xu"", ""Yue Yu"", ""Bo Li"", ""Le Song"", ""Chengfeng Liu"", ""Carl Gunter""]",[],"Deep neural networks on graph structured data have shown increasing success in various applications. However, due to recent studies about vulnerabilities of machine learning models, researchers are encouraged to explore the robustness of graph neural networks (GNNs). So far there are two work targeting to attack GNNs by adding/deleting edges to fool graph based classification tasks. Such attacks are challenging to be detected since the manipulation is very subtle compared with traditional graph attacks. In this paper we propose the first detection mechanism against these two proposed attacks. Given a perturbed graph, we propose a novel graph generation method together with link prediction as preprocessing to detect potential malicious edges. We also propose novel features which can be leveraged to perform outlier detection when the number of added malicious edges are large. Different detection components are proposed and tested, and we also evaluate the performance of final detection pipeline. Extensive experiments are conducted to show that the proposed detection mechanism can achieve AUC above 90% against the two attack strategies on both Cora and Citeseer datasets. We also provide in-depth analysis of different attack strategies and corresponding suitable detection methods. Our results shed light on several principles for detecting different types of attacks.",/pdf/0537498a9f7831740e1aec1f49daf5d7be45e036.pdf,ICLR,2019, +ZPa2SyGcbwh,zIsA0P5mEC,1601310000000.0,1616820000000.0,2811,Learning with Feature-Dependent Label Noise: A Progressive Approach,"[""~Yikai_Zhang1"", ""~Songzhu_Zheng1"", ""~Pengxiang_Wu1"", ""~Mayank_Goswami1"", ""~Chao_Chen1""]","[""Yikai Zhang"", ""Songzhu Zheng"", ""Pengxiang Wu"", ""Mayank Goswami"", ""Chao Chen""]","[""Noisy Label"", ""Deep Learning"", ""Classification""]","Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels.",/pdf/2223330f69a5f2c06d8e54f79e3935f8da43de13.pdf,ICLR,2021,We propose a progressive label correction approach for noisy label learning task. +Bkf4XgrKvS,H1g9b4xFvS,1569440000000.0,1577170000000.0,2208,Unsupervised Learning of Graph Hierarchical Abstractions with Differentiable Coarsening and Optimal Transport,"[""tengfei.ma1@ibm.com"", ""chenjie@us.ibm.com""]","[""Tengfei Ma"", ""Jie Chen""]","[""Unsupervised learning"", ""hierarchical representation learning"", ""graph neural networks""]","Hierarchical abstractions are a methodology for solving large-scale graph problems in various disciplines. Coarsening is one such approach: it generates a pyramid of graphs whereby the one in the next level is a structural summary of the prior one. With a long history in scientific computing, many coarsening strategies were developed based on mathematically driven heuristics. Recently, resurgent interests exist in deep learning to design hierarchical methods learnable through differentiable parameterization. These approaches are paired with downstream tasks for supervised learning. In this work, we propose an unsupervised approach, coined \textsc{OTCoarsening}, with the use of optimal transport. Both the coarsening matrix and the transport cost matrix are parameterized, so that an optimal coarsening strategy can be learned and tailored for a given set of graphs. We demonstrate that the proposed approach produces meaningful coarse graphs and yields competitive performance compared with supervised methods for graph classification.",/pdf/480a1e2a2459842b0b1b1a67109715498481b832.pdf,ICLR,2020, +szUsQ3NcQwV,TdIfbDxqNPm,1601310000000.0,1614990000000.0,1939,Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning,"[""~Shariq_Iqbal1"", ""~Christian_Schroeder_de_Witt1"", ""~Bei_Peng2"", ""~Wendelin_Boehmer1"", ""~Shimon_Whiteson1"", ""~Fei_Sha3""]","[""Shariq Iqbal"", ""Christian Schroeder de Witt"", ""Bei Peng"", ""Wendelin Boehmer"", ""Shimon Whiteson"", ""Fei Sha""]","[""MARL"", ""multi-agent reinforcement learning"", ""value function factorization"", ""attention""]","Real world multi-agent tasks often involve varying types and quantities of agents and non-agent entities; however, agents within these tasks rarely need to consider all others at all times in order to act effectively. Factored value function approaches have historically leveraged such independences to improve learning efficiency, but these approaches typically rely on domain knowledge to select fixed subsets of state features to include in each factor. We propose to utilize value function factoring with random subsets of entities in each factor as an auxiliary objective in order to disentangle value predictions from irrelevant entities. This factoring approach is instantiated through a simple attention mechanism masking procedure. We hypothesize that such an approach helps agents learn more effectively in multi-agent settings by discovering common trajectories across episodes within sub-groups of agents/entities. Our approach, Randomized Entity-wise Factorization for Imagined Learning (REFIL), outperforms all strong baselines by a significant margin in challenging StarCraft micromanagement tasks.",/pdf/f995c4ccbb9635272a00b2f12b6d2afc11f6bd0c.pdf,ICLR,2021,We propose an auxiliary objective to disentangle value function predictions from irrelevant entities in cooperative MARL and find that it significantly improves performance in settings with varying quantities and types of agents. +BJl-5pNKDB,H1xfpuTwvS,1569440000000.0,1583910000000.0,696,On Computation and Generalization of Generative Adversarial Imitation Learning,"[""mchen393@gatech.edu"", ""wyzjack990122@gmail.com"", ""tianyiliu@gatech.edu"", ""zy6@princeton.edu"", ""xingguol@princeton.edu"", ""zhaoran.wang@northwestern.edu"", ""tourzhao@gatech.edu""]","[""Minshuo Chen"", ""Yizhou Wang"", ""Tianyi Liu"", ""Zhuoran Yang"", ""Xingguo Li"", ""Zhaoran Wang"", ""Tuo Zhao""]",[],"Generative Adversarial Imitation Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies. Different from Reinforcement Learning (RL), GAIL takes advantage of demonstration data by experts (e.g., human), and learns both the policy and reward function of the unknown environment. Despite the significant empirical progresses, the theory behind GAIL is still largely unknown. The major difficulty comes from the underlying temporal dependency of the demonstration data and the minimax computational formulation of GAIL without convex-concave structure. To bridge such a gap between theory and practice, this paper investigates the theoretical properties of GAIL. Specifically, we show: (1) For GAIL with general reward parameterization, the generalization can be guaranteed as long as the class of the reward functions is properly controlled; (2) For GAIL, where the reward is parameterized as a reproducing kernel function, GAIL can be efficiently solved by stochastic first order optimization algorithms, which attain sublinear convergence to a stationary solution. To the best of our knowledge, these are the first results on statistical and computational guarantees of imitation learning with reward/policy function ap- proximation. Numerical experiments are provided to support our analysis. +",/pdf/380211993dcb604e40da970774090070dc0b6f02.pdf,ICLR,2020, +S1MAriC5F7,SJliZHaYYQ,1538090000000.0,1545360000000.0,133,Massively Parallel Hyperparameter Tuning,"[""jamieson@cs.washington.edu"", ""rostami@google.com"", ""kgonina@google.com"", ""hardt@berkeley.edu"", ""brecht@berkeley.edu"", ""talwalkar@cmu.edu""]","[""Liam Li"", ""Kevin Jamieson"", ""Afshin Rostamizadeh"", ""Ekaterina Gonina"", ""Moritz Hardt"", ""Ben Recht"", ""Ameet Talwalkar""]","[""hyperparameter optimization"", ""automl""]","Modern learning models are characterized by large hyperparameter spaces. In order to adequately explore these large spaces, we must evaluate a large number of configurations, typically orders of magnitude more configurations than available parallel workers. Given the growing costs of model training, we would ideally like to perform this search in roughly the same wall-clock time needed to train a single model. In this work, we tackle this challenge by introducing ASHA, a simple and robust hyperparameter tuning algorithm with solid theoretical underpinnings that exploits parallelism and aggressive early-stopping. Our extensive empirical results show that ASHA outperforms state-of-the-art hyperparameter tuning methods; scales linearly with the number of workers in distributed settings; converges to a high quality configuration in half the time taken by Vizier, Google's internal hyperparameter tuning service) in an experiment with 500 workers; and beats the published result for a near state-of-the-art LSTM architecture in under $2\times$ the time to train a single model.",/pdf/ef50518fcca7d3f60c4af4b94275363c0cea8cd8.pdf,ICLR,2019, +S1xWh1rYwB,S1efdeyYvS,1569440000000.0,1588880000000.0,1940,Restricting the Flow: Information Bottlenecks for Attribution,"[""karl.schulz@tum.de"", ""leon.sixt@fu-berlin.de"", ""tombari@in.tum.de"", ""tim.landgraf@fu-berlin.de""]","[""Karl Schulz"", ""Leon Sixt"", ""Federico Tombari"", ""Tim Landgraf""]","[""Attribution"", ""Informational Bottleneck"", ""Interpretable Machine Learning"", ""Explainable AI""]","Attribution methods provide insights into the decision-making of machine learning models like artificial neural networks. For a given input sample, they assign a relevance score to each individual input variable, such as the pixels of an image. In this work, we adopt the information bottleneck concept for attribution. By adding noise to intermediate feature maps, we restrict the flow of information and can quantify (in bits) how much information image regions provide. We compare our method against ten baselines using three different metrics on VGG-16 and ResNet-50, and find that our methods outperform all baselines in five out of six settings. The method’s information-theoretic foundation provides an absolute frame of reference for attribution values (bits) and a guarantee that regions scored close to zero are not necessary for the network's decision. ",/pdf/fa83870e660a55419088e307e2571dfea2df9116.pdf,ICLR,2020,We apply the informational bottleneck concept to attribution. +SJvrXqvaZ,BJISm5vT-,1508510000000.0,1518730000000.0,24,Adversary A3C for Robust Reinforcement Learning,"[""guzhaoyuan14@gmail.com"", ""zhenzhong.jia@gmail.com"", ""choset@cs.cmu.edu""]","[""Zhaoyuan Gu"", ""Zhenzhong Jia"", ""Howie Choset""]","[""Adversary"", ""Robust"", ""Reinforcement Learning"", ""A3C""]","Asynchronous Advantage Actor Critic (A3C) is an effective Reinforcement Learning (RL) algorithm for a wide range of tasks, such as Atari games and robot control. The agent learns policies and value function through trial-and-error interactions with the environment until converging to an optimal policy. Robustness and stability are critical in RL; however, neural network can be vulnerable to noise from unexpected sources and is not likely to withstand very slight disturbances. We note that agents generated from mild environment using A3C are not able to handle challenging environments. Learning from adversarial examples, we proposed an algorithm called Adversary Robust A3C (AR-A3C) to improve the agent’s performance under noisy environments. In this algorithm, an adversarial agent is introduced to the learning process to make it more robust against adversarial disturbances, thereby making it more adaptive to noisy environments. Both simulations and real-world experiments are carried out to illustrate the stability of the proposed algorithm. The AR-A3C algorithm outperforms A3C in both clean and noisy environments. ",/pdf/c25bf035784d1264c1a41f3000bfe219c202b361.pdf,ICLR,2018, +Hk-FlMbAZ,ry1KxGZRW,1509130000000.0,1518730000000.0,770,The Manifold Assumption and Defenses Against Adversarial Perturbations,"[""xiwu@cs.wisc.edu"", ""wjang@cs.wisc.edu"", ""lchen@cs.wisc.edu"", ""jha@cs.wisc.edu""]","[""Xi Wu"", ""Uyeong Jang"", ""Lingjiao Chen"", ""Somesh Jha""]","[""the manifold assumption"", ""adversarial perturbation"", ""neural networks""]","In the adversarial-perturbation problem of neural networks, an adversary starts with a neural network model $F$ and a point $\bfx$ that $F$ classifies correctly, and applies a \emph{small perturbation} to $\bfx$ to produce another point $\bfx'$ that $F$ classifies \emph{incorrectly}. In this paper, we propose taking into account \emph{the inherent confidence information} produced by models when studying adversarial perturbations, where a natural measure of ``confidence'' is \|F(\bfx)\|_\infty$ (i.e. how confident $F$ is about its prediction?). Motivated by a thought experiment based on the manifold assumption, we propose a ``goodness property'' of models which states that \emph{confident regions of a good model should be well separated}. We give formalizations of this property and examine existing robust training objectives in view of them. Interestingly, we find that a recent objective by Madry et al. encourages training a model that satisfies well our formal version of the goodness property, but has a weak control of points that are wrong but with low confidence. However, if Madry et al.'s model is indeed a good solution to their objective, then good and bad points are now distinguishable and we can try to embed uncertain points back to the closest confident region to get (hopefully) correct predictions. We thus propose embedding objectives and algorithms, and perform an empirical study using this method. Our experimental results are encouraging: Madry et al.'s model wrapped with our embedding procedure achieves almost perfect success rate in defending against attacks that the base model fails on, while retaining good generalization behavior. +",/pdf/f57d780f3311d322c02e9fb2ab4e43f33151c0c1.pdf,ICLR,2018,Defending against adversarial perturbations of neural networks from manifold assumption +SkxQp1StDH,HJx1gVktwS,1569440000000.0,1583910000000.0,1982,Low-dimensional statistical manifold embedding of directed graphs,"[""fun@biba.uni-bremen.de"", ""tian.guo0980@gmail.com"", ""alen.lancic@math.hr"", ""nino.antulov@gess.ethz.ch""]","[""Thorben Funke"", ""Tian Guo"", ""Alen Lancic"", ""Nino Antulov-Fantulin""]","[""graph embedding"", ""information geometry"", ""graph representations""]","We propose a novel node embedding of directed graphs to statistical manifolds, which is based on a global minimization of pairwise relative entropy and graph geodesics in a non-linear way. Each node is encoded with a probability density function over a measurable space. Furthermore, we analyze the connection of the geometrical properties of such embedding and their efficient learning procedure. Extensive experiments show that our proposed embedding is better preserving the global geodesic information of graphs, as well as outperforming existing embedding models on directed graphs in a variety of evaluation metrics, in an unsupervised setting.",/pdf/5505a3164f03ecb8799366aee7409994bdcf64b1.pdf,ICLR,2020,"We propose a novel node embedding of directed graphs to statistical manifolds and analyze connections to divergence, geometry and efficient learning procedure." +rJl_NhR9K7,rylSAfn5tX,1538090000000.0,1545360000000.0,1466,ISA-VAE: Independent Subspace Analysis with Variational Autoencoders,"[""t-jastuh@microsoft.com"", ""ret26@cam.ac.uk"", ""senowozi@microsoft.com""]","[""Jan St\u00fchmer"", ""Richard Turner"", ""Sebastian Nowozin""]","[""representation learning"", ""disentanglement"", ""interpretability"", ""variational autoencoders""]","Recent work has shown increased interest in using the Variational Autoencoder (VAE) framework to discover interpretable representations of data in an unsupervised way. These methods have focussed largely on modifying the variational cost function to achieve this goal. However, we show that methods like beta-VAE simplify the tendency of variational inference to underfit causing pathological over-pruning and over-orthogonalization of learned components. In this paper we take a complementary approach: to modify the probabilistic model to encourage structured latent variable representations to be discovered. Specifically, the standard VAE probabilistic model is unidentifiable: the likelihood of the parameters is invariant under rotations of the latent space. This means there is no pressure to identify each true factor of variation with a latent variable. +We therefore employ a rich prior distribution, akin to the ICA model, that breaks the rotational symmetry. +Extensive quantitative and qualitative experiments demonstrate that the proposed prior mitigates the trade-off introduced by modified cost functions like beta-VAE and TCVAE between reconstruction loss and disentanglement. The proposed prior allows to improve these approaches with respect to both disentanglement and reconstruction quality significantly over the state of the art.",/pdf/0fbbd986c4d7485dd449815014b09ef789e8da44.pdf,ICLR,2019,We present structured priors for unsupervised learning of disentangled representations in VAEs that significantly mitigate the trade-off between disentanglement and reconstruction loss. +Qm7R_SdqTpT,FM8J39BG3Env,1601310000000.0,1615920000000.0,1046,Diverse Video Generation using a Gaussian Process Trigger,"[""~Gaurav_Shrivastava1"", ""~Abhinav_Shrivastava2""]","[""Gaurav Shrivastava"", ""Abhinav Shrivastava""]","[""video synthesis"", ""future frame generation"", ""video generation"", ""gaussian process priors"", ""diverse video generation""]","Generating future frames given a few context (or past) frames is a challenging task. It requires modeling the temporal coherence of videos as well as multi-modality in terms of diversity in the potential future states. Current variational approaches for video generation tend to marginalize over multi-modal future outcomes. Instead, we propose to explicitly model the multi-modality in the future outcomes and leverage it to sample diverse futures. Our approach, Diverse Video Generator, uses a GP to learn priors on future states given the past and maintains a probability distribution over possible futures given a particular sample. We leverage the changes in this distribution over time to control the sampling of diverse future states by estimating the end of on-going sequences. In particular, we use the variance of GP over the output function space to trigger a change in the action sequence. We achieve state-of-the-art results on diverse future frame generation in terms of reconstruction quality and diversity of the generated sequences.",/pdf/fa17e9ba0b11fb7c0b424feda300e370d82df3eb.pdf,ICLR,2021,"Diverse future frame synthesis by modeling the diversity of future states using a Gaussian Process, and using Bayesian inference to sample diverse future states." +_QdvdkxOii6,KEcoRarKl96,1601310000000.0,1614990000000.0,616,Measuring Progress in Deep Reinforcement Learning Sample Efficiency ,"[""~Florian_E._Dorner1""]","[""Florian E. Dorner""]","[""Deep Reinforcement Learning"", ""Sample Efficiency""]","Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does not necessarily correspond to improved sample efficiency. As simulating real world processes is often prohibitively hard and collecting real world experience is costly, sample efficiency is an important indicator for economically relevant applications of DRL. We investigate progress in sample efficiency on Atari games and continuous control tasks by comparing the amount of samples that a variety of algorithms need to reach a given performance level according to training curves in the corresponding publications. We find exponential progress in sample efficiency with estimated doubling times of around 10 to 18 months on Atari, 5 to 24 months on state-based continuous control and of around 4 to 9 months on pixel-based continuous control depending on the specific task and performance level.",/pdf/d2b5b622c8c995a07f56c05166b348b135314fc3.pdf,ICLR,2021,We measure progress in deep reinforcement learning sample efficiency using training curves from published papers. +HyM7AiA5YX,HJgb2VkcK7,1538090000000.0,1553190000000.0,880,Complement Objective Training,"[""haoyunchen@gapp.nthu.edu.tw"", ""peihsin@gapp.nthu.edu.tw"", ""newgod1992@gapp.nthu.edu.tw"", ""scchang@cs.nthu.edu.tw"", ""jypan@google.com"", ""yutingchen@google.com"", ""wewei@google.com"", ""dacheng@google.com""]","[""Hao-Yun Chen"", ""Pei-Hsin Wang"", ""Chun-Hao Liu"", ""Shih-Chieh Chang"", ""Jia-Yu Pan"", ""Yu-Ting Chen"", ""Wei Wei"", ""Da-Cheng Juan""]","[""optimization"", ""entropy"", ""image recognition"", ""natural language understanding"", ""adversarial attacks"", ""deep learning""]","Learning with a primary objective, such as softmax cross entropy for classification and sequence generation, has been the norm for training deep neural networks for years. Although being a widely-adopted approach, using cross entropy as the primary objective exploits mostly the information from the ground-truth class for maximizing data likelihood, and largely ignores information from the complement (incorrect) classes. We argue that, in addition to the primary objective, training also using a complement objective that leverages information from the complement classes can be effective in improving model performance. This motivates us to study a new training paradigm that maximizes the likelihood of the ground-truth class while neutralizing the probabilities of the complement classes. We conduct extensive experiments on multiple tasks ranging from computer vision to natural language understanding. The experimental results confirm that, compared to the conventional training with just one primary objective, training also with the complement objective further improves the performance of the state-of-the-art models across all tasks. In addition to the accuracy improvement, we also show that models trained with both primary and complement objectives are more robust to single-step adversarial attacks. +",/pdf/ea9f2fdb5e102291ee5e41f1fa4a4689f2ed0344.pdf,ICLR,2019,"We propose Complement Objective Training (COT), a new training paradigm that optimizes both the primary and complement objectives for effectively learning the parameters of neural networks." +ZW0yXJyNmoG,iXc6xNYABp-,1601310000000.0,1613310000000.0,1684,Taming GANs with Lookahead-Minmax,"[""~Tatjana_Chavdarova2"", ""~Matteo_Pagliardini1"", ""~Sebastian_U_Stich1"", ""~Fran\u00e7ois_Fleuret2"", ""~Martin_Jaggi1""]","[""Tatjana Chavdarova"", ""Matteo Pagliardini"", ""Sebastian U Stich"", ""Fran\u00e7ois Fleuret"", ""Martin Jaggi""]","[""Minmax"", ""Generative Adversarial Networks""]","Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead–minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead–minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources.",/pdf/1e1819f3ec50eb22f92cb7e0e6a5ff0d1c7d458a.pdf,ICLR,2021,A novel optimizer for GANs and games in general. +ryl5khRcKm,rkgDBzwcFQ,1538090000000.0,1550830000000.0,1008,Human-level Protein Localization with Convolutional Neural Networks,"[""rumetshofer@ml.jku.at"", ""hofmarcher@ml.jku.at"", ""clemens.roehrl@meduniwien.ac.at"", ""hochreit@ml.jku.at"", ""klambauer@ml.jku.at""]","[""Elisabeth Rumetshofer"", ""Markus Hofmarcher"", ""Clemens R\u00f6hrl"", ""Sepp Hochreiter"", ""G\u00fcnter Klambauer""]","[""Convolutional Neural Networks"", ""High-resolution images"", ""Multiple-Instance Learning"", ""Microscopy Imaging"", ""Protein Localization""]","Localizing a specific protein in a human cell is essential for understanding cellular functions and biological processes of underlying diseases. A promising, low-cost,and time-efficient biotechnology for localizing proteins is high-throughput fluorescence microscopy imaging (HTI). This imaging technique stains the protein of interest in a cell with fluorescent antibodies and subsequently takes a microscopic image. Together with images of other stained proteins or cell organelles and the annotation by the Human Protein Atlas project, these images provide a rich source of information on the protein location which can be utilized by computational methods. It is yet unclear how precise such methods are and whether they can compete with human experts. We here focus on deep learning image analysis methods and, in particular, on Convolutional Neural Networks (CNNs)since they showed overwhelming success across different imaging tasks. We pro-pose a novel CNN architecture “GapNet-PL” that has been designed to tackle the characteristics of HTI data and uses global averages of filters at different abstraction levels. We present the largest comparison of CNN architectures including GapNet-PL for protein localization in HTI images of human cells. GapNet-PL outperforms all other competing methods and reaches close to perfect localization in all 13 tasks with an average AUC of 98% and F1 score of 78%. On a separate test set the performance of GapNet-PL was compared with three human experts and 25 scholars. GapNet-PL achieved an accuracy of 91%, significantly (p-value 1.1e−6) outperforming the best human expert with an accuracy of 72%.",/pdf/4b51acaed74c639f4c5fa93681e60e45bd63512e.pdf,ICLR,2019, +Hye00pVtPS,HygxRzzOwB,1569440000000.0,1577170000000.0,875,CONFEDERATED MACHINE LEARNING ON HORIZONTALLY AND VERTICALLY SEPARATED MEDICAL DATA FOR LARGE-SCALE HEALTH SYSTEM INTELLIGENCE,"[""dianbo.liu@childrens.harvard.edu"", ""timothy.miller@childrens.harvard.edu"", ""kenneth.mandl@childrens.harvard.edu""]","[""Dianbo Liu"", ""Tim Miller"", ""Kenneth Mandl""]","[""Confederated learning"", ""siloed medical data"", ""representation joining""]","A patient’s health information is generally fragmented across silos. Though it is technically feasible to unite data for analysis in a manner that underpins a rapid learning healthcare system, privacy concerns and regulatory barriers limit data centralization. Machine learning can be conducted in a federated manner on patient datasets with the same set of variables, but separated across sites of care. But federated learning cannot handle the situation where different data types for a given +patient are separated vertically across different organizations. We call methods that enable machine learning model training on data separated by two or more degrees “confederated machine learning.” We built and evaluated a confederated machine +learning model to stratify the risk of accidental falls among the elderly.",/pdf/2428e958406cfbe518e55e084403c2e316b51fa6.pdf,ICLR,2020,a confederated learning method that train model from horizontally and vertically separated medical data +Hy8hkYeRb,ByrnkYlA-,1509100000000.0,1518730000000.0,329,A Deep Predictive Coding Network for Learning Latent Representations,"[""shirin.dora@gmail.com"", ""c.m.a.pennartz@uva.nl"", ""s.m.bohte@cwi.nl""]","[""Shirin Dora"", ""Cyriel Pennartz"", ""Sander Bohte""]","[""Predictive coding"", ""deep neural network"", ""generative model"", ""unsupervised learning"", ""learning latent representations""]","It has been argued that the brain is a prediction machine that continuously learns how to make better predictions about the stimuli received from the external environment. For this purpose, it builds a model of the world around us and uses this model to infer the external stimulus. Predictive coding has been proposed as a mechanism through which the brain might be able to build such a model of the external environment. However, it is not clear how predictive coding can be used to build deep neural network models of the brain while complying with the architectural constraints imposed by the brain. In this paper, we describe an algorithm to build a deep generative model using predictive coding that can be used to infer latent representations about the stimuli received from external environment. Specifically, we used predictive coding to train a deep neural network on real-world images in a unsupervised learning paradigm. To understand the capacity of the network with regards to modeling the external environment, we studied the latent representations generated by the model on images of objects that are never presented to the model during training. Despite the novel features of these objects the model is able to infer the latent representations for them. Furthermore, the reconstructions of the original images obtained from these latent representations preserve the important details of these objects.",/pdf/3f6f1173dec159cae89cc5b8b3670ce2df0e3697.pdf,ICLR,2018,A predictive coding based learning algorithm for building deep neural network models of the brain +SkgZNnR5tX,BkgF5atqt7,1538090000000.0,1545360000000.0,1424,Uncovering Surprising Behaviors in Reinforcement Learning via Worst-case Analysis,"[""aruderman@google.com"", ""reverett@google.com"", ""bristy@google.com"", ""soyer@google.com"", ""juesato@google.com"", ""skywalker94@gmail.com"", ""cbeattie@google.com"", ""pushmeet@google.com""]","[""Avraham Ruderman"", ""Richard Everett"", ""Bristy Sikder"", ""Hubert Soyer"", ""Jonathan Uesato"", ""Ananya Kumar"", ""Charlie Beattie"", ""Pushmeet Kohli""]","[""Reinforcement learning"", ""Adversarial examples"", ""Navigation"", ""Evaluation"", ""Analysis""]","Reinforcement learning agents are typically trained and evaluated according to their performance averaged over some distribution of environment settings. But does the distribution over environment settings contain important biases, and do these lead to agents that fail in certain cases despite high average-case performance? In this work, we consider worst-case analysis of agents over environment settings in order to detect whether there are directions in which agents may have failed to generalize. Specifically, we consider a 3D first-person task where agents must navigate procedurally generated mazes, and where reinforcement learning agents have recently achieved human-level average-case performance. By optimizing over the structure of mazes, we find that agents can suffer from catastrophic failures, failing to find the goal even on surprisingly simple mazes, despite their impressive average-case performance. Additionally, we find that these failures transfer between different agents and even significantly different architectures. We believe our findings highlight an important role for worst-case analysis in identifying whether there are directions in which agents have failed to generalize. Our hope is that the ability to automatically identify failures of generalization will facilitate development of more general and robust agents. To this end, we report initial results on enriching training with settings causing failure.",/pdf/0dca5a4d8233c8b3ec61568466577e7498ea97e6.pdf,ICLR,2019,We find environment settings in which SOTA agents trained on navigation tasks display extreme failures suggesting failures in generalization. +I6QHpMdZD5k,WBLwSyF-Fb,1601310000000.0,1614990000000.0,749,Learning to Solve Nonlinear Partial Differential Equation Systems To Accelerate MOSFET Simulation,"[""~Seungcheol_Han1"", ""~Jonghyun_Choi1"", ""~Sung-Min_Hong1""]","[""Seungcheol Han"", ""Jonghyun Choi"", ""Sung-Min Hong""]","[""Partial differential equation"", ""nonlinear equation"", ""Newton-Raphson method"", ""convolutional neural network""]","Semiconductor device simulation uses numerical analysis, where a set of coupled nonlinear partial differential equations is solved with the iterative Newton-Raphson method. Since an appropriate initial guess to start the Newton-Raphson method is not available, a solution of practical importance with desired boundary conditions cannot be trivially achieved. Instead, several solutions with intermediate boundary conditions should be calculated to address the nonlinearity and introducing intermediate boundary conditions significantly increases the computation time. In order to accelerate the semiconductor device simulation, we propose to use a neural network to learn an approximate solution for desired boundary conditions. With an initial solution sufficiently close to the final one by a trained neural network, computational cost to calculate several unnecessary solutions is significantly reduced. Specifically, a convolutional neural network for MOSFET (Metal-Oxide-Semiconductor Field-Effect Transistor), the most widely used semiconductor device, are trained in a supervised manner to compute the initial solution. Particularly, we propose to consider device grids with varying size and spacing and derive a compact expression of the solution based upon the electrostatic potential. We empirically show that the proposed method accelerates the simulation by more than 12 times. Results from the local linear regression and a fully-connected network are compared and extension to a complex two-dimensional domain is sketched.",/pdf/2a529b9bb23c7fe1aa5ca639122b01266ed49619.pdf,ICLR,2021,Learning a convolutional neural network to approximately solve nonlinear PDE systems to accelerate MOSFET simulation by more than 12x times. +S1zlmnA5K7,Bkeo84aqFm,1538090000000.0,1545360000000.0,1328,Where Off-Policy Deep Reinforcement Learning Fails,"[""scott.fujimoto@mail.mcgill.ca"", ""david.meger@mcgill.ca"", ""dprecup@cs.mcgill.ca""]","[""Scott Fujimoto"", ""David Meger"", ""Doina Precup""]","[""reinforcement learning"", ""off-policy"", ""imitation"", ""batch reinforcement learning""]","This work examines batch reinforcement learning--the task of maximally exploiting a given batch of off-policy data, without further data collection. We demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are only capable of learning with data correlated to their current policy, making them ineffective for most off-policy applications. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space to force the agent towards behaving on-policy with respect to a subset of the given data. We extend this notion to deep reinforcement learning, and to the best of our knowledge, present the first continuous control deep reinforcement learning algorithm which can learn effectively from uncorrelated off-policy data.",/pdf/acd2a7443c6404a7eba7782db12fdeec27723d80.pdf,ICLR,2019,We describe conditions where off-policy deep reinforcements algorithms fail and present a solution. +rke2HRVYvH,rJeeip8dwS,1569440000000.0,1577170000000.0,1122,Stochastic Prototype Embeddings,"[""tysc7237@colorado.edu"", ""karl.ridgeway@colorado.edu"", ""mcmozer@google.com""]","[""Tyler R. Scott"", ""Karl Ridgeway"", ""Michael C. Mozer""]","[""deep embeddings"", ""stochastic embeddings"", ""probabilistic embeddings"", ""deep metric learning"", ""few-shot learning""]","Supervised deep-embedding methods project inputs of a domain to a representational space in which same-class instances lie near one another and different-class instances lie far apart. We propose a probabilistic method that treats embeddings as random variables. Extending a state-of-the-art deterministic method, Prototypical Networks (Snell et al., 2017), our approach supposes the existence of a class prototype around which class instances are Gaussian distributed. The prototype posterior is a product distribution over labeled instances, and query instances are classified by marginalizing relative prototype proximity over embedding uncertainty. We describe an efficient sampler for approximate inference that allows us to train the model at roughly the same space and time cost as its deterministic sibling. Incorporating uncertainty improves performance on few-shot learning and gracefully handles label noise and out-of-distribution inputs. Compared to the state-of-the-art stochastic method, Hedged Instance Embeddings (Oh et al., 2019), we achieve superior large- and open-set classification accuracy. Our method also aligns class-discriminating features with the axes of the embedding space, yielding an interpretable, disentangled representation.",/pdf/c2f23d162d85e4d5117d59a674f94f5aa2afb3d8.pdf,ICLR,2020,"The paper proposes a probabilistic extension of Prototypical Networks that achieves superior few-shot, large-, and open-set classification performance, while gracefully handling label noise and out-of-distribution inputs." +B1e3OlStPB,ryeSHplKDH,1569440000000.0,1583910000000.0,2413,DeepSphere: a graph-based spherical CNN,"[""michael.defferrard@epfl.ch"", ""martino.milani@epfl.ch"", ""frederick.gusset@epfl.ch"", ""nathanael.perraudin@sdsc.ethz.ch""]","[""Micha\u00ebl Defferrard"", ""Martino Milani"", ""Fr\u00e9d\u00e9rick Gusset"", ""Nathana\u00ebl Perraudin""]","[""spherical cnns"", ""graph neural networks"", ""geometric deep learning""]","Designing a convolution for a spherical neural network requires a delicate tradeoff between efficiency and rotation equivariance. DeepSphere, a method based on a graph representation of the discretized sphere, strikes a controllable balance between these two desiderata. This contribution is twofold. First, we study both theoretically and empirically how equivariance is affected by the underlying graph with respect to the number of pixels and neighbors. Second, we evaluate DeepSphere on relevant problems. Experiments show state-of-the-art performance and demonstrates the efficiency and flexibility of this formulation. Perhaps surprisingly, comparison with previous work suggests that anisotropic filters might be an unnecessary price to pay. Our code is available at https://github.com/deepsphere.",/pdf/4062fbb4d40a4cef410b9b9b99e84e62ade36d63.pdf,ICLR,2020,A graph-based spherical CNN that strikes an interesting balance of trade-offs for a wide variety of applications. +ryeyti0qKX,r1xBuU5qtX,1538090000000.0,1545360000000.0,406,On the Statistical and Information Theoretical Characteristics of DNN Representations,"[""choid@snu.ac.kr"", ""wrhee@snu.ac.kr"", ""ruddms0415@snu.ac.kr"", ""chshin@encoredtech.com""]","[""Daeyoung Choi"", ""Wonjong Rhee"", ""Kyungeun Lee"", ""Changho Shin""]","[""learned representation"", ""statistical characteristics"", ""information theoretical characteristics"", ""deep network""]","It has been common to argue or imply that a regularizer can be used to alter a statistical property of a hidden layer's representation and thus improve generalization or performance of deep networks. For instance, dropout has been known to improve performance by reducing co-adaptation, and representational sparsity has been argued as a good characteristic because many data-generation processes have only a small number of factors that are independent. In this work, we analytically and empirically investigate the popular characteristics of learned representations, including correlation, sparsity, dead unit, rank, and mutual information, and disprove many of the \textit{conventional wisdom}. We first show that infinitely many Identical Output Networks (IONs) can be constructed for any deep network with a linear layer, where any invertible affine transformation can be applied to alter the layer's representation characteristics. The existence of ION proves that the correlation characteristics of representation can be either low or high for a well-performing network. Extensions to ReLU layers are provided, too. Then, we consider sparsity, dead unit, and rank to show that only loose relationships exist among the three characteristics. It is shown that a higher sparsity or additional dead units do not imply a better or worse performance when the rank of representation is fixed. We also develop a rank regularizer and show that neither representation sparsity nor lower rank is helpful for improving performance even when the data-generation process has only a small number of independent factors. Mutual information $I(\z_l;\x)$ and $I(\z_l;\y)$ are investigated as well, and we show that regularizers can affect $I(\z_l;\x)$ and thus indirectly influence the performance. Finally, we explain how a rich set of regularizers can be used as a powerful tool for performance tuning. ",/pdf/0a1530e51dc082457cbba8cf9dfb78255f52382e.pdf,ICLR,2019, +H1tSsb-AW,HkuBsbb0b,1509130000000.0,1519540000000.0,712,Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines,"[""cathywu@eecs.berkeley.edu"", ""aravraj@cs.washington.edu"", ""rockyduan@eecs.berkeley.edu"", ""vikash@cs.washington.edu"", ""bayen@berkeley.edu"", ""sham@cs.washington.edu"", ""igor.mordatch@gmail.com"", ""pabbeel@cs.berkeley.edu""]","[""Cathy Wu"", ""Aravind Rajeswaran"", ""Yan Duan"", ""Vikash Kumar"", ""Alexandre M Bayen"", ""Sham Kakade"", ""Igor Mordatch"", ""Pieter Abbeel""]","[""reinforcement learning"", ""policy gradient"", ""variance reduction"", ""baseline"", ""control variates""]","Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline through both theoretical analysis as well as numerical results, including an analysis of the suboptimality of the optimal state-dependent baseline. The result is a computationally efficient policy gradient algorithm, which scales to high-dimensional control problems, as demonstrated by a synthetic 2000-dimensional target matching task. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks. Finally, we show that the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.",/pdf/9b96677031b946ffbd1bc1632375cf5e2a190309.pdf,ICLR,2018,Action-dependent baselines can be bias-free and yield greater variance reduction than state-only dependent baselines for policy gradient methods. +dlEJsyHGeaL,LOdsN94MFZk,1601310000000.0,1615710000000.0,408,Graph Edit Networks,"[""~Benjamin_Paassen1"", ""~Daniele_Grattarola1"", ""~Daniele_Zambon1"", ""cesare.alippi@usi.ch"", ""~Barbara_Eva_Hammer1""]","[""Benjamin Paassen"", ""Daniele Grattarola"", ""Daniele Zambon"", ""Cesare Alippi"", ""Barbara Eva Hammer""]","[""graph neural networks"", ""graph edit distance"", ""time series prediction"", ""structured prediction""]","While graph neural networks have made impressive progress in classification and regression, few approaches to date perform time series prediction on graphs, and those that do are mostly limited to edge changes. We suggest that graph edits are a more natural interface for graph-to-graph learning. In particular, graph edits are general enough to describe any graph-to-graph change, not only edge changes; they are sparse, making them easier to understand for humans and more efficient computationally; and they are local, avoiding the need for pooling layers in graph neural networks. In this paper, we propose a novel output layer - the graph edit network - which takes node embeddings as input and generates a sequence of graph edits that transform the input graph to the output graph. We prove that a mapping between the node sets of two graphs is sufficient to construct training data for a graph edit network and that an optimal mapping yields edit scripts that are almost as short as the graph edit distance between the graphs. We further provide a proof-of-concept empirical evaluation on several graph dynamical systems, which are difficult to learn for baselines from the literature.",/pdf/febbb3ef6ff460d897054b7c7d79d3a0083df6a2.pdf,ICLR,2021,We show that graph neural networks can predict graph edits and are connected to the graph edit distance via graph mappings +B1spAqUp-,SJqT0cI6W,1508450000000.0,1518730000000.0,20,Pixel Deconvolutional Networks,"[""hongyang.gao@wsu.edu"", ""hao.yuan@wsu.edu"", ""zwang6@eecs.wsu.edu"", ""sji@eecs.wsu.edu""]","[""Hongyang Gao"", ""Hao Yuan"", ""Zhengyang Wang"", ""Shuiwang Ji""]","[""Deep Learning"", ""Deconvolutional Layer"", ""Pixel CNN""]","Deconvolutional layers have been widely used in a variety of deep +models for up-sampling, including encoder-decoder networks for +semantic segmentation and deep generative models for unsupervised +learning. One of the key limitations of deconvolutional operations +is that they result in the so-called checkerboard problem. This is +caused by the fact that no direct relationship exists among adjacent +pixels on the output feature map. To address this problem, we +propose the pixel deconvolutional layer (PixelDCL) to establish +direct relationships among adjacent pixels on the up-sampled feature +map. Our method is based on a fresh interpretation of the regular +deconvolution operation. The resulting PixelDCL can be used to +replace any deconvolutional layer in a plug-and-play manner without +compromising the fully trainable capabilities of original models. +The proposed PixelDCL may result in slight decrease in efficiency, +but this can be overcome by an implementation trick. Experimental +results on semantic segmentation demonstrate that PixelDCL can +consider spatial features such as edges and shapes and yields more +accurate segmentation outputs than deconvolutional layers. When used +in image generation tasks, our PixelDCL can largely overcome the +checkerboard problem suffered by regular deconvolution operations.",/pdf/00f4ebdee239818f1429fc1a600522710f3502e0.pdf,ICLR,2018,Solve checkerboard problem in Deconvolutional layer by building dependencies between pixels +B1eoyAVFwH,Skxa1az_vS,1569440000000.0,1577170000000.0,905,Feature Partitioning for Efficient Multi-Task Architectures,"[""anewell@cs.princeton.edu"", ""lujiang@google.com"", ""chong.wang@bytedance.com"", ""lijiali@cs.stanford.edu"", ""jiadeng@princeton.edu""]","[""Alejandro Newell"", ""Lu Jiang"", ""Chong Wang"", ""Li-Jia Li"", ""Jia Deng""]","[""multi-task learning"", ""neural architecture search"", ""multi-task architecture search""]","Multi-task learning promises to use less data, parameters, and time than training separate single-task models. But realizing these benefits in practice is challenging. In particular, it is difficult to define a suitable architecture that has enough capacity to support many tasks while not requiring excessive compute for each individual task. There are difficult trade-offs when deciding how to allocate parameters and layers across a large set of tasks. To address this, we propose a method for automatically searching over multi-task architectures that accounts for resource constraints. We define a parameterization of feature sharing strategies for effective coverage and sampling of architectures. We also present a method for quick evaluation of such architectures with feature distillation. Together these contributions allow us to quickly optimize for parameter-efficient multi-task models. We benchmark on Visual Decathlon, demonstrating that we can automatically search for and identify architectures that effectively make trade-offs between task resource requirements while maintaining a high level of final performance.",/pdf/2574e0f684308965daaeb5dc51a801604f3d2e7f.pdf,ICLR,2020,automatic search for multi-task architectures that reduce per-task feature use +HyeYJ1SKDH,HJguPRq_PH,1569440000000.0,1577170000000.0,1476,FLUID FLOW MASS TRANSPORT FOR GENERATIVE NETWORKS,"[""jlin@eoas.ubc.ca"", ""klensink@eoas.ubc.ca"", ""ehaber@eoas.ubc.ca""]","[""Jingrong Lin"", ""Keegan Lensink"", ""Eldad Haber""]","[""generative network"", ""optimal mass transport"", ""gaussian mixture"", ""model matching""]","Generative Adversarial Networks have been shown to be powerful tools for generating content resulting in them being intensively studied in recent years. Training these networks requires maximizing a generator loss and minimizing a discriminator loss, leading to a difficult saddle point problem that is slow and difficult to converge. Motivated by techniques in the registration of point clouds and the fluid flow formulation of mass transport, we investigate a new formulation that is based on strict minimization, without the need for the maximization. This formulation views the problem as a matching problem rather than an adversarial one, and thus allows us to quickly converge and obtain meaningful metrics in the optimization path.",/pdf/799f4932fb4012160374b650b39d3079a117c8e5.pdf,ICLR,2020, +S1xLuRVFvr,rJeIq6w_PB,1569440000000.0,1577170000000.0,1213,Visual Explanation for Deep Metric Learning,"[""szhu3@uncc.edu"", ""tyang30@uncc.edu"", ""chen.chen@uncc.edu""]","[""Sijie Zhu"", ""Taojiannan Yang"", ""Chen Chen""]","[""Metric Learning"", ""Visual Explanation""]","This work explores the visual explanation for deep metric learning and its applications. As an important problem for learning representation, metric learning has attracted much attention recently, while the interpretation of such model is not as well studied as classification. To this end, we propose an intuitive idea to show where contributes the most to the overall similarity of two input images by decomposing the final activation. Instead of only providing the overall activation map of each image, we propose to generate point-to-point activation intensity between two images so that the relationship between different regions is uncovered. We show that the proposed framework can be directly deployed to a large range of metric learning applications and provides valuable information for understanding the model. Furthermore, our experiments show its effectiveness on two potential applications, i.e. cross-view pattern discovery and interactive retrieval. ",/pdf/c59432531b2034ed5f601e2e1adbbdb45aae8405.pdf,ICLR,2020, +HJg3HyStwB,Byl9mX6uPB,1569440000000.0,1577170000000.0,1706,Perturbations are not Enough: Generating Adversarial Examples with Spatial Distortions,"[""ethanhezhao@gmail.com"", ""trunglm@monash.edu"", ""paul.montague@dst.defence.gov.au"", ""olivier.devel@dst.defence.gov.au"", ""tamas.abraham@dst.defence.gov.au"", ""dinh.phung@monash.edu""]","[""He Zhao"", ""Trung Le"", ""Paul Montague"", ""Olivier De Vel"", ""Tamas Abraham"", ""Dinh Phung""]",[],"Deep neural network image classifiers are reported to be susceptible to adversarial evasion attacks, which use carefully crafted images created to mislead a classifier. Recently, various kinds of adversarial attack methods have been proposed, most of which focus on adding small perturbations to input images. Despite the success of existing approaches, the way to generate realistic adversarial images with small perturbations remains a challenging problem. In this paper, we aim to address this problem by proposing a novel adversarial method, which generates adversarial examples by imposing not only perturbations but also spatial distortions on input images, including scaling, rotation, shear, and translation. As humans are less susceptible to small spatial distortions, the proposed approach can produce visually more realistic attacks with smaller perturbations, able to deceive classifiers without affecting human predictions. We learn our method by amortized techniques with neural networks and generate adversarial examples efficiently by a forward pass of the networks. Extensive experiments on attacking different types of non-robustified classifiers and robust classifiers with defence show that our method has state-of-the-art performance in comparison with advanced attack parallels.",/pdf/170cfc3bc6beae2308e16b2690cba26fcd11f984.pdf,ICLR,2020,A new adversarial attack for images with both perturbations and spatial distortions +SygLu0VtPH,ByxuDpPuDB,1569440000000.0,1577170000000.0,1212,Deep Innovation Protection,"[""sebr@itu.dk"", ""kstanley@uber.com""]","[""Sebastian Risi"", ""Kenneth O. Stanley""]","[""Neuroevolution"", ""innovation protection"", ""world models"", ""genetic algorithm""]","Evolutionary-based optimization approaches have recently shown promising results in domains such as Atari and robot locomotion but less so in solving 3D tasks directly from pixels. This paper presents a method called Deep Innovation Protection (DIP) that allows training complex world models end-to-end for such 3D environments. The main idea behind the approach is to employ multiobjective optimization to temporally reduce the selection pressure on specific components in a world model, allowing other components to adapt. We investigate the emergent representations of these evolved networks, which learn a model of the world without the need for a specific forward-prediction loss. ",/pdf/eb365a54cd2d1d8308d14048f8f1b0b565e5d878.pdf,ICLR,2020,Deep Innovation Protection allows evolving complex world models end-to-end for 3D tasks. +HyGLy2RqtQ,rJeOCTicFX,1538090000000.0,1545360000000.0,988,Over-parameterization Improves Generalization in the XOR Detection Problem,"[""brutzkus@gmail.com"", ""amir.globerson@gmail.com""]","[""Alon Brutzkus"", ""Amir Globerson""]","[""deep learning"", ""theory"", ""non convex optimization"", ""over-parameterization""]","Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we study a simplified learning task with over-parameterized convolutional networks that empirically exhibits the same qualitative phenomenon. For this setting, we provide a theoretical analysis of the optimization and generalization performance of gradient descent. Specifically, we prove data-dependent sample complexity bounds which show that over-parameterization improves the generalization performance of gradient descent.",/pdf/547a9dd0b1e70ab29d1b53fa5d50793431f4e3f6.pdf,ICLR,2019,We show in a simplified learning task that over-parameterization improves generalization of a convnet that is trained with gradient descent. +1OCwJdJSnSA,orV3LX-cU10,1601310000000.0,1614990000000.0,896,Disentangled cyclic reconstruction for domain adaptation,"[""~David_Bertoin1"", ""~Emmanuel_Rachelson1""]","[""David Bertoin"", ""Emmanuel Rachelson""]","[""Domain adaptation"", ""Disentanglement""]","The domain adaptation problem involves learning a unique classification or regression model capable of performing on both a source and a target domain. Although the labels for the source data are available during training, the labels in the target domain are unknown. An effective way to tackle this problem lies in extracting insightful features invariant to the source and target domains. In this work, we propose splitting the information for each domain into a task-related representation and its complimentary context representation. We propose an original method to disentangle these two representations in the single-domain supervised case. We then adapt this method to the unsupervised domain adaptation problem. In particular, our method allows disentanglement in the target domain, despite the absence of training labels. This enables the isolation of task-specific information from both domains and a projection into a common representation. The task-specific representation allows efficient transfer of knowledge acquired from the source domain to the target domain. We validate the proposed method on several classical domain adaptation benchmarks and illustrate the benefits of disentanglement for domain adaptation.",/pdf/822f394f6484d5e1dcbecd4864aeee5fef7177d5.pdf,ICLR,2021,"We tackle unsupervised domain adaptation by intra-domain and cross-domain cyclic reconstruction and achieve efficient representation disentanglement, including in the target domain." +jcN7a3yZeQc,cQuqMEPyU1b,1601310000000.0,1614990000000.0,2391,Decorrelated Double Q-learning,"[""~GANG_CHEN1""]","[""GANG CHEN""]","[""q-learning"", ""control variates"", ""reinforcement learning""]","Q-learning with value function approximation may have the poor performance because of overestimation bias and imprecise estimate. Specifically, overestimation bias is from the maximum operator over noise estimate, which is exaggerated using the estimate of a subsequent state. Inspired by the recent advance of deep reinforcement learning and Double Q-learning, we introduce the decorrelated double Q-learning (D2Q). Specifically, we introduce the decorrelated regularization item to reduce the correlation between value function approximators, which can lead to less biased estimation and low variance. The experimental results on a suite of MuJoCo continuous control tasks demonstrate that our decorrelated double Q-learning can effectively improve the performance.",/pdf/b0fb7882290e819c004f6b91a03606e5326dde8d.pdf,ICLR,2021,This paper proposes a decorrelated Double Q-learning for continuous task control +C5kn825mU19,MTN5BHGV3Nf,1601310000000.0,1614990000000.0,2657,A Coach-Player Framework for Dynamic Team Composition,"[""~Bo_Liu13"", ""~qiang_liu4"", ""~Peter_Stone1"", ""~Animesh_Garg1"", ""~Yuke_Zhu1"", ""~Anima_Anandkumar1""]","[""Bo Liu"", ""qiang liu"", ""Peter Stone"", ""Animesh Garg"", ""Yuke Zhu"", ""Anima Anandkumar""]","[""Multiagent reinforcement learning""]","In real-world multi-agent teams, agents with different capabilities may join or leave ""on the fly"" without altering the team's overarching goals. Coordinating teams with such dynamic composition remains a challenging problem: the optimal team strategy may vary with its composition. Inspired by real-world team sports, we propose a coach-player framework to tackle this problem. We assume that the players only have a partial view of the environment, while the coach has a complete view. The coach coordinates the players by distributing individual strategies. Specifically, we 1) propose an attention mechanism for both the players and the coach; 2) incorporate a variational objective to regularize learning; and 3) design an adaptive communication method to let the coach decide when to communicate with different players. Our attention mechanism on the players and the coach allows for a varying number of heterogeneous agents, and can thus tackle the dynamic team composition. We validate our methods on resource collection tasks in multi-agent particle environment. We demonstrate zero-shot generalization to new team compositions with varying numbers of heterogeneous agents. The performance of our method is comparable or even better than the setting where all players have a full view of the environment, but no coach. Moreover, we see that the performance stays nearly the same even when the coach communicates as little as 13% of the time using our adaptive communication strategy. These results demonstrate the significance of a coach to coordinate players in dynamic teams.",/pdf/e826f510885e9ff0953a26edbd102a90b00726e0.pdf,ICLR,2021,We design a coach-player hierarchy with mixed observability to tackle the multi-agent coordination problem where both the number of team members as well as their capabilities are subject to change. +HJxfm2CqKm,ByxtLbC9F7,1538090000000.0,1545360000000.0,1336,Discovering General-Purpose Active Learning Strategies,"[""ksenia.konyushkova@epfl.ch"", ""raphael.sznitman@artorg.unibe.ch"", ""pascal.fua@epfl.ch""]","[""Ksenia Konyushkova"", ""Raphael Sznitman"", ""Pascal Fua""]","[""active learning"", ""meta learning"", ""reinforcement learning""]","We propose a general-purpose approach to discovering active learning (AL) strategies from data. These strategies are transferable from one domain to another and can be used in conjunction with many machine learning models. To this end, we formalize the annotation process as a Markov decision process, design universal state and action spaces and introduce a new reward function that precisely reflects the AL objective of minimizing the annotation cost We seek to find an optimal (non-myopic) AL strategy using reinforcement learning. We evaluate the learned strategies on multiple unrelated domains and show that they consistently outperform state-of-the-art baselines.",/pdf/c5606ca035e31277e33bafb977343a4f175f3742.pdf,ICLR,2019, +AjrRA6WYSW,LBd5Woskjf5,1601310000000.0,1614990000000.0,873,Estimation of Number of Communities in Assortative Sparse Networks,"[""neilhwang@gmail.com"", ""xujiar@oregonstate.edu"", ""shirshendu@ccny.cuny.edu"", ""~Sharmodeep_Bhattacharyya1""]","[""Neil Hwang"", ""Jiarui Xu"", ""Shirshendu Chatterjee"", ""Sharmodeep Bhattacharyya""]","[""networks"", ""number of communities"", ""Bethe-Hessian"", ""sparse networks"", ""stochastic block model""]","Most community detection algorithms assume the number of communities, $K$, to be known \textit{a priori}. Among various approaches that have been proposed to estimate $K$, the non-parametric methods based on the spectral properties of the Bethe Hessian matrices have garnered much popularity for their simplicity, computational efficiency, and robust performance irrespective of the sparsity of the input data. Recently, one such method has been shown to estimate $K$ consistently if the input network is generated from the (semi-dense) stochastic block model, when the average of the expected degrees ($\tilde{d}$) of all the nodes in the network satisfies $\tilde{d} \gg \log(N)$ ($N$ being the number of nodes in the network). In this paper, we prove some finite sample results that hold for $\tilde{d} = o(\log(N))$, which in turn show that the estimation of $K$ based on the spectra of the Bethe Hessian matrices is consistent not only for the semi-dense regime, but also for the sub-logarithmic sparse regime when $1 \ll \tilde{d} \ll \log(N)$. Thus, our estimation procedure is a robust method for a wide range of problem settings, regardless of the sparsity of the network input.",/pdf/8649e7c12e8e8fc786ac16649dd72f3eb0cda342.pdf,ICLR,2021,Estimation of number of communities in sparse networks using eigenvalues of the Bethe-Hessian Matrix +yuXQOhKRjBr,5516ml4giKV,1601310000000.0,1614990000000.0,589,Towards Powerful Graph Neural Networks: Diversity Matters,"[""~Xu_Bingbing1"", ""~Huawei_Shen1"", ""~Qi_Cao1"", ""liuyuanhao20z@ict.ac.cn"", ""cenketing@ict.ac.cn"", ""~Xueqi_Cheng1""]","[""Xu Bingbing"", ""Huawei Shen"", ""Qi Cao"", ""Yuanhao Liu"", ""Keting Cen"", ""Xueqi Cheng""]","[""GNNs"", ""Expressive power"", ""Diverse sampling"", ""Injective""]","Graph neural networks (GNNs) offer us an effective framework for graph representation learning via layer-wise neighborhood aggregation. Their success is attributed to their expressive power at learning representation of nodes and graphs. To achieve GNNs with high expressive power, existing methods mainly resort to complex neighborhood aggregation functions, e.g., designing injective aggregation function or using multiple aggregation functions. Consequently, their expressive power is limited by the capability of aggregation function, which is tricky to determine in practice. To combat this problem, we propose a novel framework, namely diverse sampling, to improve the expressive power of GNNs. For a target node, diverse sampling offers it diverse neighborhoods, i.e., rooted sub-graphs, and the representation of target node is finally obtained via aggregating the representation of diverse neighborhoods obtained using any GNN model. High expressive power is guaranteed by the diversity of different neighborhoods. We use classical GNNs (i.e., GCN and GAT) as base models to evaluate the effectiveness of the proposed framework. Experiments are conducted at multi-class node classification task on three benchmark datasets and multi-label node classification task on a dataset collected in this paper. Extensive experiments demonstrate the proposed method consistently improve the performance of base GNN models. The proposed framework is applicable to any GNN models and thus is general for improving the expressive power of GNNs.",/pdf/2baeb3175f2ffe72444d49966cd1fa23815e815e.pdf,ICLR,2021,"We propose a novel framework to improve the expressive power of GNNs via diverse subgraph sampling, without depending on layer-wise injective aggregation functions." +tnq_O52RVbR,IjjnXKEnh9,1601310000000.0,1614990000000.0,472,SHADOWCAST: Controllable Graph Generation with Explainability,"[""~Wesley_Joon-Wie_Tann1"", ""~Ee-Chien_Chang1"", ""~Bryan_Hooi1""]","[""Wesley Joon-Wie Tann"", ""Ee-Chien Chang"", ""Bryan Hooi""]","[""Controllable Graph Generation"", ""Explainability"", ""Conditional Generative Adversarial Network""]","We introduce the problem of explaining graph generation, formulated as controlling the generative process to produce desired graphs with explainable structures. By directing this generative process, we can explain the observed outcomes. We propose SHADOWCAST, a controllable generative model capable of mimicking networks and directing the generation, as an approach to this novel problem. The proposed model is based on a conditional generative adversarial network for graph data. We design it with the capability to control the conditions using a simple and transparent Markov model. Comprehensive experiments on three real-world network datasets demonstrate our model's competitive performance in the graph generation task. Furthermore, we control SHADOWCAST to generate graphs of different structures to show its effective controllability and explainability. As the first work to pose the problem of explaining generated graphs by controlling the generation, SHADOWCAST paves the way for future research in this exciting area.",/pdf/0cb5fc204e7832b6659885e700b23b36efc23a3f.pdf,ICLR,2021,We introduce the problem of controlling the graph generation process and propose a novel approach based on a conditional generative adversarial network to produce deliberate graphs with explainable structures. +rylJkpEtwS,rkx3N2MrPB,1569440000000.0,1583910000000.0,287,Learning the Arrow of Time for Problems in Reinforcement Learning,"[""nasim.rahaman@tuebingen.mpg.de"", ""steffen.wolf@iwr.uni-heidelberg.de"", ""anirudhgoyal9119@gmail.com"", ""roman.remme@iwr.uni-heidelberg.de"", ""yoshua.bengio@mila.quebec""]","[""Nasim Rahaman"", ""Steffen Wolf"", ""Anirudh Goyal"", ""Roman Remme"", ""Yoshua Bengio""]","[""Arrow of Time"", ""Reinforcement Learning"", ""AI-Safety""]","We humans have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we approach the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture salient information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. Finally, we propose a simple yet effective algorithm to parameterize the problem at hand and learn an arrow of time with a function approximator (here, a deep neural network). Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer and Otto (1998). ",/pdf/aa03eaa098f6306a36c2dbe6df194e1d33ac2fdb.pdf,ICLR,2020,"We learn the arrow of time for MDPs and use it to measure reachability, detect side-effects and obtain a curiosity reward signal. " +H1xQVn09FX,B1ex_t6ctQ,1538090000000.0,1550880000000.0,1437,GANSynth: Adversarial Neural Audio Synthesis,"[""jesseengel@google.com"", ""kumarkagrawal@gmail.com"", ""chenshuo@google.com"", ""igul222@gmail.com"", ""christopherdonahue@gmail.com"", ""adarob@google.com""]","[""Jesse Engel"", ""Kumar Krishna Agrawal"", ""Shuo Chen"", ""Ishaan Gulrajani"", ""Chris Donahue"", ""Adam Roberts""]","[""GAN"", ""Audio"", ""WaveNet"", ""NSynth"", ""Music""]","Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modeling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts. +",/pdf/0f046c130a3228cda053e1f35b7e1efc06e2adf6.pdf,ICLR,2019,High-quality audio synthesis with GANs +rJTKKKqeg,,1478300000000.0,1488580000000.0,512,Tracking the World State with Recurrent Entity Networks,"[""mbh305@nyu.edu"", ""jase@fb.com"", ""azslam@fb.com"", ""abordes@fb.com"", ""yann@fb.com""]","[""Mikael Henaff"", ""Jason Weston"", ""Arthur Szlam"", ""Antoine Bordes"", ""Yann LeCun""]","[""Natural language processing"", ""Deep learning""]","We introduce a new model, the Recurrent Entity Network (EntNet). It is equipped +with a dynamic long-term memory which allows it to maintain and update a rep- +resentation of the state of the world as it receives new data. For language under- +standing tasks, it can reason on-the-fly as it reads text, not just when it is required +to answer a question or respond as is the case for a Memory Network (Sukhbaatar +et al., 2015). Like a Neural Turing Machine or Differentiable Neural Computer +(Graves et al., 2014; 2016) it maintains a fixed size memory and can learn to +perform location and content-based read and write operations. However, unlike +those models it has a simple parallel architecture in which several memory loca- +tions can be updated simultaneously. The EntNet sets a new state-of-the-art on +the bAbI tasks, and is the first method to solve all the tasks in the 10k training +examples setting. We also demonstrate that it can solve a reasoning task which +requires a large number of supporting facts, which other methods are not able to +solve, and can generalize past its training horizon. It can also be practically used +on large scale datasets such as Children’s Book Test, where it obtains competitive +performance, reading the story in a single pass.",/pdf/2492f18cfd03efe076f230140b0523b03e033c86.pdf,ICLR,2017,"A new memory-augmented model which learns to track the world state, obtaining SOTA on the bAbI tasks amongst other results." +KYPz4YsCPj,kz8dT1sX5zS,1601310000000.0,1616010000000.0,2646,Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks,"[""~Yanbang_Wang1"", ""yenyu@stanford.edu"", ""~Yunyu_Liu1"", ""~Jure_Leskovec1"", ""~Pan_Li2""]","[""Yanbang Wang"", ""Yen-Yu Chang"", ""Yunyu Liu"", ""Jure Leskovec"", ""Pan Li""]","[""temporal networks"", ""inductive representation learning"", ""anonymous walk"", ""network motif""]","Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose {\em Causal Anonymous Walks (CAWs)} to inductively represent a temporal network. CAWs are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the time-consuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neural-network model CAW-N to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAW-N is evaluated to predict links over 6 real temporal networks and uniformly outperforms previous SOTA methods by averaged 15\% AUC gain in the inductive setting. CAW-N also outperforms previous methods in 5 out of the 6 networks in the transductive setting.",/pdf/6cc011fa593c3860f0afcadd1e157f9160471ce6.pdf,ICLR,2021,"The paper proposes Causal Anonymous Walks (CAW) as an effective way to encode the dynamic laws that govern the evolution of temporal networks, which significantly improves inductive representation learning on those networks." +IUaOP8jQfHn,cN5tWY7cO0,1601310000000.0,1614990000000.0,3040,Benchmarking Unsupervised Object Representations for Video Sequences,"[""~Marissa_A._Weis1"", ""~Kashyap_Chitta1"", ""~Yash_Sharma1"", ""~Wieland_Brendel1"", ""~Matthias_Bethge2"", ""~Andreas_Geiger3"", ""~Alexander_S_Ecker1""]","[""Marissa A. Weis"", ""Kashyap Chitta"", ""Yash Sharma"", ""Wieland Brendel"", ""Matthias Bethge"", ""Andreas Geiger"", ""Alexander S Ecker""]","[""Unsupervised learning"", ""object-centric representations"", ""benchmark"", ""tracking""]","Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models have been evaluated with respect to different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of individual objects. To close this gap, we design a benchmark with three datasets of varying complexity and seven additional test sets which feature challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four unsupervised object-centric learning approaches: ViMON, a video-extension of MONet, based on a recurrent spatial attention mechanism, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use an explicit factorization via spatial transformers. Our results suggest that architectures with unconstrained latent representations and full-image object masks such as ViMON and OP3 are able to learn more powerful representations in terms of object detection, segmentation and tracking than the explicitly parameterized spatial transformer based architecture of TBA and SCALOR. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.",/pdf/6f9d7fac9271c0a287b97957f3b552388cc5e98a.pdf,ICLR,2021,We quantitatively analyze the performance of four object-centric representation learning models for video sequences on challenging tracking scenarios. +BJxRVnC5Fm,HyxQbinFKQ,1538090000000.0,1545360000000.0,1501,Mean Replacement Pruning ,"[""evcu@google.com"", ""nicolas@le-roux.name"", ""psc@google.com"", ""leon@bottou.org""]","[""Utku Evci"", ""Nicolas Le Roux"", ""Pablo Castro"", ""Leon Bottou""]","[""pruning"", ""saliency"", ""neural networks"", ""optimization"", ""redundancy"", ""model compression""]","Pruning units in a deep network can help speed up inference and training as well as reduce the size of the model. We show that bias propagation is a pruning technique which consistently outperforms the common approach of merely removing units, regardless of the architecture and the dataset. We also show how a simple adaptation to an existing scoring function allows us to select the best units to prune. Finally, we show that the units selected by the best performing scoring functions are somewhat consistent over the course of training, implying the dead parts of the network appear during the stages of training.",/pdf/6785651a4ea307b2c53974eff9c443a0b2d2f24b.pdf,ICLR,2019,Mean Replacement is an efficient method to improve the loss after pruning and Taylor approximation based scoring functions works better with absolute values. +Byx4NkrtDS,rJl_V33dDB,1569440000000.0,1583910000000.0,1651,Implementing Inductive bias for different navigation tasks through diverse RNN attrractors,"[""fexutie@gmail.com"", ""omri.barak@gmail.com""]","[""Tie XU"", ""Omri Barak""]","[""navigation"", ""Recurrent Neural Networks"", ""dynamics"", ""inductive bias"", ""pre-training"", ""reinforcement learning""]","Navigation is crucial for animal behavior and is assumed to require an internal representation of the external environment, termed a cognitive map. The precise form of this representation is often considered to be a metric representation of space. An internal representation, however, is judged by its contribution to performance on a given task, and may thus vary between different types of navigation tasks. Here we train a recurrent neural network that controls an agent performing several navigation tasks in a simple environment. To focus on internal representations, we split learning into a task-agnostic pre-training stage that modifies internal connectivity and a task-specific Q learning stage that controls the network's output. We show that pre-training shapes the attractor landscape of the networks, leading to either a continuous attractor, discrete attractors or a disordered state. These structures induce bias onto the Q-Learning phase, leading to a performance pattern across the tasks corresponding to metric and topological regularities. Our results show that, in recurrent networks, inductive bias takes the form of attractor landscapes -- which can be shaped by pre-training and analyzed using dynamical systems methods. Furthermore, we demonstrate that non-metric representations are useful for navigation tasks. ",/pdf/2b8a8ec45a40a84d5b5653027d320986ea8a67ec.pdf,ICLR,2020,"Task agnostic pre-training can shape RNN's attractor landscape, and form diverse inductive bias for different navigation tasks " +SJme6-ZR-,rJ7e6ZbAZ,1509130000000.0,1518730000000.0,727,A Deep Learning Approach for Survival Clustering without End-of-life Signals,"[""chandr@purdue.edu"", ""ribeiro@cs.purdue.edu"", ""neville@cs.purdue.edu""]","[""S Chandra Mouli"", ""Bruno Ribeiro"", ""Jennifer Neville""]","[""Survival Analysis"", ""Kuiper statistics"", ""model-free""]","The goal of survival clustering is to map subjects (e.g., users in a social network, patients in a medical study) to $K$ clusters ranging from low-risk to high-risk. Existing survival methods assume the presence of clear \textit{end-of-life} signals or introduce them artificially using a pre-defined timeout. In this paper, we forego this assumption and introduce a loss function that differentiates between the empirical lifetime distributions of the clusters using a modified Kuiper statistic. We learn a deep neural network by optimizing this loss, that performs a soft clustering of users into survival groups. We apply our method to a social network dataset with over 1M subjects, and show significant improvement in C-index compared to alternatives.",/pdf/a48a9659ab1469b879da198b68e85436e719e3a9.pdf,ICLR,2018,"The goal of survival clustering is to map subjects into clusters. Without end-of-life signals, this is a challenging task. To address this task we propose a new loss function by modifying the Kuiper statistics." +S1xtORNFwH,rkxXZkO_vH,1569440000000.0,1583910000000.0,1220,FSNet: Compression of Deep Convolutional Neural Networks by Filter Summary,"[""superyyzg@gmail.com"", ""jyu79@illinois.edu"", ""jojic@microsoft.com"", ""lukehuan@shenshangtech.com"", ""t-huang1@illinois.edu""]","[""Yingzhen Yang"", ""Jiahui Yu"", ""Nebojsa Jojic"", ""Jun Huan"", ""Thomas S. Huang""]","[""Compression of Convolutional Neural Networks"", ""Filter Summary CNNs"", ""Weight Sharing""]","We present a novel method of compression of deep Convolutional Neural Networks (CNNs) by weight sharing through a new representation of convolutional filters. The proposed method reduces the number of parameters of each convolutional layer by learning a $1$D vector termed Filter Summary (FS). The convolutional filters are located in FS as overlapping $1$D segments, and nearby filters in FS share weights in their overlapping regions in a natural way. The resultant neural network based on such weight sharing scheme, termed Filter Summary CNNs or FSNet, has a FS in each convolution layer instead of a set of independent filters in the conventional convolution layer. FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet has the same number of filters from FS as that of the basline CNN in the forward process. With compelling computational acceleration ratio, the parameter space of FSNet is much smaller than that of the baseline CNN. In addition, FSNet is quantization friendly. FSNet with weight quantization leads to even higher compression ratio without noticeable performance loss. We further propose Differentiable FSNet where the way filters share weights is learned in a differentiable and end-to-end manner. Experiments demonstrate the effectiveness of FSNet in compression of CNNs for computer vision tasks including image classification and object detection, and the effectiveness of DFSNet is evidenced by the task of Neural Architecture Search.",/pdf/d9f4bc50582a915bf44d952b56a6e0d494d9b863.pdf,ICLR,2020,We present a novel method of compression of deep Convolutional Neural Networks (CNNs) by weight sharing through a new representation of convolutional filters. +Syld53NtvH,Bkxxzndywr,1569440000000.0,1577170000000.0,120,Expected Tight Bounds for Robust Deep Neural Network Training,"[""salman.subaihi@kaust.edu.sa"", ""adel.bibi@kaust.edu.sa"", ""modar.alfadly@kaust.edu.sa"", ""abdullah.hamdi@kaust.edu.sa"", ""bernard.ghanem@kaust.edu.sa""]","[""Salman Alsubaihi"", ""Adel Bibi"", ""Modar Alfadly"", ""Abdullah Hamdi"", ""Bernard Ghanem""]","[""network robustness"", ""network verification"", ""interval bound propagation""]","Training Deep Neural Networks (DNNs) that are robust to norm bounded adversarial attacks remains an elusive problem. While verification based methods are generally too expensive to robustly train large networks, it was demonstrated by Gowal et. al. that bounded input intervals can be inexpensively propagated from layer to layer through deep networks. This interval bound propagation (IBP) approach led to high robustness and was the first to be employed on large networks. However, due to the very loose nature of the IBP bounds, particularly for large/deep networks, the required training procedure is complex and involved. In this paper, we closely examine the bounds of a block of layers composed of an affine layer, followed by a ReLU, followed by another affine layer. To this end, we propose \emph{expected} bounds (true bounds in expectation), which are provably tighter than IBP bounds in expectation. We then extend this result to deeper networks through blockwise propagation and show that we can achieve orders of magnitudes tighter bounds compared to IBP. Using these tight bounds, we demonstrate that a simple standard training procedure can achieve impressive robustness-accuracy trade-off across several architectures on both MNIST and CIFAR10.",/pdf/bb7c15178db0aefe33f6250fd9ed2d8b1b31e170.pdf,ICLR,2020,"For networks with ReLU activations, we derive output interval bounds, which are tight and true (in expectation) and easy to use in robust training." +H1Xw62kRZ,Bk7vpnyAb,1509050000000.0,1519650000000.0,175,Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis,"[""rudy@robots.ox.ac.uk"", ""mahauskn@microsoft.com"", ""jacobdevlin@google.com"", ""risin@microsoft.com"", ""pushmeet@google.com""]","[""Rudy Bunel"", ""Matthew Hausknecht"", ""Jacob Devlin"", ""Rishabh Singh"", ""Pushmeet Kohli""]","[""Program Synthesis"", ""Reinforcement Learning"", ""Language Model""]","Program synthesis is the task of automatically generating a program consistent with +a specification. Recent years have seen proposal of a number of neural approaches +for program synthesis, many of which adopt a sequence generation paradigm similar +to neural machine translation, in which sequence-to-sequence models are trained to +maximize the likelihood of known reference programs. While achieving impressive +results, this strategy has two key limitations. First, it ignores Program Aliasing: the +fact that many different programs may satisfy a given specification (especially with +incomplete specifications such as a few input-output examples). By maximizing +the likelihood of only a single reference program, it penalizes many semantically +correct programs, which can adversely affect the synthesizer performance. Second, +this strategy overlooks the fact that programs have a strict syntax that can be +efficiently checked. To address the first limitation, we perform reinforcement +learning on top of a supervised model with an objective that explicitly maximizes +the likelihood of generating semantically correct programs. For addressing the +second limitation, we introduce a training procedure that directly maximizes the +probability of generating syntactically correct programs that fulfill the specification. +We show that our contributions lead to improved accuracy of the models, especially +in cases where the training data is limited.",/pdf/f58500ef2f2e08832b2b72e534cc740ee50ac0b0.pdf,ICLR,2018,Using the DSL grammar and reinforcement learning to improve synthesis of programs with complex control flow. +B1eP504YDr,BkeCl3OdvH,1569440000000.0,1611060000000.0,1286,Independence-aware Advantage Estimation,"[""zpschang@gmail.com"", ""lizo@microsoft.com"", ""lgq1001@mail.ustc.edu.cn"", ""jiang.bian@microsoft.com"", ""aihuang@mails.tsinghua.edu.cn"", ""taoqin@microsoft.com"", ""tie-yan.liu@microsoft.com""]","[""Pushi Zhang"", ""Li Zhao"", ""Guoqing Liu"", ""Jiang Bian"", ""Minglie Huang"", ""Tao Qin"", ""Tie-Yan Liu""]","[""Reinforcement Learning"", ""Advantage Estimation""]","Most of existing advantage function estimation methods in reinforcement learning suffer from the problem of high variance, which scales unfavorably with the time horizon. To address this challenge, we propose to identify the independence property between current action and future states in environments, which can be further leveraged to effectively reduce the variance of the advantage estimation. In particular, the recognized independence property can be naturally utilized to construct a novel importance sampling advantage estimator with close-to-zero variance even when the Monte-Carlo return signal yields a large variance. To further remove the risk of the high variance introduced by the new estimator, we combine it with existing Monte-Carlo estimator via a reward decomposition model learned by minimizing the estimation variance. Experiments demonstrate that our method achieves higher sample efficiency compared with existing advantage estimation methods in complex environments. ",/pdf/8cf38e2a4db1bae863ba9653dd181e20ac9d51fd.pdf,ICLR,2020, +HJrDIpiee,,1478380000000.0,1481580000000.0,599,Investigating Recurrence and Eligibility Traces in Deep Q-Networks,"[""jharb@cs.mcgill.ca"", ""dprecup@cs.mcgill.ca""]","[""Jean Harb"", ""Doina Precup""]","[""Reinforcement Learning"", ""Deep learning""]","Eligibility traces in reinforcement learning are used as a bias-variance trade-off and can often speed up training time by propagating knowledge back over time-steps in a single update. We investigate the use of eligibility traces in combination with recurrent networks in the Atari domain. We illustrate the benefits of both recurrent nets and eligibility traces in some Atari games, and highlight also the importance of the optimization used in the training.",/pdf/0ef00b69b455f1ef1bf93bfacd23b16660df34ba.pdf,ICLR,2017,Analyze the effects of using eligibility traces different optimizations in Deep Recurrent Q-Networks +5L8XMh667qz,T23lx19sue,1601310000000.0,1614990000000.0,3302,Encoded Prior Sliced Wasserstein AutoEncoder for learning latent manifold representations,"[""~Sanjukta_Krishnagopal1"", ""jacob@math.umd.edu""]","[""Sanjukta Krishnagopal"", ""Jacob Bedrossian""]","[""VAE"", ""sliced Wasserstein distance"", ""latent representation"", ""interpolation"", ""manifold embedding"", ""geodesics"", ""network algorithm""]","While variational autoencoders have been successful in a variety of tasks, the use of conventional Gaussian or Gaussian mixture priors are limited in their ability to encode underlying structure of data in the latent representation. +In this work, we introduce an Encoded Prior Sliced Wasserstein AutoEncoder (EPSWAE) wherein an additional prior-encoder network facilitates learns an embedding of the data manifold which preserves topological and geometric properties of the data, thus improving the structure of latent space. +The autoencoder and prior-encoder networks are iteratively trained using the Sliced Wasserstein (SW) distance, which efficiently measures the distance between two \textit{arbitrary} sampleable distributions without being constrained to a specific form as in the KL divergence, and without requiring expensive adversarial training. +To improve the representation, we use (1) a structural consistency term in the loss that encourages isometry between feature space and latent space and (2) a nonlinear variant of the SW distance which averages over random nonlinear shearing. +The effectiveness of the learned manifold encoding is best explored by traversing the latent space through interpolations along \textit{geodesics} which generate samples that lie on the manifold and hence are advantageous compared to standard Euclidean interpolation. +To this end, we introduce a graph-based algorithm for interpolating along network-geodesics in latent space by maximizing the density of samples along the path while minimizing total energy. We use the 3D-spiral data to show that the prior does indeed encode the geometry underlying the data and to demonstrate the advantages of the network-algorithm for interpolation. +Additionally, we apply our framework to MNIST, and CelebA datasets, and show that outlier generations, latent representations, and geodesic interpolations are comparable to the state of the art.",/pdf/3e81b80c342a5d0dba7563917270717bcf743170.pdf,ICLR,2021,"A novel VAE-like architecture that uses an encoded-prior network to match the prior to the encoded data manifold using nonlinear sliced Wasserstein distances, and a graph-based algorithm for network-geodesic interpolations along the latent manifold." +MyHwDabUHZm,nPNu-UKEVLf,1601310000000.0,1615830000000.0,2341,Beyond Categorical Label Representations for Image Classification,"[""~Boyuan_Chen1"", ""yl4019@columbia.edu"", ""sr3587@columbia.edu"", ""~Hod_Lipson1""]","[""Boyuan Chen"", ""Yu Li"", ""Sunand Raghupathi"", ""Hod Lipson""]","[""Label Representation"", ""Image Classification"", ""Representation Learning""]","We find that the way we choose to represent data labels can have a profound effect on the quality of trained models. For example, training an image classifier to regress audio labels rather than traditional categorical probabilities produces a more reliable classification. This result is surprising, considering that audio labels are more complex than simpler numerical probabilities or text. We hypothesize that high dimensional, high entropy label representations are generally more useful because they provide a stronger error signal. We support this hypothesis with evidence from various label representations including constant matrices, spectrograms, shuffled spectrograms, Gaussian mixtures, and uniform random matrices of various dimensionalities. Our experiments reveal that high dimensional, high entropy labels achieve comparable accuracy to text (categorical) labels on standard image classification tasks, but features learned through our label representations exhibit more robustness under various adversarial attacks and better effectiveness with a limited amount of training data. These results suggest that label representation may play a more important role than previously thought.",/pdf/14e605cccc7af2ba01dc51b23e624ff89dbeff7c.pdf,ICLR,2021,We study the role of label representations for standard image classification task and found high-dimensional hign-entropy labes generally lead to more robust and data-efficient networks. +vNw0Gzw8oki,IhY-1UkbphC,1601310000000.0,1614990000000.0,946,Physics Informed Deep Kernel Learning,"[""~Zheng_Wang2"", ""wxing@sci.utah.edu"", ""~Robert_Kirby1"", ""~Shandian_Zhe1""]","[""Zheng Wang"", ""Wei Xing"", ""Robert Kirby"", ""Shandian Zhe""]","[""Deep Kernel"", ""Bayesian Learning""]","Deep kernel learning is a promising combination of deep neural networks and nonparametric function estimation. However, as a data-driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a simple model evidence lower bound (ELBO), based on which we develop a stochastic collapsed inference algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification. ",/pdf/9a276d918d6f250eaba65a9273870c52c30bfb29.pdf,ICLR,2021, +B1ZvaaeAZ,H1gDpalRZ,1509120000000.0,1519170000000.0,434,WRPN: Wide Reduced-Precision Networks,"[""asit.k.mishra@intel.com"", ""eriko.nurvitadhi@intel.com"", ""jeffrey.j.cook@intel.com"", ""debbie.marr@intel.com""]","[""Asit Mishra"", ""Eriko Nurvitadhi"", ""Jeffrey J Cook"", ""Debbie Marr""]","[""Low precision"", ""binary"", ""ternary"", ""4-bits networks""]","For computer vision applications, prior works have shown the efficacy of reducing numeric precision of model parameters (network weights) in deep neural networks. Activation maps, however, occupy a large memory footprint during both the training and inference step when using mini-batches of inputs. One way to reduce this large memory footprint is to reduce the precision of activations. However, past works have shown that reducing the precision of activations hurts model accuracy. We study schemes to train networks from scratch using reduced-precision activations without hurting accuracy. We reduce the precision of activation maps (along with model parameters) and increase the number of filter maps in a layer, and find that this scheme matches or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly improve the execution efficiency (e.g. reduce dynamic memory footprint, memory band- width and computational energy) and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN -- wide reduced-precision networks. We report results and show that WRPN scheme is better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.",/pdf/17a7d81e64b77525832536d00c682478d6d7eec3.pdf,ICLR,2018,"Lowering precision (to 4-bits, 2-bits and even binary) and widening the filter banks gives as accurate network as those obtained with FP32 weights and activations." +SkluFgrFwH,H1es-CxKDS,1569440000000.0,1577170000000.0,2440,Learning Mahalanobis Metric Spaces via Geometric Approximation Algorithms,"[""dihara@gmail.com"", ""nmoham24@uic.edu"", ""sidiropo@uic.edu""]","[""Diego Ihara"", ""Neshat Mohammadi"", ""Anastasios Sidiropoulos""]","[""Metric Learning"", ""Geometric Algorithms"", ""Approximation Algorithms""]","Learning Mahalanobis metric spaces is an important problem that has found numerous applications. Several algorithms have been designed for this problem, including Information Theoretic Metric Learning (ITML) [Davis et al. 2007] and Large Margin Nearest Neighbor (LMNN) classification [Weinberger and Saul 2009]. We consider a formulation of Mahalanobis metric learning as an optimization problem,where the objective is to minimize the number of violated similarity/dissimilarity constraints. We show that for any fixed ambient dimension, there exists a fully polynomial time approximation scheme (FPTAS) with nearly-linear running time.This result is obtained using tools from the theory of linear programming in low dimensions. We also discuss improvements of the algorithm in practice, and present experimental results on synthetic and real-world data sets. Our algorithm is fully parallelizable and performs favorably in the presence of adversarial noise.",/pdf/342399b3129eb3fbf5d5c78e96d8d0a4bbe9c1b0.pdf,ICLR,2020,Fully parallelizable and adversarial-noise resistant metric learning algorithm with theoretical guarantees. +9Y7_c5ZAd5i,6ic-SOdA64s,1601310000000.0,1614990000000.0,374,A Sharp Analysis of Model-based Reinforcement Learning with Self-Play,"[""~Qinghua_Liu1"", ""~Tiancheng_Yu1"", ""~Yu_Bai1"", ""~Chi_Jin1""]","[""Qinghua Liu"", ""Tiancheng Yu"", ""Yu Bai"", ""Chi Jin""]","[""Reinforcement learning theory"", ""Markov games"", ""model-based RL"", ""task-agnostic RL"", ""multi-agent RL""]","Model-based algorithms---algorithms that explore the environment through building and utilizing an estimated model---are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm \emph{Optimistic Nash Value Iteration} (Nash-VI) for two-player zero-sum Markov games that is able to output an $\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This significantly improves over the best known model-based guarantee of $\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)$, and is the first that matches the information-theoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if $\min\{A,B\}=o(H^3)$, and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.",/pdf/b64882258cb3e6e447dde367e42481353fa67bec.pdf,ICLR,2021,We design a new model-based algorithm for Markov games that achieves nearly optimal regret bound and PAC sample complexity. +Z3XVHSbSawb,WTgnzzFwd6,1601310000000.0,1614990000000.0,1768,Daylight: Assessing Generalization Skills of Deep Reinforcement Learning Agents,"[""~Ezgi_Korkmaz1""]","[""Ezgi Korkmaz""]","[""deep reinforcement learning"", ""generalization""]","Deep reinforcement learning algorithms have recently achieved significant success in learning high-performing policies from purely visual observations. The ability to perform end-to-end learning from raw high dimensional input alone has led to deep reinforcement learning algorithms being deployed in a variety of fields. Thus, understanding and improving the ability of deep reinforcement learning agents to generalize to unseen data distributions is of critical importance. Much recent work has focused on assessing the generalization of deep reinforcement learning agents by introducing specifically crafted adversarial perturbations to their inputs. In this paper, we propose another approach that we call daylight: a framework to assess the generalization skills of trained deep reinforcement learning agents. Rather than focusing on worst-case analysis of distribution shift, our approach is based on black-box perturbations that correspond to semantically meaningful changes to the environment or the agent's visual observation system ranging from brightness to compression artifacts. We demonstrate that even the smallest changes in the environment cause the performance of the agents to degrade significantly in various games from the Atari environment despite having orders of magnitude lower perceptual similarity distance compared to state-of-the-art adversarial attacks. We show that our framework captures a diverse set of bands in the Fourier spectrum, giving a better overall understanding of the agent's generalization capabilities. We believe our work can be crucial towards building resilient and generalizable deep reinforcement learning agents.",/pdf/34ae36de0694e14ed903e22a21bb179b4620968f.pdf,ICLR,2021, +S1M6Z2Cctm,HklRMC_FYm,1538090000000.0,1550290000000.0,1216,Harmonic Unpaired Image-to-image Translation,"[""zhangrui@ict.ac.cn"", ""tpfister@google.com"", ""lijiali@google.com""]","[""Rui Zhang"", ""Tomas Pfister"", ""Jia Li""]","[""unpaired image-to-image translation"", ""cyclegan"", ""smoothness constraint""]","The recent direction of unpaired image-to-image translation is on one hand very exciting as it alleviates the big burden in obtaining label-intensive pixel-to-pixel supervision, but it is on the other hand not fully satisfactory due to the presence of artifacts and degenerated transformations. In this paper, we take a manifold view of the problem by introducing a smoothness term over the sample graph to attain harmonic functions to enforce consistent mappings during the translation. We develop HarmonicGAN to learn bi-directional translations between the source and the target domains. With the help of similarity-consistency, the inherent self-consistency property of samples can be maintained. Distance metrics defined on two types of features including histogram and CNN are exploited. Under an identical problem setting as CycleGAN, without additional manual inputs and only at a small training-time cost, HarmonicGAN demonstrates a significant qualitative and quantitative improvement over the state of the art, as well as improved interpretability. We show experimental results in a number of applications including medical imaging, object transfiguration, and semantic labeling. We outperform the competing methods in all tasks, and for a medical imaging task in particular our method turns CycleGAN from a failure to a success, halving the mean-squared error, and generating images that radiologists prefer over competing methods in 95% of cases.",/pdf/4494c3a6acaa114426879aac747d0fc9236546ad.pdf,ICLR,2019,Smooth regularization over sample graph for unpaired image-to-image translation results in significantly improved consistency +SJev6JBtvH,HJxv241FPS,1569440000000.0,1577170000000.0,1992,Testing For Typicality with Respect to an Ensemble of Learned Distributions,"[""forrest.laine@berkeley.edu"", ""tomlin@eecs.berkeley.edu""]","[""Forrest Laine"", ""Claire Tomlin""]","[""anomaly detection"", ""density estimation"", ""generative models""]","Good methods of performing anomaly detection on high-dimensional data sets are +needed, since algorithms which are trained on data are only expected to perform +well on data that is similar to the training data. There are theoretical results on the +ability to detect if a population of data is likely to come from a known base distribution, +which is known as the goodness-of-fit problem, but those results require +knowing a model of the base distribution. The ability to correctly reject anomalous +data hinges on the accuracy of the model of the base distribution. For high dimensional +data, learning an accurate-enough model of the base distribution such that +anomaly detection works reliably is very challenging, as many researchers have +noted in recent years. Existing methods for the goodness-of-fit problem do not ac- +count for the fact that a model of the base distribution is learned. To address that +gap, we offer a theoretically motivated approach to account for the density learning +procedure. In particular, we propose training an ensemble of density models, +considering data to be anomalous if the data is anomalous with respect to any +member of the ensemble. We provide a theoretical justification for this approach, +proving first that a test on typicality is a valid approach to the goodness-of-fit +problem, and then proving that for a correctly constructed ensemble of models, +the intersection of typical sets of the models lies in the interior of the typical set +of the base distribution. We present our method in the context of an example on +synthetic data in which the effects we consider can easily be seen.",/pdf/af3c816601680872cf2cc60886fe15f2c0aa3d93.pdf,ICLR,2020,We show theoretically and empirically that testing for typicality with respect to an ensemble of learned distributions can account for learning error in the hypothesis testing. +EZ8aZaCt9k,P9wQImshmQX,1601310000000.0,1614990000000.0,345,No Spurious Local Minima: on the Optimization Landscapes of Wide and Deep Neural Networks,"[""~Johannes_Lederer1""]","[""Johannes Lederer""]",[],"Empirical studies suggest that wide neural networks are comparably easy to optimize, but mathematical support for this observation is scarce. In this paper, we analyze the optimization landscapes of deep learning with wide networks. We prove especially that constraint and unconstraint empirical-risk minimization over such networks has no spurious local minima. Hence, our theories substantiate the common belief that increasing network widths not only improves the expressiveness of deep-learning pipelines but also facilitates their optimizations.",/pdf/110b5c0cd3c39f32b2060edb3826255e090570f2.pdf,ICLR,2021, +IpPQmzj4T_,KDpwi-9nnuG,1601310000000.0,1614990000000.0,353,Teleport Graph Convolutional Networks,"[""~Hongyang_Gao1"", ""~Shuiwang_Ji1""]","[""Hongyang Gao"", ""Shuiwang Ji""]","[""over-smoothing""]","We consider the limitations in message-passing graph neural networks. In message-passing operations, each node aggregates information from its neighboring nodes. To enlarge the receptive field, graph neural networks need to stack multiple message-passing graph convolution layers, which leads to the over-fitting issue and over-smoothing issue. To address these limitations, we propose a teleport graph convolution layer (TeleGCL) that uses teleport functions to enable each node to aggregate information from a much larger neighborhood. For each node, teleport functions select relevant nodes beyond the local neighborhood, thereby resulting in a larger receptive field. To apply our structure-aware teleport function, we propose a novel method to construct structural features for nodes in the graph. Based on our TeleGCL, we build a family of teleport graph convolutional networks. The empirical results on graph and node classification tasks demonstrate the effectiveness of our proposed methods.",/pdf/eaa5b7ce99ebba473b22000852c03e48b0f31bdb.pdf,ICLR,2021,We propose a teleport graph convolution layer to address the over-smoothing limitations in graph neural networks. +HyxOIoRqFQ,S1eB6ftIYQ,1538090000000.0,1545360000000.0,188,Discrete flow posteriors for variational inference in discrete dynamical systems,"[""laurence.aitchison@gmail.com"", ""vincent.adam@prowler.io"", ""turagas@janelia.hhmi.org""]","[""Laurence Aitchison"", ""Vincent Adam"", ""Srinivas C. Turaga""]","[""normalising flow"", ""variational inference"", ""discrete latent variable""]","Each training step for a variational autoencoder (VAE) requires us to sample from the approximate posterior, so we usually choose simple (e.g. factorised) approximate posteriors in which sampling is an efficient computation that fully exploits GPU parallelism. However, such simple approximate posteriors are often insufficient, as they eliminate statistical dependencies in the posterior. While it is possible to use normalizing flow approximate posteriors for continuous latents, there is nothing analogous for discrete latents. The most natural approach to model discrete dependencies is an autoregressive distribution, but sampling from such distributions is inherently sequential and thus slow. We develop a fast, parallel sampling procedure for autoregressive distributions based on fixed-point iterations which enables efficient and accurate variational inference in discrete state-space models. To optimize the variational bound, we considered two ways to evaluate probabilities: inserting the relaxed samples directly into the pmf for the discrete distribution, or converting to continuous logistic latent variables and interpreting the K-step fixed-point iterations as a normalizing flow. We found that converting to continuous latent variables gave considerable additional scope for mismatch between the true and approximate posteriors, which resulted in biased inferences, we thus used the former approach. We tested our approach on the neuroscience problem of inferring discrete spiking activity from noisy calcium-imaging data, and found that it gave accurate connectivity estimates in an order of magnitude less time.",/pdf/23c6900cad7a39c3d5bca6e26ec50c7ec825f51e.pdf,ICLR,2019,We give a fast normalising-flow like sampling procedure for discrete latent variable models. +BygNqoR9tm,H1gjFHlqt7,1538090000000.0,1545360000000.0,525,Sinkhorn AutoEncoders,"[""patrinig@hotmail.com"", ""marcello.carioni@uni-graz.at"", ""patrickforre@gmail.com"", ""samarth.bhargav@student.uva.nl"", ""welling.max@gmail.com"", ""riannevdberg@gmail.com"", ""tim.genewein@de.bosch.com"", ""nielsen@lix.polytechnique.fr""]","[""Giorgio Patrini"", ""Marcello Carioni"", ""Patrick Forr\u00e9"", ""Samarth Bhargav"", ""Max Welling"", ""Rianne van den Berg"", ""Tim Genewein"", ""Frank Nielsen""]","[""generative models"", ""autoencoders"", ""optimal transport"", ""sinkhorn algorithm""]","Optimal Transport offers an alternative to maximum likelihood for learning generative autoencoding models. We show how this principle dictates the minimization of the Wasserstein distance between the encoder aggregated posterior and the prior, plus a reconstruction error. We prove that in the non-parametric limit the autoencoder generates the data distribution if and only if the two distributions match exactly, and that the optimum can be obtained by deterministic autoencoders. +We then introduce the Sinkhorn AutoEncoder (SAE), which casts the problem into Optimal Transport on the latent space. The resulting Wasserstein distance is minimized by backpropagating through the Sinkhorn algorithm. +SAE models the aggregated posterior as an implicit distribution and therefore does not need a reparameterization trick for gradients estimation. Moreover, it requires virtually no adaptation to different prior distributions. We demonstrate its flexibility by considering models with hyperspherical and Dirichlet priors, as well as a simple case of probabilistic programming. SAE matches or outperforms other autoencoding models in visual quality and FID scores. ",/pdf/42599469eba7e21b02b2a18c15d47d120f613f04.pdf,ICLR,2019, +HyerxgHYvH,SJePfa1tPH,1569440000000.0,1577170000000.0,2097,Neural Arithmetic Unit by reusing many small pre-trained networks,"[""ammarahmad977@gmail.com"", ""oneebalibabar@gmail.com"", ""murtaza.taj@lums.edu.pk""]","[""Ammar Ahmad"", ""Oneeb Babar"", ""Murtaza Taj""]","[""NALU"", ""feed forward NN""]","We propose a solution for evaluation of mathematical expression. However, instead of designing a single end-to-end model we propose a Lego bricks style architecture. In this architecture instead of training a complex end-to-end neural network, many small networks can be trained independently each accomplishing one specific operation and acting a single lego brick. More difficult or complex task can then be solved using a combination of these smaller network. In this work we first identify 8 fundamental operations that are commonly used to solve arithmetic operations (such as 1 digit multiplication, addition, subtraction, sign calculator etc). These fundamental operations are then learned using simple feed forward neural networks. We then shows that different operations can be designed simply by reusing these smaller networks. As an example we reuse these smaller networks to develop larger and a more complex network to solve n-digit multiplication, n-digit division, and cross product. This bottom-up strategy not only introduces reusability, we also show that it allows to generalize for computations involving n-digits and we show results for up to 7 digit numbers. Unlike existing methods, our solution also generalizes for both positive as well as negative numbers.",/pdf/fdbea97f07135cd8a74697b40863f2f77a52bbab.pdf,ICLR,2020,"We train many small networks each for a specific operation, these are then combined to perform complex operations" +B1l1b205KX,HJxST2XqY7,1538090000000.0,1545360000000.0,1132,Unsupervised Disentangling Structure and Appearance,"[""wuwenyan@sensetime.com"", ""kaidicao@cs.stanford.edu"", ""chengli@sensetime.com"", ""qianchen@sensetime.com"", ""ccloy225@gmail.com""]","[""Wayne Wu"", ""Kaidi Cao"", ""Cheng Li"", ""Chen Qian"", ""Chen Change Loy""]","[""disentangled representations"", ""VAE"", ""generative models"", ""unsupervised learning""]","It is challenging to disentangle an object into two orthogonal spaces of structure and appearance since each can influence the visual observation in a different and unpredictable way. It is rare for one to have access to a large number of data to help separate the influences. In this paper, we present a novel framework to learn this disentangled representation in a completely unsupervised manner. We address this problem in a two-branch Variational Autoencoder framework. For the structure branch, we project the latent factor into a soft structured point tensor and constrain it with losses derived from prior knowledge. This encourages the branch to distill geometry information. Another branch learns the complementary appearance information. The two branches form an effective framework that can disentangle object's structure-appearance representation without any human annotation. We evaluate our approach on four image datasets, on which we demonstrate the superior disentanglement and visual analogy quality both in synthesis and real-world data. We are able to generate photo-realistic images with 256*256 resolution that are clearly disentangled in structure and appearance.",/pdf/e4e29d156e204dca1bf7c15a20486b2687fa4fa8.pdf,ICLR,2019,We present a novel framework to learn the disentangled representation of structure and appearance in a completely unsupervised manner. +Bkg75aVKDH,HkgYrq6vPr,1569440000000.0,1577170000000.0,701,Training Provably Robust Models by Polyhedral Envelope Regularization,"[""chen.liu@epfl.ch"", ""mathieu.salzmann@epfl.ch"", ""sabine.susstrunk@epfl.ch""]","[""Chen Liu"", ""Mathieu Salzmann"", ""Sabine S\u00fcsstrunk""]","[""deep learning"", ""adversarial attack"", ""robust certification""]","Training certifiable neural networks enables one to obtain models with robustness guarantees against adversarial attacks. In this work, we use a linear approximation to bound model’s output given an input adversarial budget. This allows us to bound the adversary-free region in the data neighborhood by a polyhedral envelope and yields finer-grained certified robustness than existing methods. We further exploit this certifier to introduce a framework called polyhedral envelope regular- ization (PER), which encourages larger polyhedral envelopes and thus improves the provable robustness of the models. We demonstrate the flexibility and effectiveness of our framework on standard benchmarks; it applies to networks with general activation functions and obtains comparable or better robustness guarantees than state-of-the-art methods, with very little cost in clean accuracy, i.e., without over-regularizing the model.",/pdf/c3a20ec3ec0bef8069ff86d7f22ae4c763fda922.pdf,ICLR,2020, +H1ldNoC9tX,SkljryRvKX,1538090000000.0,1545360000000.0,14,"Classification from Positive, Unlabeled and Biased Negative Data","[""yu-guan.hsieh@ens.fr"", ""gang.niu@riken.jp"", ""sugi@k.u-tokyo.ac.jp""]","[""Yu-Guan Hsieh"", ""Gang Niu"", ""Masashi Sugiyama""]","[""positive-unlabeled learning"", ""dataset shift"", ""empirical risk minimization""]","Positive-unlabeled (PU) learning addresses the problem of learning a binary classifier from positive (P) and unlabeled (U) data. It is often applied to situations where negative (N) data are difficult to be fully labeled. However, collecting a non-representative N set that contains only a small portion of all possible N data can be much easier in many practical situations. This paper studies a novel classification framework which incorporates such biased N (bN) data in PU learning. The fact that the training N data are biased also makes our work very different from those of standard semi-supervised learning. We provide an empirical risk minimization-based method to address this PUbN classification problem. Our approach can be regarded as a variant of traditional example-reweighting algorithms, with the weight of each example computed through a preliminary step that draws inspiration from PU learning. We also derive an estimation error bound for the proposed method. Experimental results demonstrate the effectiveness of our algorithm in not only PUbN learning scenarios but also ordinary PU leaning scenarios on several benchmark datasets.",/pdf/8ddc2bdb66afa624688e31ff3ef6369653cd821c.pdf,ICLR,2019,"This paper studied the PUbN classification problem, where we incorporate biased negative (bN) data, i.e., negative data that is not fully representative of the true underlying negative distribution, into positive-unlabeled (PU) learning." +BIwkgTsSp_8,C5fhgQohgei,1601310000000.0,1614990000000.0,3455,Learning to Noise: Application-Agnostic Data Sharing with Local Differential Privacy,"[""~Alex_Mansbridge1"", ""~Gregory_Barbour1"", ""~Davide_Piras1"", ""~Christopher_Frye1"", ""~Ilya_Feige1"", ""~David_Barber2""]","[""Alex Mansbridge"", ""Gregory Barbour"", ""Davide Piras"", ""Christopher Frye"", ""Ilya Feige"", ""David Barber""]","[""Differential Privacy"", ""Representation Learning"", ""Variational Inference"", ""Generative Modelling""]","In recent years, the collection and sharing of individuals’ private data has become commonplace in many industries. Local differential privacy (LDP) is a rigorous approach which uses a randomized algorithm to preserve privacy even from the database administrator, unlike the more standard central differential privacy. For LDP, when applying noise directly to high-dimensional data, the level of noise required all but entirely destroys data utility. In this paper we introduce a novel, application-agnostic privatization mechanism that leverages representation learning to overcome the prohibitive noise requirements of direct methods, while maintaining the strict guarantees of LDP. We further demonstrate that data privatized with this mechanism can be used to train machine learning algorithms. Applications of this model include private data collection, private novel-class classification, and the augmentation of clean datasets with additional privatized features. We achieve significant gains in performance on downstream classification tasks relative to benchmarks that noise the data directly, which are state-of-the-art in the context of application-agnostic LDP mechanisms for high-dimensional data sharing tasks.",/pdf/d75134f9887c9665417296f82cde939c7fb6d3cb.pdf,ICLR,2021,"Using representation learning to induce local differential privacy on high-dimensional data, via an application-agnostic privatization mechanism." +eyXknI5scWu,UtGseJVpjR3,1601310000000.0,1614990000000.0,2163,Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability,"[""~Jason_Phang1"", ""~Jungkyu_Park1"", ""~Krzysztof_J._Geras1""]","[""Jason Phang"", ""Jungkyu Park"", ""Krzysztof J. Geras""]","[""saliency maps"", ""interpretability"", ""explainable AI"", ""image recognition"", ""image masking"", ""adversarial training""]","Saliency maps that identify the most informative regions of an image for a classifier are valuable for model interpretability. A common approach to creating saliency maps involves generating input masks that mask out portions of an image to maximally deteriorate classification performance, or mask in an image to preserve classification performance. Many variants of this approach have been proposed in the literature, such as counterfactual generation and optimizing over a Gumbel-Softmax distribution. Using a general formulation of masking-based saliency methods, we conduct an extensive evaluation study of a number of recently proposed variants to understand which elements of these methods meaningfully improve performance. Surprisingly, we find that a well-tuned, relatively simple formulation of a masking-based saliency model outperforms many more complex approaches. We find that the most important ingredients for high quality saliency map generation are (1) using both masked-in and masked-out objectives and (2) training the classifier alongside the masking model. Strikingly, we show that a masking model can be trained with as few as 10 examples per class and still generate saliency maps with only a 0.7-point increase in localization error.",/pdf/3af94971648ab2fb9623d091854a5786ff5ab341.pdf,ICLR,2021,"In a large evaluation study, we show that a simple formulation of masking-based saliency map generation outperforms many recently proposed improvements, and that comparable results can be obtained by training on as few as 10 examples per class." +SkgS2lBFPS,Byg7IebYvS,1569440000000.0,1577170000000.0,2536,A Bilingual Generative Transformer for Semantic Sentence Embedding,"[""jwieting@cs.cmu.edu"", ""gneubig@cs.cmu.edu"", ""tberg@eng.ucsd.edu""]","[""John Wieting"", ""Graham Neubig"", ""Taylor Berg-Kirkpatrick""]","[""sentence embedding"", ""semantic similarity"", ""multilingual"", ""latent variables"", ""vae""]","Semantic sentence embedding models take natural language sentences and turn them into vectors, such that similar vectors indicate similarity in the semantics between the sentences. Bilingual data offers a useful signal for learning such embeddings: properties shared by both sentences in a translation pair are likely semantic, while divergent properties are likely stylistic or language-specific. We propose a deep latent variable model that attempts to perform source separation on parallel sentences, isolating what they have in common in a latent semantic vector, and explaining what is left over with language-specific latent vectors. Our proposed approach differs from past work on semantic sentence encoding in two ways. First, by using a variational probabilistic framework, we introduce priors that encourage source separation, and can use our model’s posterior to predict sentence embeddings for monolingual data at test time. Second, we use high- capacity transformers as both data generating distributions and inference networks – contrasting with most past work on sentence embeddings. In experiments, our approach substantially outperforms the state-of-the-art on a standard suite of se- mantic similarity evaluations. Further, we demonstrate that our approach yields the largest gains on more difficult subsets of test where simple word overlap is not a good indicator of similarity.",/pdf/548a93908b8f6959063784f2e44fbdeca3a9f2af.pdf,ICLR,2020, +Ske25sC9FQ,SJeJqG2ct7,1538090000000.0,1545360000000.0,567,Robustness and Equivariance of Neural Networks,"[""amitdesh@microsoft.com"", ""ksandeshk@cmi.ac.in"", ""kv@cmi.ac.in""]","[""Amit Deshpande"", ""Sandesh Kamath"", ""K.V.Subrahmanyam""]","[""robust"", ""adversarial"", ""equivariance"", ""rotations"", ""GCNNs"", ""CNNs"", ""steerable"", ""neural networks""]","Neural networks models are known to be vulnerable to geometric transformations +as well as small pixel-wise perturbations of input. Convolutional Neural Networks +(CNNs) are translation-equivariant but can be easily fooled using rotations and +small pixel-wise perturbations. Moreover, CNNs require sufficient translations in +their training data to achieve translation-invariance. Recent work by Cohen & +Welling (2016), Worrall et al. (2016), Kondor & Trivedi (2018), Cohen & Welling +(2017), Marcos et al. (2017), and Esteves et al. (2018) has gone beyond translations, +and constructed rotation-equivariant or more general group-equivariant +neural network models. In this paper, we do an extensive empirical study of various +rotation-equivariant neural network models to understand how effectively they +learn rotations. This includes Group-equivariant Convolutional Networks (GCNNs) +by Cohen & Welling (2016), Harmonic Networks (H-Nets) by Worrall et al. +(2016), Polar Transformer Networks (PTN) by Esteves et al. (2018) and Rotation +equivariant vector field networks by Marcos et al. (2017). We empirically compare +the ability of these networks to learn rotations efficiently in terms of their +number of parameters, sample complexity, rotation augmentation used in training. +We compare them against each other as well as Standard CNNs. We observe +that as these rotation-equivariant neural networks learn rotations, they instead become +more vulnerable to small pixel-wise adversarial attacks, e.g., Fast Gradient +Sign Method (FGSM) and Projected Gradient Descent (PGD), in comparison with +Standard CNNs. In other words, robustness to geometric transformations in these +models comes at the cost of robustness to small pixel-wise perturbations.",/pdf/ac4876584db2c7d7c97f3c09caf67a0bc4cbcdb5.pdf,ICLR,2019,Robustness to rotations comes at the cost of robustness of pixel-wise adversarial perturbations. +rye7IMbAZ,Bk17UfbRZ,1509140000000.0,1518730000000.0,844, Explicit Induction Bias for Transfer Learning with Convolutional Networks,"[""xuhong.li@utc.fr"", ""yves.grandvalet@utc.fr"", ""franck.davoine@utc.fr""]","[""Xuhong LI"", ""Yves GRANDVALET"", ""Franck DAVOINE""]","[""transfer Learning"", ""convolutional networks"", ""fine-tuning"", ""regularization"", ""induction bias""]","In inductive transfer learning, fine-tuning pre-trained convolutional networks substantially outperforms training from scratch. +When using fine-tuning, the underlying assumption is that the pre-trained model extracts generic features, which are at least partially relevant for solving the target task, but would be difficult to extract from the limited amount of data available on the target task. +However, besides the initialization with the pre-trained model and the early stopping, there is no mechanism in fine-tuning for retaining the features learned on the source task. +In this paper, we investigate several regularization schemes that explicitly promote the similarity of the final solution with the initial model. +We eventually recommend a simple $L^2$ penalty using the pre-trained model as a reference, and we show that this approach behaves much better than the standard scheme using weight decay on a partially frozen network.",/pdf/165f67b1ff4e62a8ddc596d70306ef90093bc059.pdf,ICLR,2018,"In inductive transfer learning, fine-tuning pre-trained convolutional networks substantially outperforms training from scratch." +BJfvknCqFQ,ryxRiJAqFQ,1538090000000.0,1545360000000.0,993,A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations,"[""engstrom@mit.edu"", ""btran115@mit.edu"", ""tsipras@mit.edu"", ""ludwigs@mit.edu"", ""madry@mit.edu""]","[""Logan Engstrom"", ""Brandon Tran"", ""Dimitris Tsipras"", ""Ludwig Schmidt"", ""Aleksander Madry""]","[""robustness"", ""spatial transformations"", ""invariance"", ""rotations"", ""data augmentation"", ""robust optimization""]","We show that simple spatial transformations, namely translations and rotations alone, suffice to fool neural networks on a significant fraction of their inputs in multiple image classification tasks. Our results are in sharp contrast to previous work in adversarial robustness that relied on more complicated optimization ap- proaches unlikely to appear outside a truly adversarial context. Moreover, the misclassifying rotations and translations are easy to find and require only a few black-box queries to the target model. Overall, our findings emphasize the need to design robust classifiers even for natural input transformations in benign settings. +",/pdf/3a95ee20b9313f1ab539108bda763e4511ba6e2a.pdf,ICLR,2019,We show that CNNs are not robust to simple rotations and translation and explore methods of improving this. +HJeABnCqKQ,BJe1HWOtK7,1538090000000.0,1545360000000.0,1589,Generative Adversarial Self-Imitation Learning,"[""junhyuk@umich.edu"", ""guoyijie@umich.edu"", ""baveja@umich.edu"", ""honglak@google.com""]","[""Junhyuk Oh"", ""Yijie Guo"", ""Satinder Singh"", ""Honglak Lee""]",[],"This paper explores a simple regularizer for reinforcement learning by proposing Generative Adversarial Self-Imitation Learning (GASIL), which encourages the agent to imitate past good trajectories via generative adversarial imitation learning framework. Instead of directly maximizing rewards, GASIL focuses on reproducing past good trajectories, which can potentially make long-term credit assignment easier when rewards are sparse and delayed. GASIL can be easily combined with any policy gradient objective by using GASIL as a learned reward shaping function. Our experimental results show that GASIL improves the performance of proximal policy optimization on 2D Point Mass and MuJoCo environments with delayed reward and stochastic dynamics.",/pdf/abda7f26819f3dc793bd452d231156dfd89742b6.pdf,ICLR,2019, +ByxmXnA9FQ,Bkgt9qZ9K7,1538090000000.0,1545360000000.0,1345,A Variational Dirichlet Framework for Out-of-Distribution Detection,"[""wenhuchen@ucsb.edu"", ""yilin.shen@samsung.com"", ""william@cs.ucsb.edu"", ""hongxia.jin@samsung.com""]","[""Wenhu Chen"", ""Yilin Shen"", ""William Wang"", ""Hongxia Jin""]","[""out-of-distribution detection"", ""variational inference"", ""Dirichlet distribution"", ""deep learning"", ""uncertainty measure""]","With the recently rapid development in deep learning, deep neural networks have been widely adopted in many real-life applications. However, deep neural networks are also known to have very little control over its uncertainty for test examples, which potentially causes very harmful and annoying consequences in practical scenarios. In this paper, we are particularly interested in designing a higher-order uncertainty metric for deep neural networks and investigate its performance on the out-of-distribution detection task proposed by~\cite{hendrycks2016baseline}. Our method first assumes there exists a underlying higher-order distribution $\mathcal{P}(z)$, which generated label-wise distribution $\mathcal{P}(y)$ over classes on the K-dimension simplex, and then approximate such higher-order distribution via parameterized posterior function $p_{\theta}(z|x)$ under variational inference framework, finally we use the entropy of learned posterior distribution $p_{\theta}(z|x)$ as uncertainty measure to detect out-of-distribution examples. However, we identify the overwhelming over-concentration issue in such a framework, which greatly hinders the detection performance. Therefore, we further design a log-smoothing function to alleviate such issue to greatly increase the robustness of the proposed entropy-based uncertainty measure. Through comprehensive experiments on various datasets and architectures, our proposed variational Dirichlet framework with entropy-based uncertainty measure is consistently observed to yield significant improvements over many baseline systems.",/pdf/ac533e42dbd3e9d2457d95410c4780957cedbec6.pdf,ICLR,2019,A new framework based variational inference for out-of-distribution detection +S1gDCiCqtQ,Skgcrnp5FQ,1538090000000.0,1545360000000.0,904,Learning Representations in Model-Free Hierarchical Reinforcement Learning,"[""jrafatiheravi@ucmerced.edu"", ""dnoelle@ucmerced.edu""]","[""Jacob Rafati"", ""David Noelle""]","[""Reinforcement Learning"", ""Model-Free Hierarchical Reinforcement Learning"", ""Subgoal Discovery"", ""Unsupervised Learning"", ""Temporal Difference"", ""Temporal Abstraction"", ""Intrinsic Motivation"", ""Markov Decision Processes"", ""Deep Reinforcement Learning"", ""Optimization""]","Common approaches to Reinforcement Learning (RL) are seriously challenged by large-scale applications involving huge state spaces and sparse delayed reward feedback. Hierarchical Reinforcement Learning (HRL) methods attempt to address this scalability issue by learning action selection policies at multiple levels of temporal abstraction. Abstraction can be had by identifying a relatively small set of states that are likely to be useful as subgoals, in concert with the learning of corresponding skill policies to achieve those subgoals. Many approaches to subgoal discovery in HRL depend on the analysis of a model of the environment, but the need to learn such a model introduces its own problems of scale. Once subgoals are identified, skills may be learned through intrinsic motivation, introducing an internal reward signal marking subgoal attainment. In this paper, we present a novel model-free method for subgoal discovery using incremental unsupervised learning over a small memory of the most recent experiences of the agent. When combined with an intrinsic motivation learning mechanism, this method learns subgoals and skills together, based on experiences in the environment. Thus, we offer an original approach to HRL that does not require the acquisition of a model of the environment, suitable for large-scale applications. We demonstrate the efficiency of our method on two RL problems with sparse delayed feedback: a variant of the rooms environment and the ATARI 2600 game called Montezuma's Revenge. +",/pdf/e4b525030cb6bdfaf77fc3492532d0ba948049a4.pdf,ICLR,2019,"We offer an original approach to model-free deep hierarchical reinforcement learning, including unsupervised subgoal discovery and unified temporal abstraction and intrinsic motivation learning. " +rJzaDdYxx,,1478230000000.0,1484170000000.0,106,Gradients of Counterfactuals,"[""mukunds@google.com"", ""ataly@google.com"", ""qiqiyan@google.com""]","[""Mukund Sundararajan"", ""Ankur Taly"", ""Qiqi Yan""]","[""Deep learning"", ""Computer vision"", ""Theory""]","Gradients have been used to quantify feature importance in machine learning models. Unfortunately, in nonlinear deep networks, not only individual neurons but also the whole network can saturate, and as a result an important input feature can have a tiny gradient. We study various networks, and observe that this phenomena is indeed widespread, across many inputs. + +We propose to examine interior gradients, which are gradients of counterfactual inputs constructed by scaling down the original input. We apply our method to the GoogleNet architecture for object recognition in images, as well as a ligand-based virtual screening network with categorical features and an LSTM based language model for the Penn Treebank dataset. We visualize how interior gradients better capture feature importance. Furthermore, interior gradients are applicable to a wide variety of deep networks, and have the attribution property that the feature importance scores sum to the the prediction score. + +Best of all, interior gradients can be computed just as easily as gradients. In contrast, previous methods are complex to implement, which hinders practical adoption.",/pdf/95894e6fa8b21d5e355bb0167096107df64f71d5.pdf,ICLR,2017,A method for identifying feature importance in deep networks using gradients of counterfactual inputs +HJlmHoR5tQ,H1eIaq_xKX,1538090000000.0,1550780000000.0,72,Adversarial Imitation via Variational Inverse Reinforcement Learning,"[""a1quresh@eng.ucsd.edu"", ""bboots@cc.gatech.edu"", ""yip@ucsd.edu""]","[""Ahmed H. Qureshi"", ""Byron Boots"", ""Michael C. Yip""]","[""Inverse Reinforcement Learning"", ""Imitation learning"", ""Variational lnference"", ""Learning from demonstrations""]","We consider a problem of learning the reward and policy from expert examples under unknown dynamics. Our proposed method builds on the framework of generative adversarial networks and introduces the empowerment-regularized maximum-entropy inverse reinforcement learning to learn near-optimal rewards and policies. Empowerment-based regularization prevents the policy from overfitting to expert demonstrations, which advantageously leads to more generalized behaviors that result in learning near-optimal rewards. Our method simultaneously learns empowerment through variational information maximization along with the reward and policy under the adversarial learning formulation. We evaluate our approach on various high-dimensional complex control tasks. We also test our learned rewards in challenging transfer learning problems where training and testing environments are made to be different from each other in terms of dynamics or structure. The results show that our proposed method not only learns near-optimal rewards and policies that are matching expert behavior but also performs significantly better than state-of-the-art inverse reinforcement learning algorithms.",/pdf/0ae7c8b4017f1562558e6082f5c0e72e62256b03.pdf,ICLR,2019,Our method introduces the empowerment-regularized maximum-entropy inverse reinforcement learning to learn near-optimal rewards and policies from expert demonstrations. +SJGPL9Dex,,1478100000000.0,1488500000000.0,47,Understanding Trainable Sparse Coding with Matrix Factorization,"[""thomas.moreau@cmla.ens-cachan.fr"", ""joan.bruna@berkeley.edu""]","[""Thomas Moreau"", ""Joan Bruna""]","[""Theory"", ""Deep learning"", ""Optimization""]","Sparse coding is a core building block in many data analysis and machine learning pipelines. Typically it is solved by relying on generic optimization techniques, such as the Iterative Soft Thresholding Algorithm and its accelerated version (ISTA, FISTA). These methods are optimal in the class of first-order methods for non-smooth, convex functions. However, they do not exploit the particular structure of the problem at hand nor the input data distribution. An acceleration using neural networks, coined LISTA, was proposed in \cite{Gregor10}, which showed empirically that one could achieve high quality estimates with few iterations by modifying the parameters of the proximal splitting appropriately. + +In this paper we study the reasons for such acceleration. Our mathematical analysis reveals that it is related to a specific matrix factorization of the Gram kernel of the dictionary, which attempts to nearly diagonalise the kernel with a basis that produces a small perturbation of the $\ell_1$ ball. When this factorization succeeds, we prove that the resulting splitting algorithm enjoys an improved convergence bound with respect to the non-adaptive version. Moreover, our analysis also shows that conditions for acceleration occur mostly at the beginning of the iterative process, consistent with numerical experiments. We further validate our analysis by showing that on dictionaries where this factorization does not exist, adaptive acceleration fails.",/pdf/0fff3d38d593bd9bf31ce8fd26c67e99400a6c5e.pdf,ICLR,2017,"We analyse the mechanisms which permit to accelerate sparse coding resolution using the problem structure, as it is the case in LISTA." +S1Y0td9ee,,1478290000000.0,1482940000000.0,461,Shift Aggregate Extract Networks,"[""francesco.orsini@kuleuven.be"", ""daniele.baracchi@unifi.it"", ""paolo.frasconi@unifi.it""]","[""Francesco Orsini"", ""Daniele Baracchi"", ""Paolo Frasconi""]","[""Supervised Learning""]","The Shift Aggregate Extract Network SAEN is an architecture for learning representations on social network data. +SAEN decomposes input graphs into hierarchies made of multiple strata of objects. +Vector representations of each object are learnt by applying 'shift', 'aggregate' and 'extract' operations on the vector representations of its parts. +We propose an algorithm for domain compression which takes advantage of symmetries in hierarchical decompositions to reduce the memory usage and obtain significant speedups. +Our method is empirically evaluated on real world social network datasets, outperforming the current state of the art.",/pdf/02368a8b5d40f512adf2e3a15a5f33eab0b561c1.pdf,ICLR,2017,Shift Aggregate Extract Networks for learning on social network data +HJg_tkBtwS,rJeUtIROPB,1569440000000.0,1577170000000.0,1846,Model-Agnostic Feature Selection with Additional Mutual Information,"[""ms7490@nyu.edu"", ""apm470@nyu.edu"", ""lakshmi@cs.nyu.edu"", ""sriram@cs.ucla.edu"", ""rajeshr@cims.nyu.edu""]","[""Mukund Sudarshan"", ""Aahlad Manas Puli"", ""Lakshmi Subramanian"", ""Sriram Sankararaman"", ""Rajesh Ranganath""]","[""feature selection"", ""interpretability"", ""randomization"", ""fdr control"", ""p-values""]","Answering questions about data can require understanding what parts of an input X influence the response Y. Finding such an understanding can be built by testing relationships between variables through a machine learning model. For example, conditional randomization tests help determine whether a variable relates to the response given the rest of the variables. However, randomization tests require users to specify test statistics. We formalize a class of proper test statistics that are guaranteed to select a feature when it provides information about the response even when the rest of the features are known. We show that f-divergences provide a broad class of proper test statistics. In the class of f-divergences, the KL-divergence yields an easy-to-compute proper test statistic that relates to the AMI. Questions of feature importance can be asked at the level of an individual sample. We show that estimators from the same AMI test can also be used to find important features in a particular instance. We provide an example to show that perfect predictive models are insufficient for instance-wise feature selection. We evaluate our method on several simulation experiments, on a genomic dataset, a clinical dataset for hospital readmission, and on a subset of classes in ImageNet. Our method outperforms several baselines in various simulated datasets, is able to identify biologically significant genes, can select the most important predictors of a hospital readmission event, and is able to identify distinguishing features in an image-classification task. ",/pdf/bd14218f99d6b7dcd035ddba394b2dc038c7b01c.pdf,ICLR,2020,"We develop a simple regression-based model-agnostic feature selection method to interpret data generating processes with FDR control, and outperform several popular baselines on several simulated, medical, and image datasets." +rygo9iR9F7,S1eKgWj5YQ,1538090000000.0,1545360000000.0,564,Progressive Weight Pruning Of Deep Neural Networks Using ADMM,"[""sye106@syr.edu"", ""tzhan120@syr.edu"", ""kzhang17@syr.edu"", ""jli221@syr.edu"", ""xu.kaid@husky.neu.edu"", ""yunfei.yang717@gmail.com"", ""fyu@gmu.edu"", ""jtang02@syr.edu"", ""makan@syr.edu"", ""sijia.liu@ibm.com"", ""xchen26@gmu.edu"", ""xue.lin@northeastern.edu"", ""yanz.wang@northeastern.edu""]","[""Shaokai Ye"", ""Tianyun Zhang"", ""Kaiqi Zhang"", ""Jiayu Li"", ""Kaidi Xu"", ""Yunfei Yang"", ""Fuxun Yu"", ""Jian Tang"", ""Makan Fardad"", ""Sijia Liu"", ""Xiang Chen"", ""Xue Lin"", ""Yanzhi Wang""]","[""deep learning"", ""model compression"", ""optimization"", ""ADMM"", ""weight pruning""]","Deep neural networks (DNNs) although achieving human-level performance in many domains, have very large model size that hinders their broader applications on edge computing devices. Extensive research work have been conducted on DNN model compression or pruning. However, most of the previous work took heuristic approaches. This work proposes a progressive weight pruning approach based on ADMM (Alternating Direction Method of Multipliers), a powerful technique to deal with non-convex optimization problems with potentially combinatorial constraints. Motivated by dynamic programming, the proposed method reaches extremely high pruning rate by using partial prunings with moderate pruning rates. Therefore, it resolves the accuracy degradation and long convergence time problems when pursuing extremely high pruning ratios. It achieves up to 34× pruning rate for ImageNet dataset and 167× pruning rate for MNIST dataset, significantly higher than those reached by the literature work. Under the same number of epochs, the proposed method also achieves faster convergence and higher compression rates. The codes and pruned DNN models are released in the anonymous link bit.ly/2zxdlss.",/pdf/1f8ee048d5b5668b3b59f70d2b981433d112f38d.pdf,ICLR,2019,We implement a DNN weight pruning approach that achieves the highest pruning rates. +0aW6lYOYB7d,WnIdLrM_9hz,1601310000000.0,1613800000000.0,1511,Large-width functional asymptotics for deep Gaussian neural networks,"[""daniele.bracale@edu.unito.it"", ""~Stefano_Favaro1"", ""sandra.fortini@unibocconi.it"", ""~Stefano_Peluchetti1""]","[""Daniele Bracale"", ""Stefano Favaro"", ""Sandra Fortini"", ""Stefano Peluchetti""]","[""deep learning theory"", ""infinitely wide neural network"", ""Gaussian process"", ""stochastic process""]","In this paper, we consider fully connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b;Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space $\mathbb{R}^I$. Under suitable assumptions on the activation function we show that: i) a network defines a continuous Gaussian process on the input space $\mathbb{R}^I$; ii) a network with re-scaled weights converges weakly to a continuous Gaussian process in the large-width limit; iii) the limiting Gaussian process has almost surely locally $\gamma$-Hölder continuous paths, for $0 < \gamma <1$. Our results contribute to recent theoretical studies on the interplay between infinitely wide deep neural networks and Gaussian processes by establishing weak convergence in function-space with respect to a stronger metric.",/pdf/f23196cbe1af7816b29e5ef4110170972165c6ad.pdf,ICLR,2021,We establish the convergence of infinitely wide feed-forward deep neural networks in function space. +SJw03ceRW,S1IAn9lA-,1509100000000.0,1518730000000.0,359,GENERATIVE LOW-SHOT NETWORK EXPANSION,"[""adi.hayat3@gmail.com"", ""mark.kliger@gmail.com"", ""shacharfl@gmail.com"", ""cohenor@gmail.com""]","[""Adi Hayat"", ""Mark Kliger"", ""Shachar Fleishman"", ""Daniel Cohen-Or""]","[""Low-Shot Learning"", ""class incremental learning"", ""Network expansion"", ""Generative model"", ""Distillation""]","Conventional deep learning classifiers are static in the sense that they are trained on +a predefined set of classes and learning to classify a novel class typically requires +re-training. In this work, we address the problem of Low-shot network-expansion +learning. We introduce a learning framework which enables expanding a pre-trained +(base) deep network to classify novel classes when the number of examples for the +novel classes is particularly small. We present a simple yet powerful distillation +method where the base network is augmented with additional weights to classify +the novel classes, while keeping the weights of the base network unchanged. We +term this learning hard distillation, since we preserve the response of the network +on the old classes to be equal in both the base and the expanded network. We +show that since only a small number of weights needs to be trained, the hard +distillation excels for low-shot training scenarios. Furthermore, hard distillation +avoids detriment to classification performance on the base classes. Finally, we +show that low-shot network expansion can be done with a very small memory +footprint by using a compact generative model of the base classes training data +with only a negligible degradation relative to learning with the full training set.",/pdf/8011e34779de8251fb5437de8271b45ae05d4ea3.pdf,ICLR,2018," In this paper, we address the problem of Low-shot network-expansion learning" +HJxKhyStPH,r1eg7fyFDB,1569440000000.0,1577170000000.0,1959,Toward Understanding The Effect of Loss Function on The Performance of Knowledge Graph Embedding,"[""nayyeri@cs.uni-bonn.de"", ""xuc@cs.uni-bonn.de"", ""yayaghoo@microsoft.com"", ""shariat@cs.uni-bonn.de"", ""jens.lehmann@cs.uni-bonn.de""]","[""Mojtaba Nayyeri"", ""Chengjin Xu"", ""Yadollah Yaghoobzadeh"", ""Hamed Shariat Yazdi"", ""Jens Lehmann""]","[""Knowledge graph embedding"", ""Translation based embedding"", ""loss function"", ""relation pattern""]","Knowledge graphs (KGs) represent world's facts in structured forms. KG completion exploits the existing facts in a KG to discover new ones. Translation-based embedding model (TransE) is a prominent formulation to do KG completion. +Despite the efficiency of TransE in memory and time, it suffers from several limitations in encoding relation patterns such as symmetric, reflexive etc. To resolve this problem, most of the attempts have circled around the revision of the score function of TransE i.e., proposing a more complicated score function such as Trans(A, D, G, H, R, etc) to mitigate the limitations. In this paper, we tackle this problem from a different perspective. We show that existing theories corresponding to the limitations of TransE are inaccurate because they ignore the effect of loss function. Accordingly, we pose theoretical investigations of the main limitations of TransE in the light of loss function. To the best of our knowledge, this has not been investigated so far comprehensively. We show that by a proper selection of the loss function for training the TransE model, the main limitations of the model are mitigated. This is explained by setting upper-bound for the scores of positive samples, showing the region of truth (i.e., the region that a triple is considered positive by the model). +Our theoretical proofs with experimental results fill the gap between the capability of translation-based class of embedding models and the loss function. The theories emphasis the importance of the selection of the loss functions for training the models. Our experimental evaluations on different loss functions used for training the models justify our theoretical proofs and confirm the importance of the loss functions on the performance. + +",/pdf/3aef71ad0ba460e0008c0e5c525a43e03244680a.pdf,ICLR,2020, +ryZ283gAZ,rJl3LnxCZ,1509110000000.0,1518730000000.0,392,Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations,"[""luyiping9712@pku.edu.cn"", ""zhongaoxiao@gmail.com"", ""quanzhengli5@gmail.com"", ""dongbin@math.pku.edu.cn""]","[""Yiping Lu"", ""Aoxiao Zhong"", ""Quanzheng Li"", ""Bin Dong""]","[""deep convolutional network"", ""residual network"", ""dynamic system"", ""stochastic dynamic system"", ""modified equation""]","Deep neural networks have become the state-of-the-art models in numerous machine learning tasks. However, general guidance to network architecture design is still missing. In our work, we bridge deep neural network design with numerical differential equations. We show that many effective networks, such as ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations. This finding brings us a brand new perspective on the design of effective deep architectures. We can take advantage of the rich knowledge in numerical analysis to guide us in designing new and potentially more effective deep networks. As an example, we propose a linear multi-step architecture (LM-architecture) which is inspired by the linear multi-step method solving ordinary differential equations. The LM-architecture is an effective structure that can be used on any ResNet-like networks. In particular, we demonstrate that LM-ResNet and LM-ResNeXt (i.e. the networks obtained by applying the LM-architecture on ResNet and ResNeXt respectively) can achieve noticeably higher accuracy than ResNet and ResNeXt on both CIFAR and ImageNet with comparable numbers of trainable parameters. In particular, on both CIFAR and ImageNet, LM-ResNet/LM-ResNeXt can significantly compress (>50%) the original networks while maintaining a similar performance. This can be explained mathematically using the concept of modified equation from numerical analysis. Last but not least, we also establish a connection between stochastic control and noise injection in the training process which helps to improve generalization of the networks. Furthermore, by relating stochastic training strategy with stochastic dynamic system, we can easily apply stochastic training to the networks with the LM-architecture. As an example, we introduced stochastic depth to LM-ResNet and achieve significant improvement over the original LM-ResNet on CIFAR10.",/pdf/e757d4b34fd712fdae365e67c8ae6e841dd1761b.pdf,ICLR,2018,This paper bridges deep network architectures with numerical (stochastic) differential equations. This new perspective enables new designs of more effective deep neural networks. +BJll6o09tm,r1lsWwTcKm,1538090000000.0,1545360000000.0,769,Padam: Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks,"[""jc4zg@virginia.edu"", ""qgu@cs.ucla.edu""]","[""Jinghui Chen"", ""Quanquan Gu""]",[],"Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes ""over adapted"". We design a new algorithm, called Partially adaptive momentum estimation method (Padam), which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter p, to achieve the best from both worlds. Experiments on standard benchmarks show that Padam can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.",/pdf/bebe4e072efb718a2cb702809a8f9297eb631f4a.pdf,ICLR,2019, +BJcAWaeCW,ByjSgTeCZ,1509110000000.0,1518730000000.0,407,Graph Topological Features via GAN,"[""weiyiliu@us.ibm.com"", ""hal.cooper@columbia.edu"", ""m.oh@columbia.edu""]","[""Weiyi Liu"", ""Hal Cooper"", ""Min-Hwan Oh""]","[""graph topology"", ""GAN"", ""network science"", ""hierarchical learning""]","Inspired by the success of generative adversarial networks (GANs) in image domains, we introduce a novel hierarchical architecture for learning characteristic topological features from a single arbitrary input graph via GANs. The hierarchical architecture consisting of multiple GANs preserves both local and global topological features, and automatically partitions the input graph into representative stages for feature learning. The stages facilitate reconstruction and can be used as indicators of the importance of the associated topological structures. Experiments show that our method produces subgraphs retaining a wide range of topological features, even in early reconstruction stages. This paper contains original research on combining the use of GANs and graph topological analysis.",/pdf/e17e58e9643cc9564132bf52721717e6dbca1c70.pdf,ICLR,2018,A GAN based method to learn important topological features of an arbitrary input graph. +H1Heentlx,,1478240000000.0,1481750000000.0,133,Deep Variational Canonical Correlation Analysis,"[""weiranwang@ttic.edu"", ""xcyan@umich.edu"", ""honglak@umich.edu"", ""klivescu@ttic.edu""]","[""Weiran Wang"", ""Xinchen Yan"", ""Honglak Lee"", ""Karen Livescu""]",[],"We present deep variational canonical correlation analysis (VCCA), a deep multi-view learning model that extends the latent variable model interpretation of linear CCA~\citep{BachJordan05a} to nonlinear observation models parameterized by deep neural networks (DNNs). Computing the marginal data likelihood, as well as inference of the latent variables, are intractable under this model. We derive a variational lower bound of the data likelihood by parameterizing the posterior density of the latent variables with another DNN, and approximate the lower bound via Monte Carlo sampling. Interestingly, the resulting model resembles that of multi-view autoencoders~\citep{Ngiam_11b}, with the key distinction of an additional sampling procedure at the bottleneck layer. We also propose a variant of VCCA called VCCA-private which can, in addition to the ``common variables'' underlying both views, extract the ``private variables'' within each view. We demonstrate that VCCA-private is able to disentangle the shared and private information for multi-view data without hard supervision.",/pdf/0cd74c0874a098f3a4c9d86ed3b4490e86bd86ff.pdf,ICLR,2017,A deep generative model for multi-view representation learning +rJeA_aVtPB,Hyxrz92DvH,1569440000000.0,1577170000000.0,652,Decaying momentum helps neural network training,"[""jc114@rice.edu"", ""anastasios@rice.edu""]","[""John Chen"", ""Anastasios Kyrillidis""]","[""sgd"", ""momentum"", ""adam"", ""optimization"", ""deep learning""]","Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD rivals momentum SGD with learning rate decay, and in many cases leads to improved performance. Demon is trivial to implement and incurs limited extra computational overhead, compared to the vanilla counterparts. ",/pdf/532c6312754de5c876a8e1f248fd3bf2fa7c4d31.pdf,ICLR,2020,We introduce a momentum decay rule which significantly improves the performance of Adam and momentum SGD +ryGpEiAcFQ,HyeSDTwFdQ,1538090000000.0,1545360000000.0,41,A Synaptic Neural Network and Synapse Learning,"[""changli@neatware.com""]","[""Chang Li""]","[""synaptic neural network"", ""surprisal"", ""synapse"", ""probability"", ""excitation"", ""inhibition"", ""synapse learning"", ""bose-einstein distribution"", ""tensor"", ""gradient"", ""loss function"", ""mnist"", ""topologically conjugate""]","A Synaptic Neural Network (SynaNN) consists of synapses and neurons. Inspired by the synapse research of neuroscience, we built a synapse model with a nonlinear synapse function of excitatory and inhibitory channel probabilities. Introduced the concept of surprisal space and constructed a commutative diagram, we proved that the inhibitory probability function -log(1-exp(-x)) in surprisal space is the topologically conjugate function of the inhibitory complementary probability 1-x in probability space. Furthermore, we found that the derivative of the synapse over the parameter in the surprisal space is equal to the negative Bose-Einstein distribution. In addition, we constructed a fully connected synapse graph (tensor) as a synapse block of a synaptic neural network. Moreover, we proved the gradient formula of a cross-entropy loss function over parameters, so synapse learning can work with the gradient descent and backpropagation algorithms. In the proof-of-concept experiment, we performed an MNIST training and testing on the MLP model with synapse network as hidden layers.",/pdf/d3fe1f5ae0cbd47f0823840b397fbf1dd8476f0d.pdf,ICLR,2019,A synaptic neural network with synapse graph and learning that has the feature of topological conjugation and Bose-Einstein distribution in surprisal space. +HygjqjR9Km,SkxDo2iqFm,1538090000000.0,1554070000000.0,565,Improving MMD-GAN Training with Repulsive Loss Function,"[""weiw8@student.unimelb.edu.au"", ""yuan.sun@rmit.edu.au"", ""saman@unimelb.edu.au""]","[""Wei Wang"", ""Yuan Sun"", ""Saman Halgamuge""]","[""generative adversarial nets"", ""loss function"", ""maximum mean discrepancy"", ""image generation"", ""unsupervised learning""]","Generative adversarial nets (GANs) are widely used to learn the data sampling process and their performance may heavily depend on the loss functions, given a limited computational budget. This study revisits MMD-GAN that uses the maximum mean discrepancy (MMD) as the loss function for GAN and makes two contributions. First, we argue that the existing MMD loss function may discourage the learning of fine details in data as it attempts to contract the discriminator outputs of real data. To address this issue, we propose a repulsive loss function to actively learn the difference among the real data by simply rearranging the terms in MMD. Second, inspired by the hinge loss, we propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the repulsive loss function. The proposed methods are applied to the unsupervised image generation tasks on CIFAR-10, STL-10, CelebA, and LSUN bedroom datasets. Results show that the repulsive loss function significantly improves over the MMD loss at no additional computational cost and outperforms other representative loss functions. The proposed methods achieve an FID score of 16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral normalization.",/pdf/291201361e0ceb623bf792ba5fc6d875800ec823.pdf,ICLR,2019,Rearranging the terms in maximum mean discrepancy yields a much better loss function for the discriminator of generative adversarial nets +B1ewdt9xe,,1478300000000.0,1488330000000.0,508,Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning,"[""lotter@fas.harvard.edu"", ""gabriel.kreiman@tch.harvard.edu"", ""davidcox@fas.harvard.edu""]","[""William Lotter"", ""Gabriel Kreiman"", ""David Cox""]",[],"While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network (""PredNet"") architecture that is inspired by the concept of ""predictive coding"" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. These results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.",/pdf/277fb50ed09a36b54842b8129c0f9027fd7edc53.pdf,ICLR,2017, +H1gpET4YDB,rJeKxOEvwr,1569440000000.0,1577170000000.0,504,Blockwise Self-Attention for Long Document Understanding,"[""qiujz16@mails.tsinghua.edu.cn"", ""gabe.hao.ma@gmail.com"", ""omerlevy@gmail.com"", ""scottyih@gmail.com"", ""sinongwang@fb.com"", ""jietang@tsinghua.edu.cn""]","[""Jiezhong Qiu"", ""Hao Ma"", ""Omer Levy"", ""Scott Wen-tau Yih"", ""Sinong Wang"", ""Jie Tang""]","[""BERT"", ""Transformer""]","We present BlockBERT, a lightweight and efficient BERT model that is designed to better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training time, which also enables attention heads to capture either short- or long-range contextual information. We conduct experiments on several benchmark question answering datasets with various paragraph lengths. Results show that BlockBERT uses 18.7-36.1% less memory and reduces the training time by 12.0-25.1%, while having comparable and sometimes better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.",/pdf/9ec5ccd2471b84b9db3d91407d6cff5a94cd4282.pdf,ICLR,2020,"We present BlockBERT, a lightweight and efficient BERT model that is designed to better modeling long-distance dependencies." +rJxdQ3jeg,,1478370000000.0,1488550000000.0,573,End-to-end Optimized Image Compression,"[""johannes.balle@nyu.edu"", ""valero.laparra@uv.es"", ""eero.simoncelli@nyu.edu""]","[""Johannes Ball\u00e9"", ""Valero Laparra"", ""Eero P. Simoncelli""]",[],"We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. The transforms are constructed in three successive stages of convolutional linear filters and nonlinear activation functions. Unlike most convolutional neural networks, the joint nonlinearity is chosen to implement a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the entire model for rate-distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. Under certain conditions, the relaxed loss function may be interpreted as the log likelihood of a generative model, as implemented by a variational autoencoder. Unlike these models, however, the compression model must operate at any given point along the rate-distortion curve, as specified by a trade-off parameter. Across an independent set of test images, we find that the optimized method generally exhibits better rate-distortion performance than the standard JPEG and JPEG 2000 compression methods. More importantly, we observe a dramatic improvement in visual quality for all images at all bit rates, which is supported by objective quality estimates using MS-SSIM.",/pdf/d7df16ec78247c2550d3a0b658b90e6d755089c0.pdf,ICLR,2017, +BJGfCjA5FX,S1esrn69FX,1538090000000.0,1545360000000.0,876,PAIRWISE AUGMENTED GANS WITH ADVERSARIAL RECONSTRUCTION LOSS,"[""alanov.aibek@gmail.com"", ""maxim.v.kochurov@gmail.com"", ""daniil.yashkov@phystech.edu"", ""vetrodim@gmail.com""]","[""Aibek Alanov"", ""Max Kochurov"", ""Daniil Yashkov"", ""Dmitry Vetrov""]","[""Computer vision"", ""Deep learning"", ""Unsupervised Learning"", ""Generative Adversarial Networks""]","We propose a novel autoencoding model called Pairwise Augmented GANs. We train a generator and an encoder jointly and in an adversarial manner. The generator network learns to sample realistic objects. In turn, the encoder network at the same time is trained to map the true data distribution to the prior in latent space. To ensure good reconstructions, we introduce an augmented adversarial reconstruction loss. Here we train a discriminator to distinguish two types of pairs: an object with its augmentation and the one with its reconstruction. We show that such adversarial loss compares objects based on the content rather than on the exact match. We experimentally demonstrate that our model generates samples and reconstructions of quality competitive with state-of-the-art on datasets MNIST, CIFAR10, CelebA and achieves good quantitative results on CIFAR10. ",/pdf/3c6b1ec9342f1cc15195be1c92fffc24cdae9f6b.pdf,ICLR,2019,We propose a novel autoencoding model with augmented adversarial reconstruction loss. We intoduce new metric for content-based assessment of reconstructions. +H1ziPjC5Fm,H1l2U9TuYX,1538090000000.0,1550830000000.0,294,Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks,"[""jose.oramas@esat.kuleuven.be"", ""kaili.wang@esat.kuleuven.be"", ""tinne.tuytelaars@esat.kuleuven.be""]","[""Jose Oramas"", ""Kaili Wang"", ""Tinne Tuytelaars""]","[""model explanation"", ""model interpretation"", ""explainable ai"", ""evaluation""]","Visual Interpretation and explanation of deep models is critical towards wide adoption of systems that rely on them. In this paper, we propose a novel scheme for both interpretation as well as explanation in which, given a pretrained model, we automatically identify internal features relevant for the set of classes considered by the model, without relying on additional annotations. We interpret the model through average visualizations of this reduced set of features. Then, at test time, we explain the network prediction by accompanying the predicted class label with supporting visualizations derived from the identified features. In addition, we propose a method to address the artifacts introduced by strided operations in deconvNet-based visualizations. Moreover, we introduce an8Flower , a dataset specifically designed for objective quantitative evaluation of methods for visual explanation. Experiments on the MNIST , ILSVRC 12, Fashion 144k and an8Flower datasets show that our method produces detailed explanations with good coverage of relevant features of the classes of interest.",/pdf/10429c6bce50952c015da49b8685f44b01b5e48a.pdf,ICLR,2019,Interpretation by Identifying model-learned features that serve as indicators for the task of interest. Explain model decisions by highlighting the response of these features in test data. Evaluate explanations objectively with a controlled dataset. +rkpACe1lx,,1477540000000.0,1487260000000.0,8,HyperNetworks,"[""hadavid@google.com"", ""adai@google.com"", ""qvl@google.com""]","[""David Ha"", ""Andrew M. Dai"", ""Quoc V. Le""]","[""Natural language processing"", ""Deep learning"", ""Supervised Learning""]","This work explores hypernetworks: an approach of using one network, also known as a hypernetwork, to generate the weights for another network. We apply hypernetworks to generate adaptive weights for recurrent networks. In this case, hypernetworks can be viewed as a relaxed form of weight-sharing across layers. In our implementation, hypernetworks are are trained jointly with the main network in an end-to-end fashion. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks.",/pdf/bbbae35015016cd02efae22e4caad0aba8e8a2fd.pdf,ICLR,2017,"We train a small RNN to generate weights for a larger RNN, and train the system end-to-end. We obtain state-of-the-art results on a variety of sequence modelling tasks." +SyeZIkrKwS,r1lUZSa_PS,1569440000000.0,1577170000000.0,1718,DyNet: Dynamic Convolution for Accelerating Convolution Neural Networks,"[""zhangyikang5@huawei.com"", ""zhangjian157@huawei.com"", ""wangqiang168@huawei.com"", ""zorro.zhongzhao@huawei.com""]","[""Kane Zhang"", ""Jian Zhang"", ""Qiang Wang"", ""Zhao Zhong""]","[""CNNs"", ""dynamic convolution kernel""]","Convolution operator is the core of convolutional neural networks (CNNs) and occupies the most computation cost. To make CNNs more efficient, many methods have been proposed to either design lightweight networks or compress models. Although some efficient network structures have been proposed, such as MobileNet or ShuffleNet, we find that there still exists redundant information between convolution kernels. To address this issue, we propose a novel dynamic convolution method named \textbf{DyNet} in this paper, which can adaptively generate convolution kernels based on image contents. To demonstrate the effectiveness, we apply DyNet on multiple state-of-the-art CNNs. The experiment results show that DyNet can reduce the computation cost remarkably, while maintaining the performance nearly unchanged. Specifically, for ShuffleNetV2 (1.0), MobileNetV2 (1.0), ResNet18 and ResNet50, DyNet reduces 40.0%, 56.7%, 68.2% and 72.4% FLOPs respectively while the Top-1 accuracy on ImageNet only changes by +1.0%, -0.27%, -0.6% and -0.08%. Meanwhile, DyNet further accelerates the inference speed of MobileNetV2 (1.0), ResNet18 and ResNet50 by 1.87x,1.32x and 1.48x on CPU platform respectively. To verify the scalability, we also apply DyNet on segmentation task, the results show that DyNet can reduces 69.3% FLOPs while maintaining the Mean IoU on segmentation task.",/pdf/1805cd4025104e121d1f8ca13b2d67fe1bb8b534.pdf,ICLR,2020,We propose a dynamic convolution method to significantly accelerate inference time of CNNs while maintaining the accuracy. +L2LEB4vd9Qw,eOd2awnTsUH,1601310000000.0,1614990000000.0,2084,Multimodal Attention for Layout Synthesis in Diverse Domains,"[""~Kamal_Gupta1"", ""~Vijay_Mahadevan1"", ""~Alessandro_Achille1"", ""~Justin_Lazarow1"", ""~Larry_S._Davis1"", ""~Abhinav_Shrivastava2""]","[""Kamal Gupta"", ""Vijay Mahadevan"", ""Alessandro Achille"", ""Justin Lazarow"", ""Larry S. Davis"", ""Abhinav Shrivastava""]","[""layout generation"", ""layout synthesis"", ""multimodal attention"", ""transformers"", ""document layouts"", ""generative model"", ""3D""]","We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents and 3D objects. Most complex scenes, natural or human-designed, can be expressed as a meaningful arrangement of simpler compositional graphical primitives. Generating a new layout or extending an existing layout requires understanding the relationships between these primitives. To do this, we propose a multimodal attention framework, MMA, that leverages self-attention to learn contextual relationships between layout elements and generate novel layouts in a given domain. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout. Further, our analyses show that the model is able to automatically capture the semantic properties of the primitives. We propose simple improvements in both representation of layout primitives, as well as training methods to demonstrate competitive performance in very diverse data domains such as object bounding boxes in natural images (COCO bounding boxes), documents (PubLayNet), mobile applications (RICO dataset) as well as 3D shapes (PartNet).",/pdf/7148445c3c9718b886b2066e8d5fed0f5fcb2e6b.pdf,ICLR,2021,"A simple robust generative model for layouts; results on diverse real world datasets (3D objects, image, document layouts, mobile app wireframes)" +o1O5nc48rn,vwrUvdWNWyu,1601310000000.0,1614990000000.0,2320,Optimal Transport Graph Neural Networks,"[""~Gary_B\u00e9cigneul1"", ""~Octavian-Eugen_Ganea1"", ""~Benson_Chen1"", ""~Regina_Barzilay1"", ""~Tommi_S._Jaakkola1""]","[""Gary B\u00e9cigneul"", ""Octavian-Eugen Ganea"", ""Benson Chen"", ""Regina Barzilay"", ""Tommi S. Jaakkola""]","[""graph neural networks"", ""optimal transport"", ""molecular representations"", ""molecular property prediction""]","Current graph neural network (GNN) architectures naively average or sum node embeddings into an aggregated graph representation---potentially losing structural or semantic information. We here introduce OT-GNN, a model that computes graph embeddings using parametric prototypes that highlight key facets of different graph aspects. Towards this goal, we are (to our knowledge) the first to successfully combine optimal transport with parametric graph models. Graph representations are obtained from Wasserstein distances between the set of GNN node embeddings and ""prototype"" point clouds as free parameters. We theoretically prove that, unlike traditional sum aggregation, our function class on point clouds satisfies a fundamental universal approximation theorem. Empirically, we address an inherent collapse optimization issue by proposing a noise contrastive regularizer to steer the model towards truly exploiting the optimal transport geometry. Finally, we consistently report better generalization performance on several molecular property prediction tasks, while exhibiting smoother graph representations.",/pdf/c125c6180505c776a458e56b5eeeecc0737c5a0c.pdf,ICLR,2021,We compute graph representations based on abstract prototypes that leverage optimal transport and graph neural networks. +B1lXfA4Ywr,HkeGkhNdPB,1569440000000.0,1577170000000.0,996,Towards Modular Algorithm Induction,"[""danabo@google.com"", ""rising@google.com"", ""manzilzaheer@google.com"", ""charlessutton@google.com""]","[""Daniel A. Abolafia"", ""Rishabh Singh"", ""Manzil Zaheer"", ""Charles Sutton""]","[""algorithm induction"", ""reinforcement learning"", ""program synthesis"", ""modular""]","We present a modular neural network architecture MAIN that learns algorithms given a set of input-output examples. MAIN consists of a neural controller that interacts with a variable-length input tape and learns to compose modules together with their corresponding argument choices. Unlike previous approaches, MAIN uses a general domain-agnostic mechanism for selection of modules and their arguments. It uses a general input tape layout together with a parallel history tape to indicate most recently used locations. Finally, it uses a memoryless controller with a length-invariant self-attention based input tape encoding to allow for random access to tape locations. The MAIN architecture is trained end-to-end using reinforcement learning from a set of input-output examples. We evaluate MAIN on five algorithmic tasks and show that it can learn policies that generalizes perfectly to inputs of much longer lengths than the ones used for training.",/pdf/6b97229cdf77f9fbae8046d3c6c1bf1614a4272a.pdf,ICLR,2020,An architecture for learning to compose modules to learn algorithmic tasks. +Bylh2krYPr,Bkg9tMyFDr,1569440000000.0,1577170000000.0,1966,Probing Emergent Semantics in Predictive Agents via Question Answering,"[""abhshkdz@gatech.edu"", ""fedecarnev@google.com"", ""hamzamerzic@google.com"", ""laurarimell@google.com"", ""rgschneider@google.com"", ""aldenhung@google.com"", ""jabramson@google.com"", ""arahuja@google.com"", ""clarkstephen@google.com"", ""gregwayne@google.com"", ""felixhill@google.com""]","[""Abhishek Das"", ""Federico Carnevale"", ""Hamza Merzic"", ""Laura Rimell"", ""Rosalia Schneider"", ""Alden Hung"", ""Josh Abramson"", ""Arun Ahuja"", ""Stephen Clark"", ""Greg Wayne"", ""Felix Hill""]","[""question-answering"", ""predictive models""]","Recent work has demonstrated how predictive modeling can endow agents with rich knowledge of their surroundings, improving their ability to act in complex environments. We propose question-answering as a general paradigm to decode and understand the representations that such agents develop, applying our method to two recent approaches to predictive modeling – action-conditional CPC (Guo et al., 2018) and SimCore (Gregor et al., 2019). After training agents with these predictive objectives in a visually-rich, 3D environment with an assortment of objects, colors, shapes, and spatial configurations, we probe their internal state representations with a host of synthetic (English) questions, without backpropagating gradients from the question-answering decoder into the agent. The performance of different agents when probed in this way reveals that they learn to encode detailed, and seemingly compositional, information about objects, properties and spatial relations from their physical environment. Our approach is intuitive, i.e. humans can easily interpret the responses of the model as opposed to inspecting continuous vectors, and model-agnostic, i.e. applicable to any modeling approach. By revealing the implicit knowledge of objects, quantities, properties and relations acquired by agents as they learn, question-conditional agent probing can stimulate the design and development of stronger predictive learning objectives.",/pdf/a0ad008cef39edcc2ed671b248e4a77357563b75.pdf,ICLR,2020,We use question-answering to evaluate how much knowledge about the environment can agents learn by self-supervised prediction. +BJe-unNYPr,r1xOwhx18H,1569440000000.0,1577170000000.0,31,Accelerated Information Gradient flow,"[""zackwang24@pku.edu.cn"", ""wcli@math.ucla.edu""]","[""Yifei Wang"", ""Wuchen Li""]","[""Optimal transport"", ""Information geometry"", ""Nesterov accelerated gradient method""]","We present a systematic framework for the Nesterov's accelerated gradient flows in the spaces of probabilities embedded with information metrics. Here two metrics are considered, including both the Fisher-Rao metric and the Wasserstein-$2$ metric. For the Wasserstein-$2$ metric case, we prove the convergence properties of the accelerated gradient flows, and introduce their formulations in Gaussian families. Furthermore, we propose a practical discrete-time algorithm in particle implementations with an adaptive restart technique. We formulate a novel bandwidth selection method, which learns the Wasserstein-$2$ gradient direction from Brownian-motion samples. Experimental results including Bayesian inference show the strength of the current method compared with the state-of-the-art.",/pdf/8898485c7f9f40f0bfe944c6ce3cd9bb7e2e034b.pdf,ICLR,2020,We study the accelerated gradient flows in the probability space. +TSRTzJnuEBS,#NAME?,1601310000000.0,1613010000000.0,1024,Anytime Sampling for Autoregressive Models via Ordered Autoencoding,"[""~Yilun_Xu1"", ""~Yang_Song1"", ""~Sahaj_Garg1"", ""gonglinyuan@hotmail.com"", ""~Rui_Shu1"", ""~Aditya_Grover1"", ""~Stefano_Ermon1""]","[""Yilun Xu"", ""Yang Song"", ""Sahaj Garg"", ""Linyuan Gong"", ""Rui Shu"", ""Aditya Grover"", ""Stefano Ermon""]",[],"Autoregressive models are widely used for tasks such as image and audio generation. The sampling process of these models, however, does not allow interruptions and cannot adapt to real-time computational resources. This challenge impedes the deployment of powerful autoregressive models, which involve a slow sampling process that is sequential in nature and typically scales linearly with respect to the data dimension. To address this difficulty, we propose a new family of autoregressive models that enables anytime sampling. Inspired by Principal Component Analysis, we learn a structured representation space where dimensions are ordered based on their importance with respect to reconstruction. Using an autoregressive model in this latent space, we trade off sample quality for computational efficiency by truncating the generation process before decoding into the original data space. Experimentally, we demonstrate in several image and audio generation tasks that sample quality degrades gracefully as we reduce the computational budget for sampling. The approach suffers almost no loss in sample quality (measured by FID) using only 60\% to 80\% of all latent dimensions for image data. Code is available at https://github.com/Newbeeer/Anytime-Auto-Regressive-Model.",/pdf/1367785ee095d8df274476d18c53a21ad0d67173.pdf,ICLR,2021,We propose a new family of autoregressive model that enables anytime sampling +HJerDj05tQ,HJxCSAfctQ,1538090000000.0,1545360000000.0,259,Optimization on Multiple Manifolds,"[""yimingyang17@mails.ucas.edu.cn"", ""huishuai.zhang@microsoft.com"", ""wche@microsoft.com"", ""mazm@amt.ac.cn"", ""tie-yan.liu@mircosoft.com""]","[""Mingyang Yi"", ""Huishuai Zhang"", ""Wei Chen"", ""Zhi-ming Ma"", ""Tie-yan Liu""]","[""Optimization"", ""Multiple constraints"", ""Manifold""]","Optimization on manifold has been widely used in machine learning, to handle optimization problems with constraint. Most previous works focus on the case with a single manifold. However, in practice it is quite common that the optimization problem involves more than one constraints, (each constraint corresponding to one manifold). It is not clear in general how to optimize on multiple manifolds effectively and provably especially when the intersection of multiple manifolds is not a manifold or cannot be easily calculated. We propose a unified algorithm framework to handle the optimization on multiple manifolds. Specifically, we integrate information from multiple manifolds and move along an ensemble direction by viewing the information from each manifold as a drift and adding them together. We prove the convergence properties of the proposed algorithms. We also apply the algorithms into training neural network with batch normalization layers and achieve preferable empirical results.",/pdf/60f5791c370892574630a44e93bfd015619d4010.pdf,ICLR,2019,This paper introduces an algorithm to handle optimization problem with multiple constraints under vision of manifold. +S1eqj1SKvr,S1xXlkktwH,1569440000000.0,1577170000000.0,1925,TOWARDS FEATURE SPACE ADVERSARIAL ATTACK,"[""xu1230@purdue.edu"", ""taog@purdue.edu"", ""516030910472@sjtu.edu.cn"", ""lintan@purdue.edu"", ""xyzhang@cs.purdue.edu""]","[""Qiuling Xu"", ""Guanhong Tao"", ""Siyuan Cheng"", ""Lin Tan"", ""Xiangyu Zhang""]",[],"We propose a new type of adversarial attack to Deep Neural Networks (DNNs) for image classification. Different from most existing attacks that directly perturb input pixels. Our attack focuses on perturbing abstract features, more specifically, features that denote styles, including interpretable styles such as vivid colors and sharp outlines, and uninterpretable ones. It induces model misclassfication by injecting style changes insensitive for humans, through an optimization procedure. We show that state-of-the-art pixel space adversarial attack detection and defense techniques are ineffective in guarding against feature space attacks. ",/pdf/ae90756467d49a5e64354909a228e90836147ab4.pdf,ICLR,2020, +K6YbHUIWHOy,SahILU0Fokk,1601310000000.0,1614990000000.0,2966,Memory Augmented Design of Graph Neural Networks,"[""~Tao_Xiong3"", ""tailiang.zl@antgroup.com"", ""~Ruofan_Wu1"", ""yuan.qi@antgroup.com""]","[""Tao Xiong"", ""Liang Zhu"", ""Ruofan Wu"", ""Yuan Qi""]",[],"The expressive power of graph neural networks (GNN) has drawn much interest recently. Most existent work focused on measuring the expressiveness of GNN through the task of distinguishing between graphs. In this paper, we inspect the representation limits of locally unordered messaging passing (LUMP) GNN architecture through the lens of \emph{node classification}. For GNNs based on permutation invariant local aggregators, we characterize graph-theoretic conditions under which such GNNs fail to discriminate simple instances, regardless of underlying architecture or network depth. To overcome this limitation, we propose a novel framework to augment GNNs with global graph information called \emph{memory augmentation}. Specifically, we allow every node in the original graph to interact with a group of memory nodes. For each node, information from all the other nodes in the graph can be gleaned through the relay of the memory nodes. For proper backbone architectures like GAT and GCN, memory augmented GNNs are theoretically shown to be more expressive than LUMP GNNs. Empirical evaluations demonstrate the significant improvement of memory augmentation. In particular, memory augmented GAT and GCN are shown to either outperform or closely match state-of-the-art performance across various benchmark datasets. ",/pdf/f80d81f81998f7ba9e82d8da5239a3a3f84a66bc.pdf,ICLR,2021, +zgGmAx9ZcY,nj6eFVf0Dv-,1601310000000.0,1614990000000.0,202,Learning the Connections in Direct Feedback Alignment,"[""~Matthew_Bailey_Webster1"", ""~Jonghyun_Choi1"", ""cwan@gist.ac.kr""]","[""Matthew Bailey Webster"", ""Jonghyun Choi"", ""changwook Ahn""]","[""Deep Learning"", ""Feedback Alignment"", ""Backpropagation""]","Feedback alignment was proposed to address the biological implausibility of the backpropagation algorithm which requires the transportation of the weight transpose during the backwards pass. The idea was later built upon with the proposal of direct feedback alignment (DFA), which propagates the error directly from the output layer to each hidden layer in the backward path using a fixed random weight matrix. This contribution was significant because it allowed for the parallelization of the backwards pass by the use of these feedback connections. However, just as feedback alignment, DFA does not perform well in deep convolutional networks. We propose to learn the backward weight matrices in DFA, adopting the methodology of Kolen-Pollack learning, to improve training and inference accuracy in deep convolutional neural networks by updating the direct feedback connections such that they come to estimate the forward path. The proposed method improves the accuracy of learning by direct feedback connections and reduces the gap between parallel training to serial training by means of backpropagation.",/pdf/94cf73a1fccb00a73e96f6e8ec01fd0d1e1948c8.pdf,ICLR,2021,"We improve upon the direct feedback alignment approach, and show that our method can more effectively train convolutional networks on larger datasets such as CIFAR100." +BylPSkHKvB,HylX7z6dDS,1569440000000.0,1577170000000.0,1695,Natural- to formal-language generation using Tensor Product Representations,"[""kezhenchen2021@u.northwestern.edu"", ""qihua@microsoft.com"", ""hpalangi@microsoft.com"", ""paul.smolensky@gmail.com"", ""forbus@northwestern.edu"", ""jfgao@microsoft.com""]","[""Kezhen Chen"", ""Qiuyuan Huang"", ""Hamid Palangi"", ""Paul Smolensky"", ""Kenneth D. Forbus"", ""Jianfeng Gao""]","[""Neural Symbolic Reasoning"", ""Deep Learning"", ""Natural Language Processing"", ""Structural Representation"", ""Interpretation of Learned Representations""]","Generating formal-language represented by relational tuples, such as Lisp programs or mathematical expressions, from a natural-language input is an extremely challenging task because it requires to explicitly capture discrete symbolic structural information from the input to generate the output. Most state-of-the-art neural sequence models do not explicitly capture such structure information, and thus do not perform well on these tasks. In this paper, we propose a new encoder-decoder model based on Tensor Product Representations (TPRs) for Natural- to Formal-language generation, called TP-N2F. The encoder of TP-N2F employs TPR 'binding' to encode natural-language symbolic structure in vector space and the decoder uses TPR 'unbinding' to generate a sequence of relational tuples, each consisting of a relation (or operation) and a number of arguments, in symbolic space. TP-N2F considerably outperforms LSTM-based Seq2Seq models, creating a new state of the art results on two benchmarks: the MathQA dataset for math problem solving, and the AlgoList dataset for program synthesis. Ablation studies show that improvements are mainly attributed to the use of TPRs in both the encoder and decoder to explicitly capture relational structure information for symbolic reasoning. ",/pdf/4330a040585aca35513b257a9d848209ddfd703e.pdf,ICLR,2020,"In this paper, we propose a new encoder-decoder model based on Tensor Product Representations for Natural- to Formal-language generation, called TP-N2F." +SJfFTjA5KQ,BJgMD039FX,1538090000000.0,1545360000000.0,825,Unification of Recurrent Neural Network Architectures and Quantum Inspired Stable Design ,"[""yzniu@mit.edu"", ""lhoresh@us.ibm.com"", ""michael.okeeffe@ll.mit.edu"", ""ichuang@mit.edu""]","[""Murphy Yuezhen Niu"", ""Lior Horesh"", ""Michael O'Keeffe"", ""Isaac Chuang""]","[""theory and analysis of RNNs architectures"", ""reversibe evolution"", ""stability of deep neural network"", ""learning representations of outputs or states"", ""quantum inspired embedding""]","Various architectural advancements in the design of recurrent neural networks~(RNN) have been focusing on improving the empirical stability and representability by sacrificing the complexity of the architecture. However, more remains to be done to fully understand the fundamental trade-off between these conflicting requirements. Towards answering this question, we forsake the purely bottom-up approach of data-driven machine learning to understand, instead, the physical origin and dynamical properties of existing RNN architectures. This facilitates designing new RNNs with smaller complexity overhead and provable stability guarantee. First, we define a family of deep recurrent neural networks, $n$-$t$-ORNN, according to the order of nonlinearity $n$ and the range of temporal memory scale $t$ in their underlying dynamics embodied in the form of discretized ordinary differential equations. We show that most of the existing proposals of RNN architectures belong to different orders of $n$-$t$-ORNNs. We then propose a new RNN ansatz, namely the Quantum-inspired Universal computing Neural Network~(QUNN), to leverage the reversibility, stability, and universality of quantum computation for stable and universal RNN. QUNN provides a complexity reduction in the number of training parameters from being polynomial in both data and correlation time to only linear in correlation time. Compared to Long-Short-Term Memory (LSTM), QUNN of the same number of hidden layers facilitates higher nonlinearity and longer memory span with provable stability. Our work opens new directions in designing minimal RNNs based on additional knowledge about the dynamical nature of both the data and different training architectures.",/pdf/5a26b67c5ba83aaf758ec469732814d4541556d1.pdf,ICLR,2019,"We provide theoretical proof of various recurrent neural network designs representable dynamics' nonlinearity and memory scale, and propose a new RNN ansatz inspired by quantum physics." +SJx4Ogrtvr,HJeps3etwS,1569440000000.0,1577170000000.0,2393,Random Bias Initialization Improving Binary Neural Network Training,"[""xinlin.li1@huawei.com"", ""vahid.partovinia@huawei.com""]","[""Xinlin Li"", ""Vahid Partovi Nia""]","[""Binarized Neural Network"", ""Activation function"", ""Initialization"", ""Neural Network Acceleration""]","Edge intelligence especially binary neural network (BNN) has attracted considerable attention of the artificial intelligence community recently. BNNs significantly reduce the computational cost, model size, and memory footprint. However, there is still a performance gap between the successful full-precision neural network with ReLU activation and BNNs. We argue that the accuracy drop of BNNs is due to their geometry. +We analyze the behaviour of the full-precision neural network with ReLU activation and compare it with its binarized counterpart. This comparison suggests random bias initialization as a remedy to activation saturation in full-precision networks and leads us towards an improved BNN training. Our numerical experiments confirm our geometric intuition.",/pdf/a5406188b7c10c122b77bcd0a49b4de43e033c4a.pdf,ICLR,2020,"Improve saturating activations (sigmoid, tanh, htanh etc.) and Binarized Neural Network with Bias Initialization" +BJe1334YDH,rygmHBr-vS,1569440000000.0,1606250000000.0,173,A Learning-based Iterative Method for Solving Vehicle Routing Problems,"[""haolu@princeton.edu"", ""xingwen.zhang@antfin.com"", ""shuang.yang@antfin.com""]","[""Hao Lu"", ""Xingwen Zhang"", ""Shuang Yang""]","[""vehicle routing"", ""reinforcement learning"", ""optimization"", ""heuristics""]","This paper is concerned with solving combinatorial optimization problems, in particular, the capacitated vehicle routing problems (CVRP). Classical Operations Research (OR) algorithms such as LKH3 \citep{helsgaun2017extension} are inefficient and difficult to scale to larger-size problems. Machine learning based approaches have recently shown to be promising, partly because of their efficiency (once trained, they can perform solving within minutes or even seconds). However, there is still a considerable gap between the quality of a machine learned solution and what OR methods can offer (e.g., on CVRP-100, the best result of learned solutions is between 16.10-16.80, significantly worse than LKH3's 15.65). In this paper, we present ``Learn to Improve'' (L2I), the first learning based approach for CVRP that is efficient in solving speed and at the same time outperforms OR methods. Starting with a random initial solution, L2I learns to iteratively refine the solution with an improvement operator, selected by a reinforcement learning based controller. The improvement operator is selected from a pool of powerful operators that are customized for routing problems. By combining the strengths of the two worlds, our approach achieves the new state-of-the-art results on CVRP, e.g., an average cost of 15.57 on CVRP-100.",/pdf/6d969652fc20c3eeffc285d52534d45921176f8a.pdf,ICLR,2020, +SJggZnRcFQ,B1eVpJO9YQ,1538090000000.0,1549730000000.0,1141,Learning Programmatically Structured Representations with Perceptor Gradients,"[""sv.penkov@gmail.com"", ""s.ramamoorthy@ed.ac.uk""]","[""Svetlin Penkov"", ""Subramanian Ramamoorthy""]","[""representation learning"", ""structured representations"", ""symbols"", ""programs""]","We present the perceptor gradients algorithm -- a novel approach to learning symbolic representations based on the idea of decomposing an agent's policy into i) a perceptor network extracting symbols from raw observation data and ii) a task encoding program which maps the input symbols to output actions. We show that the proposed algorithm is able to learn representations that can be directly fed into a Linear-Quadratic Regulator (LQR) or a general purpose A* planner. Our experimental results confirm that the perceptor gradients algorithm is able to efficiently learn transferable symbolic representations as well as generate new observations according to a semantically meaningful specification. +",/pdf/41a5c8299e9b7634f7fe199a57e745ea6af1f263.pdf,ICLR,2019, +C70cp4Cn32,m-eNDIPpcoj,1601310000000.0,1615930000000.0,1804,Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks,"[""~Timothy_Castiglia1"", ""dasa2@rpi.edu"", ""~Stacy_Patterson1""]","[""Timothy Castiglia"", ""Anirban Das"", ""Stacy Patterson""]","[""Machine Learning"", ""Stochastic Gradient Descent"", ""Federated Learning"", ""Hierarchical Networks"", ""Distributed"", ""Heterogeneous"", ""Convergence Analysis""]","We propose Multi-Level Local SGD, a distributed stochastic gradient method for learning a smooth, non-convex objective in a multi-level communication network with heterogeneous workers. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple workers; further, workers may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We illustrate the effectiveness of our algorithm in a multi-level network with slow workers via simulation-based experiments.",/pdf/8dbde4c960f1598d19e7201058a77a1224b4a939.pdf,ICLR,2021,"We propose Multi-Level Local SGD, a distributed stochastic gradient method for learning a smooth, non-convex objective in a multi-level communication network with heterogeneous workers." +H1sUHgb0Z,S15IBlWAZ,1509130000000.0,1519390000000.0,581,Learning From Noisy Singly-labeled Data,"[""khetan2@illinois.edu"", ""zlipton@cmu.edu"", ""anima@amazon.com""]","[""Ashish Khetan"", ""Zachary C. Lipton"", ""Animashree Anandkumar""]","[""crowdsourcing"", ""noisy annotations"", ""deep leaerning""]","Supervised learning depends on annotated examples, which are taken to be the ground truth. But these labels often come from noisy crowdsourcing platforms, like Amazon Mechanical Turk. Practitioners typically collect multiple labels per example and aggregate the results to mitigate noise (the classic crowdsourcing problem). Given a fixed annotation budget and unlimited unlabeled data, redundant annotation comes at the expense of fewer labeled examples. This raises two fundamental questions: (1) How can we best learn from noisy workers? (2) How should we allocate our labeling budget to maximize the performance of a classifier? We propose a new algorithm for jointly modeling labels and worker quality from noisy crowd-sourced data. The alternating minimization proceeds in rounds, estimating worker quality from disagreement with the current model and then updating the model by optimizing a loss function that accounts for the current estimate of worker quality. Unlike previous approaches, even with only one annotation per example, our algorithm can estimate worker quality. We establish a generalization error bound for models learned with our algorithm and establish theoretically that it's better to label many examples once (vs less multiply) when worker quality exceeds a threshold. Experiments conducted on both ImageNet (with simulated noisy workers) and MS-COCO (using the real crowdsourced labels) confirm our algorithm's benefits. ",/pdf/c73dbdc6c906d00b3e5501fc3aba4afc0cead9c0.pdf,ICLR,2018,A new approach for learning a model from noisy crowdsourced annotations. +SklcyJBtvB,SyluByi_PS,1569440000000.0,1577170000000.0,1479,Off-policy Bandits with Deficient Support,"[""ernoveen@gmail.com"", ""ys756@cornell.edu"", ""tj@cs.cornell.edu""]","[""Noveen Sachdeva"", ""Yi Su"", ""Thorsten Joachims""]","[""Recommender System"", ""Search Engine"", ""Counterfactual Learning""]","Off-policy training of contextual-bandit policies is attractive in online systems (e.g. search, recommendation, ad placement), since it enables the reuse of large amounts of log data from the production system. State-of-the-art methods for off-policy learning, however, are based on inverse propensity score (IPS) weighting, which requires that the logging policy chooses all actions with non-zero probability for any context (i.e., full support). In real-world systems, this condition is often violated, and we show that existing off-policy learning methods based on IPS weighting can fail catastrophically. We therefore develop new off-policy contextual-bandit methods that can controllably and robustly learn even when the logging policy has deficient support. To this effect, we explore three approaches that provide various guarantees for safe learning despite the inherent limitations of support deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We analyze the statistical and computational properties of these three approaches, and empirically evaluate their effectiveness in a series of experiments. We find that controlling the policy space is both computationally efficient and that it robustly leads to accurate policies.",/pdf/54dfd9cffaedc40c077bab008fd4e0edee01730c.pdf,ICLR,2020, +HJeqhA4YDS,ryedLqKdvr,1569440000000.0,1583910000000.0,1366,Denoising and Regularization via Exploiting the Structural Bias of Convolutional Generators,"[""reinhard.heckel@tum.de"", ""msoltoon@gmail.com""]","[""Reinhard Heckel and Mahdi Soltanolkotabi""]","[""theory for deep learning"", ""convolutional network"", ""deep image prior"", ""deep decoder"", ""dynamics of gradient descent"", ""overparameterization""]","Convolutional Neural Networks (CNNs) have emerged as highly successful tools for image generation, recovery, and restoration. A major contributing factor to this success is that convolutional networks impose strong prior assumptions about natural images. A surprising experiment that highlights this architectural bias towards natural images is that one can remove noise and corruptions from a natural image without using any training data, by simply fitting (via gradient descent) a randomly initialized, over-parameterized convolutional generator to the corrupted image. While this over-parameterized network can fit the corrupted image perfectly, surprisingly after a few iterations of gradient descent it generates an almost uncorrupted image. This intriguing phenomenon enables state-of-the-art CNN-based denoising and regularization of other inverse problems. In this paper, we attribute this effect to a particular architectural choice of convolutional networks, namely convolutions with fixed interpolating filters. We then formally characterize the dynamics of fitting a two-layer convolutional generator to a noisy signal and prove that early-stopped gradient descent denoises/regularizes. Our proof relies on showing that convolutional generators fit the structured part of an image significantly faster than the corrupted portion. ",/pdf/de835d58f86f19412df72738c87d7ecfe7cbb93c.pdf,ICLR,2020, +FoM-RnF6SNe,TOUqv4gzOsg,1601310000000.0,1614990000000.0,1867,Evaluating Agents Without Rewards,"[""~Brendon_Matusch1"", ""~Jimmy_Ba1"", ""~Danijar_Hafner1""]","[""Brendon Matusch"", ""Jimmy Ba"", ""Danijar Hafner""]","[""reinforcement learning"", ""task-agnostic"", ""agent evaluation"", ""exploration"", ""information gain"", ""empowerment"", ""curiosity""]","Reinforcement learning has enabled agents to solve challenging control tasks from raw image inputs. However, manually crafting reward functions can be time consuming, expensive, and prone to human error. Competing objectives have been proposed for agents to learn without external supervision, such as artificial input entropy, information gain, and empowerment. Estimating these objectives can be challenging and it remains unclear how well they reflect task rewards or human behavior. We study these objectives across seven agents and three Atari games. Retrospectively computing the objectives from the agent's lifetime of experience simplifies accurate estimation. We find that all three objectives correlate more strongly with a human behavior similarity metric than with task reward. Moreover, input entropy and information gain both correlate more strongly with human similarity than task reward does.",/pdf/1df0ac8c509d349740d496fdaf11f5b5c55976be.pdf,ICLR,2021, +r1ghgxHtPH,SJxYf0ktDS,1569440000000.0,1577170000000.0,2113,Blurring Structure and Learning to Optimize and Adapt Receptive Fields,"[""shelhamer@cs.berkeley.edu"", ""dqwang@eecs.berkeley.edu"", ""trevor@eecs.berkeley.edu""]","[""Evan Shelhamer"", ""Dequan Wang"", ""Trevor Darrell""]","[""scale"", ""deep learning"", ""dynamic inference"", ""fully convolutional""]","The visual world is vast and varied, but its variations divide into structured and unstructured factors. We compose free-form filters and structured Gaussian filters, optimized end-to-end, to factorize deep representations and learn both local features and their degree of locality. In effect this optimizes over receptive field size and shape, tuning locality to the data and task. Our semi-structured composition is strictly more expressive than free-form filtering, and changes in its structured parameters would require changes in architecture for standard networks. Dynamic inference, in which the Gaussian structure varies with the input, adapts receptive field size to compensate for local scale variation. Optimizing receptive field size improves semantic segmentation accuracy on Cityscapes by 1-2 points for strong dilated and skip architectures and by up to 10 points for suboptimal designs. Adapting receptive fields by dynamic Gaussian structure further improves results, equaling the accuracy of free-form deformation while improving efficiency.",/pdf/2de47a7c112312913d8b37ef45c4230787667516.pdf,ICLR,2020,Composing structured Gaussian and free-form filters makes receptive field size and shape differentiable for end-to-end optimization and dynamic adaptation. +SJeOAJStwB,rJe6tDJFDH,1569440000000.0,1577170000000.0,2030,On Federated Learning of Deep Networks from Non-IID Data: Parameter Divergence and the Effects of Hyperparametric Methods,"[""kim881019@kaist.ac.kr"", ""taewoo_kim@kaist.ac.kr"", ""chyoun@kaist.ac.kr""]","[""Heejae Kim"", ""Taewoo Kim"", ""Chan-Hyun Youn""]","[""Federated learning"", ""Iterative parameter averaging"", ""Deep networks"", ""Decentralized non-IID data"", ""Hyperparameter optimization methods""]","Federated learning, where a global model is trained by iterative parameter averaging of locally-computed updates, is a promising approach for distributed training of deep networks; it provides high communication-efficiency and privacy-preservability, which allows to fit well into decentralized data environments, e.g., mobile-cloud ecosystems. However, despite the advantages, the federated learning-based methods still have a challenge in dealing with non-IID training data of local devices (i.e., learners). In this regard, we study the effects of a variety of hyperparametric conditions under the non-IID environments, to answer important concerns in practical implementations: (i) We first investigate parameter divergence of local updates to explain performance degradation from non-IID data. The origin of the parameter divergence is also found both empirically and theoretically. (ii) We then revisit the effects of optimizers, network depth/width, and regularization techniques; our observations show that the well-known advantages of the hyperparameter optimization strategies could rather yield diminishing returns with non-IID data. (iii) We finally provide the reasons of the failure cases in a categorized way, mainly based on metrics of the parameter divergence.",/pdf/5e77d730063a01ed6a3cd0b0b51e689011736107.pdf,ICLR,2020,"We investigate the internal reasons of our observations, the diminishing effects of the well-known hyperparameter optimization methods on federated learning from decentralized non-IID data." +kLbhLJ8OT12,ESpJK9gihpt,1601310000000.0,1613760000000.0,443,Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System,"[""~Jianhong_Wang1"", ""~Yuan_Zhang7"", ""~Tae-Kyun_Kim2"", ""yunjie.gu@imperial.ac.uk""]","[""Jianhong Wang"", ""Yuan Zhang"", ""Tae-Kyun Kim"", ""Yunjie Gu""]","[""Task-oriented Dialogue System"", ""Natural Language Processing"", ""Hierarchical Reinforcement Learning"", ""Policy Optimization""]","Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the explanability for HDNO.",/pdf/a99c17ceac45ecae4d3e3677d8e1179cd7d3900b.pdf,ICLR,2021,We propose a novel algorithm called HDNO for policy optimization for task-oriented dialogue system so that the performance on the comprehensibility of generated responses is improved compared with other RL-based algorithms. +rklhb2R9Y7,BklTWdpctX,1538090000000.0,1571430000000.0,1212,Reinforced Imitation Learning from Observations,"[""konrad.zolna@gmail.com"", ""negar.rostamzadeh@gmail.com"", ""yoshua.umontreal@gmail.com"", ""sjn.ahn@gmail.com"", ""pedro@opinheiro.com""]","[""Konrad Zolna"", ""Negar Rostamzadeh"", ""Yoshua Bengio"", ""Sungjin Ahn"", ""Pedro O. Pinheiro""]","[""imitation learning"", ""state-only observations"", ""self-exploration""]","Imitation learning is an effective alternative approach to learn a policy when the reward function is sparse. In this paper, we consider a challenging setting where an agent has access to a sparse reward function and state-only expert observations. We propose a method which gradually balances between the imitation learning cost and the reinforcement learning objective. Built upon an existing imitation learning method, our approach works with state-only observations. We show, through navigation scenarios, that (i) an agent is able to efficiently leverage sparse rewards to outperform standard state-only imitation learning, (ii) it can learn a policy even when learner's actions are different from the expert, and (iii) the performance of the agent is not bounded by that of the expert due to the optimized usage of sparse rewards.",/pdf/331bb6bb7456b94caebd919c1588a01e48400798.pdf,ICLR,2019, +SkUfhuFsvK-,99jSMETG7hCp,1601310000000.0,1614990000000.0,989,FASG: Feature Aggregation Self-training GCN for Semi-supervised Node Classification,"[""~Gongpei_Zhao1"", ""~Tao_Wang1"", ""~Yidong_Li1"", ""~Yi_Jin2""]","[""Gongpei Zhao"", ""Tao Wang"", ""Yidong Li"", ""Yi Jin""]",[],"Recently, Graph Convolutioal Networks (GCNs) have achieved significant success in many graph-based learning tasks, especially for node classification, due to its excellent ability in representation learning. Nevertheless, it remains challenging for GCN models to obtain satisfying prediction on graphs where few nodes are with known labels. In this paper, we propose a novel self-training algorithm based on GCN to boost semi-supervised node classification on graphs with little supervised information. Inspired by self-supervision strategy, the proposed method introduces an ingenious checking part to add new nodes as supervision after each training epoch to enhance node prediction. In particular, the embedded checking part is designed based on aggregated features, which is more accurate than previous methods and boosts node classification significantly. The proposed algorithm is validated on three public benchmarks in comparison with several state-of-the-art baseline algorithms, and the results illustrate its excellent performance.",/pdf/c14804b5e5e02c92cb071f950c7f1d748585b65b.pdf,ICLR,2021, +S1lJv0VYDr,ryek0VP_Pr,1569440000000.0,1577170000000.0,1159,Model Imitation for Model-Based Reinforcement Learning,"[""kriswu8021@gmail.com"", ""tinghanf@princeton.edu"", ""ramadge@princeton.edu"", ""haosu@eng.ucsd.edu""]","[""Yueh-Hua Wu"", ""Ting-Han Fan"", ""Peter J. Ramadge"", ""Hao Su""]","[""Model-Based Reinforcement Learning""]","Model-based reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with real-world environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in real-world environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the one-step transition models, which has inherent difficulty ensuring the matching of distributions from multi-step rollouts. Based on the claim, we propose to learn the synthesized model by matching the distributions of multi-step rollouts sampled from the synthesized model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one. Our experiments also show that the proposed model imitation method outperforms the state-of-the-art in terms of sample complexity and average return.",/pdf/23c4bc44a1299ccf6c22e5b0ae19cd070c6be69e.pdf,ICLR,2020,Our method incorporates WGAN to achieve occupancy measure matching for transition learning. +SylUiREKvB,SylgeMFODH,1569440000000.0,1577170000000.0,1319,Variational Hyper RNN for Sequence Modeling,"[""ruizhid@sfu.ca"", ""yanshuaicao@gmail.com"", ""bchang@stat.ubc.ca"", ""lsigal@cs.ubc.ca"", ""mori@cs.sfu.ca"", ""marcus.brubaker@borealisai.com""]","[""Ruizhi Deng"", ""Yanshuai Cao"", ""Bo Chang"", ""Leonid Sigal"", ""Greg Mori"", ""Marcus Brubaker""]","[""variational autoencoder"", ""hypernetwork"", ""recurrent neural network"", ""time series""]","In this work, we propose a novel probabilistic sequence model that excels at capturing high variability in time series data, both across sequences and within an individual sequence. Our method uses temporal latent variables to capture information about the underlying data pattern and dynamically decodes the latent information into modifications of weights of the base decoder and recurrent model. The efficacy of the proposed method is demonstrated on a range of synthetic and real-world sequential data that exhibit large scale variations, regime shifts, and complex dynamics.",/pdf/93f90e0957e0c16587617a33a124a46db9f2a256.pdf,ICLR,2020,We propose a novel probabilistic sequence model that excels at capturing high variability in time series data using hypernetworks. +D_KeYoqCYC,jaq4KHUfujm,1601310000000.0,1612540000000.0,3647,Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization,"[""~Joshua_C_Chang1"", ""patrick@mederrata.com"", ""jungmin@mederrata.com"", ""ted@mederrata.com"", ""shashaank@mederrata.com"", ""bart.desmet@gmail.com"", ""ayah.zirikly@gmail.com"", ""carsonc@niddk.nih.gov""]","[""Joshua C Chang"", ""Patrick Fletcher"", ""Jungmin Han"", ""Ted L Chang"", ""Shashaank Vattikuti"", ""Bart Desmet"", ""Ayah Zirikly"", ""Carson C Chow""]","[""poisson matrix factorization"", ""generalized additive model"", ""probabilistic matrix factorization"", ""bayesian"", ""sparse coding"", ""interpretability"", ""factor analysis""]","Dimensionality reduction methods for count data are critical to a wide range of applications in medical informatics and other fields where model interpretability is paramount. For such data, hierarchical Poisson matrix factorization (HPF) and other sparse probabilistic non-negative matrix factorization (NMF) methods are considered to be interpretable generative models. They consist of sparse transformations for decoding their learned representations into predictions. However, sparsity in representation decoding does not necessarily imply sparsity in the encoding of representations from the original data features. HPF is often incorrectly interpreted in the literature as if it possesses encoder sparsity. The distinction between decoder sparsity and encoder sparsity is subtle but important. Due to the lack of encoder sparsity, HPF does not possess the column-clustering property of classical NMF -- the factor loading matrix does not sufficiently define how each factor is formed from the original features. We address this deficiency by self-consistently enforcing encoder sparsity, using a generalized additive model (GAM), thereby allowing one to relate each representation coordinate to a subset of the original data features. In doing so, the method also gains the ability to perform feature selection. We demonstrate our method on simulated data and give an example of how encoder sparsity is of practical use in a concrete application of representing inpatient comorbidities in Medicare patients.",/pdf/8dec97cd769e293780630c02e65abf11b85674b1.pdf,ICLR,2021,We introduce a simple modification to existing sparse matrix factorization methods to rectify widespread erroneous interpretation of the factors. +qRdED5QjM9e,U3PiB44dZA,1601310000000.0,1614990000000.0,2815,Distributionally Robust Learning for Unsupervised Domain Adaptation,"[""hatchet25@sjtu.edu.cn"", ""~Anqi_Liu2"", ""~Zhiding_Yu1"", ""~Yisong_Yue1"", ""~Anima_Anandkumar1""]","[""Haoxuan Wang"", ""Anqi Liu"", ""Zhiding Yu"", ""Yisong Yue"", ""Anima Anandkumar""]","[""Distributionally Robust Learning"", ""Domain Adaptation"", ""Self-training"", ""Density Ratio Estimation""]","We propose a distributionally robust learning (DRL) method for unsupervised domain adaptation (UDA) that scales to modern computer-vision benchmarks. DRL can be naturally formulated as a competitive two-player game between a predictor and an adversary that is allowed to corrupt the labels, subject to certain constraints, and reduces to incorporating a density ratio between the source and target domains (under the standard log loss). This formulation motivates the use of two neural networks that are jointly trained --- a discriminative network between the source and target domains for density-ratio estimation, in addition to the standard classification network. The use of a density ratio in DRL prevents the model from being overconfident on target inputs far away from the source domain. Thus, DRL provides conservative confidence estimation in the target domain, even when the target labels are not available. This conservatism motivates the use of DRL in self-training for sample selection, and we term the approach distributionally robust self-training (DRST). In our experiments, DRST generates more calibrated probabilities and achieves state-of-the-art self-training accuracy on benchmark datasets. We demonstrate that DRST captures shape features more effectively, and reduces the extent of distributional shift during self-training. ",/pdf/68d7a4e7a51ec01a4bd3416c0cd6659926f60684.pdf,ICLR,2021,We propose a distributionally robust method for unsupervised domain adaptation that gives conservative uncertainties and SOTA accuracy. +HklBjCEKvH,rygDa-YOwr,1569440000000.0,1583910000000.0,1318,Generalization through Memorization: Nearest Neighbor Language Models,"[""urvashik@stanford.edu"", ""omerlevy@gmail.com"", ""jurafsky@stanford.edu"", ""lsz@fb.com"", ""mikelewis@fb.com""]","[""Urvashi Khandelwal"", ""Omer Levy"", ""Dan Jurafsky"", ""Luke Zettlemoyer"", ""Mike Lewis""]","[""language models"", ""k-nearest neighbors""]","We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this transformation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 -- a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.",/pdf/418d153509d33823203ee741a22904058ec65cae.pdf,ICLR,2020,"We extend a pre-trained neural language model by linearly interpolating it with a k-nearest neighbors model, achieving new state-of-the-art results on Wikitext-103 with no additional training." +3UDSdyIcBDA,#NAME?,1601310000000.0,1622690000000.0,2889,RMSprop converges with proper hyper-parameter,"[""~Naichen_Shi1"", ""~Dawei_Li3"", ""~Mingyi_Hong1"", ""~Ruoyu_Sun1""]","[""Naichen Shi"", ""Dawei Li"", ""Mingyi Hong"", ""Ruoyu Sun""]","[""RMSprop"", ""convergence"", ""hyperparameter""]","Despite the existence of divergence examples, RMSprop remains +one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyper-parameters under certain conditions. More specifically, we prove that when the hyper-parameter $\beta_2$ is close enough to $1$, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on ``bounded gradient"" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. + +Finally, based on our theory, we conjecture that in practice there is a critical threshold $\sf{\beta_2^*}$, such that RMSprop generates reasonably good results only if $1>\beta_2\ge \sf{\beta_2^*}$. We provide empirical evidence for such a phase transition in our numerical experiments.",/pdf/3df947e54dae8ab86ecfdcbf1ee7cc3542fe9e76.pdf,ICLR,2021, +SySaJ0xCZ,By7py0eR-,1509120000000.0,1518730000000.0,440,Simple and efficient architecture search for Convolutional Neural Networks,"[""thomas.elsken@de.bosch.com"", ""janhendrik.metzen@de.bosch.com"", ""fh@cs.uni-freiburg.de""]","[""Thomas Elsken"", ""Jan Hendrik Metzen"", ""Frank Hutter""]","[""Deep Learning"", ""Hyperparameter Optimization"", ""Architecture Search"", ""Convolutional Neural Networks"", ""Network Morphism"", ""Network Transformation"", ""SGDR"", ""Cosine annealing"", ""hill climbing""]","Neural networks have recently had a lot of success for many tasks. However, neural +network architectures that perform well are still typically designed manually +by experts in a cumbersome trial-and-error process. We propose a new method +to automatically search for well-performing CNN architectures based on a simple +hill climbing procedure whose operators apply network morphisms, followed +by short optimization runs by cosine annealing. Surprisingly, this simple method +yields competitive results, despite only requiring resources in the same order of +magnitude as training a single network. E.g., on CIFAR-10, our method designs +and trains networks with an error rate below 6% in only 12 hours on a single GPU; +training for one day reduces this error further, to almost 5%.",/pdf/6d6cfdaa12dd6971680f5b14cdc7057032601a02.pdf,ICLR,2018,We propose a simple and efficent method for architecture search for convolutional neural networks. +dJbf5SqbFrM,FvmScd3wQllb,1601310000000.0,1614990000000.0,106,Continuous Transfer Learning,"[""~Jun_Wu3"", ""~Jingrui_He1""]","[""Jun Wu"", ""Jingrui He""]",[],"Transfer learning has been successfully applied across many high-impact applications. However, most existing work focuses on the static transfer learning setting, and very little is devoted to modeling the time evolving target domain, such as the online reviews for movies. To bridge this gap, in this paper, we focus on the continuous transfer learning setting with a time evolving target domain. One major challenge associated with continuous transfer learning is the time evolving relatedness of the source domain and the current target domain as the target domain evolves over time. To address this challenge, we first derive a generic generalization error bound on the current target domain with flexible domain discrepancy measures. Furthermore, a novel label-informed C-divergence is proposed to measure the shift of joint data distributions (over input features and output labels) across domains. It could be utilized to instantiate a tighter error upper bound in the continuous transfer learning setting, thus motivating us to develop an adversarial Variational Auto-encoder algorithm named CONTE by minimizing the C-divergence based error upper bound. Extensive experiments on various data sets demonstrate the effectiveness of our CONTE algorithm. +",/pdf/309c862b5786a429a1aaf3ef05810a8464d59c86.pdf,ICLR,2021,Theory and algorithm for transfer learning with a static source domain and a time-evolving target domain +SyxIWpVYvr,S1gx7PPLDr,1569440000000.0,1583910000000.0,377,Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models,"[""joansj@gmail.com"", ""davidalvarezdlt@gmail.com"", ""vicen.gomez@upf.edu"", ""oslizovskaia@gmail.com"", ""jfn237@nyu.edu"", ""jordi.luqueserrano@telefonica.com""]","[""Joan Serr\u00e0"", ""David \u00c1lvarez"", ""Vicen\u00e7 G\u00f3mez"", ""Olga Slizovskaia"", ""Jos\u00e9 F. N\u00fa\u00f1ez"", ""Jordi Luque""]","[""OOD"", ""generative models"", ""likelihood""]","Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models' likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, model sizes, and complexity estimates.",/pdf/8471ec4430e98ab580a18918ecee66542e525041.pdf,ICLR,2020,"We pose that generative models' likelihoods are excessively influenced by the input's complexity, and propose a way to compensate it when detecting out-of-distribution inputs" +HkgXteBYPB,rklXpplKDS,1569440000000.0,1577170000000.0,2430,Stochastic Neural Physics Predictor,"[""piotr.tatarczyk@tum.de"", ""mrowca@stanford.edu"", ""feifeili@cs.stanford.edu"", ""yamins@stanford.edu"", ""nils.thuerey@tum.de""]","[""Piotr Tatarczyk"", ""Damian Mrowca"", ""Li Fei-Fei"", ""Daniel L. K. Yamins"", ""Nils Thuerey""]","[""physics prediction"", ""forward dynamics"", ""stochastic environments"", ""dropout""]","Recently, neural-network based forward dynamics models have been proposed that attempt to learn the dynamics of physical systems in a deterministic way. While near-term motion can be predicted accurately, long-term predictions suffer from accumulating input and prediction errors which can lead to plausible but different trajectories that diverge from the ground truth. A system that predicts distributions of the future physical states for long time horizons based on its uncertainty is thus a promising solution. In this work, we introduce a novel robust Monte Carlo sampling based graph-convolutional dropout method that allows us to sample multiple plausible trajectories for an initial state given a neural-network based forward dynamics predictor. By introducing a new shape preservation loss and training our dynamics model recurrently, we stabilize long-term predictions. We show that our model’s long-term forward dynamics prediction errors on complicated physical interactions of rigid and deformable objects of various shapes are significantly lower than existing strong baselines. Lastly, we demonstrate how generating multiple trajectories with our Monte Carlo dropout method can be used to train model-free reinforcement learning agents faster and to better solutions on simple manipulation tasks.",/pdf/2da335d1412391428134cc3583557560433a322d.pdf,ICLR,2020,We propose a stochastic differentiable forward dynamics predictor that is able to sample multiple physically plausible trajectories under the same initial input state and show that it can be used to train model-free policies more efficiently. +h9XgC7JzyHZ,k9xedw--xZ_,1601310000000.0,1614990000000.0,3641,Efficient estimates of optimal transport via low-dimensional embeddings,"[""~Patric_Fulop1"", ""~Vincent_Danos1""]","[""Patric Fulop"", ""Vincent Danos""]","[""optimal transport"", ""sinkhorn divergences"", ""robustness"", ""neural networks"", ""lipschitz"", ""spectral norm""]","Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension. +Recent work aims specifically at reducing this cost by computing OT using low-rank projections of the data (seen as discrete measures)~\citep{paty2019subspace}. We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension. +We demonstrate the idea with neural networks. ",/pdf/8e37301d06494ad0acb7208bfd3577a9e1bf3bff.pdf,ICLR,2021,Approximating optimal transport distances using a low-dimensional space constructed through a general projection map that is 1-Lipschitz. +r1lQQeHYPr,Hkxk-VxKwr,1569440000000.0,1577170000000.0,2204,Embodied Multimodal Multitask Learning,"[""chaplot@cs.cmu.edu"", ""lslee@cs.cmu.edu"", ""rsalakhu@cs.cmu.edu"", ""parikh@gatech.edu"", ""dbatra@gatech.edu""]","[""Devendra Singh Chaplot"", ""Lisa Lee"", ""Ruslan Salakhutdinov"", ""Devi Parikh"", ""Dhruv Batra""]","[""Visual Grounding"", ""Semantic Goal Navigation"", ""Embodied Question Answering""]","Visually-grounded embodied language learning models have recently shown to be effective at learning multiple multimodal tasks such as following navigational instructions and answering questions. In this paper, we address two key limitations of these models, (a) the inability to transfer the grounded knowledge across different tasks and (b) the inability to transfer to new words and concepts not seen during training using only a few examples. We propose a multitask model which facilitates knowledge transfer across tasks by disentangling the knowledge of words and visual attributes in the intermediate representations. We create scenarios and datasets to quantify cross-task knowledge transfer and show that the proposed model outperforms a range of baselines in simulated 3D environments. We also show that this disentanglement of representations makes our model modular and interpretable which allows for transfer to instructions containing new concepts.",/pdf/ae8db272968e6485697872bd0a3334d301b1271c.pdf,ICLR,2020,We propose a multitask model which facilitates knowledge transfer across multimodal tasks by disentangling the knowledge of words and visual concepts in the intermediate representations. +EdXhmWvvQV,Ufi-aTh-fo,1601310000000.0,1614990000000.0,27,Center-wise Local Image Mixture For Contrastive Representation Learning,"[""~Hao_Li11"", ""~XIAOPENG_ZHANG7"", ""~Ruoyu_Sun2"", ""~Hongkai_Xiong1"", ""~Qi_Tian3""]","[""Hao Li"", ""XIAOPENG ZHANG"", ""Ruoyu Sun"", ""Hongkai Xiong"", ""Qi Tian""]","[""Self-supervised Learning"", ""Data Mixing"", ""Contrastive Learning""]","Recent advances in unsupervised representation learning have experienced remarkable progress, especially with the achievements of contrastive learning, which regards each image as well its augmentations as a separate class, while does not consider the semantic similarity among images. This paper proposes a new kind of data augmentation, named Center-wise Local Image Mixture, to expand the neighborhood space of an image. CLIM encourages both local similarity and global aggregation while pulling similar images. This is achieved by searching local similar samples of an image, and only selecting images that are closer to the corresponding cluster center, which we denote as center-wise local selection. As a result, similar representations are progressively approaching the clusters, while do not break the local similarity. Furthermore, image mixture is used as a smoothing regularization to avoid overconfident the selected samples. Besides, we introduce multi-resolution augmentation, which enables the representation to be scale invariant. Integrating the two augmentations produces better feature representation on several unsupervised benchmarks. Notably, we reach 75.5% top-1 accuracy with linear evaluation over ResNet-50, and 59.3% top-1 accuracy when fine-tuned with only 1% labels, as well as consistently outperforming supervised pretraining on several downstream transfer tasks.",/pdf/bb5ae8df23d7c8cc312b497cd6ebea7a38386549.pdf,ICLR,2021, +S1347ot3b,BkjV7jK3Z,1507600000000.0,1518730000000.0,4,Exploring Sentence Vectors Through Automatic Summarization,"[""at7@williams.edu"", ""jkalita@uccs.edu""]","[""Adly Templeton"", ""Jugal Kalita""]","[""Sentence Vectors"", ""Vector Semantics"", ""Automatic Summarization""]","Vector semantics, especially sentence vectors, have recently been used successfully in many areas of natural language processing. However, relatively little work has explored the internal structure and properties of spaces of sentence vectors. In this paper, we will explore the properties of sentence vectors by studying a particular real-world application: Automatic Summarization. In particular, we show that cosine similarity between sentence vectors and document vectors is strongly correlated with sentence importance and that vector semantics can identify and correct gaps between the sentences chosen so far and the document. In addition, we identify specific dimensions which are linked to effective summaries. To our knowledge, this is the first time specific dimensions of sentence embeddings have been connected to sentence properties. We also compare the features of different methods of sentence embeddings. Many of these insights have applications in uses of sentence embeddings far beyond summarization.",/pdf/ebf03b1dc5f72c08fc493441abcae04e05abfc58.pdf,ICLR,2018,A comparison and detailed analysis of various sentence embedding models through the real-world task of automatic summarization. +H1e0-30qKm,BJxzYbO5F7,1538090000000.0,1545360000000.0,1220,Unlabeled Disentangling of GANs with Guided Siamese Networks,"[""gokhan.yildirim@zalando.de"", ""nikolay.jetchev@zalando.de"", ""urs.bergmann@zalando.de""]","[""G\u00f6khan Yildirim"", ""Nikolay Jetchev"", ""Urs Bergmann""]","[""GAN"", ""disentange"", ""siamese networks"", ""semantic""]","Disentangling underlying generative factors of a data distribution is important for interpretability and generalizable representations. In this paper, we introduce two novel disentangling methods. Our first method, Unlabeled Disentangling GAN (UD-GAN, unsupervised), decomposes the latent noise by generating similar/dissimilar image pairs and it learns a distance metric on these pairs with siamese networks and a contrastive loss. This pairwise approach provides consistent representations for similar data points. Our second method (UD-GAN-G, weakly supervised) modifies the UD-GAN with user-defined guidance functions, which restrict the information that goes into the siamese networks. This constraint helps UD-GAN-G to focus on the desired semantic variations in the data. We show that both our methods outperform existing unsupervised approaches in quantitative metrics that measure semantic accuracy of the learned representations. In addition, we illustrate that simple guidance functions we use in UD-GAN-G allow us to directly capture the desired variations in the data.",/pdf/fd1dc1c70334a569f2da6223b8ebbae3a0cb6a97.pdf,ICLR,2019,We use Siamese Networks to guide and disentangle the generation process in GANs without labeled data. +BylWYC4KwH,SJlmmGd_wS,1569440000000.0,1577170000000.0,1236,On Concept-Based Explanations in Deep Neural Networks,"[""cjyeh@cs.cmu.edu"", ""beenkim.mit@gmail.com"", ""soarik@google.com"", ""chunliang.tw@gmail.com"", ""pradeep.ravikumar@gmail.com"", ""tpfister@google.com""]","[""Chih-Kuan Yeh"", ""Been Kim"", ""Sercan Arik"", ""Chun-Liang Li"", ""Pradeep Ravikumar"", ""Tomas Pfister""]","[""concept-based explanations"", ""interpretability""]","Deep neural networks (DNNs) build high-level intelligence on low-level raw features. Understanding of this high-level intelligence can be enabled by deciphering the concepts they base their decisions on, as human-level thinking. In this paper, we study concept-based explainability for DNNs in a systematic framework. First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining a model's prediction behavior. Based on performance and variability motivations, we propose two definitions to quantify completeness. We show that under degenerate conditions, our method is equivalent to Principal Component Analysis. Next, we propose a concept discovery method that considers two additional constraints to encourage the interpretability of the discovered concepts. We use game-theoretic notions to aggregate over sets to define an importance score for each discovered concept, which we call \emph{ConceptSHAP}. On specifically-designed synthetic datasets and real-world text and image datasets, we validate the effectiveness of our framework in finding concepts that are complete in explaining the decision, and interpretable.",/pdf/de36a4db16ab365485149e44fc87eb04c6e6be10.pdf,ICLR,2020,we propose a concept-based explanation for DNNs that is both sufficient for prediction (complete) and interpretable +SJetQpEYvB,HkeJSzlwvB,1569440000000.0,1583910000000.0,458,LEARNING EXECUTION THROUGH NEURAL CODE FUSION,"[""zshi17@cs.utexas.edu"", ""kswersky@google.com"", ""dtarlow@google.com"", ""parthas@google.com"", ""miladh@google.com""]","[""Zhan Shi"", ""Kevin Swersky"", ""Daniel Tarlow"", ""Parthasarathy Ranganathan"", ""Milad Hashemi""]","[""code understanding"", ""graph neural networks"", ""learning program execution"", ""execution traces"", ""program performance""]","As the performance of computer systems stagnates due to the end of Moore’s Law, +there is a need for new models that can understand and optimize the execution +of general purpose code. While there is a growing body of work on using Graph +Neural Networks (GNNs) to learn static representations of source code, these +representations do not understand how code executes at runtime. In this work, we +propose a new approach using GNNs to learn fused representations of general +source code and its execution. Our approach defines a multi-task GNN over +low-level representations of source code and program state (i.e., assembly code +and dynamic memory states), converting complex source code constructs and data +structures into a simpler, more uniform format. We show that this leads to improved +performance over similar methods that do not use execution and it opens the door +to applying GNN models to new tasks that would not be feasible from static code +alone. As an illustration of this, we apply the new model to challenging dynamic +tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, +outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we +use the learned fused graph embeddings to demonstrate transfer learning with high +performance on an indirectly related algorithm classification task.",/pdf/607dc220d4f786c1a3b440746c261155e31143d5.pdf,ICLR,2020, +BkxdqA4tvB,BklfT3udwS,1569440000000.0,1577170000000.0,1289,Collapsed amortized variational inference for switching nonlinear dynamical systems,"[""zhedong@google.com"", ""baseybold@gmail.com"", ""kpmurphy@google.com"", ""bui.h.hung@gmail.com""]","[""Zhe Dong"", ""Bryan A. Seybold"", ""Kevin P. Murphy"", ""Hung H. Bui""]",[],"We propose an efficient inference method for switching nonlinear dynamical systems. The key idea is to learn an inference network which can be used as a proposal distribution for the continuous latent variables, while performing exact marginalization of the discrete latent variables. This allows us to use the reparameterization trick, and apply end-to-end training with SGD. We show that this method can successfully segment time series data (including videos) into meaningful ""regimes"", due to the use of piece-wise nonlinear dynamics.",/pdf/8ff4876f3b6c797a479de48e232ebffed7280748.pdf,ICLR,2020, +UvBPbpvHRj-,vbCMCLud1d,1601310000000.0,1615980000000.0,2992,Activation-level uncertainty in deep neural networks,"[""~Pablo_Morales-Alvarez1"", ""~Daniel_Hern\u00e1ndez-Lobato1"", ""rms@decsai.ugr.es"", ""~Jos\u00e9_Miguel_Hern\u00e1ndez-Lobato1""]","[""Pablo Morales-Alvarez"", ""Daniel Hern\u00e1ndez-Lobato"", ""Rafael Molina"", ""Jos\u00e9 Miguel Hern\u00e1ndez-Lobato""]","[""Gaussian Processes"", ""Uncertainty estimation"", ""Deep Gaussian Processes"", ""Bayesian Neural Networks""]","Current approaches for uncertainty estimation in deep learning often produce too confident results. Bayesian Neural Networks (BNNs) model uncertainty in the space of weights, which is usually high-dimensional and limits the quality of variational approximations. The more recent functional BNNs (fBNNs) address this only partially because, although the prior is specified in the space of functions, the posterior approximation is still defined in terms of stochastic weights. In this work we propose to move uncertainty from the weights (which are deterministic) to the activation function. Specifically, the activations are modelled with simple 1D Gaussian Processes (GP), for which a triangular kernel inspired by the ReLu non-linearity is explored. Our experiments show that activation-level stochasticity provides more reliable uncertainty estimates than BNN and fBNN, whereas it performs competitively in standard prediction tasks. We also study the connection with deep GPs, both theoretically and empirically. More precisely, we show that activation-level uncertainty requires fewer inducing points and is better suited for deep architectures.",/pdf/3675d798eb4cc1b53b84850025e0a9edaee1ddcb.pdf,ICLR,2021,"We use 1D Gaussian Processes to introduce activation-level uncertainty in neural networks, which overcomes known limitations of (functional) Bayesian neural nets and obtains better results than the related deep Gaussian Processes." +bd66LuDPPFh,JSZS9OyQ2_F,1601310000000.0,1614990000000.0,1950,Towards Understanding Label Smoothing,"[""~Yi_Xu8"", ""~Yuanhong_Xu1"", ""~Qi_Qian1"", ""~Li_Hao1"", ""~Rong_Jin1""]","[""Yi Xu"", ""Yuanhong Xu"", ""Qi Qian"", ""Li Hao"", ""Rong Jin""]","[""Label Smoothing"", ""Non-convex Optimization"", ""Deep Learning Theory""]","Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first stage and essentially converges faster in the second stage. To the best of our knowledge, this is the first work for understanding the power of LSR via establishing convergence complexity of stochastic methods with LSR in non-convex optimization. We empirically demonstrate the effectiveness of the proposed method in comparison with baselines on training ResNet models over benchmark data sets.",/pdf/6b065fc5efbd0cd00cecca361ae6819ffbe96f3e.pdf,ICLR,2021, This paper studies the theoretical understanding of the power of label smoothing from the view of optimization and proposes a simple yet effective algorithm TSLA with theoretical guarantee of convergence. +BklEFpEYwS,SygBMp3DwB,1569440000000.0,1588020000000.0,665,Meta-Learning without Memorization,"[""mzyin@utexas.edu"", ""gjt@google.com"", ""mingyuan.zhou@mccombs.utexas.edu"", ""svlevine@eecs.berkeley.edu"", ""cbfinn@cs.stanford.edu""]","[""Mingzhang Yin"", ""George Tucker"", ""Mingyuan Zhou"", ""Sergey Levine"", ""Chelsea Finn""]","[""meta-learning"", ""memorization"", ""regularization"", ""overfitting"", ""mutually-exclusive""]","The ability to learn new concepts with small amounts of data is a critical aspect of intelligence that has proven challenging for deep learning methods. Meta-learning has emerged as a promising technique for leveraging data from previous tasks to enable efficient learning of new tasks. However, most meta-learning algorithms implicitly require that the meta-training tasks be mutually-exclusive, such that no single model can solve all of the tasks at once. For example, when creating tasks for few-shot image classification, prior work uses a per-task random assignment of image classes to N-way classification labels. If this is not done, the meta-learner can ignore the task training data and learn a single model that performs all of the meta-training tasks zero-shot, but does not adapt effectively to new image classes. This requirement means that the user must take great care in designing the tasks, for example by shuffling labels or removing task identifying information from the inputs. In some domains, this makes meta-learning entirely inapplicable. In this paper, we address this challenge by designing a meta-regularization objective using information theory that places precedence on data-driven adaptation. This causes the meta-learner to decide what must be learned from the task training data and what should be inferred from the task testing input. By doing so, our algorithm can successfully use data from non-mutually-exclusive tasks to efficiently adapt to novel tasks. We demonstrate its applicability to both contextual and gradient-based meta-learning algorithms, and apply it in practical settings where applying standard meta-learning has been difficult. Our approach substantially outperforms standard meta-learning algorithms in these settings. ",/pdf/68499d53a335aa65cc8e6e4806eb34cd0363b820.pdf,ICLR,2020,"We identify and formalize the memorization problem in meta-learning and solve this problem with novel meta-regularization method, which greatly expand the domain that meta-learning can be applicable to and effective on." +L3iGqaCTWS9,Y7_RQxwH8Bm,1601310000000.0,1614990000000.0,3674,Hybrid and Non-Uniform DNN quantization methods using Retro Synthesis data for efficient inference,"[""~TEJPRATAP_GVSL1"", ""~Raja_Kumar2"", ""pradeep.ns@samsung.com""]","[""TEJPRATAP GVSL"", ""Raja Kumar"", ""Pradeep NS""]","[""quantization"", ""dnn inference"", ""data free quantization"", ""synthetic data"", ""model compression""]","Existing post-training quantization methods attempt to compensate for the quantization loss by determining the quantized weights and activation ranges with the help of training data. Quantization aware training methods, on the other hand, achieve accuracy near to FP32 models by training the quantized model which consume more time. Both these methods are not effective for privacy constraint applications as they are tightly coupled with training data. In contrast, this paper proposes a data-independent post-training quantization scheme that eliminates the need for training data. This is achieved by generating a faux dataset hereafter called as $\textit{‘Retro-Synthesis Data’}$ from the FP32 model layer statistics and further using it for quantization. This approach outperformed state-of-the-art methods including, but not limited to, ZeroQ and DFQ on models with and without batch-normalization layers for 8, 6 and 4 bit precisions. We also introduced two futuristic variants of post-training quantization methods namely $\textit{‘Hybrid-Quantization’}$ and $\textit{‘Non-Uniform Quantization’}$. The Hybrid-Quantization scheme determines the sensitivity of each layer for per-tensor and per-channel quantization, and thereby generates hybrid quantized models that are $10 - 20\%$ efficient in inference time while achieving same or better accuracy as compared to per-channel quantization. Also this method outperformed FP32 accuracy when applied for models such as ResNet-18, and ResNet-50 onImageNet dataset. In the proposed Non-Uniform quantization scheme, the weights are grouped into different clusters and these clusters are assigned with a varied number of quantization steps depending on the number of weights and their ranges in respective cluster. This method resulted in an accuracy improvement of $1\%$ against state-of-the-art quantization methods on ImageNet dataset.",/pdf/6b5daf187ad95d204a97e8e5a0041efd622ee821.pdf,ICLR,2021, +B1CNpYg0-,HJT46Ye0-,1509100000000.0,1518730000000.0,343,Learning to Compute Word Embeddings On the Fly,"[""dimabgv@gmail.com"", ""bosc.tom@gmail.com"", ""staszek.jastrzebski@gmail.com"", ""etg@google.com"", ""pascal.vincent@umontreal.ca"", ""yoshua.umontreal@gmail.com""]","[""Dzmitry Bahdanau"", ""Tom Bosc"", ""Stanis\u0142aw Jastrz\u0119bski"", ""Edward Grefenstette"", ""Pascal Vincent"", ""Yoshua Bengio""]","[""NLU"", ""word embeddings"", ""representation learning""]","Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the ``long tail'' of this distribution requires enormous amounts of data. +Representations of rare words trained directly on end tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words with a unique representation. We provide a method for predicting embeddings of rare words on the fly from small amounts of auxiliary data with a network trained end-to-end for the downstream task. We show that this improves results against baselines where embeddings are trained on the end task for reading comprehension, recognizing textual entailment and language modeling. +",/pdf/54c7b6555f1528a10aa2b60624ea3d9f9adb3479.pdf,ICLR,2018,We propose a method to deal with rare words by computing their embedding from definitions. +B1EGg7ZCb,r1CyeQ-0-,1509140000000.0,1518730000000.0,1103,Autonomous Vehicle Fleet Coordination With Deep Reinforcement Learning,"[""cane.cane@live.com""]","[""Cane Punma""]","[""Deep Reinforcement Learning"", ""mult-agent systems""]","Autonomous vehicles are becoming more common in city transportation. Companies will begin to find a need to teach these vehicles smart city fleet coordination. Currently, simulation based modeling along with hand coded rules dictate the decision making of these autonomous vehicles. We believe that complex intelligent behavior can be learned by these agents through Reinforcement Learning.In this paper, we discuss our work for solving this system by adapting the Deep Q-Learning (DQN) model to the multi-agent setting. Our approach applies deep reinforcement learning by combining convolutional neural networks with DQN to teach agents to fulfill customer demand in an environment that is partially observ-able to them. We also demonstrate how to utilize transfer learning to teach agents to balance multiple objectives such as navigating to a charging station when its en-ergy level is low. The two evaluations presented show that our solution has shown hat we are successfully able to teach agents cooperation policies while balancing multiple objectives.",/pdf/c7c8805df55c3c06682680157594fc6adcc1686c.pdf,ICLR,2018,Utilized Deep Reinforcement Learning to teach agents ride-sharing fleet style coordination. +rkgfWh0qKX,r1xoJSWcY7,1538090000000.0,1545360000000.0,1151,Do Language Models Have Common Sense?,"[""thtrieu@google.com"", ""qvl@google.com""]","[""Trieu H. Trinh"", ""Quoc V. Le""]",[],"It has been argued that current machine learning models do not have commonsense, and therefore must be hard-coded with prior knowledge (Marcus, 2018). Here we show surprising evidence that language models can already learn to capture certain common sense knowledge. Our key observation is that a language model can compute the probability of any statement, and this probability can be used to evaluate the truthfulness of that statement. On the Winograd Schema Challenge (Levesque et al., 2011), language models are 11% higher in accuracy than previous state-of-the-art supervised methods. Language models can also be fine-tuned for the task of Mining Commonsense Knowledge on ConceptNet to achieve an F1 score of 0.912 and 0.824, outperforming previous best results (Jastrzebskiet al., 2018). Further analysis demonstrates that language models can discover unique features of Winograd Schema contexts that decide the correct answers without explicit supervision.",/pdf/66733cd0185bdc64077133178135d382c35f8ae2.pdf,ICLR,2019,We present evidence that LMs do capture common sense with state-of-the-art results on both Winograd Schema Challenge and Commonsense Knowledge Mining. +Mf4ZSXMZP7,T72jz5cgvqs,1601310000000.0,1614990000000.0,1562,Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming,"[""~Itay_Hubara1"", ""~Yury_Nahshan1"", ""~Yair_Hanani1"", ""~Ron_Banner1"", ""~Daniel_Soudry1""]","[""Itay Hubara"", ""Yury Nahshan"", ""Yair Hanani"", ""Ron Banner"", ""Daniel Soudry""]","[""Efficient Deep Learning"", ""Quantization"", ""Compression""]","Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1\% accuracy degradation --- with 4-bit weights and activations in all layers, but the smallest two. Our code is available at, https://github.com/papers-submission/CalibTIP",/pdf/ce0e4b560de85ec662b936d3b2e452d8b848ce88.pdf,ICLR,2021,State-of-the-art results using advanced method for post training per channel quantization - squeezing all the information from the calibration set. +rylT0AVtwH,S1x_L59OvB,1569440000000.0,1577170000000.0,1447,Learning from Partially-Observed Multimodal Data with Variational Autoencoders,"[""yu_gong@sfu.ca"", ""hossein.hajimirsadeghi@gmail.com"", ""jha203@sfu.ca"", ""mnawhal@sfu.ca"", ""thibaut.p.durand@borealisai.com"", ""mori@cs.sfu.ca""]","[""Yu Gong"", ""Hossein Hajimirsadeghi"", ""Jiawei He"", ""Megha Nawhal"", ""Thibaut Durand"", ""Greg Mori""]","[""data imputation"", ""variational autoencoders"", ""generative models""]","Learning from only partially-observed data for imputation has been an active research area. Despite promising progress on unimodal data imputation (e.g., image in-painting), models designed for multimodal data imputation are far from satisfactory. In this paper, we propose variational selective autoencoders (VSAE) for this task. Different from previous works, our proposed VSAE learns only from partially-observed data. The proposed VSAE is capable of learning the joint distribution of observed and unobserved modalities as well as the imputation mask, resulting in a unified model for various down-stream tasks including data generation and imputation. +Evaluation on both synthetic high-dimensional and challenging low-dimensional multi-modality datasets shows significant improvement over the state-of-the-art data imputation models.",/pdf/23f10f00e7960ac23e554ab0ac562bb90618deac.pdf,ICLR,2020,We propose a novel VAE-based framework learning from partially-observed data for imputation and generation. +Hygv0sC5F7,S1xAXZq5YX,1538090000000.0,1545360000000.0,903,When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models?,"[""xu.3260@osu.edu"", ""zhou.1172@osu.edu"", ""ji.367@osu.edu"", ""liang.889@osu.edu""]","[""Tengyu Xu"", ""Yi Zhou"", ""Kaiyi Ji"", ""Yingbin Liang""]","[""gradient method"", ""max-margin"", ""ReLU model""]","We study the implicit bias of gradient descent methods in solving a binary classification problem over a linearly separable dataset. The classifier is described by a nonlinear ReLU model and the objective function adopts the exponential loss function. We first characterize the landscape of the loss function and show that there can exist spurious asymptotic local minima besides asymptotic global minima. We then show that gradient descent (GD) can converge to either a global or a local max-margin direction, or may diverge from the desired max-margin direction in a general context. For stochastic gradient descent (SGD), we show that it converges in expectation to either the global or the local max-margin direction if SGD converges. We further explore the implicit bias of these algorithms in learning a multi-neuron network under certain stationary conditions, and show that the learned classifier maximizes the margins of each sample pattern partition under the ReLU activation.",/pdf/f4b7688b3c9bea0e4a93a377dad1dccc965aab65.pdf,ICLR,2019,We study the implicit bias of gradient methods in solving a binary classification problem with nonlinear ReLU models. +Skx24yHFDr,HJxLlkadwH,1569440000000.0,1577170000000.0,1669,Discovering Topics With Neural Topic Models Built From PLSA Loss,"[""sileye.ba@outlook.com""]","[""sileye ba""]","[""neural network"", ""topic model"", ""neural topic model"", ""bag-of-words"", ""PLSA""]","In this paper we present a model for unsupervised topic discovery in texts corpora. The proposed model uses documents, words, and topics lookup table embedding as neural network model parameters to build probabilities of words given topics, and probabilities of topics given documents. These probabilities are used to recover by marginalization probabilities of words given documents. For very large corpora where the number of documents can be in the order of billions, using a neural auto-encoder based document embedding is more scalable then using a lookup table embedding as classically done. We thus extended the lookup based document embedding model to continuous auto-encoder based model. Our models are trained using probabilistic latent semantic analysis (PLSA) assumptions. We evaluated our models on six datasets with a rich variety of contents. Conducted experiments demonstrate that the proposed neural topic models are very effective in capturing relevant topics. Furthermore, considering perplexity metric, conducted evaluation benchmarks show that our topic models outperform latent Dirichlet allocation (LDA) model which is classically used to address topic discovery tasks.",/pdf/44830d12133a5e0c4ea4037b9313113ccf4d7456.pdf,ICLR,2020,"We propose a neural topic model that is built using documents, words, and topics embedding together with PLSA independence assumptions. " +TiXl51SCNw8,Ynq0oT-jgCX,1601310000000.0,1613400000000.0,110,BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization,"[""~Huanrui_Yang1"", ""ld213@duke.edu"", ""~Yiran_Chen1"", ""~Hai_Li1""]","[""Huanrui Yang"", ""Lin Duan"", ""Yiran Chen"", ""Hai Li""]","[""Mixed-precision quantization"", ""bit-level sparsity"", ""DNN compression""]","Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods.",/pdf/85f47585c645bbb6e9d4e2306480002404a37c22.pdf,ICLR,2021,We propose bit-level sparsity inducing regularizer to induce mixed-percision quantization scheme in DNN with gradient-based training. +rJEgeXFex,,1478210000000.0,1486660000000.0,84,Predicting Medications from Diagnostic Codes with Recurrent Neural Networks,"[""jacek.m.bajor@vanderbilt.edu"", ""tom.lasko@vanderbilt.edu""]","[""Jacek M. Bajor"", ""Thomas A. Lasko""]","[""Deep learning"", ""Supervised Learning"", ""Applications""]","It is a surprising fact that electronic medical records are failing at one of their primary purposes, that of tracking the set of medications that the patient is actively taking. Studies estimate that up to 50% of such lists omit active drugs, and that up to 25% of all active medications do not appear on the appropriate patient list. Manual efforts to maintain these lists involve a great deal of tedious human labor, which could be reduced by computational tools to suggest likely missing or incorrect medications on a patient’s list. We report here an application of recurrent neural networks to predict the likely therapeutic classes of medications that a patient is taking, given a sequence of the last 100 billing codes in their record. Our best model was a GRU that achieved high prediction accuracy (micro-averaged AUC 0.93, Label Ranking Loss 0.076), limited by hardware constraints on model size. Additionally, examining individual cases revealed that many of the predictions marked incorrect were likely to be examples of either omitted medications or omitted billing codes, supporting our assertion of a substantial number of errors and omissions in the data, and the likelihood of models such as these to help correct them.",/pdf/f612598e357946a8f81772ed1cb0cf561d06bb20.pdf,ICLR,2017,Applying recurrent neural networks to fix errors and omissions in patient medication records. +98fWAc-sFkv,DFUcj-6Fai_9,1601310000000.0,1614990000000.0,2901,A Unified Bayesian Framework for Discriminative and Generative Continual Learning,"[""abhi.kumar.chaudhary@gmail.com"", ""sunabhac@gmail.com"", ""~Piyush_Rai1""]","[""Abhishek Kumar"", ""Sunabha Chatterjee"", ""Piyush Rai""]","[""continual learning"", ""bayesian learning""]","Continual Learning is a learning paradigm where learning systems are trained on a sequence of tasks. The goal here is to perform well on the current task without suffering from a performance drop on the previous tasks. Two notable directions among the recent advances in continual learning with neural networks are (1) variational Bayes based regularization by learning priors from previous tasks, and, (2) learning the structure of deep networks to adapt to new tasks. So far, these two approaches have been orthogonal. We present a novel Bayesian framework for continual learning based on learning the structure of deep neural networks, addressing the shortcomings of both these approaches. The proposed framework learns the deep structure for each task by learning which weights to be used, and supports inter-task transfer through the overlapping of different sparse subsets of weights learned by different tasks. An appealing aspect of our proposed continual learning framework is that it is applicable to both discriminative (supervised) and generative (unsupervised) settings. Experimental results on supervised and unsupervised benchmarks shows that our model performs comparably or better than recent advances in continual learning.",/pdf/a1eb5b770e158dac566fa05d60ee3cf651694c3a.pdf,ICLR,2021,"A Bayesian approach for continual learning under supervised as well as unsupervised setting, with the flexibility to adapt the model complexity as more and more tasks arrive." +B1xSperKvH,S1xIM-WFPr,1569440000000.0,1583910000000.0,2573,Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation,"[""rathi2@purdue.edu"", ""srinivg@purdue.edu"", ""priya.panda@yale.edu"", ""kaushik@purdue.edu""]","[""Nitin Rathi"", ""Gopalakrishnan Srinivasan"", ""Priyadarshini Panda"", ""Kaushik Roy""]","[""spiking neural networks"", ""ann-snn conversion"", ""spike-based backpropagation"", ""imagenet""]","Spiking Neural Networks (SNNs) operate with asynchronous discrete events (or spikes) which can potentially lead to higher energy-efficiency in neuromorphic hardware implementations. Many works have shown that an SNN for inference can be formed by copying the weights from a trained Artificial Neural Network (ANN) and setting the firing threshold for each layer as the maximum input received in that layer. These type of converted SNNs require a large number of time steps to achieve competitive accuracy which diminishes the energy savings. The number of time steps can be reduced by training SNNs with spike-based backpropagation from scratch, but that is computationally expensive and slow. To address these challenges, we present a computationally-efficient training technique for deep SNNs. We propose a hybrid training methodology: 1) take a converted SNN and use its weights and thresholds as an initialization step for spike-based backpropagation, and 2) perform incremental spike-timing dependent backpropagation (STDB) on this carefully initialized network to obtain an SNN that converges within few epochs and requires fewer time steps for input processing. STDB is performed with a novel surrogate gradient function defined using neuron's spike time. The weight update is proportional to the difference in spike timing between the current time step and the most recent time step the neuron generated an output spike. The SNNs trained with our hybrid conversion-and-STDB training perform at $10{\times}{-}25{\times}$ fewer number of time steps and achieve similar accuracy compared to purely converted SNNs. The proposed training methodology converges in less than $20$ epochs of spike-based backpropagation for most standard image classification datasets, thereby greatly reducing the training complexity compared to training SNNs from scratch. We perform experiments on CIFAR-10, CIFAR-100 and ImageNet datasets for both VGG and ResNet architectures. We achieve top-1 accuracy of $65.19\%$ for ImageNet dataset on SNN with $250$ time steps, which is $10{\times}$ faster compared to converted SNNs with similar accuracy.",/pdf/6b27e9db132d2fe4c67bfdf18940211c2fc5ee5e.pdf,ICLR,2020,A hybrid training technique that combines ANN-SNN conversion and spike-based backpropagation to optimize training effort and inference latency. +r1xNJ0NYDH,H1lTPvMuwr,1569440000000.0,1577170000000.0,890,The Effect of Neural Net Architecture on Gradient Confusion & Training Performance,"[""karthikabinavs@gmail.com"", ""sohamde@google.com"", ""xuzh@cs.umd.edu"", ""wrhuang@cs.umd.edu"", ""tomg@cs.umd.edu""]","[""Karthik A. Sankararaman"", ""Soham De"", ""Zheng Xu"", ""W. Ronny Huang"", ""Tom Goldstein""]","[""neural network architecture"", ""speed of training"", ""layer width"", ""network depth""]","The goal of this paper is to study why typical neural networks train so fast, and how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through novel theoretical and experimental results, we show how the neural net architecture affects gradient confusion, and thus the efficiency of training. We show that increasing the width of neural networks leads to lower gradient confusion, and thus easier model training. On the other hand, increasing the depth of neural networks has the opposite effect. Finally, we observe empirically that techniques like batch normalization and skip connections reduce gradient confusion, which helps reduce the training burden of very deep networks.",/pdf/752cb4da1ad78306fd4d3900f3c10ba36f711f48.pdf,ICLR,2020,"We formally show that increased layer width helps training, while increased network depth makes training harder." +ryzm6BATZ,B1zmTBAT-,1508950000000.0,1518730000000.0,93,Image Quality Assessment Techniques Improve Training and Evaluation of Energy-Based Generative Adversarial Networks,"[""michaelvertolli@gmail.com"", ""jim@jimdavies.org""]","[""Michael O. Vertolli"", ""Jim Davies""]","[""generative adversarial networks"", ""gans"", ""deep learning"", ""image modeling"", ""image generation"", ""energy based models""]","We propose a new, multi-component energy function for energy-based Generative Adversarial Networks (GANs) based on methods from the image quality assessment literature. Our approach expands on the Boundary Equilibrium Generative Adversarial Network (BEGAN) by outlining some of the short-comings of the original energy and loss functions. We address these short-comings by incorporating an l1 score, the Gradient Magnitude Similarity score, and a chrominance score into the new energy function. We then provide a set of systematic experiments that explore its hyper-parameters. We show that each of the energy function's components is able to represent a slightly different set of features, which require their own evaluation criteria to assess whether they have been adequately learned. We show that models using the new energy function are able to produce better image representations than the BEGAN model in predicted ways.",/pdf/7e656f2b65e7c379977b425701a3c58d618758ae.pdf,ICLR,2018,Image Quality Assessment Techniques Improve Training and Evaluation of Energy-Based Generative Adversarial Networks +rkYgAJWCZ,S1uxC1bAb,1509130000000.0,1518730000000.0,532,One-shot and few-shot learning of word embeddings,"[""lampinen@stanford.edu"", ""mcclelland@stanford.edu""]","[""Andrew Kyle Lampinen"", ""James Lloyd McClelland""]","[""One-shot learning"", ""embeddings"", ""word embeddings"", ""natural language processing"", ""NLP""]","Standard deep learning systems require thousands or millions of examples to learn a concept, and cannot integrate new concepts easily. By contrast, humans have an incredible ability to do one-shot or few-shot learning. For instance, from just hearing a word used in a sentence, humans can infer a great deal about it, by leveraging what the syntax and semantics of the surrounding words tells us. Here, we draw inspiration from this to highlight a simple technique by which deep recurrent networks can similarly exploit their prior knowledge to learn a useful representation for a new word from little data. This could make natural language processing systems much more flexible, by allowing them to learn continually from the new words they encounter.",/pdf/c9ade40c8f59a1058b5d9f15b7ef14eed2a4bf09.pdf,ICLR,2018,"We highlight a technique by which natural language processing systems can learn a new word from context, allowing them to be much more flexible." +BkMq0oRqFQ,B1gQV_8KKX,1538090000000.0,1545360000000.0,923,Normalization Gradients are Least-squares Residuals,"[""liu.yi.pei@gmail.com""]","[""Yi Liu""]","[""Deep Learning"", ""Normalization"", ""Least squares"", ""Gradient regression""]","Batch Normalization (BN) and its variants have seen widespread adoption in the deep learning community because they improve the training of deep neural networks. Discussions of why this normalization works so well remain unsettled. We make explicit the relationship between ordinary least squares and partial derivatives computed when back-propagating through BN. We recast the back-propagation of BN as a least squares fit, which zero-centers and decorrelates partial derivatives from normalized activations. This view, which we term {\em gradient-least-squares}, is an extensible and arithmetically accurate description of BN. To further explore this perspective, we motivate, interpret, and evaluate two adjustments to BN.",/pdf/b37504a8f6cad67daa4ca5a89ddf53116d7a0d52.pdf,ICLR,2019,"Gaussian normalization performs a least-squares fit during back-propagation, which zero-centers and decorrelates partial derivatives from normalized activations." +SkNksoRctQ,r1lACSrOFQ,1538090000000.0,1545410000000.0,588,Fluctuation-dissipation relations for stochastic gradient descent,"[""shoyaida@fb.com""]","[""Sho Yaida""]","[""stochastic gradient descent"", ""adaptive method"", ""loss surface"", ""Hessian""]","The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations hold exactly for any stationary state and can in particular be used to adaptively set training schedule. We can further use the relations to efficiently extract information pertaining to a loss-function landscape such as the magnitudes of its Hessian and anharmonicity. Our claims are empirically verified.",/pdf/2b424796c43cea9eee25004714c27e18a6e9fc18.pdf,ICLR,2019,"We prove fluctuation-dissipation relations for SGD, which can be used to (i) adaptively set learning rates and (ii) probe loss surfaces." +TaYhv-q1Xit,bPa9o5Rt-M4,1601310000000.0,1616070000000.0,3561,Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks,"[""~Christian_H.X._Ali_Mehmeti-G\u00f6pel1"", ""~David_Hartmann1"", ""michael.wand@uni-mainz.de""]","[""Christian H.X. Ali Mehmeti-G\u00f6pel"", ""David Hartmann"", ""Michael Wand""]","[""deep learning theory"", ""loss landscape"", ""harmonic distortion analysis"", ""network trainability""]","In this paper, we apply harmonic distortion analysis to understand the effect of nonlinearities in the spectral domain. Each nonlinear layer creates higher-frequency harmonics, which we call ""blueshift"", whose magnitude increases with network depth, thereby increasing the “roughness” of the output landscape. Unlike differential models (such as vanishing gradients, sharpness), this provides a more global view of how network architectures behave across larger areas of their parameter domain. For example, the model predicts that residual connections are able to counter the effect by dampening corresponding higher frequency modes. We empirically verify the connection between blueshift and architectural choices, and provide evidence for a connection with trainability.",/pdf/e4c6e52caf84c69fd30b6bfc6f0936b70f6187a9.pdf,ICLR,2021,Nonlinearities create high-frequency distortions that affect network trainability. +hiq1rHO8pNT,JhKF0y9UEa,1601310000000.0,1615720000000.0,1622,HyperGrid Transformers: Towards A Single Model for Multiple Tasks,"[""~Yi_Tay1"", ""~Zhe_Zhao3"", ""~Dara_Bahri1"", ""metzler@google.com"", ""~Da-Cheng_Juan1""]","[""Yi Tay"", ""Zhe Zhao"", ""Dara Bahri"", ""Donald Metzler"", ""Da-Cheng Juan""]","[""Transformers"", ""Multi-Task Learning""]","Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being $16$ times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.",/pdf/ec11dfff9f173589062f4d3948f9212e345c56cf.pdf,ICLR,2021,State-of-the-art multi-task NLU performance with only a single model +AHm3dbp7D1D,4nfUEMfLWYhV,1601310000000.0,1615520000000.0,149,SEED: Self-supervised Distillation For Visual Representation,"[""~Zhiyuan_Fang1"", ""~Jianfeng_Wang4"", ""~Lijuan_Wang1"", ""~Lei_Zhang23"", ""~Yezhou_Yang1"", ""~Zicheng_Liu1""]","[""Zhiyuan Fang"", ""Jianfeng Wang"", ""Lijuan Wang"", ""Lei Zhang"", ""Yezhou Yang"", ""Zicheng Liu""]","[""Self Supervised Learning"", ""Knowledge Distillation"", ""Representation Learning""]","This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named $\textbf{SE}$lf-Sup$\textbf{E}$rvised $\textbf{D}$istillation (${\large S}$EED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that ${\large S}$EED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, ${\large S}$EED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the ImageNet-1k dataset. ",/pdf/1f85b790d971bca718badf148e1c89397c3c4d90.pdf,ICLR,2021,"We propose ${\large S}$EED, a self-supervised distillation technique for visual representation learning." +HyunpgbR-,ByPhTxZ0b,1509130000000.0,1518730000000.0,631,Structured Exploration via Hierarchical Variational Policy Networks,"[""stephan@caltech.edu"", ""yyue@caltech.edu""]","[""Stephan Zheng"", ""Yisong Yue""]","[""Deep Reinforcement Learning"", ""Structured Variational Inference"", ""Multi-agent Coordination"", ""Multi-agent Learning""]","Reinforcement learning in environments with large state-action spaces is challenging, as exploration can be highly inefficient. Even if the dynamics are simple, the optimal policy can be combinatorially hard to discover. In this work, we propose a hierarchical approach to structured exploration to improve the sample efficiency of on-policy exploration in large state-action spaces. The key idea is to model a stochastic policy as a hierarchical latent variable model, which can learn low-dimensional structure in the state-action space, and to define exploration by sampling from the low-dimensional latent space. This approach enables lower sample complexity, while preserving policy expressivity. In order to make learning tractable, we derive a joint learning and exploration strategy by combining hierarchical variational inference with actor-critic learning. The benefits of our learning approach are that 1) it is principled, 2) simple to implement, 3) easily scalable to settings with many actions and 4) easily composable with existing deep learning approaches. We demonstrate the effectiveness of our approach on learning a deep centralized multi-agent policy, as multi-agent environments naturally have an exponentially large state-action space. In this setting, the latent hierarchy implements a form of multi-agent coordination during exploration and execution (MACE). We demonstrate empirically that MACE can more efficiently learn optimal policies in challenging multi-agent games with a large number (~20) of agents, compared to conventional baselines. Moreover, we show that our hierarchical structure leads to meaningful agent coordination.",/pdf/c919771a95ab0c405ebc28a3734b750277b51dbd.pdf,ICLR,2018,Make deep reinforcement learning in large state-action spaces more efficient using structured exploration with deep hierarchical policies. +SklTQCNtvS,BJlPf1LuDH,1569440000000.0,1583910000000.0,1054,Sign-OPT: A Query-Efficient Hard-label Adversarial Attack,"[""mhcheng@ucla.edu"", ""simranjit@cs.ucla.edu"", ""patrickchen@ucla.edu"", ""pin-yu.chen@ibm.com"", ""sijia.liu@ibm.com"", ""chohsieh@cs.ucla.edu""]","[""Minhao Cheng"", ""Simranjit Singh"", ""Patrick H. Chen"", ""Pin-Yu Chen"", ""Sijia Liu"", ""Cho-Jui Hsieh""]",[],"We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount (>20,000) of queries for attacking one example. Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack can be modeled as an optimization problem where the objective function can be evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization formulation but propose to directly estimate the sign of gradient at any direction instead of the gradient itself, which enjoys the benefit of single query. +Using this single query oracle for retrieving sign of directional derivative, we develop a novel query-efficient Sign-OPT approach for hard-label black-box attack. We provide a convergence analysis of the new algorithm and conduct experiments on several models on MNIST, CIFAR-10 and ImageNet. +We find that Sign-OPT attack consistently requires 5X to 10X fewer queries when compared to the current state-of-the-art approaches, and usually converges to an adversarial example with smaller perturbation. ",/pdf/b0e100de9d582c690cc356f1475cfd56649b84a4.pdf,ICLR,2020, +SyxV9ANFDH,SJxFRt_OPB,1569440000000.0,1583910000000.0,1279,Economy Statistical Recurrent Units For Inferring Nonlinear Granger Causality,"[""elesaur@nus.edu.sg"", ""vtan@nus.edu.sg""]","[""Saurabh Khanna"", ""Vincent Y. F. Tan""]","[""Recurrent neural networks"", ""Granger causality"", ""Causal inference"", ""Statistical Recurrent Unit""]","Granger causality is a widely-used criterion for analyzing interactions in large-scale networks. As most physical interactions are inherently nonlinear, we consider the problem of inferring the existence of pairwise Granger causality between nonlinearly interacting stochastic processes from their time series measurements. Our proposed approach relies on modeling the embedded nonlinearities in the measurements using a component-wise time series prediction model based on Statistical Recurrent Units (SRUs). We make a case that the network topology of Granger causal relations is directly inferrable from a structured sparse estimate of the internal parameters of the SRU networks trained to predict the processes’ time series measurements. We propose a variant of SRU, called economy-SRU, which, by design has considerably fewer trainable parameters, and therefore less prone to overfitting. The economy-SRU computes a low-dimensional sketch of its high-dimensional hidden state in the form of random projections to generate the feedback for its recurrent processing. Additionally, the internal weight parameters of the economy-SRU are strategically regularized in a group-wise manner to facilitate the proposed network in extracting meaningful predictive features that are highly time-localized to mimic real-world causal events. Extensive experiments are carried out to demonstrate that the proposed economy-SRU based time series prediction model outperforms the MLP, LSTM and attention-gated CNN-based time series models considered previously for inferring Granger causality. ",/pdf/b77f50150d3f24a52b70cfe3b4549acc0bda1e48.pdf,ICLR,2020,A new recurrent neural network architecture for detecting pairwise Granger causality between nonlinearly interacting time series. +B1em9h4KDS,ByxlK1t0IB,1569440000000.0,1577170000000.0,110,Generative Imputation and Stochastic Prediction,"[""mkachuee@ucla.edu"", ""kimmo@cs.ucla.edu"", ""orpgol@cs.ucla.edu"", ""sajad.darabi@cs.ucla.edu"", ""majid@cs.ucla.edu""]","[""Mohammad Kachuee"", ""Kimmo K\u00e4rkk\u00e4inen"", ""Orpaz Goldstein"", ""Sajad Darabi"", ""Majid Sarrafzadeh""]",[],"In many machine learning applications, we are faced with incomplete datasets. In the literature, missing data imputation techniques have been mostly concerned with filling missing values. However, the existence of missing values is synonymous with uncertainties not only over the distribution of missing values but also over target class assignments that require careful consideration. In this paper, we propose a simple and effective method for imputing missing features and estimating the distribution of target assignments given incomplete data. In order to make imputations, we train a simple and effective generator network to generate imputations that a discriminator network is tasked to distinguish. Following this, a predictor network is trained using the imputed samples from the generator network to capture the classification uncertainties and make predictions accordingly. The proposed method is evaluated on CIFAR-10 image dataset as well as three real-world tabular classification datasets, under different missingness rates and structures. Our experimental results show the effectiveness of the proposed method in generating imputations as well as providing estimates for the class uncertainties in a classification task when faced with missing values.",/pdf/6bb90fac94db5eea133812d893f9182973c34e3f.pdf,ICLR,2020,A method to generate imputations and measure uncertainties over target class assignments based on incomplete feature vectors +r1l1myStwr,rylZJHndPH,1569440000000.0,1577170000000.0,1602,Continuous Meta-Learning without Tasks,"[""jharrison@stanford.edu"", ""apoorva@stanford.edu"", ""cbfinn@cs.stanford.edu"", ""pavone@stanford.edu""]","[""James Harrison"", ""Apoorva Sharma"", ""Chelsea Finn"", ""Marco Pavone""]","[""Meta-learning"", ""Continual learning"", ""changepoint detection"", ""Bayesian learning""]","Meta-learning is a promising strategy for learning to efficiently learn within new tasks, using data gathered from a distribution of tasks. However, the meta-learning literature thus far has focused on the task segmented setting, where at train-time, offline data is assumed to be split according to the underlying task, and at test-time, the algorithms are optimized to learn in a single task. In this work, we enable the application of generic meta-learning algorithms to settings where this task segmentation is unavailable, such as continual online learning with a time-varying task. We present meta-learning via online changepoint analysis (MOCA), an approach which augments a meta-learning algorithm with a differentiable Bayesian changepoint detection scheme. The framework allows both training and testing directly on time series data without segmenting it into discrete tasks. We demonstrate the utility of this approach on a nonlinear meta-regression benchmark as well as two meta-image-classification benchmarks.",/pdf/367f77bd4f316ba5aefc0db65bb0d9675976dbc1.pdf,ICLR,2020,Bayesian changepoint detection enables meta-learning directly from time series data. +Hke3gyHYwH,SyxaZUo_PS,1569440000000.0,1583910000000.0,1521,Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee,"[""huwei@cs.princeton.edu"", ""zhiyuanli@cs.princeton.edu"", ""dingliy@cs.princeton.edu""]","[""Wei Hu"", ""Zhiyuan Li"", ""Dingli Yu""]","[""deep learning theory"", ""regularization"", ""noisy labels""]","Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other hand, simple regularization methods like early-stopping can often achieve highly nontrivial performance on clean test data in these scenarios, a phenomenon not theoretically understood. This paper proposes and analyzes two simple and intuitive regularization methods: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, we prove that gradient descent training with either of these two methods leads to a generalization guarantee on the clean data distribution despite being trained using noisy labels. Our generalization analysis relies on the connection between wide neural network and neural tangent kernel (NTK). The generalization bound is independent of the network size, and is comparable to the bound one can get when there is no label noise. Experimental results verify the effectiveness of these methods on noisily labeled datasets.",/pdf/907345259310ec754023a3696a33480febe41d0b.pdf,ICLR,2020, +B1gJ1L2aW,Hyk1kUn6b,1508820000000.0,1521790000000.0,59,Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality,"[""xingjunm@student.unimelb.edu.au"", ""crystalboli@berkeley.edu"", ""wangys14@mails.tsinghua.edu.cn"", ""sarah.erfani@unimelb.edu.au"", ""sudanthi.wijewickrema@unimelb.edu.au"", ""schoeneb@umich.edu"", ""dawnsong.travel@gmail.com"", ""meh@nii.ac.jp"", ""baileyj@unimelb.edu.au""]","[""Xingjun Ma"", ""Bo Li"", ""Yisen Wang"", ""Sarah M. Erfani"", ""Sudanthi Wijewickrema"", ""Grant Schoenebeck"", ""Dawn Song"", ""Michael E. Houle"", ""James Bailey""]","[""Adversarial Subspace"", ""Local Intrinsic Dimensionality"", ""Deep Neural Networks""]","Deep Neural Networks (DNNs) have recently been shown to be vulnerable against adversarial examples, which are carefully crafted instances that can mislead DNNs to make errors during prediction. To better understand such attacks, a characterization is needed of the properties of regions (the so-called `adversarial subspaces') in which adversarial examples lie. We tackle this challenge by characterizing the dimensional properties of adversarial regions, via the use of Local Intrinsic Dimensionality (LID). LID assesses the space-filling capability of the region surrounding a reference example, based on the distance distribution of the example to its neighbors. We first provide explanations about how adversarial perturbation can affect the LID characteristic of adversarial regions, and then show empirically that LID characteristics can facilitate the distinction of adversarial examples generated using state-of-the-art attacks. As a proof-of-concept, we show that a potential application of LID is to distinguish adversarial examples, and the preliminary results show that it can outperform several state-of-the-art detection measures by large margins for five attack strategies considered in this paper across three benchmark datasets. Our analysis of the LID characteristic for adversarial regions not only motivates new directions of effective adversarial defense, but also opens up more challenges for developing new attacks to better understand the vulnerabilities of DNNs.",/pdf/1b0b9eefe47ff28b205d421b5fceefb34449b771.pdf,ICLR,2018,We characterize the dimensional properties of adversarial subspaces in the neighborhood of adversarial examples via the use of Local Intrinsic Dimensionality (LID). +S1uxsye0Z,B1wgiJgAb,1509060000000.0,1519340000000.0,204,Adaptive Dropout with Rademacher Complexity Regularization,"[""zhaikedavy@gmail.com"", ""joyousprince@gmail.com""]","[""Ke Zhai"", ""Huan Wang""]","[""model complexity"", ""regularization"", ""deep learning"", ""model generalization"", ""adaptive dropout""]","We propose a novel framework to adaptively adjust the dropout rates for the deep neural network based on a Rademacher complexity bound. The state-of-the-art deep learning algorithms impose dropout strategy to prevent feature co-adaptation. However, choosing the dropout rates remains an art of heuristics or relies on empirical grid-search over some hyperparameter space. In this work, we show the network Rademacher complexity is bounded by a function related to the dropout rate vectors and the weight coefficient matrices. Subsequently, we impose this bound as a regularizer and provide a theoretical justified way to trade-off between model complexity and representation power. Therefore, the dropout rates and the empirical loss are unified into the same objective function, which is then optimized using the block coordinate descent algorithm. We discover that the adaptively adjusted dropout rates converge to some interesting distributions that reveal meaningful patterns.Experiments on the task of image and document classification also show our method achieves better performance compared to the state-of the-art dropout algorithms.",/pdf/a4b4dcae4dcddfaf2bee5828f6d8452898c52821.pdf,ICLR,2018,We propose a novel framework to adaptively adjust the dropout rates for the deep neural network based on a Rademacher complexity bound. +rke_YiRct7,r1gDn5xvYX,1538090000000.0,1556910000000.0,458,Small nonlinearities in activation functions create bad local minima in neural networks,"[""chulheey@mit.edu"", ""suvrit@mit.edu"", ""jadbabai@mit.edu""]","[""Chulhee Yun"", ""Suvrit Sra"", ""Ali Jadbabaie""]","[""spurious local minima"", ""loss surface"", ""optimization landscape"", ""neural network""]","We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with ""slightest"" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general ""no spurious local minim"" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic.",/pdf/9dc34e786d0037c655b4e01acba62d2612460cf2.pdf,ICLR,2019,"We constructively prove that even the slightest nonlinear activation functions introduce spurious local minima, for general datasets and activation functions." +XQQA6-So14,nF47OALggXE,1601310000000.0,1616020000000.0,665,Neural Spatio-Temporal Point Processes,"[""~Ricky_T._Q._Chen1"", ""~Brandon_Amos1"", ""~Maximilian_Nickel1""]","[""Ricky T. Q. Chen"", ""Brandon Amos"", ""Maximilian Nickel""]","[""point processes"", ""normalizing flows"", ""differential equations""]","We propose a new class of parameterizations for spatio-temporal point processes which leverage Neural ODEs as a computational method and enable flexible, high-fidelity models of discrete events that are localized in continuous time and space. Central to our approach is a combination of continuous-time neural networks with two novel neural architectures, \ie, Jump and Attentive Continuous-time Normalizing Flows. This approach allows us to learn complex distributions for both the spatial and temporal domain and to condition non-trivially on the observed event history. We validate our models on data sets from a wide variety of contexts such as seismology, epidemiology, urban mobility, and neuroscience.",/pdf/668da7eb2c6955f36c010d76bb62d8a0cea81a06.pdf,ICLR,2021,"We motivate the use of Continuous-time Normalizing Flows for building spatio-temporal point processes, and discuss modeling conditional dependencies with recurrent- or attention-based Neural ODEs." +UJRFjuJDsIO,HZP_1sUBOPA,1601310000000.0,1614990000000.0,2308,Why Convolutional Networks Learn Oriented Bandpass Filters: Theory and Empirical Support,"[""~Isma_Hadji2"", ""~Richard_Wildes1""]","[""Isma Hadji"", ""Richard Wildes""]",[],"It has been repeatedly observed that convolutional architectures when applied to image understanding tasks learn oriented bandpass filters. A standard explanation of this result is that these filters reflect the structure of the images that they have been exposed to during training: Natural images typically are locally composed of oriented contours at various scales and oriented bandpass filters are matched to such structure. We offer an alternative explanation based not on the structure of images, but rather on the structure of convolutional architectures. In particular, complex exponentials are the eigenfunctions of convolution. These eigenfunctions are defined globally; however, convolutional architectures operate locally. To enforce locality, one can apply a windowing function to the eigenfunctions, which leads to oriented bandpass filters as the natural operators to be learned with convolutional architectures. From a representational point of view, these filters allow for a local systematic way to characterize and operate on an image or other signal. We offer empirical support for the hypothesis that convolutional networks learn such filters at all of their convolutional layers. While previous research has shown evidence of filters having oriented bandpass characteristics at early layers, ours appears to be the first study to document the predominance of such filter characteristics at all layers. Previous studies have missed this observation because they have concentrated on the cumulative compositional effects of filtering across layers, while we examine the filter characteristics that are present at each layer.",/pdf/cbfc054e6adddf4b333affb894a38ee363e4b501.pdf,ICLR,2021, +H1xSNiRcF7,HkltNFR5Ym,1538090000000.0,1550880000000.0,1,Smoothing the Geometry of Probabilistic Box Embeddings,"[""xiangl@cs.umass.edu"", ""luke@cs.umass.edu"", ""dongxuzhang@cs.umass.edu"", ""mboratko@math.umass.edu"", ""mccallum@cs.umass.edu""]","[""Xiang Li"", ""Luke Vilnis"", ""Dongxu Zhang"", ""Michael Boratko"", ""Andrew McCallum""]","[""embeddings"", ""order embeddings"", ""knowledge graph embedding"", ""relational learning""]","There is growing interest in geometrically-inspired embeddings for learning hierarchies, partial orders, and lattice structures, with natural applications to transitive relational data such as entailment graphs. Recent work has extended these ideas beyond deterministic hierarchies to probabilistically calibrated models, which enable learning from uncertain supervision and inferring soft-inclusions among concepts, while maintaining the geometric inductive bias of hierarchical embedding models. We build on the Box Lattice model of Vilnis et al. (2018), which showed promising results in modeling soft-inclusions through an overlapping hierarchy of sets, parameterized as high-dimensional hyperrectangles (boxes). However, the hard edges of the boxes present difficulties for standard gradient based optimization; that work employed a special surrogate function for the disjoint case, but we find this method to be fragile. In this work, we present a novel hierarchical embedding model, inspired by a relaxation of box embeddings into parameterized density functions using Gaussian convolutions over the boxes. Our approach provides an alternative surrogate to the original lattice measure that improves the robustness of optimization in the disjoint case, while also preserving the desirable properties with respect to the original lattice. We demonstrate increased or matching performance on WordNet hypernymy prediction, Flickr caption entailment, and a MovieLens-based market basket dataset. We show especially marked improvements in the case of sparse data, where many conditional probabilities should be low, and thus boxes should be nearly disjoint.",/pdf/30f1421b94869cfbadcc02bf5c259dcd2d51501e.pdf,ICLR,2019,Improve hierarchical embedding models using kernel smoothing +BJzmzn0ctX,r1xj-16ctm,1538090000000.0,1545360000000.0,1251,Scalable Neural Theorem Proving on Knowledge Bases and Natural Language,"[""p.minervini@gmail.com"", ""matko.bosnjak@gmail.com"", ""tim.rocktaeschel@gmail.com"", ""etg@google.com"", ""etg@google.com""]","[""Pasquale Minervini"", ""Matko Bosnjak"", ""Tim Rockt\u00e4schel"", ""Edward Grefenstette"", ""Sebastian Riedel""]","[""Machine Reading"", ""Natural Language Processing"", ""Neural Theorem Proving"", ""Representation Learning"", ""First Order Logic""]","Reasoning over text and Knowledge Bases (KBs) is a major challenge for Artificial Intelligence, with applications in machine reading, dialogue, and question answering. Transducing text to logical forms which can be operated on is a brittle and error-prone process. Operating directly on text by jointly learning representations and transformations thereof by means of neural architectures that lack the ability to learn and exploit general rules can be very data-inefficient and not generalise correctly. These issues are addressed by Neural Theorem Provers (NTPs) (Rocktäschel & Riedel, 2017), neuro-symbolic systems based on a continuous relaxation of Prolog’s backward chaining algorithm, where symbolic unification between atoms is replaced by a differentiable operator computing the similarity between their embedding representations. In this paper, we first propose Neighbourhood-approximated Neural Theorem Provers (NaNTPs) consisting of two extensions toNTPs, namely a) a method for drastically reducing the previously prohibitive time and space complexity during inference and learning, and b) an attention mechanism for improving the rule learning process, deeming them usable on real-world datasets. Then, we propose a novel approach for jointly reasoning over KB facts and textual mentions, by jointly embedding them in a shared embedding space. The proposed method is able to extract rules and provide explanations—involving both textual patterns and KB relations—from large KBs and text corpora. We show that NaNTPs perform on par with NTPs at a fraction of a cost, and can achieve competitive link prediction results on challenging large-scale datasets, including WN18, WN18RR, and FB15k-237 (with and without textual mentions) while being able to provide explanations for each prediction and extract interpretable rules.",/pdf/bfa74579164e74950dc3438d3fd16ebb542a6deb.pdf,ICLR,2019,"We scale Neural Theorem Provers to large datasets, improve the rule learning process, and extend it to jointly reason over text and Knowledge Bases." +B1l4SgHKDH,ByxnHuxFwS,1569440000000.0,1587610000000.0,2281,Residual Energy-Based Models for Text Generation,"[""dengyuntian@seas.harvard.edu"", ""yolo@fb.com"", ""aszlam@fb.com"", ""ranzato@fb.com""]","[""Yuntian Deng"", ""Anton Bakhtin"", ""Myle Ott"", ""Arthur Szlam"", ""Marc'Aurelio Ranzato""]","[""energy-based models"", ""text generation""]","Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation.",/pdf/a52949443b29f47fd3298eec9c93f3b5d25eb545.pdf,ICLR,2020,We show that Energy-Based models when trained on the residual of an auto-regressive language model can be used effectively and efficiently to generate text. +B1eCk1StPH,ByxXGlo_Dr,1569440000000.0,1577170000000.0,1489,The Generalization-Stability Tradeoff in Neural Network Pruning,"[""bbartoldson@fsu.edu"", ""arimorcos@gmail.com"", ""abarbu@stat.fsu.edu"", ""gerlebacher@fsu.edu""]","[""Brian R. Bartoldson"", ""Ari S. Morcos"", ""Adrian Barbu"", ""Gordon Erlebacher""]","[""pruning"", ""generalization"", ""stability"", ""dynamics"", ""regularization""]","Pruning neural network parameters is often viewed as a means to compress models, but pruning has also been motivated by the desire to prevent overfitting. This motivation is particularly relevant given the perhaps surprising observation that a wide variety of pruning approaches increase test accuracy despite sometimes massive reductions in parameter counts. To better understand this phenomenon, we analyze the behavior of pruning over the course of training, finding that pruning's effect on generalization relies more on the instability it generates (defined as the drops in test accuracy immediately following pruning) than on the final size of the pruned model. We demonstrate that even the pruning of unimportant parameters can lead to such instability, and show similarities between pruning and regularizing by injecting noise, suggesting a mechanism for pruning-based generalization improvements that is compatible with the strong generalization recently observed in over-parameterized networks.",/pdf/3b10bc68a8ccf2d2e991723fb6d4688a18e4ae49.pdf,ICLR,2020,"We demonstrate that pruning methods which introduce greater instability into the loss also confer improved generalization, and explore the mechanisms underlying this effect." +S1lk61BtvB,BJeDEQJYvS,1569440000000.0,1577170000000.0,1974,"``""Best-of-Many-Samples"" Distribution Matching","[""abhattac@mpi-inf.mpg.de"", ""fritz@cispa.saarland"", ""schiele@mpi-inf.mpg.de""]","[""Apratim Bhattacharyya"", ""Mario Fritz"", ""Bernt Schiele""]","[""Distribution Matching"", ""Generative Adversarial Networks"", ""Variational Autoencoders""]","Generative Adversarial Networks (GANs) can achieve state-of-the-art sample quality in generative modelling tasks but suffer from the mode collapse problem. Variational Autoencoders (VAE) on the other hand explicitly maximize a reconstruction-based data log-likelihood forcing it to cover all modes, but suffer from poorer sample quality. Recent works have proposed hybrid VAE-GAN frameworks which integrate a GAN-based synthetic likelihood to the VAE objective to address both the mode collapse and sample quality issues, with limited success. This is because the VAE objective forces a trade-off between the data log-likelihood and divergence to the latent prior. The synthetic likelihood ratio term also shows instability during training. We propose a novel objective with a ``""Best-of-Many-Samples"" reconstruction cost and a stable direct estimate of the synthetic likelihood. This enables our hybrid VAE-GAN framework to achieve high data log-likelihood and low divergence to the latent prior at the same time and shows significant improvement over both hybrid VAE-GANS and plain GANs in mode coverage and quality.",/pdf/4bc3077bb44c606fda33ddd4f51c5f233ff7a032.pdf,ICLR,2020,We propose a new objective for training hybrid VAE-GANs which lead to significant improvement in mode coverage and quality. +3YQAVD9_Dz3,DkT5u-jEEtV,1601310000000.0,1614990000000.0,1389,NOSE Augment: Fast and Effective Data Augmentation Without Searching,"[""~Qingrui_Li1"", ""xiesong@mail.nwpu.edu.cn"", ""~An\u0131l_Oymagil1"", ""~Mustafa_Furkan_Eseoglu1"", ""zhangziyin1@huawei.com"", ""~CM_Lee1""]","[""Qingrui Li"", ""Song Xie"", ""An\u0131l Oymagil"", ""Mustafa Furkan Eseoglu"", ""Ziyin Zhang"", ""CM Lee""]","[""data augmentation"", ""stochastic policy"", ""multi-stage augmentation""]","Data augmentation has been widely used for enhancing the diversity of training data and model generalization. Different from traditional handcrafted methods, recent research introduced automated search for optimal data augmentation policies and achieved state-of-the-art results on image classification tasks. However, these search-based implementations typically incur high computation cost and long search time because of large search spaces and complex searching algorithms. We revisited automated augmentation from alternate perspectives, such as increasing diversity and manipulating the overall usage of augmented data. In this paper, we present an augmentation method without policy searching called NOSE Augment (NO SEarch Augment). Our method completely skips policy searching; instead, it jointly applies multi-stage augmentation strategy and introduces more augmentation operations on top of a simple stochastic augmentation mechanism. With more augmentation operations, we boost the data diversity of stochastic augmentation; and with the phased complexity driven strategy, we ensure the whole training process converged smoothly to a good quality model. We conducted extensive experiments and showed that our method could match or surpass state-of-the-art results provided by search-based methods in terms of accuracies. Without the need for policy search, our method is much more efficient than the existing AutoAugment series of methods. Besides image classification, we also examine the general validity of our proposed method by applying our method to Face Recognition and Text Detection of the Optical Character Recognition (OCR) problems. The results establish our proposed method as a fast and competitive data augmentation strategy that can be used across various CV tasks.",/pdf/c6661535b70d7d1c5973af5fdd2367a2090121b0.pdf,ICLR,2021,Fast and Effective Data Augmentation Without Searching +bjkX6Kzb5H,O5ZNA2md8JP,1601310000000.0,1615940000000.0,2529,"Cut out the annotator, keep the cutout: better segmentation with weak supervision","[""~Sarah_Hooper1"", ""mwornow@stanford.edu"", ""yinghang@stanford.edu"", ""kellmanp@nhlbi.nih.gov"", ""hui.xue@nih.gov"", ""~Frederic_Sala1"", ""~Curtis_Langlotz1"", ""~Christopher_Re1""]","[""Sarah Hooper"", ""Michael Wornow"", ""Ying Hang Seah"", ""Peter Kellman"", ""Hui Xue"", ""Frederic Sala"", ""Curtis Langlotz"", ""Christopher Re""]","[""Weak supervision"", ""segmentation"", ""CNN"", ""latent variable"", ""medical imaging""]","Constructing large, labeled training datasets for segmentation models is an expensive and labor-intensive process. This is a common challenge in machine learning, addressed by methods that require few or no labeled data points such as few-shot learning (FSL) and weakly-supervised learning (WS). Such techniques, however, have limitations when applied to image segmentation---FSL methods often produce noisy results and are strongly dependent on which few datapoints are labeled, while WS models struggle to fully exploit rich image information. We propose a framework that fuses FSL and WS for segmentation tasks, enabling users to train high-performing segmentation networks with very few hand-labeled training points. We use FSL models as weak sources in a WS framework, requiring a very small set of reference labeled images, and introduce a new WS model that focuses on key areas---areas with contention among noisy labels---of the image to fuse these weak sources. Empirically, we evaluate our proposed approach over seven well-motivated segmentation tasks. We show that our methods can achieve within 1.4 Dice points compared to fully supervised networks while only requiring five hand-labeled training points. Compared to existing FSL methods, our approach improves performance by a mean 3.6 Dice points over the next-best method. ",/pdf/483be50ec4cee1c25de217a88795d4d99938cb4a.pdf,ICLR,2021,"In this work, we present a weak supervision approach for segmentation tasks, allowing users to train high-performing segmentation CNNs with very few hand-labeled training points." +JNP-CqSjkDb,T5kpx7PYwD9x,1601310000000.0,1614990000000.0,17,Transforming Recurrent Neural Networks with Attention and Fixed-point Equations,"[""~Zhaobin_Xu2"", ""~Baotian_Hu1"", ""~Buzhou_Tang1""]","[""Zhaobin Xu"", ""Baotian Hu"", ""Buzhou Tang""]","[""Fixed-point"", ""Attention"", ""Feed Forward Network"", ""Transformer"", ""Recurrent Neural Network"", ""Deep Learning""]","Transformer has achieved state of the art performance in multiple Natural Language Processing tasks recently. Yet the Feed Forward Network(FFN) in a Transformer block is computationally expensive. In this paper, we present a framework to transform Recurrent Neural Networks(RNNs) and their variants into self-attention-style models, with an approximation of Banach Fixed-point Theorem. Within this framework, we propose a new model, StarSaber, by solving a set of equations obtained from RNN with Fixed-point Theorem and further approximate it with a Multi-layer Perceptron. It provides a view of stacking layers. StarSaber achieves better performance than both the vanilla Transformer and an improved version called ReZero on three datasets and is more computationally efficient, due to the reduction of Transformer's FFN layer. It has two major parts. One is a way to encode position information with two different matrices. For every position in a sequence, we have a matrix operating on positions before it and another matrix operating on positions after it. The other is the introduction of direct paths from the input layer to the rest of layers. Ablation studies show the effectiveness of these two parts. We additionally show that other RNN variants such as RNNs with gates can also be transformed in the same way, outperforming the two kinds of Transformers as well.",/pdf/48bcd1778319e6d1ff7e437b50d013a710e0e1c7.pdf,ICLR,2021,From Recurrent Neural Networks to self-attention models with an approximation of Banach Fixed-point Theorem. +BJxI5gHKDr,SJgXTCxYvS,1569440000000.0,1626620000000.0,2473,Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning,"[""ars.ashuha@gmail.com"", ""alex.grig.lyzhov@gmail.com"", ""dmolch111@gmail.com"", ""vetrovd@yandex.ru""]","[""Arsenii Ashukha"", ""Alexander Lyzhov"", ""Dmitry Molchanov"", ""Dmitry Vetrov""]","[""uncertainty"", ""in-domain uncertainty"", ""deep ensembles"", ""ensemble learning"", ""deep learning""]","Uncertainty estimation and ensembling methods go hand-in-hand. Uncertainty estimation is one of the main benchmarks for assessment of ensembling performance. At the same time, deep learning ensembles have provided state-of-the-art results in uncertainty estimation. In this work, we focus on in-domain uncertainty for image classification. We explore the standards for its quantification and point out pitfalls of existing metrics. Avoiding these pitfalls, we perform a broad study of different ensembling techniques. To provide more insight in this study, we introduce the deep ensemble equivalent score (DEE) and show that many sophisticated ensembling techniques are equivalent to an ensemble of only few independently trained networks in terms of test performance.",/pdf/1958d8a12443afebc2d8a381672b2604b736c837.pdf,ICLR,2020,We highlight the problems with common metrics of in-domain uncertainty and perform a broad study of modern ensembling techniques. +oweBPxtma_i,UiurBqRlzGa,1601310000000.0,1614990000000.0,2719,A self-explanatory method for the black box problem on discrimination part of CNN,"[""~Jinwei_Zhao1"", ""852333436@qq.com"", ""axmaiqiu@foxmail.com"", ""guoxie@xaut.edu.cn"", ""wwang@xaut.edu.cn"", ""heixinhong@xaut.edu.cn"", ""~Deyu_Meng1""]","[""Jinwei Zhao"", ""Qizhou Wang"", ""Wanli Qiu"", ""Guo Xie"", ""Wei Wang"", ""Xinhong Hei"", ""Deyu Meng""]","[""Convolution neural network"", ""Interpretability performance"", ""Markov random field""]","Recently, for finding inherent causality implied in CNN, the black box problem of its discrimination part, which is composed of all fully connected layers of the CNN, has been studied by different scientific communities. Many methods were proposed, which can extract various interpretable models from the optimal discrimination part based on inputs and outputs of the part for finding the inherent causality implied in the part. However, the inherent causality cannot readily be found. We think that the problem could be solved by shrinking an interpretable distance which can evaluate the degree for the discrimination part to be easily explained by an interpretable model. This paper proposes a lightweight interpretable model, Deep Cognitive Learning Model(DCLM). And then, a game method between the DCLM and the discrimination part is implemented for shrinking the interpretation distance. Finally, the proposed self-explanatory method was evaluated by some contrastive experiments with certain baseline methods on some standard image processing benchmarks. These experiments indicate that the proposed method can effectively find the inherent causality implied in the discrimination part of the CNN without largely reducing its generalization performance. Moreover, the generalization performance of the DCLM also can be improved.",/pdf/a0d2917e45eb09012a908b5fb840b5a23ca5c221.pdf,ICLR,2021,"For finding the inherent causality implied in the discrimination part of CNN without largely reducing its generalization performance, a self-explanatory method is proposed." +r1ledo0ctX,rylhE3UcF7,1538090000000.0,1545360000000.0,322,Consistency-based anomaly detection with adaptive multiple-hypotheses predictions,"[""nguyen@cs.uni-freiburg.de"", ""zhongyu.lou@de.bosch.com"", ""michael.klar2@de.bosch.com"", ""brox@cs.uni-freiburg.de""]","[""Duc Tam Nguyen"", ""Zhongyu Lou"", ""Michael Klar"", ""Thomas Brox""]","[""Anomaly detection"", ""outlier detection"", ""generative models"", ""VAE"", ""GAN""]","In one-class-learning tasks, only the normal case can be modeled with data, whereas the variation of all possible anomalies is too large to be described sufficiently by samples. Thus, due to the lack of representative data, the wide-spread discriminative approaches cannot cover such learning tasks, and rather generative models, which attempt to learn the input density of the normal cases, are used. However, generative models suffer from a large input dimensionality (as in images) and are typically inefficient learners. We propose to learn the data distribution more efficiently with a multi-hypotheses autoencoder. Moreover, the model is criticized by a discriminator, which prevents artificial data modes not supported by data, and which enforces diversity across hypotheses. This consistency-based anomaly detection (ConAD) framework allows the reliable identification of outof- distribution samples. For anomaly detection on CIFAR-10, it yields up to 3.9% points improvement over previously reported results. On a real anomaly detection task, the approach reduces the error of the baseline models from 6.8% to 1.5%.",/pdf/1272ab87edc1ad89573487944ca87c376ec28bde.pdf,ICLR,2019,We propose an anomaly-detection approach that combines modeling the foreground class via multiple local densities with adversarial training. +NqPW1ZJjXDJ,NRWdT-V2rH,1601310000000.0,1614990000000.0,1155,NASOA: Towards Faster Task-oriented Online Fine-tuning,"[""~Hang_Xu1"", ""kang.ning2@huawei.com"", ""~Gengwei_Zhang1"", ""~Xiaodan_Liang2"", ""~Zhenguo_Li1""]","[""Hang Xu"", ""Ning Kang"", ""Gengwei Zhang"", ""Xiaodan Liang"", ""Zhenguo Li""]","[""Fine-tuning"", ""AutoML"", ""NAS""]","Fine-tuning from pre-trained ImageNet models has been a simple, effective, and popular approach for various computer vision tasks. The common practice of fine-tuning is to adopt a default hyperparameter setting with a fixed pre-trained model, while both of them are not optimized for specific tasks and time constraints. Moreover, in cloud computing or GPU clusters where the tasks arrive sequentially in a stream, faster online fine-tuning is a more desired and realistic strategy for saving money, energy consumption, and CO2 emission. In this paper, we propose a joint Neural Architecture Search and Online Adaption framework named NASOA towards a faster task-oriented fine-tuning upon the request of users. Specifically, NASOA first adopts an offline NAS to identify a group of training-efficient networks to form a pretrained model zoo. We propose a novel joint block and macro-level search space to enable a flexible and efficient search. Then, by estimating fine-tuning performance via an adaptive model by accumulating experience from the past tasks, an online schedule generator is proposed to pick up the most suitable model and generate a personalized training regime with respect to each desired task in a one-shot fashion. The resulting model zoo is more training efficient than SOTA NAS models, e.g. 6x faster than RegNetY-16GF, and 1.7x faster than EfficientNetB3. Experiments on multiple datasets also show that NASOA achieves much better fine-tuning results, i.e. improving around 2.1% accuracy than the best performance in RegNet series under various time constraints and tasks; 40x faster compared to the BOHB method.",/pdf/1d0f6964bb8a96e4810d468fbcd63e226a3ceb6f.pdf,ICLR,2021,We propose a Neural Architecture Search and Online Adaption framework named NASOA towards a faster task-oriented fine-tuning upon the request of users. +SkBYYyZRZ,rkHtKyWAZ,1509120000000.0,1518730000000.0,503,Searching for Activation Functions,"[""prajitram@gmail.com"", ""barretzoph@google.com"", ""qvl@google.com""]","[""Prajit Ramachandran"", ""Barret Zoph"", ""Quoc V. Le""]","[""meta learning"", ""activation functions""]","The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, f(x) = x * sigmoid(beta * x), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network. ",/pdf/1c6e9cf33ec0b8f38d2638509ea5ee4c2099c24d.pdf,ICLR,2018,"We use search techniques to discover novel activation functions, and our best discovered activation function, f(x) = x * sigmoid(beta * x), outperforms ReLU on a number of challenging tasks like ImageNet." +rylmoxrFDH,BkliuybtPr,1569440000000.0,1583910000000.0,2503,Critical initialisation in continuous approximations of binary neural networks,"[""george.stamatescu@gmail.com"", ""federicagerace91@gmail.com"", ""carlo.lucibello@gmail.com"", ""ian.fuss@adelaide.edu.au"", ""lang.white@adelaide.edu.au""]","[""George Stamatescu"", ""Federica Gerace"", ""Carlo Lucibello"", ""Ian Fuss"", ""Langford White""]",[],"The training of stochastic neural network models with binary ($\pm1$) weights and activations via continuous surrogate networks is investigated. We derive new surrogates using a novel derivation based on writing the stochastic neural network as a Markov chain. This derivation also encompasses existing variants of the surrogates presented in the literature. Following this, we theoretically study the surrogates at initialisation. We derive, using mean field theory, a set of scalar equations describing how input signals propagate through the randomly initialised networks. The equations reveal whether so-called critical initialisations exist for each surrogate network, where the network can be trained to arbitrary depth. Moreover, we predict theoretically and confirm numerically, that common weight initialisation schemes used in standard continuous networks, when applied to the mean values of the stochastic binary weights, yield poor training performance. This study shows that, contrary to common intuition, the means of the stochastic binary weights should be initialised close to $\pm 1$, for deeper networks to be trainable.",/pdf/71a41b43a6dc91d6e4322cb83f251fd555711e7e.pdf,ICLR,2020,signal propagation theory applied to continuous surrogates of binary nets; counter intuitive initialisation; reparameterisation trick not helpful +fAbkE6ant2,wEYJxnPg-T6,1601310000000.0,1611670000000.0,2786,Revisiting Locally Supervised Learning: an Alternative to End-to-end Training,"[""~Yulin_Wang1"", ""~Zanlin_Ni1"", ""~Shiji_Song1"", ""~Le_Yang2"", ""~Gao_Huang1""]","[""Yulin Wang"", ""Zanlin Ni"", ""Shiji Song"", ""Le Yang"", ""Gao Huang""]","[""Locally supervised training"", ""Deep learning""]","Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration.",/pdf/ae46b2e0daac3e1e7af2c0b30ca3ed05b9675f66.pdf,ICLR,2021,"We provide a deep understanding of locally supervised learning, and make it perform on par with end-to-end training, while with significantly reduced GPUs memory footprint. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch." +HyN-M2Rctm,rkxv5M_cKQ,1538090000000.0,1550780000000.0,1240,Mode Normalization,"[""l.deecke@ed.ac.uk"", ""i.murray@ed.ac.uk"", ""hbilen@ed.ac.uk""]","[""Lucas Deecke"", ""Iain Murray"", ""Hakan Bilen""]","[""Deep Learning"", ""Expert Models"", ""Normalization"", ""Computer Vision""]","Normalization methods are a central building block in the deep learning toolbox. They accelerate and stabilize training, while decreasing the dependence on manually tuned learning rate schedules. When learning from multi-modal distributions, the effectiveness of batch normalization (BN), arguably the most prominent normalization method, is reduced. As a remedy, we propose a more flexible approach: by extending the normalization to more than a single mean and variance, we detect modes of data on-the-fly, jointly normalizing samples that share common features. We demonstrate that our method outperforms BN and other widely used normalization techniques in several experiments, including single and multi-task datasets.",/pdf/4ca442899b602012b9aab97d25d27cd99ec26d32.pdf,ICLR,2019,We present a novel normalization method for deep neural networks that is robust to multi-modalities in intermediate feature distributions. +D9I3drBz4UC,DyXlm89SAlC,1601310000000.0,1620710000000.0,148,Long-tailed Recognition by Routing Diverse Distribution-Aware Experts,"[""~Xudong_Wang4"", ""~Long_Lian1"", ""~Zhongqi_Miao1"", ""~Ziwei_Liu1"", ""~Stella_Yu2""]","[""Xudong Wang"", ""Long Lian"", ""Zhongqi Miao"", ""Ziwei Liu"", ""Stella Yu""]","[""Long-tailed Recognition"", ""Bias-variance Decomposition""]","Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. +We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the head-tail model bias gap remains large, due to more and larger confusion with hard negatives for the tail. +We propose a new long-tailed classifier called RoutIng Diverse Experts (RIDE). It reduces the model variance with multiple experts, reduces the model bias with a distribution-aware diversity loss, reduces the computational cost with a dynamic expert routing module. RIDE outperforms the state-of-the-art by 5% to 7% on CIFAR100-LT, ImageNet-LT and iNaturalist 2018 benchmarks. It is also a universal framework that is applicable to various backbone networks, long-tailed algorithms and training mechanisms for consistent performance gains. Our code is available at: https://github.com/frank-xwang/RIDE-LongTailRecognition.",/pdf/a53160f4df3e5d7b2d13f02d20579f6dd0460010.pdf,ICLR,2021, +DNl5s5BXeBn,eZXClfLMMts,1601310000000.0,1615370000000.0,1240,Fair Mixup: Fairness via Interpolation,"[""~Ching-Yao_Chuang1"", ""~Youssef_Mroueh1""]","[""Ching-Yao Chuang"", ""Youssef Mroueh""]","[""fairness"", ""data augmentation""]","Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use mixup, a powerful data augmentation strategy to generate these interpolates. We analyze fair mixup and empirically show that it ensures a better generalization for both accuracy and fairness measurement in tabular, vision, and language benchmarks.",/pdf/26fcabb2173a9b48339615540d74ed1bc1c6281e.pdf,ICLR,2021, +Hklc6oAcFX,H1e4vFh5tX,1538090000000.0,1545360000000.0,826,Co-manifold learning with missing data,"[""gal.mishne@yale.edu"", ""eric_chi@ncsu.edu"", ""coifman.ronald@yale.edu""]","[""Gal Mishne"", ""Eric C. Chi"", ""Ronald R. Coifman""]","[""nonlinear dimensionality reduction"", ""missing data"", ""manifold learning"", ""co-clustering"", ""optimization""]"," Representation learning is typically applied to only one mode of a data matrix, either its rows or columns. Yet in many applications, there is an underlying geometry to both the rows and the columns. We propose utilizing this coupled structure to perform co-manifold learning: uncovering the underlying geometry of both the rows and the columns of a given matrix, where we focus on a missing data setting. Our unsupervised approach consists of three components. We first solve a family of optimization problems to estimate a complete matrix at multiple scales of smoothness. We then use this collection of smooth matrix estimates to compute pairwise distances on the rows and columns based on a new multi-scale metric that implicitly introduces a coupling between the rows and the columns. Finally, we construct row and column representations from these multi-scale metrics. We demonstrate that our approach outperforms competing methods in both data visualization and clustering. ",/pdf/83422292d839a664716e5ecb8363759ce9144344.pdf,ICLR,2019,Nonlinear representations of observations and features of a data matrix with missing entries and coupled geometries +B1xf9jAqFQ,r1l2o9i9tm,1538090000000.0,1557790000000.0,514,Neural Speed Reading with Structural-Jump-LSTM,"[""chrh@di.ku.dk"", ""c.hansen@di.ku.dk"", ""s.alstrup@di.ku.dk"", ""simonsen@di.ku.dk"", ""c.lioma@di.ku.dk""]","[""Christian Hansen"", ""Casper Hansen"", ""Stephen Alstrup"", ""Jakob Grue Simonsen"", ""Christina Lioma""]","[""natural language processing"", ""speed reading"", ""recurrent neural network"", ""classification""]","Recurrent neural networks (RNNs) can model natural language by sequentially ''reading'' input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as ''neural speed reading'', either ignore or skim over part of the input. We present Structural-Jump-LSTM: the first neural speed reading model to both skip and jump text during inference. The model consists of a standard LSTM and two agents: one capable of skipping single words when reading, and one capable of exploiting punctuation structure (sub-sentence separators (,:), sentence end symbols (.!?), or end of text markers) to jump ahead after reading a word. +A comprehensive experimental evaluation of our model against all five state-of-the-art neural reading models shows that +Structural-Jump-LSTM achieves the best overall floating point operations (FLOP) reduction (hence is faster), while keeping the same accuracy or even improving it compared to a vanilla LSTM that reads the whole text.",/pdf/5e881b790928082b69453f56c76dd623c197125e.pdf,ICLR,2019,We propose a new model for neural speed reading that utilizes the inherent punctuation structure of a text to define effective jumping and skipping behavior. +rkxraoRcF7,Syg19xKqFm,1538090000000.0,1545360000000.0,797,Learning Disentangled Representations with Reference-Based Variational Autoencoders,"[""adria.ruiz-ovejero@inria.fr"", ""oriol.martinez@upf.edu"", ""xavier.binefa@upf.edu"", ""jakob.verbeek@inria.fr""]","[""Adria Ruiz"", ""Oriol Martinez"", ""Xavier Binefa"", ""Jakob Verbeek""]","[""Disentangled representations"", ""Variational Autoencoders"", ""Adversarial Learning"", ""Weakly-supervised learning""]","Learning disentangled representations from visual data, where different high-level generative factors are independently encoded, is of importance for many computer vision tasks. Supervised approaches, however, require a significant annotation effort in order to label the factors of interest in a training set. To alleviate the annotation cost, we introduce a learning setting which we refer to as ""reference-based disentangling''. Given a pool of unlabelled images, the goal is to learn a representation where a set of target factors are disentangled from others. The only supervision comes from an auxiliary ""reference set'' that contains images where the factors of interest are constant. In order to address this problem, we propose reference-based variational autoencoders, a novel deep generative model designed to exploit the weak supervisory signal provided by the reference set. During training, we use the variational inference framework where adversarial learning is used to minimize the objective function. By addressing tasks such as feature learning, conditional image generation or attribute transfer, we validate the ability of the proposed model to learn disentangled representations from minimal supervision. + +",/pdf/21969b81d6646254758df17f820d91858d407c36.pdf,ICLR,2019, +H1uR4GZRZ,BJPC4MbRW,1509140000000.0,1519620000000.0,828,Stochastic Activation Pruning for Robust Adversarial Defense,"[""guneetdhillon@utexas.edu"", ""kazizzad@uci.edu"", ""zlipton@cmu.edu"", ""bernstein@caltech.edu"", ""jean.kossaifi@gmail.com"", ""arankhan@amazon.com"", ""animakumar@gmail.com""]","[""Guneet S. Dhillon"", ""Kamyar Azizzadenesheli"", ""Zachary C. Lipton"", ""Jeremy D. Bernstein"", ""Jean Kossaifi"", ""Aran Khanna"", ""Animashree Anandkumar""]",[],"Neural networks are known to be vulnerable to adversarial examples. Carefully chosen perturbations to real images, while imperceptible to humans, induce misclassification and threaten the reliability of deep learning systems in the wild. To guard against adversarial examples, we take inspiration from game theory and cast the problem as a minimax zero-sum game between the adversary and the model. In general, for such games, the optimal strategy for both players requires a stochastic policy, also known as a mixed strategy. In this light, we propose Stochastic Activation Pruning (SAP), a mixed strategy for adversarial defense. SAP prunes a random subset of activations (preferentially pruning those with smaller magnitude) and scales up the survivors to compensate. We can apply SAP to pretrained networks, including adversarially trained models, without fine-tuning, providing robustness against adversarial examples. Experiments demonstrate that SAP confers robustness against attacks, increasing accuracy and preserving calibration.",/pdf/7e7f8343a6a6e3d751e013455df6a1d01bf3ea5b.pdf,ICLR,2018, +qf6Nmm-_6Z,Z3F6s3PsTa,1601310000000.0,1614990000000.0,1969,VECoDeR - Variational Embeddings for Community Detection and Node Representation,"[""~Rayyan_Ahmad_Khan1"", ""~Muhammad_Umer_Anwaar1"", ""~Omran_Kaddah1"", ""~Martin_Kleinsteuber1""]","[""Rayyan Ahmad Khan"", ""Muhammad Umer Anwaar"", ""Omran Kaddah"", ""Martin Kleinsteuber""]","[""Generative Models"", ""Node representation"", ""VECoDeR"", ""Graph Neural Networks"", ""Variational Embeddings""]","In this paper, we study how to simultaneously learn two highly correlated tasks of graph analysis, i.e., community detection and node representation learning. We propose an efficient generative model called VECoDeR for jointly learning Variational Embeddings for COmmunity DEtection and node Representation. VECoDeR assumes that every node can be a member of one or more communities. The node embeddings are learned in such a way that connected nodes are not only ``closer"" to each other but also share similar community assignments. A joint learning framework leverages community-aware node embeddings for better community detection. We demonstrate on several graph datasets that VECoDeR effectively outperforms many competitive baselines on all three tasks i.e. node classification, overlapping community detection and non-overlapping community detection. We also show that VECoDeR is computationally efficient and has quite robust performance with varying hyperparameters.",/pdf/a89a4218c77f2d71201a06bacc08b72acf138c43.pdf,ICLR,2021,A variational approach to jointly learn node and community embeddings for community detection and node representation. +rJ8Je4clg,,1478280000000.0,1488850000000.0,202,Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening,"[""frankheshibi@gmail.com"", ""liu301@illinois.edu"", ""aschwing@illinois.edu"", ""jianpeng@illinois.edu""]","[""Frank S.He"", ""Yang Liu"", ""Alexander G. Schwing"", ""Jian Peng""]","[""Reinforcement Learning"", ""Optimization"", ""Games""]","We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the challenging Arcade Learning Environment, and report significant improvements in both training time and accuracy.",/pdf/95d3eec6ca24b9443ff0c11dcef40cfd2bcebc3b.pdf,ICLR,2017,We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. +HJgfDREKDB,rylEaBPdvH,1569440000000.0,1583910000000.0,1166,Higher-Order Function Networks for Learning Composable 3D Object Representations,"[""eric.anthony.mitchell95@gmail.com"", ""engin003@umn.edu"", ""isler@umn.edu"", ""ddlee@seas.upenn.edu""]","[""Eric Mitchell"", ""Selim Engin"", ""Volkan Isler"", ""Daniel D Lee""]","[""computer vision"", ""3d reconstruction"", ""deep learning"", ""representation learning""]","We present a new approach to 3D object representation where a neural network encodes the geometry of an object directly into the weights and biases of a second 'mapping' network. This mapping network can be used to reconstruct an object by applying its encoded transformation to points randomly sampled from a simple geometric space, such as the unit sphere. We study the effectiveness of our method through various experiments on subsets of the ShapeNet dataset. We find that the proposed approach can reconstruct encoded objects with accuracy equal to or exceeding state-of-the-art methods with orders of magnitude fewer parameters. Our smallest mapping network has only about 7000 parameters and shows reconstruction quality on par with state-of-the-art object decoder architectures with millions of parameters. Further experiments on feature mixing through the composition of learned functions show that the encoding captures a meaningful subspace of objects.",/pdf/aa1d65118055944102c9de56cbada2966f55de6f.pdf,ICLR,2020,Neural nets can encode complex 3D objects into the parameters of other (surprisingly small) neural nets +r1eU1gHFvH,BJetHc1YDS,1569440000000.0,1577170000000.0,2062,Under what circumstances do local codes emerge in feed-forward neural networks,"[""ella.gale@bristol.ac.uk"", ""nm13850@bristol.ac.uk""]","[""Ella M. Gale"", ""Nicolas Martin""]","[""localist coding"", ""emergence"", ""contructionist science"", ""neural networks"", ""feed-forward"", ""learning representation"", ""distributed coding"", ""generalisation"", ""memorisation"", ""biological plausibility"", ""deep-NNs"", ""training conditions""]","Localist coding schemes are more easily interpretable than the distributed schemes but generally believed to be biologically implausible. Recent results have found highly selective units and object detectors in NNs that are indicative of local codes (LCs). Here we undertake a constructionist study on feed-forward NNs and find LCs emerging in response to invariant features, and this finding is robust until the invariant feature is perturbed by 40%. Decreasing the number of input data, increasing the relative weight of the invariant features and large values of dropout all increase the number of LCs. Longer training times increase the number of LCs and the turning point of the LC-epoch curve correlates well with the point at which NNs reach 90-100% on both test and training accuracy. Pseudo-deep networks (2 hidden layers) which have many LCs lose them when common aspects of deep-NN research are applied (large training data, ReLU activations, early stopping on training accuracy and softmax), suggesting that LCs may not be found in deep-NNs. Switching to more biologically feasible constraints (sigmoidal activation functions, longer training times, dropout, activation noise) increases the number of LCs. If LCs are not found in the feed-forward classification layers of modern deep-CNNs these data suggest this could either be caused by a lack of (moderately) invariant features being passed to the fully connected layers or due to the choice of training conditions and architecture. Should the interpretability and resilience to noise of LCs be required, this work suggests how to tune a NN so they emerge. ",/pdf/1a95a87fda2bbd05620c942a21a6b30a5358e3e1.pdf,ICLR,2020,"Localist codes emerge in response to learning a rule and generalising in 3- and 4-layer NNs under some situations, including noise, but are inhibited by softmax, large datasets and early stopping." +HyQWFOVge,,1477900000000.0,1484290000000.0,23,Significance of Softmax-Based Features over Metric Learning-Based Features,"[""horiguchi@hal.t.u-tokyo.ac.jp"", ""ikami@hal.t.u-tokyo.ac.jp"", ""aizawa@hal.t.u-tokyo.ac.jp""]","[""Shota Horiguchi"", ""Daiki Ikami"", ""Kiyoharu Aizawa""]","[""Computer vision"", ""Deep learning""]","The extraction of useful deep features is important for many computer vision tasks. +Deep features extracted from classification networks have proved to perform well in those tasks. +To obtain features of greater usefulness, end-to-end distance metric learning (DML) has been applied to train the feature extractor directly. +End-to-end DML approaches such as Magnet Loss and lifted structured feature embedding show state-of-the-art performance in several image recognition tasks. +However, in these DML studies, there were no equitable comparisons between features extracted from a DML-based network and those from a softmax-based network. +In this paper, by presenting objective comparisons between these two approaches under the same network architecture, we show that the softmax-based features are markedly better than the state-of-the-art DML features for tasks such as fine-grained recognition, attribute estimation, clustering, and retrieval.",/pdf/a75e869d17ffc25adb6467b38dfe0189d75f5a75.pdf,ICLR,2017,We show softmax-based features are markedly better than state-of-the-art metric learning-based features by conducting fair comparison between them. +vnlqCDH1b6n,BfWA5mhCLsE,1601310000000.0,1614990000000.0,3340,Learning disentangled representations with the Wasserstein Autoencoder,"[""~Benoit_Gaujac1"", ""~Ilya_Feige1"", ""~David_Barber2""]","[""Benoit Gaujac"", ""Ilya Feige"", ""David Barber""]","[""generative modeling"", ""disentangle learning"", ""wasserstein autoencoder""]","Disentangled representation learning has undoubtedly benefited from objective function surgery. However, a delicate balancing act of tuning is still required in order to trade off reconstruction fidelity versus disentanglement. Building on previous successes of penalizing the total correlation in the latent variables, we propose TCWAE (Total Correlation Wasserstein Autoencoder). Working in the WAE paradigm naturally enables the separation of the total-correlation term, thus providing disentanglement control over the learned representation, while offering more flexibility in the choice of reconstruction cost. We propose two variants using different KL estimators and perform extensive quantitative comparisons on data sets with known generative factors, showing competitive results relative to state-of-the-art techniques. We further study the trade off between disentanglement and reconstruction on more-difficult data sets with unknown generative factors, where we expect improved reconstructions due to the flexibility of the WAE paradigm.",/pdf/db187e35db59312ce51ed98f7c3e8e82fa4e0ad7.pdf,ICLR,2021,Improving the reconstruction-disentanglement trade off with the Wasserstein Autoencoder. +rklxF0NtDr,Ske6zMd_vS,1569440000000.0,1577170000000.0,1235,Policy Message Passing: A New Algorithm for Probabilistic Graph Inference,"[""zhiweid@princeton.edu"", ""mori@cs.sfu.ca""]","[""Zhiwei Deng"", ""Greg Mori""]","[""graph inference algorithm"", ""graph reasoning"", ""variational inference""]","A general graph-structured neural network architecture operates on graphs through two core components: (1) complex enough message functions; (2) a fixed information aggregation process. In this paper, we present the Policy Message Passing algorithm, which takes a probabilistic perspective and reformulates the whole information aggregation as stochastic sequential processes. The algorithm works on a much larger search space, utilizes reasoning history to perform inference, and is robust to noisy edges. We apply our algorithm to multiple complex graph reasoning and prediction tasks and show that our algorithm consistently outperforms state-of-the-art graph-structured models by a significant margin.",/pdf/56bdfd826120e5151de346a2afbb5b0e3230e801.pdf,ICLR,2020,An probabilistic inference algorithm driven by neural network for graph-structured models +SkgODpVFDr,r1lDlGoDvr,1569440000000.0,1577170000000.0,602,Incorporating Horizontal Connections in Convolution by Spatial Shuffling,"[""kishida@nlab.ci.i.u-tokyo.ac.jp"", ""nakayama@nlab.ci.i.u-tokyo.ac.jp""]","[""Ikki Kishida"", ""Hideki Nakayama""]","[""shuffle"", ""convolution"", ""receptive field"", ""classification"", ""horizontal connections""]","Convolutional Neural Networks (CNNs) are composed of multiple convolution layers and show elegant performance in vision tasks. +The design of the regular convolution is based on the Receptive Field (RF) where the information within a specific region is processed. +In the view of the regular convolution's RF, the outputs of neurons in lower layers with smaller RF are bundled to create neurons in higher layers with larger RF. +As a result, the neurons in high layers are able to capture the global context even though the neurons in low layers only see the local information. +However, in lower layers of the biological brain, the information outside of the RF changes the properties of neurons. +In this work, we extend the regular convolution and propose spatially shuffled convolution (ss convolution). +In ss convolution, the regular convolution is able to use the information outside of its RF by spatial shuffling which is a simple and lightweight operation. +We perform experiments on CIFAR-10 and ImageNet-1k dataset, and show that ss convolution improves the classification performance across various CNNs.",/pdf/666968d34917d4b3de10c2f0db58f5fc268d1628.pdf,ICLR,2020,We propose spatially shuffled convolution that the regular convolution incorporates the information from outside of its receptive field. +ByuI-mW0W,Sk1HWXbC-,1509140000000.0,1518730000000.0,1127,Towards a Testable Notion of Generalization for Generative Adversarial Networks,"[""rcornish@robots.ox.ac.uk"", ""hongseok.yang@cs.ox.ac.uk"", ""fwood@robots.ox.ac.uk""]","[""Robert Cornish"", ""Hongseok Yang"", ""Frank Wood""]","[""generative adversarial networks"", ""Wasserstein"", ""GAN"", ""generalization"", ""theory""]","We consider the question of how to assess generative adversarial networks, in particular with respect to whether or not they generalise beyond memorising the training data. We propose a simple procedure for assessing generative adversarial network performance based on a principled consideration of what the actual goal of generalisation is. Our approach involves using a test set to estimate the Wasserstein distance between the generative distribution produced by our procedure, and the underlying data distribution. We use this procedure to assess the performance of several modern generative adversarial network architectures. We find that this procedure is sensitive to the choice of ground metric on the underlying data space, and suggest a choice of ground metric that substantially improves performance. We finally suggest that attending to the ground metric used in Wasserstein generative adversarial network training may be fruitful, and outline a concrete pathway towards doing so.",/pdf/ea05da52d583577c2fb998bb73fadfa82e01d80b.pdf,ICLR,2018,Assess whether or not your GAN is actually doing something other than memorizing the training data. +H1z_Z2A5tX,H1ghPeRqK7,1538090000000.0,1545360000000.0,1185,DON’T JUDGE A BOOK BY ITS COVER - ON THE DYNAMICS OF RECURRENT NEURAL NETWORKS,"[""doron.haviv12@gmail.com"", ""sashkarivkind@gmail.com"", ""omri.barak@gmail.com""]","[""Doron Haviv"", ""Alexander Rivkind"", ""Omri Barak""]",[],"To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to keep track of past events by creating memories. Consequently RNNs are harder to train than their feedforward counterparts, prompting the developments of both dedicated units such as LSTM and GRU and of a handful of training tricks. In this paper, we investigate the effect of different training protocols on the representation of memories in RNN. While reaching similar performance for different protocols, RNNs are shown to exhibit substantial differences in their ability to generalize for unforeseen tasks or conditions. We analyze the dynamics of the network’s hidden state, and uncover the reasons for this difference. Each memory is found to be associated with a nearly steady state of the dynamics whose speed predicts performance on unforeseen tasks and which we refer to as a ’slow point’. By tracing the formation of the slow points we are able to understand the origin of differences between training protocols. Our results show that multiple solutions to the same task exist but may rely on different dynamical mechanisms, and that training protocols can bias the choice of such solutions in an interpretable way.",/pdf/af4fb0f05ab91b40dfe13f01b6c565d2e2ca2621.pdf,ICLR,2019, +SkgToo0qFm,Syx_1y89Ym,1538090000000.0,1545360000000.0,661,Transferrable End-to-End Learning for Protein Interface Prediction,"[""raphael@cs.stanford.edu"", ""rbedi@cs.stanford.edu"", ""rondror@cs.stanford.edu""]","[""Raphael J. L. Townshend"", ""Rishi Bedi"", ""Ron O. Dror""]","[""transfer learning"", ""protein interface prediction"", ""deep learning"", ""structural biology""]","While there has been an explosion in the number of experimentally determined, atomically detailed structures of proteins, how to represent these structures in a machine learning context remains an open research question. In this work we demonstrate that representations learned from raw atomic coordinates can outperform hand-engineered structural features while displaying a much higher degree of transferrability. To do so, we focus on a central problem in biology: predicting how proteins interact with one another—that is, which surfaces of one protein bind to which surfaces of another protein. We present Siamese Atomic Surfacelet Network (SASNet), the first end-to-end learning method for protein interface prediction. Despite using only spatial coordinates and identities of atoms as inputs, SASNet outperforms state-of-the-art methods that rely on hand-engineered, high-level features. These results are particularly striking because we train the method entirely on a significantly biased data set that does not account for the fact that proteins deform when binding to one another. Demonstrating the first successful application of transfer learning to atomic-level data, our network maintains high performance, without retraining, when tested on real cases in which proteins do deform.",/pdf/5c819e1caf5d4f27080e91b623b5b833c2624756.pdf,ICLR,2019,We demonstrate the first successful application of transfer learning to atomic-level data in order to build a state-of-the-art end-to-end learning model for the protein interface prediction problem. +BJg9DoR9t7,rklvqGzYYX,1538090000000.0,1569430000000.0,287,Max-MIG: an Information Theoretic Approach for Joint Learning from Crowds,"[""caopeng2016@pku.edu.cn"", ""xuyilun@pku.edu.cn"", ""yuqing.kong@pku.edu.cn"", ""yizhou.wang@pku.edu.cn""]","[""Peng Cao*"", ""Yilun Xu*"", ""Yuqing Kong"", ""Yizhou Wang""]","[""crowdsourcing"", ""information theory""]","Eliciting labels from crowds is a potential way to obtain large labeled data. Despite a variety of methods developed for learning from crowds, a key challenge remains unsolved: \emph{learning from crowds without knowing the information structure among the crowds a priori, when some people of the crowds make highly correlated mistakes and some of them label effortlessly (e.g. randomly)}. We propose an information theoretic approach, Max-MIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. Max-MIG simultaneously aggregates the crowdsourced labels and learns an accurate data classifier. Furthermore, we devise an accurate data-crowds forecaster that employs both the data and the crowdsourced labels to forecast the ground truth. To the best of our knowledge, this is the first algorithm that solves the aforementioned challenge of learning from crowds. In addition to the theoretical validation, we also empirically show that our algorithm achieves the new state-of-the-art results in most settings, including the real-world data, and is the first algorithm that is robust to various information structures. Codes are available at https://github.com/Newbeeer/Max-MIG . +",/pdf/6d826c78b6d8b90396fcb156ac1799e0ffe56737.pdf,ICLR,2019, +rkxVz1HKwB,rylzUWh_vB,1569440000000.0,1577170000000.0,1576,Certifiably Robust Interpretation in Deep Learning,"[""alevine0@cs.umd.edu"", ""ssingla@cs.umd.edu"", ""sfeizi@cs.umd.edu""]","[""Alexander Levine"", ""Sahil Singla"", ""Soheil Feizi""]","[""deep learning interpretation"", ""robustness certificates"", ""adversarial examples""]","Deep learning interpretation is essential to explain the reasoning behind model predictions. Understanding the robustness of interpretation methods is important especially in sensitive domains such as medical applications since interpretation results are often used in downstream tasks. Although gradient-based saliency maps are popular methods for deep learning interpretation, recent works show that they can be vulnerable to adversarial attacks. In this paper, we address this problem and provide a certifiable defense method for deep learning interpretation. We show that a sparsified version of the popular SmoothGrad method, which computes the average saliency maps over random perturbations of the input, is certifiably robust against adversarial perturbations. We obtain this result by extending recent bounds for certifiably robust smooth classifiers to the interpretation setting. Experiments on ImageNet samples validate our theory.",/pdf/be7f39474ec772cfc66afbd22a4ecb050e893ef8.pdf,ICLR,2020,We develop an interpretation procedure for deep learning models which is certifiably robust to adversarial attack. +S1gTwJSKvr,r1ljCpauwH,1569440000000.0,1577170000000.0,1783,OPTIMAL BINARY QUANTIZATION FOR DEEP NEURAL NETWORKS,"[""mpouransari@apple.com"", ""onceltuzel@gmail.com""]","[""Hadi Pouransari"", ""Oncel Tuzel""]","[""Binary Neural Networks"", ""Quantization""]","Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. +In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We introduce several novel quantization algorithms: optimal 2-bits, optimal ternary, and greedy. Our quantization algorithms can be implemented efficiently on the hardware using bitwise operations. We present proofs to show that our proposed methods are optimal, and also provide empirical error analysis. We conduct experiments on the ImageNet dataset and show a reduced accuracy gap when using the proposed optimal quantization algorithms.",/pdf/2cc6fcf48c3200eaecbee011ccdecbfebef85dc4.pdf,ICLR,2020, +rkxw-hAcFQ,rJeXVyCqFX,1538090000000.0,1550810000000.0,1178,Generating Multi-Agent Trajectories using Programmatic Weak Supervision,"[""ezhan@caltech.edu"", ""stzheng@caltech.edu"", ""yyue@caltech.edu"", ""lsha@stats.com"", ""plucey@stats.com""]","[""Eric Zhan"", ""Stephan Zheng"", ""Yisong Yue"", ""Long Sha"", ""Patrick Lucey""]","[""deep learning"", ""generative models"", ""imitation learning"", ""hierarchical methods"", ""data programming"", ""weak supervision"", ""spatiotemporal""]","We study the problem of training sequential generative models for capturing coordinated multi-agent trajectory behavior, such as offensive basketball gameplay. When modeling such settings, it is often beneficial to design hierarchical models that can capture long-term coordination using intermediate variables. Furthermore, these intermediate variables should capture interesting high-level behavioral semantics in an interpretable and manipulable way. We present a hierarchical framework that can effectively learn such sequential generative models. Our approach is inspired by recent work on leveraging programmatically produced weak labels, which we extend to the spatiotemporal regime. In addition to synthetic settings, we show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multi-agent trajectories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study comparison conducted with professional sports analysts.",/pdf/92bd373f4a7a19220bfef48707ae74610498bda3.pdf,ICLR,2019,We blend deep generative models with programmatic weak supervision to generate coordinated multi-agent trajectories of significantly higher quality than previous baselines. +rylBK34FDS,HJlM6JmhUr,1569440000000.0,1583910000000.0,77,DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures,"[""huanrui.yang@duke.edu"", ""wei.wen@duke.edu"", ""hai.li@duke.edu""]","[""Huanrui Yang"", ""Wei Wen"", ""Hai Li""]","[""Deep neural network"", ""Sparsity inducing regularizer"", ""Model compression""]","In seeking for sparse and efficient neural network models, many previous works investigated on enforcing L1 or L0 regularizers to encourage weight sparsity during training. The L0 regularizer measures the parameter sparsity directly and is invariant to the scaling of parameter values. But it cannot provide useful gradients and therefore requires complex optimization techniques. The L1 regularizer is almost everywhere differentiable and can be easily optimized with gradient descent. Yet it is not scale-invariant and causes the same shrinking rate to all parameters, which is inefficient in increasing sparsity. Inspired by the Hoyer measure (the ratio between L1 and L2 norms) used in traditional compressed sensing problems, we present DeepHoyer, a set of sparsity-inducing regularizers that are both differentiable almost everywhere and scale-invariant. Our experiments show that enforcing DeepHoyer regularizers can produce even sparser neural network models than previous works, under the same accuracy level. We also show that DeepHoyer can be applied to both element-wise and structural pruning.",/pdf/1d5e3997e449727a4e2d2c3559ef81ec5d1b8fc0.pdf,ICLR,2020,"We propose almost everywhere differentiable and scale invariant regularizers for DNN pruning, which can lead to supremum sparsity through standard SGD training." +YmqAnY0CMEy,vGaIHn2vXqB,1601310000000.0,1616450000000.0,1327,Mathematical Reasoning via Self-supervised Skip-tree Training,"[""~Markus_Norman_Rabe1"", ""~Dennis_Lee1"", ""~Kshitij_Bansal1"", ""~Christian_Szegedy1""]","[""Markus Norman Rabe"", ""Dennis Lee"", ""Kshitij Bansal"", ""Christian Szegedy""]","[""self-supervised learning"", ""mathematics"", ""reasoning"", ""theorem proving"", ""language modeling""]","We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skip-sequence tasks. We also analyze the models' ability to formulate new conjectures by measuring how often the predictions are provable and useful in other proofs.",/pdf/405aeadddeb5c223426f15f57b0e520aeb2ce585.pdf,ICLR,2021,We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. +ry8_g12nVD,rCLBONPL8Fiz,1601310000000.0,1614990000000.0,1370,SEMANTIC APPROACH TO AGENT ROUTING USING A HYBRID ATTRIBUTE-BASED RECOMMENDER SYSTEM,"[""~Anwitha_Paruchuri1""]","[""Anwitha Paruchuri""]","[""Hybrid Recommendation"", ""Customer Relationship Management"", ""Semantic Embedding"", ""Approximate Nearest Neighbor""]","Traditionally contact centers route an issue to an agent based on ticket load or skill of the agent. When a ticket comes into the system, it is either manually analyzed and pushed to an agent or automatically routed to an agent based on some business rules. A Customer Relationship Management (CRM) system often has predefined categories that an issue could belong to. The agents are generally proficient in handling multiple categories, the categories in the CRM system are often related to each other, and a ticket typically contains content across multiple categories. This makes the traditional approach sub-optimal. We propose a Hybrid Recommendation based approach that recommends top N agents for a ticket by jointly modelling on the interactions between the agents and categories as well as on the semantic features of the categories and the agents.",/pdf/627bbdcf34ada7fb2abb1c19d42ef461079f4623.pdf,ICLR,2021,A Hybrid Recommendation based approach that recommends top N agents for a ticket by jointly modelling on the interactions between the agents and categories as well as on the semantic features of the categories and the agents +S1g7tpEYDS,HJg5_hnvwS,1569440000000.0,1583910000000.0,663,From Variational to Deterministic Autoencoders,"[""partha.ghosh@tuebingen.mpg.de"", ""msajjadi@tue.mpg.de"", ""antonio.vergari@tuebingen.mpg.de"", ""black@tue.mpg.de"", ""bs@tue.mpg.de""]","[""Partha Ghosh"", ""Mehdi S. M. Sajjadi"", ""Antonio Vergari"", ""Michael Black"", ""Bernhard Scholkopf""]","[""Unsupervised learning"", ""Generative Models"", ""Variational Autoencoders"", ""Regularization""]"," Variational Autoencoders (VAEs) provide a theoretically-backed and popular framework for deep generative models. However, learning a VAE from data poses still unanswered theoretical questions and considerable practical challenges. In this work, we propose an alternative framework for generative modeling that is simpler, easier to train, and deterministic, yet has many of the advantages of the VAE. We observe that sampling a stochastic encoder in a Gaussian VAE can be interpreted as simply injecting noise into the input of a deterministic decoder. We investigate how substituting this kind of stochasticity, with other explicit and implicit regularization schemes, can lead to an equally smooth and meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism to sample new data points, we introduce an ex-post density estimation step that can be readily applied to the proposed framework as well as existing VAEs, improving their sample quality. We show, in a rigorous empirical study, that the proposed regularized deterministic autoencoders are able to generate samples that are comparable to, or better than, those of VAEs and more powerful alternatives when applied to images as well as to structured data such as molecules. ",/pdf/4855d3d40c16853fcf3844059592f0c5783a60fe.pdf,ICLR,2020,"Deterministic regularized autoencoders can learn a smooth, meaningful latent space as VAEs without having to force some arbitrarily chosen prior (i.e., Gaussian)." +BJxSWeSYPB,rJln4egYvr,1569440000000.0,1577170000000.0,2135,Self-supervised Training of Proposal-based Segmentation via Background Prediction,"[""isinsu.katircioglu@epfl.ch"", ""rhodin@cs.ubc.ca"", ""victor.constantin@epfl.ch"", ""joerg.spoerri@balgrist.ch"", ""mathieu.salzmann@epfl.ch"", ""pascal.fua@epfl.ch""]","[""Isinsu Katircioglu"", ""Helge Rhodin"", ""Victor Constantin"", ""J\u00f6rg Sp\u00f6rri"", ""Mathieu Salzmann"", ""Pascal Fua""]",[],"While supervised object detection and segmentation methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on. To address this in scenarios where annotating data is prohibitively expensive, we introduce a self-supervised approach to detection and segmentation, able to work with monocular images captured with a moving camera. At the heart of our approach lies the observations that object segmentation and background reconstruction are linked tasks, and that, for structured scenes, background regions can be re-synthesized from their surroundings, whereas regions depicting the object cannot. +We encode this intuition as a self-supervised loss function that we exploit to train a proposal-based segmentation network. To account for the discrete nature of the proposals, we develop a Monte Carlo-based training strategy that allows the algorithm to explore the large space of object proposals. We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks, achieving competitive results compared to the few existing self-supervised methods and approaching the accuracy of supervised ones that exploit large annotated datasets.",/pdf/e88079182cb5248f8aa3398d395d00fc9bc75170.pdf,ICLR,2020, +iQxS0S9ir1a,2k3f2NymIbf,1601310000000.0,1614990000000.0,1932,Distributional Generalization: A New Kind of Generalization,"[""~Preetum_Nakkiran1"", ""~Yamini_Bansal1""]","[""Preetum Nakkiran"", ""Yamini Bansal""]","[""understanding deep learning"", ""generalization"", ""interpolating methods"", ""empirical investigation""]","We introduce a new notion of generalization--- Distributional Generalization--- which roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the test set as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our understanding of interpolating classifiers.",/pdf/83b8f83a27d27f15b90866b2dfe44deff0fe8c48.pdf,ICLR,2021,"We introduce a new notion of generalization (""Distributional Generalization""), to characterize empirical observations of interpolating classifiers." +lE1AB4stmX,zDH4XGIft-m,1601310000000.0,1614990000000.0,2233,A Transformer-based Framework for Multivariate Time Series Representation Learning,"[""~George_Zerveas1"", ""j.srideepika@ibm.com"", ""pateldha@us.ibm.com"", ""anubham@us.ibm.com"", ""~Carsten_Eickhoff1""]","[""George Zerveas"", ""Srideepika Jayaraman"", ""Dhaval Patel"", ""Anuradha Bhamidipaty"", ""Carsten Eickhoff""]","[""transformer"", ""multivariate time series"", ""unsupervised representation learning"", ""deep learning""]","In this work we propose for the first time a transformer-based framework for unsupervised representation learning of multivariate time series. Pre-trained models can be potentially used for downstream tasks such as regression and classification, forecasting and missing value imputation. We evaluate our models on several benchmark datasets for multivariate time series regression and classification and show that they exceed current state-of-the-art performance, even when the number of training samples is very limited, while at the same time offering computational efficiency. We show that unsupervised pre-training of our transformer models offers a substantial performance benefit over fully supervised learning, even without leveraging additional unlabeled data, i.e., by reusing the same data samples through the unsupervised objective.",/pdf/94d9b3b7124c900589a8c1b5e86d7cea7aa46f86.pdf,ICLR,2021, We propose a transformer-based framework for unsupervised representation learning of multivariate time series +Drynvt7gg4L,4EL-2Hbik5B,1601310000000.0,1614600000000.0,1751,AdaSpeech: Adaptive Text to Speech for Custom Voice,"[""t-miche@microsoft.com"", ""~Xu_Tan1"", ""bohan.li@microsoft.com"", ""yanqliu@microsoft.com"", ""~Tao_Qin1"", ""~sheng_zhao1"", ""~Tie-Yan_Liu1""]","[""Mingjian Chen"", ""Xu Tan"", ""Bohan Li"", ""Yanqing Liu"", ""Tao Qin"", ""sheng zhao"", ""Tie-Yan Liu""]","[""Text to speech"", ""adaptation"", ""fine-tuning"", ""custom voice"", ""acoustic condition modeling"", ""conditional layer normalization""]","Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech from her/him. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions which could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we model the acoustic information in both utterance and phoneme level. Specifically, we use one acoustic encoder to extract an utterance-level vector and another one to extract a sequence of phoneme-level vectors from the target speech during pre-training and fine-tuning; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. The audio samples are available at https://speechresearch.github.io/adaspeech/.",/pdf/a106470a7fdf6a1a5575c27c3e1b22bbc5ec2fa6.pdf,ICLR,2021,"We propose AdaSpeech, an adaptive TTS system for high-quality and efficient adaptation of new speaker in custom voice." +0N8jUH4JMv6,xIdwjTH-Cir,1601310000000.0,1616080000000.0,1468,Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time,"[""~Tolga_Ergen1"", ""~Mert_Pilanci3""]","[""Tolga Ergen"", ""Mert Pilanci""]","[""Convex optimization"", ""non-convex optimization"", ""group sparsity"", ""$\\ell_1$ norm"", ""convex duality"", ""polynomial time"", ""deep learning""]","We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an $\ell_2$ norm regularized convex program. We then show that multi-layer circular CNN training problems with a single ReLU layer are equivalent to an $\ell_1$ regularized convex program that encourages sparsity in the spectral domain. We also extend these results to three-layer CNNs with two ReLU layers. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers.",/pdf/dba1d25e1354e478235ccc68af0dd34e0cf91c79.pdf,ICLR,2021,We study the training problem for various CNN architectures with ReLU activations and introduce equivalent finite dimensional convex formulations that can be used to globally optimize these architectures. +dx11_7vm5_r,T2Z_TAlzu6m,1601310000000.0,1615860000000.0,2848,Linear Last-iterate Convergence in Constrained Saddle-point Optimization,"[""~Chen-Yu_Wei1"", ""~Chung-Wei_Lee1"", ""~Mengxiao_Zhang2"", ""~Haipeng_Luo1""]","[""Chen-Yu Wei"", ""Chung-Wei Lee"", ""Mengxiao Zhang"", ""Haipeng Luo""]","[""Saddle-point Optimization"", ""Optimistic Mirror Decent"", ""Optimistic Gradient Descent Ascent"", ""Optimistic Multiplicative Weights Update"", ""Last-iterate Convergence"", ""Game Theory""]","Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood --- previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. + +In this work, we significantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achievable with a constant learning rate, which improves the result of (Daskalakis & Panageas, 2019) under the same assumption. We then significantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufficient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-strongly-concave functions, recovering the result of (Hsieh et al., 2019). Finally, we provide experimental results to further support our theory. ",/pdf/80ab11841a700c095d09408aebe0552dc6c2c21f.pdf,ICLR,2021,We prove Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) converge exponentially fast to the Nash equilibrium in the sense of last-iterate in various game settings including matrix games. +SJecKyrKPH,HyxcywCODS,1569440000000.0,1577170000000.0,1850,ICNN: INPUT-CONDITIONED FEATURE REPRESENTATION LEARNING FOR TRANSFORMATION-INVARIANT NEURAL NETWORK,"[""surajtripathi93@gmail.com"", ""c.singh@samsung.com"", ""abykumar12011@gmail.com""]","[""Suraj Tripathi"", ""Chirag Singh"", ""Abhay Kumar""]","[""Transformation-invariance"", ""Reconstruction"", ""Run-time Convolution Filter generation""]","We propose a novel framework, ICNN, which combines the input-conditioned filter generation module and a decoder based network to incorporate contextual information present in images into Convolutional Neural Networks (CNNs). In contrast to traditional CNNs, we do not employ the same set of learned convolution filters for all input image instances. And our proposed decoder network serves the purpose of reducing the transformation present in the input image by learning to construct a representative image of the input image class. Our proposed joint supervision of input-aware framework when combined with techniques inspired by Multi-instance learning and max-pooling, results in a transformation-invariant neural network. We investigated the performance of our proposed framework on three MNIST variations, which covers both rotation and scaling variance, and achieved 0.98% error on MNIST-rot-12k, 1.12% error on Half-rotated MNIST and 0.68% error on Scaling MNIST, which is significantly better than the state-of-the-art results. Our proposed model also showcased consistent improvement on the CIFAR dataset. We make use of visualization to further prove the effectiveness of our input-aware convolution filters. Our proposed convolution filter generation framework can also serve as a plugin for any CNN based architecture and enhance its modeling capacity.",/pdf/d10a39b6c0a2291ea84dae419ff4b8ca056dd8fa.pdf,ICLR,2020, +Pgq5GE_-ph,eamCygIZ4W8,1601310000000.0,1614990000000.0,1290,Video Prediction with Variational Temporal Hierarchies,"[""~Vaibhav_Saxena1"", ""~Jimmy_Ba1"", ""~Danijar_Hafner1""]","[""Vaibhav Saxena"", ""Jimmy Ba"", ""Danijar Hafner""]","[""latent dynamics"", ""temporal abstraction"", ""video prediction"", ""probabilistic modeling"", ""variational inference"", ""deep learning""]","Deep learning has shown promise for accurately predicting high-dimensional video sequences. Existing video prediction models succeeded in generating sharp but often short video sequences. Toward improving long-term video prediction, we study hierarchical latent variable models with levels that process at different time scales. To gain insights into the representations of such models, we study the information stored at each level of the hierarchy via the KL divergence, predictive entropy, datasets of varying speed, and generative distributions. Our analysis confirms that faster changing details are generally captured by lower levels, while slower changing facts are remembered by higher levels. On synthetic datasets where common methods fail after 25 frames, we show that temporally abstract latent variable models can make accurate predictions for up to 200 frames.",/pdf/18f7fec8042f9416ad2da4bee0099f3f48b4420d.pdf,ICLR,2021,"We introduce and investigate the properties of a temporally-abstract latent dynamics model, trained using a variational objective, for long-horizon video prediction." +B1xpI1BFDS,S1lPGYT_DS,1569440000000.0,1577170000000.0,1746,Semi-Supervised Few-Shot Learning with a Controlled Degree of Task-Adaptive Conditioning,"[""shyoon8@kaist.ac.kr"", ""tjwns0630@kaist.ac.kr"", ""jmoon@kaist.edu""]","[""Sung Whan Yoon"", ""Jun Seo"", ""Jaekyun Moon""]","[""few-shot learning"", ""meta-learning"", ""semi-supervised learning"", ""task-adaptive clustering"", ""task-adaptive projection space""]","Few-shot learning aims to handle previously unseen tasks using only a small amount of new training data. In preparing (or meta-training) a few-shot learner, however, massive labeled data are necessary. In the real world, unfortunately, labeled data are expensive and/or scarce. In this work, we propose a few-shot learner that can work well under the semi-supervised setting where a large portion of training data is unlabeled. Our method employs explicit task-conditioning in which unlabeled sample clustering for the current task takes place in a new projection space different from the embedding feature space. The conditioned clustering space is linearly constructed so as to quickly close the gap between the class centroids for the current task and the independent per-class reference vectors meta-trained across tasks. In a more general setting, our method introduces a concept of controlling the degree of task-conditioning for meta-learning: the amount of task-conditioning varies with the number of repetitive updates for the clustering space. During each update, the soft labels of the unlabeled samples estimated in the conditioned clustering space are used to update the class averages in the original embedded space, which in turn are used to reconstruct the clustering space. Extensive simulation results based on the miniImageNet and tieredImageNet datasets show state-of-the-art semi-supervised few-shot classification performance of the proposed method. Simulation results also indicate that the proposed task-adaptive clustering shows graceful degradation with a growing number of distractor samples, i.e., unlabeled samples coming from outside the candidate classes.",/pdf/0b81cb3112b75baec01c7d38d67fbff4a147eaac.pdf,ICLR,2020,We propose a semi-supervised few-shot learning algorithm with a controlled degree of task-adaptive conditioning by an iterative update of a task-conditioned projection space where the clustering of unlabeled samples takes place. +SkGT6sRcFX,B1x1Xta5KQ,1538090000000.0,1545360000000.0,848,Infinitely Deep Infinite-Width Networks,"[""jovana.mitrovic@spc.ox.ac.uk"", ""pewi@google.com"", ""cblundell@google.com"", ""dino.sejdinovic@stats.ox.ac.uk"", ""ywteh@google.com""]","[""Jovana Mitrovic"", ""Peter Wirnsberger"", ""Charles Blundell"", ""Dino Sejdinovic"", ""Yee Whye Teh""]","[""Infinite-width networks"", ""initialisation"", ""kernel methods"", ""reproducing kernel Hilbert spaces"", ""Gaussian processes""]","Infinite-width neural networks have been extensively used to study the theoretical properties underlying the extraordinary empirical success of standard, finite-width neural networks. Nevertheless, until now, infinite-width networks have been limited to at most two hidden layers. To address this shortcoming, we study the initialisation requirements of these networks and show that the main challenge for constructing them is defining the appropriate sampling distributions for the weights. Based on these observations, we propose a principled approach to weight initialisation that correctly accounts for the functional nature of the hidden layer activations and facilitates the construction of arbitrarily many infinite-width layers, thus enabling the construction of arbitrarily deep infinite-width networks. The main idea of our approach is to iteratively reparametrise the hidden-layer activations into appropriately defined reproducing kernel Hilbert spaces and use the canonical way of constructing probability distributions over these spaces for specifying the required weight distributions in a principled way. Furthermore, we examine the practical implications of this construction for standard, finite-width networks. In particular, we derive a novel weight initialisation scheme for standard, finite-width networks that takes into account the structure of the data and information about the task at hand. We demonstrate the effectiveness of this weight initialisation approach on the MNIST, CIFAR-10 and Year Prediction MSD datasets.",/pdf/e99c6267511df6941bfe62d71a1549e15f3d9620.pdf,ICLR,2019,"We propose a method for the construction of arbitrarily deep infinite-width networks, based on which we derive a novel weight initialisation scheme for finite-width networks and demonstrate its competitive performance." +SkVhlh09tX,BkxV9H65tm,1538090000000.0,1550710000000.0,1115,Pay Less Attention with Lightweight and Dynamic Convolutions,"[""fw245@cornell.edu"", ""angelfan@fb.com"", ""alexei.b@gmail.com"", ""yann@dauphin.io"", ""michael.auli@gmail.com""]","[""Felix Wu"", ""Angela Fan"", ""Alexei Baevski"", ""Yann Dauphin"", ""Michael Auli""]","[""Deep learning"", ""sequence to sequence learning"", ""convolutional neural networks"", ""generative models""]","Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.",/pdf/bc01ded1a059dfb3a08003d867a2147f5f3dbceb.pdf,ICLR,2019,Dynamic lightweight convolutions are competitive to self-attention on language tasks. +B1x9siCcYQ,HJlxp4GOKQ,1538090000000.0,1545360000000.0,646,SENSE: SEMANTICALLY ENHANCED NODE SEQUENCE EMBEDDING,"[""srallapalli@us.ibm.com"", ""maliang@us.ibm.com"", ""msrivats@us.ibm.com"", ""ananthram.swami.civ@mail.mil"", ""heesung.kwon.civ@mail.mil"", ""gbent@uk.ibm.com"", ""simpkin.chris@gmail.com""]","[""Swati Rallapalli"", ""Liang Ma"", ""Mudhakar Srivatsa"", ""Ananthram Swami"", ""Heesung Kwon"", ""Graham Bent"", ""Christopher Simpkin""]","[""Semantic"", ""Graph"", ""Sequence"", ""Embeddings""]","Effectively capturing graph node sequences in the form of vector embeddings is critical to many applications. We achieve this by (i) first learning vector embeddings of single graph nodes and (ii) then composing them to compactly represent node sequences. Specifically, we propose SENSE-S (Semantically Enhanced Node Sequence Embedding - for Single nodes), a skip-gram based novel embedding mechanism, for single graph nodes that co-learns graph structure as well as their textual descriptions. We demonstrate that SENSE-S vectors increase the accuracy of multi-label classification tasks by up to 50% and link-prediction tasks by up to 78% under a variety of scenarios using real datasets. Based on SENSE-S, we next propose generic SENSE to compute composite vectors that represent a sequence of nodes, where preserving the node order is important. We prove that this approach is efficient in embedding node sequences, and our experiments on real data confirm its high accuracy in node order decoding.",/pdf/1e4101b36f641d2548e6797942ecc9147a8a02c4.pdf,ICLR,2019,Node sequence embedding mechanism that captures both graph and text properties. +BJVEEF9lx,,1478300000000.0,1481320000000.0,475,Learning Approximate Distribution-Sensitive Data Structures,"[""zenna@mit.edu"", ""asolar@csail.mit.edu""]","[""Zenna Tavares"", ""Armando Solar-Lezama""]","[""Unsupervised Learning""]","We present a computational model of mental representations as data-structures which are distribution sensitive, i.e., which exploit non-uniformity in their usage patterns to reduce time or space complexity. +Abstract data types equipped with axiomatic specifications specify classes of concrete data structures with equivalent logical behavior. +We extend this formalism to distribution-sensitive data structures with the concept of a probabilistic axiomatic specification, which is implemented by a concrete data structure only with some probability. +We employ a number of approximations to synthesize several distribution-sensitive data structures from probabilistic specification as deep neural networks, such as a stack, queue, natural number, set, and binary tree.",/pdf/26c4b7d083283a4341a08dfde779756c2c9267c6.pdf,ICLR,2017,We model mental representations as abstract distribution-sensitive data types and synthesize concrete implementations using deep networks from specification +l0mSUROpwY,67KofNpoaWq,1601310000000.0,1615290000000.0,1870,Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures,"[""~Pedro_Hermosilla1"", ""marco.schaefer@uni-tuebingen.de"", ""242528@mail.muni.cz"", ""gloria.fackelmann@uni-ulm.de"", ""pere.pau@cs.upc.edu"", ""kozlikova@fi.muni.cz"", ""michael.krone@uni-tuebingen.de"", ""~Tobias_Ritschel1"", ""~Timo_Ropinski2""]","[""Pedro Hermosilla"", ""Marco Sch\u00e4fer"", ""Matej Lang"", ""Gloria Fackelmann"", ""Pere-Pau V\u00e1zquez"", ""Barbora Kozlikova"", ""Michael Krone"", ""Tobias Ritschel"", ""Timo Ropinski""]","[""classification"", ""bioinformatics""]","Proteins perform a large variety of functions in living organisms and thus play a key role in biology. However, commonly used algorithms in protein representation learning were not specifically designed for protein data, and are therefore not able to capture all relevant structural levels of a protein during learning. To fill this gap, we propose two new learning operators, specifically designed to process protein structures. First, we introduce a novel convolution operator that considers the primary, secondary, and tertiary structure of a protein by using $n$-D convolutions defined on both the Euclidean distance, as well as multiple geodesic distances between the atoms in a multi-graph. Second, we introduce a set of hierarchical pooling operators that enable multi-scale protein analysis. We further evaluate the accuracy of our algorithms on common downstream tasks, where we outperform state-of-the-art protein learning algorithms.",/pdf/06a67dcf3e870e158211013994abdc6c41706fea.pdf,ICLR,2021,We present a new neural network architecture to process 3D protein structures. +rJlnOhVYPS,HJxu1t6OLS,1569440000000.0,1583910000000.0,55,Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification,"[""yxge@link.cuhk.edu.hk"", ""chendapeng@sensetime.com"", ""hsli@ee.cuhk.edu.hk""]","[""Yixiao Ge"", ""Dapeng Chen"", ""Hongsheng Li""]","[""Label Refinery"", ""Unsupervised Domain Adaptation"", ""Person Re-identification""]","Person re-identification (re-ID) aims at identifying the same persons' images across different cameras. However, domain diversities between different datasets pose an evident challenge for adapting the re-ID model trained on one dataset to another one. State-of-the-art unsupervised domain adaptation methods for person re-ID transferred the learned knowledge from the source domain by optimizing with pseudo labels created by clustering algorithms on the target domain. Although they achieved state-of-the-art performances, the inevitable label noise caused by the clustering procedure was ignored. Such noisy pseudo labels substantially hinders the model's capability on further improving feature representations on the target domain. In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner. In addition, the common practice is to adopt both the classification loss and the triplet loss jointly for achieving optimal performances in person re-ID models. However, conventional triplet loss cannot work with softly refined labels. To solve this problem, a novel soft softmax-triplet loss is proposed to support learning with soft pseudo triplet labels for achieving the optimal domain adaptation performance. The proposed MMT framework achieves considerable improvements of 14.4%, 18.2%, 13.1% and 16.4% mAP on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks.",/pdf/e74af9cf4a4cb3b9e76e7b524c3533ee23441b65.pdf,ICLR,2020,A framework that conducts online refinement of pseudo labels with a novel soft softmax-triplet loss for unsupervised domain adaptation on person re-identification. +eU776ZYxEpz,n1_O-SCkGG7,1601310000000.0,1616000000000.0,3565,Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units,"[""~Jonathan_Cornford1"", ""damjank7354@gmail.com"", ""marco.leite.11@ucl.ac.uk"", ""al858@cam.ac.uk"", ""d.kullmann@ucl.ac.uk"", ""~Blake_Aaron_Richards1""]","[""Jonathan Cornford"", ""Damjan Kalajdzievski"", ""Marco Leite"", ""Am\u00e9lie Lamarquette"", ""Dimitri Michael Kullmann"", ""Blake Aaron Richards""]",[]," The units in artificial neural networks (ANNs) can be thought of as abstractions of biological neurons, and ANNs are increasingly used in neuroscience research. However, there are many important differences between ANN units and real neurons. One of the most notable is the absence of Dale's principle, which ensures that biological neurons are either exclusively excitatory or inhibitory. Dale's principle is typically left out of ANNs because its inclusion impairs learning. This is problematic, because one of the great advantages of ANNs for neuroscience research is their ability to learn complicated, realistic tasks. Here, by taking inspiration from feedforward inhibitory interneurons in the brain we show that we can develop ANNs with separate populations of excitatory and inhibitory units that learn just as well as standard ANNs. We call these networks Dale's ANNs (DANNs). We present two insights that enable DANNs to learn well: (1) DANNs are related to normalization schemes, and can be initialized such that the inhibition centres and standardizes the excitatory activity, (2) updates to inhibitory neuron parameters should be scaled using corrections based on the Fisher Information matrix. These results demonstrate how ANNs that respect Dale's principle can be built without sacrificing learning performance, which is important for future work using ANNs as models of the brain. The results may also have interesting implications for how inhibitory plasticity in the real brain operates.",/pdf/4ec70d1b966600fcb426c0ea0982e93dd870f226.pdf,ICLR,2021, +B1lnbRNtwr,Hkl-2P4dDH,1569440000000.0,1583910000000.0,981,Global Relational Models of Source Code,"[""vjhellendoorn@gmail.com"", ""charlessutton@google.com"", ""rising@google.com"", ""maniatis@google.com"", ""dbieber@google.com""]","[""Vincent J. Hellendoorn"", ""Charles Sutton"", ""Rishabh Singh"", ""Petros Maniatis"", ""David Bieber""]","[""Models of Source Code"", ""Graph Neural Networks"", ""Structured Learning""]","Models of code can learn distributed representations of a program's syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters.",/pdf/1a70bffc358e61847e6c4b29f824a9204f8ce4c3.pdf,ICLR,2020,Models of source code that combine global and structural features learn more powerful representations of programs. +rJlnfaNYvB,Skxgsep8PH,1569440000000.0,1577170000000.0,428,Adaptive Loss Scaling for Mixed Precision Training,"[""ruizhe.zhao15@imperial.ac.uk"", ""vogel@preferred.jp"", ""tanvira@preferred.jp""]","[""Ruizhe Zhao"", ""Brian Vogel"", ""Tanvir Ahmed""]","[""Deep Learning"", ""Mixed Precision Training"", ""Loss Scaling"", ""Backpropagation""]","Mixed precision training (MPT) is becoming a practical technique to improve the speed and energy efficiency of training deep neural networks by leveraging the fast hardware support for IEEE half-precision floating point that is available in existing GPUs. MPT is typically used in combination with a technique called loss scaling, that works by scaling up the loss value up before the start of backpropagation in order to minimize the impact of numerical underflow on training. Unfortunately, existing methods make this loss scale value a hyperparameter that needs to be tuned per-model, and a single scale cannot be adapted to different layers at different training stages. We introduce a loss scaling-based training method called adaptive loss scaling that makes MPT easier and more practical to use, by removing the need to tune a model-specific loss scale hyperparameter. We achieve this by introducing layer-wise loss scale values which are automatically computed during training to deal with underflow more effectively than existing methods. We present experimental results on a variety of networks and tasks that show our approach can shorten the time to convergence and improve accuracy, compared with using the existing state-of-the-art MPT and single-precision floating point.",/pdf/09d3a81bc7b28a6018c753a5595e79adceb30483.pdf,ICLR,2020,We devise adaptive loss scaling to improve mixed precision training that surpass the state-of-the-art results. +wVYtfckXU0T,KF88yzrAn0G,1601310000000.0,1614990000000.0,2823,PriorityCut: Occlusion-aware Regularization for Image Animation,"[""~Wai_Ting_Cheung1"", ""~Gyeongsu_Chae1""]","[""Wai Ting Cheung"", ""Gyeongsu Chae""]","[""image animation"", ""occlusion"", ""inpainting"", ""gan"", ""augmentation"", ""regularization""]","Image animation generates a video of a source image following the motion of a driving video. Self-supervised image animation approaches do not require explicit pose references as inputs, thus offering large flexibility in learning. State-of-the-art self-supervised image animation approaches mostly warp the source image according to the motion of the driving video, and recover the warping artifacts by inpainting. When the source and the driving images have large pose differences, heavy inpainting is necessary. Without guidance, heavily inpainted regions usually suffer from loss of details. While previous data augmentation techniques such as CutMix are effective in regularizing non-warp-based image generation, directly applying them to image animation ignores the difficulty of inpainting on the warped image. We propose PriorityCut, a novel augmentation approach that uses the top-$k$ percent occluded pixels of the foreground to regularize image animation. By taking into account the difficulty of inpainting, PriorityCut preserves better identity than vanilla CutMix and outperforms state-of-the-art image animation models in terms of the pixel-wise difference, low-level similarity, keypoint distance, and feature embedding distance.",/pdf/a3119e3ce8b62d0cd43a5bb3f77b9f865f9d6c8b.pdf,ICLR,2021, +B1G9doA9F7,rJgOuG-cK7,1538090000000.0,1545420000000.0,380,Augmented Cyclic Adversarial Learning for Low Resource Domain Adaptation,"[""ehosseiniasl@salesforce.com"", ""yingbo.zhou@salesforce.com"", ""cxiong@salesforce.com"", ""rsocher@salesforce.com""]","[""Ehsan Hosseini-Asl"", ""Yingbo Zhou"", ""Caiming Xiong"", ""Richard Socher""]","[""Domain adaptation"", ""generative adversarial network"", ""cyclic adversarial learning"", ""speech""]","Training a model to perform a task typically requires a large amount of data from the domains in which the task will be applied. +However, it is often the case that data are abundant in some domains but scarce in others. Domain adaptation deals with the challenge of adapting a model trained from a data-rich source domain to perform well in a data-poor target domain. In general, this requires learning plausible mappings between domains. CycleGAN is a powerful framework that efficiently learns to map inputs from one domain to another using adversarial training and a cycle-consistency constraint. However, the conventional approach of enforcing cycle-consistency via reconstruction may be overly restrictive in cases where one or more domains have limited training data. In this paper, we propose an augmented cyclic adversarial learning model that enforces the cycle-consistency constraint via an external task specific model, which encourages the preservation of task-relevant content as opposed to exact reconstruction. We explore digit classification in a low-resource setting in supervised, semi and unsupervised situation, as well as high resource unsupervised. In low-resource supervised setting, the results show that our approach improves absolute performance by 14% and 4% when adapting SVHN to MNIST and vice versa, respectively, which outperforms unsupervised domain adaptation methods that require high-resource unlabeled target domain. Moreover, using only few unsupervised target data, our approach can still outperforms many high-resource unsupervised models. Our model also outperforms on USPS to MNIST and synthetic digit to SVHN for high resource unsupervised adaptation. In speech domains, we similarly adopt a speech recognition model from each domain as the task specific model. Our approach improves absolute performance of speech recognition by 2% for female speakers in the TIMIT dataset, where the majority of training samples are from male voices.",/pdf/c388f0bf7ba87cc0111e19c9f56ff08876cb1a1f.pdf,ICLR,2019,A new cyclic adversarial learning augmented with auxiliary task model which improves domain adaptation performance in low resource supervised and unsupervised situations +HJ5PIaseg,,1478380000000.0,1484440000000.0,600,Towards an automatic Turing test: Learning to evaluate dialogue responses,"[""rlowe1@cs.mcgill.ca"", ""michael.noseworthy@mail.mcgill.ca"", ""julianserban@gmail.com"", ""nicolas.angelard-gontier@mail.mcgill.ca"", ""yoshua.umontreal@gmail.com"", ""jpineau@cs.mcgill.ca""]","[""Ryan Lowe"", ""Michael Noseworthy"", ""Iulian V. Serban"", ""Nicolas Angelard-Gontier"", ""Yoshua Bengio"", ""Joelle Pineau""]","[""Natural language processing"", ""Applications""]","Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. +Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality (Liu et al., 2016). Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.",/pdf/299bbfb2dafff739ce04e4c28ecd407498b28e8b.pdf,ICLR,2017,We propose a model for evaluating dialogue responses that correlates significantly with human judgement at the utterance-level and system-level. +ByBwSPcex,,1478290000000.0,1484960000000.0,382,Song From PI: A Musically Plausible Network for Pop Music Generation,"[""chuhang1122@cs.toronto.edu"", ""urtasun@cs.toronto.edu"", ""fidler@cs.toronto.edu""]","[""Hang Chu"", ""Raquel Urtasun"", ""Sanja Fidler""]","[""Applications""]","We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",/pdf/f3a2607fd70152913005f63894f3e14c8052d7a1.pdf,ICLR,2017,"We present a novel hierarchical RNN for generating pop music, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed." +r1e7NgrYvH,rylF4IlFPB,1569440000000.0,1577170000000.0,2243,DO-AutoEncoder: Learning and Intervening Bivariate Causal Mechanisms in Images,"[""cts17@mails.tsinghua.edu.cn"", ""lepangdan@outlook.com"", ""liufurui2@huawei.com"", ""chenzhitang2@huawei.com""]","[""Tianshuo Cong"", ""Dan Peng"", ""Furui Liu"", ""Zhitang Chen""]","[""Causality discovery"", ""AutoEncoder"", ""Deep representation learning"", ""Do-calculus""]","Some fundamental limitations of deep learning have been exposed such as lacking generalizability and being vunerable to adversarial attack. Instead, researchers realize that causation is much more stable than association relationship in data. In this paper, we propose a new framework called do-calculus AutoEncoder(DO-AE) for deep representation learning that fully capture bivariate causal relationship in the images which allows us to intervene in images generation process. DO-AE consists of two key ingredients: causal relationship mining in images and intervention-enabling deep causal structured representation learning. The goal here is to learn deep representations that correspond to the concepts in the physical world as well as their causal structure. To verify the proposed method, we create a dataset named PHY2D, which contains abstract graphic description in accordance with the laws of physics. Our experiments demonstrate our method is able to correctly identify the bivariate causal relationship between concepts in images and the representation learned enables a do-calculus manipulation to images, which generates artificial images that might possibly break the physical law depending on where we intervene the causal system.",/pdf/c75e7d930ec5a0bf2427b3cc38fcc6a8fe9ef579.pdf,ICLR,2020,We propose a new framework for deep representation learning that fully capture bivariate causal relationship in the images. +HkgFDgSYPH,H1g3hseYwB,1569440000000.0,1577170000000.0,2369,Adaptive Online Planning for Continual Lifelong Learning,"[""kzl@berkeley.edu"", ""imordatch@google.com"", ""pabbeel@cs.berkeley.edu""]","[""Kevin Lu"", ""Igor Mordatch"", ""Pieter Abbeel""]","[""reinforcement learning"", ""model predictive control"", ""planning"", ""model based"", ""model free"", ""uncertainty"", ""computation""]","We study learning control in an online lifelong learning scenario, where mistakes can compound catastrophically into the future and the underlying dynamics of the environment may change. Traditional model-free policy learning methods have achieved successes in difficult tasks due to their broad flexibility, and capably condense broad experiences into compact networks, but struggle in this setting, as they can activate failure modes early in their lifetimes which are difficult to recover from and face performance degradation as dynamics change. On the other hand, model-based planning methods learn and adapt quickly, but require prohibitive levels of computational resources. Under constrained computation limits, the agent must allocate its resources wisely, which requires the agent to understand both its own performance and the current state of the environment: knowing that its mastery over control in the current dynamics is poor, the agent should dedicate more time to planning. We present a new algorithm, Adaptive Online Planning (AOP), that achieves strong performance in this setting by combining model-based planning with model-free learning. By measuring the performance of the planner and the uncertainty of the model-free components, AOP is able to call upon more extensive planning only when necessary, leading to reduced computation times. We show that AOP gracefully deals with novel situations, adapting behaviors and policies effectively in the face of unpredictable changes in the world -- challenges that a continual learning agent naturally faces over an extended lifetime -- even when traditional reinforcement learning methods fail.",/pdf/f91fd7a7df034d99d02ba839e3acb925b1b64971.pdf,ICLR,2020,We propose a method for reducing planning in MPC by measuring the uncertainty of model-free value and policy networks. +1sJWR4y1lG,#NAME?,1601310000000.0,1614990000000.0,3466,Deep Learning Is Composite Kernel Learning,"[""~CHANDRA_SHEKAR_LAKSHMINARAYANAN1"", ""~Amit_Vikram_Singh1""]","[""CHANDRA SHEKAR LAKSHMINARAYANAN"", ""Amit Vikram Singh""]","[""deep learning"", ""kernel methods""]","Recent works have connected deep learning and kernel methods. In this paper, we show that architectural choices such as convolutional layers with pooling, skip connections, make deep learning a composite kernel learning method, where the kernel is a (architecture dependent) composition of base kernels: even before training, standard deep networks have in-built structural properties that ensure their success. In particular, we build on the recently developed `neural path' framework that characterises the role of gates/masks in fully connected deep networks with ReLU activations. ",/pdf/23652e06be20c57711daf78122ee630fe513f87f.pdf,ICLR,2021, +SkEYojRqtm,ryeKct1NF7,1538090000000.0,1550860000000.0,644,Representation Degeneration Problem in Training Natural Language Generation Models,"[""jungao@cs.toronto.edu"", ""dihe@microsoft.com"", ""xu.tan@microsoft.com"", ""taoqin@microsoft.com"", ""wanglw@cis.pku.edu.cn"", ""tyliu@microsoft.com""]","[""Jun Gao"", ""Di He"", ""Xu Tan"", ""Tao Qin"", ""Liwei Wang"", ""Tieyan Liu""]","[""Natural Language Processing"", ""Representation Learning""]","We study an interesting problem in training neural network-based models for natural language generation tasks, which we call the \emph{representation degeneration problem}. We observe that when training a model for natural language generation tasks through likelihood maximization with the weight tying trick, especially with big training datasets, most of the learnt word embeddings tend to degenerate and be distributed into a narrow cone, which largely limits the representation power of word embeddings. We analyze the conditions and causes of this problem and propose a novel regularization method to address it. Experiments on language modeling and machine translation show that our method can largely mitigate the representation degeneration problem and achieve better performance than baseline algorithms.",/pdf/065d6ce30d1e8d2078936c7a6a727ad8ff0b1032.pdf,ICLR,2019, +BJFG8Yqxl,,1478300000000.0,1479660000000.0,493,Group Sparse CNNs for Question Sentence Classification with Answer Sets,"[""mam@oregonstate.edu"", ""liang.huang@oregonstate.edu"", ""bingxia@us.ibm.com"", ""zhou@us.ibm.com""]","[""Mingbo Ma"", ""Liang Huang"", ""Bing Xiang"", ""Bowen Zhou""]",[],"Classifying question sentences into their corresponding categories is an important task with wide applications, for example in many websites' FAQ sections. +However, traditional question classification techniques do not fully utilize the well-prepared answer data which has great potential for improving question sentence representations which could lead to better classification performance. In order to encode answer information into question representation, we first introduce novel group sparse autoencoders which could utilize the group information in the answer set to refine question representation. We then propose a new group sparse convolutional neural network which could naturally learn the question representation with respect to their corresponding answers by implanting the group sparse autoencoders into the traditional convolutional neural network. The proposed model show significant improvements over strong baselines on four datasets. ",/pdf/b2be6aeb38cfc5a9b7f5edb69bcb453b6ff4e0cc.pdf,ICLR,2017, +B1n8LexRZ,Bys8UegRb,1509060000000.0,1519370000000.0,212,Generalizing Hamiltonian Monte Carlo with Neural Networks,"[""danilevy@cs.stanford.edu"", ""mhoffman@google.com"", ""jaschasd@google.com""]","[""Daniel Levy"", ""Matt D. Hoffman"", ""Jascha Sohl-Dickstein""]","[""markov"", ""chain"", ""monte"", ""carlo"", ""sampling"", ""posterior"", ""deep"", ""learning"", ""hamiltonian"", ""mcmc""]","We present a general-purpose method to train Markov chain Monte Carlo kernels, parameterized by deep neural networks, that converge and mix quickly to their target distribution. Our method generalizes Hamiltonian Monte Carlo and is trained to maximize expected squared jumped distance, a proxy for mixing speed. We demonstrate large empirical gains on a collection of simple but challenging distributions, for instance achieving a 106x improvement in effective sample size in one case, and mixing when standard HMC makes no measurable progress in a second. Finally, we show quantitative and qualitative gains on a real-world task: latent-variable generative modeling. Python source code will be open-sourced with the camera-ready paper.",/pdf/4cfdd65ac9cc0a074f83a94d7cb3e3516cf34af6.pdf,ICLR,2018,"General method to train expressive MCMC kernels parameterized with deep neural networks. Given a target distribution p, our method provides a fast-mixing sampler, able to efficiently explore the state space." +eIPsmKwTrIe,UTbuv7mFyxq,1601310000000.0,1614990000000.0,3451,Using Deep Reinforcement Learning to Train and Evaluate Instructional Sequencing Policies for an Intelligent Tutoring System,"[""~Jithendaraa_Subramanian1"", ""~David_Mostow1""]","[""Jithendaraa Subramanian"", ""David Mostow""]","[""Deep Reinforcement Learning"", ""Intelligent Tutoring Systems"", ""Adaptive policy"", ""Instructional Sequencing""]","We present STEP, a novel Deep Reinforcement Learning solution to the problem of learning instructional sequencing. STEP has three components: 1. Simulate the student by fitting a knowledge tracing model to data logged by an intelligent tutoring system. 2. Train instructional sequencing policies by using Proximal Policy Optimization. 3. Evaluate the learned instructional policies by estimating their local and global impact on learning gains. STEP leverages the student model by representing the student’s knowledge state as a probability vector of knowing each skill and using the student’s estimated learning gains as its reward function to evaluate candidate policies. A learned policy represents a mapping from each state to an action that maximizes the reward, i.e. the upward distance to the next state in the multi-dimensional space. We use STEP to discover and evaluate potential improvements to a literacy and numeracy tutor used by hundreds of children in Tanzania.",/pdf/941a81e0fd95f6ca90977e1d6e610bb49dd7ef35.pdf,ICLR,2021,A Deep Reinforcement Learning framework that can be used by Intelligent Tutoring System to learn an instructional policy that maximizes student learning gains. +ry1arUgCW,Bk0nSUlRW,1509090000000.0,1519310000000.0,274,DORA The Explorer: Directed Outreaching Reinforcement Action-Selection,"[""lior.fox@mail.huji.ac.il"", ""leshem.choshen@mail.huji.ac.il"", ""yonatan.loewenstein@mail.huji.ac.il""]","[""Lior Fox"", ""Leshem Choshen"", ""Yonatan Loewenstein""]","[""Reinforcement Learning"", ""Exploration"", ""Model-Free""]","Exploration is a fundamental aspect of Reinforcement Learning, typically implemented using stochastic action-selection. Exploration, however, can be more efficient if directed toward gaining new world knowledge. Visit-counters have been proven useful both in practice and in theory for directed exploration. However, a major limitation of counters is their locality. While there are a few model-based solutions to this shortcoming, a model-free approach is still missing. +We propose $E$-values, a generalization of counters that can be used to evaluate the propagating exploratory value over state-action trajectories. We compare our approach to commonly used RL techniques, and show that using $E$-values improves learning and performance over traditional counters. We also show how our method can be implemented with function approximation to efficiently learn continuous MDPs. We demonstrate this by showing that our approach surpasses state of the art performance in the Freeway Atari 2600 game.",/pdf/41a9ae50b68e2fdf85490c726be5373bb3ff7e94.pdf,ICLR,2018,"We propose a generalization of visit-counters that evaluate the propagating exploratory value over trajectories, enabling efficient exploration for model-free RL" +BJxGan4FPB,BygcLqNmPS,1569440000000.0,1577170000000.0,219,Transfer Alignment Network for Double Blind Unsupervised Domain Adaptation,"[""xuhuiwen33@gmail.com"", ""ukang@snu.ac.kr""]","[""Huiwen Xu"", ""U Kang""]","[""unsupervised domain adaptation"", ""double blind domain adaptation""]","How can we transfer knowledge from a source domain to a target domain when each side cannot observe the data in the other side? The recent state-of-the-art deep architectures show significant performance in classification tasks which highly depend on a large number of training data. In order to resolve the dearth of abundant target labeled data, transfer learning and unsupervised learning leverage data from different sources and unlabeled data as training data, respectively. However, in some practical settings, transferring source data to target domain is restricted due to a privacy policy. + +In this paper, we define the problem of unsupervised domain adaptation under double blind constraint, where either the source or the target domain cannot observe the data in the other domain, but data from both domains are used for training. We propose TAN (Transfer Alignment Network for Double Blind Domain Adaptation), an effective method for the problem by aligning source and target domain features. TAN maps the target feature into source feature space so that the classifier learned from the labeled data in the source domain is readily used in the target domain. Extensive experiments show that TAN 1) provides the state-of-the-art accuracy for double blind domain adaptation, and 2) outperforms baselines regardless of the proportion of target domain data in the training data. +",/pdf/58108c27c285a2ea44150b48ad096ed26423cc96.pdf,ICLR,2020,"We propose an effective method for double blind domain adaptation problem where either source or target domain cannot observe the data in the other domain, but data from both domains are used for training. " +Sk6fD5yCb,BypMD910b,1509040000000.0,1518730000000.0,148,Espresso: Efficient Forward Propagation for Binary Deep Neural Networks,"[""fpeder@uvic.ca"", ""gtzan@uvic.ca"", ""ataiya@uvic.ca""]","[""Fabrizio Pedersoli"", ""George Tzanetakis"", ""Andrea Tagliasacchi""]","[""binary deep neural networks"", ""optimized implementation"", ""bitwise computations""]"," There are many applications scenarios for which the computational + performance and memory footprint of the prediction phase of Deep + Neural Networks (DNNs) need to be optimized. Binary Deep Neural + Networks (BDNNs) have been shown to be an effective way of achieving + this objective. In this paper, we show how Convolutional Neural + Networks (CNNs) can be implemented using binary + representations. Espresso is a compact, yet powerful + library written in C/CUDA that features all the functionalities + required for the forward propagation of CNNs, in a binary file less + than 400KB, without any external dependencies. Although it is mainly + designed to take advantage of massive GPU parallelism, Espresso also + provides an equivalent CPU implementation for CNNs. Espresso + provides special convolutional and dense layers for BCNNs, + leveraging bit-packing and bit-wise computations + for efficient execution. These techniques provide a speed-up of + matrix-multiplication routines, and at the same time, reduce memory + usage when storing parameters and activations. We experimentally + show that Espresso is significantly faster than existing + implementations of optimized binary neural networks (~ 2 + orders of magnitude). Espresso is released under the Apache 2.0 + license and is available at http://github.com/organization/project.",/pdf/5f3a59e74a0684045b2f6a725979a044e77e533b.pdf,ICLR,2018,state-of-the-art computational performance implementation of binary neural networks +RGeQOjc58d,LH6hC1JHNcs,1601310000000.0,1614990000000.0,813,Improved Gradient based Adversarial Attacks for Quantized Networks,"[""~Kartik_Gupta2"", ""~Thalaiyasingam_Ajanthan1""]","[""Kartik Gupta"", ""Thalaiyasingam Ajanthan""]","[""binary neural network"", ""gradient masking"", ""fake robustness"", ""temperature scaling"", ""adversarial attack"", ""signal propagation""]","Neural network quantization has become increasingly popular due to efficient memory consumption and faster computation resulting from bitwise operations on the quantized networks. Even though they exhibit excellent generalization capabilities, their robustness properties are not well-understood. In this work, we systematically study the robustness of quantized networks against gradient based adversarial attacks and demonstrate that these quantized models suffer from gradient vanishing issues and show a fake sense of robustness. By attributing gradient vanishing to poor forward-backward signal propagation in the trained network, we introduce a simple temperature scaling approach to mitigate this issue while preserving the decision boundary. Despite being a simple modification to existing gradient based adversarial attacks, experiments on CIFAR-10/100 datasets with multiple network architectures demonstrate that our temperature scaled attacks obtain near-perfect success rate on quantized networks while outperforming original attacks on adversarially trained models as well as floating point networks. +",/pdf/2ebb7ce4576bee865d541d83f90e39d632f5f736.pdf,ICLR,2021,"In this work, we systematically study the robustness of quantized networks against gradient based adversarial attacks and propose PGD++ attack variants to overcome existing gradient masking issues in BNNs." +Skl-fyHKPH,BylVmg3uvH,1569440000000.0,1577170000000.0,1569,A Mean-Field Theory for Kernel Alignment with Random Features in Generative Adverserial Networks,"[""mbadieik@stanford.edu"", ""liyues@stanford.edu"", ""shahin@tamu.edu"", ""lei@stanford.edu""]","[""Masoud Badiei Khuzani"", ""Liyue Shen"", ""Shahin Shahrampour"", ""Lei Xing""]","[""Kernel Learning"", ""Generative Adversarial Networks"", ""Mean Field Theory""]","We propose a novel supervised learning method to optimize the kernel in maximum mean discrepancy generative adversarial networks (MMD GANs). Specifically, we characterize a distributionally robust optimization problem to compute a good distribution for the random feature model of Rahimi and Recht to approximate a good kernel function. Due to the fact that the distributional optimization is infinite dimensional, we consider a Monte-Carlo sample average approximation (SAA) to obtain a more tractable finite dimensional optimization problem. We subsequently leverage a particle stochastic gradient descent (SGD) method to solve finite dimensional optimization problems. Based on a mean-field analysis, we then prove that the empirical distribution of the interactive particles system at each iteration of the SGD follows the path of the gradient descent flow on the Wasserstein manifold. We also establish the non-asymptotic consistency of the finite sample estimator. Our empirical evaluation on synthetic data-set as well as MNIST and CIFAR-10 benchmark data-sets indicates that our proposed MMD GAN model with kernel learning indeed attains higher inception scores well as Fr\`{e}chet inception distances and generates better images compared to the generative moment matching network (GMMN) and MMD GAN with untrained kernels.",/pdf/972a465dfb2d38e2bc7df8f38aab6963b96a6490.pdf,ICLR,2020,We propose a novel method to learn the kernel in MMD GANs and prove theoretical results for its performance. +H1leCRNYvS,B1gHArqdDr,1569440000000.0,1577170000000.0,1417,Hierarchical Bayes Autoencoders,"[""szhai@apple.com"", ""guestrin@apple.com"", ""jsusskind@apple.com""]","[""Shuangfei Zhai"", ""Carlos Guestrin"", ""Joshua M. Susskind""]",[],"Autoencoders are powerful generative models for complex data, such as images. However, standard models like the variational autoencoder (VAE) typically have unimodal Gaussian decoders, which cannot effectively represent the possible semantic variations in the space of images. To address this problem, we present a new probabilistic generative model called the \emph{Hierarchical Bayes Autoencoder (HBAE)}. The HBAE contains a multimodal decoder in the form of an energy-based model (EBM), instead of the commonly adopted unimodal Gaussian distribution. The HBAE can be trained using variational inference, similar to a VAE, to recover latent codes conditioned on inputs. For the decoder, we use an adversarial approximation where a conditional generator is trained to match the EBM distribution. During inference time, the HBAE consists of two sampling steps: first a latent code for the input is sampled, and then this code is passed to the conditional generator to output a stochastic reconstruction. The HBAE is also capable of modeling sets, by inferring a latent code for a set of examples, and sampling set members through the multimodal decoder. In both single image and set cases, the decoder generates plausible variations consistent with the input data, and generates realistic unconditional samples. To the best our knowledge, Set-HBAE is the first model that is able to generate complex image sets.",/pdf/490b0c1657a43b35a2b3d8f4bdb44b0a409bb133.pdf,ICLR,2020, +r1l0VCNKwB,HJxypPLdPH,1569440000000.0,1577170000000.0,1094,LOSSLESS SINGLE IMAGE SUPER RESOLUTION FROM LOW-QUALITY JPG IMAGES,"[""yshi@unomaha.edu"", ""libiao17@mails.ucas.ac.cn"", ""wangbo@uibe.edu.cn"", ""qizhiquan@foxmail.com"", ""liujiabin008@126.com"", ""mengfan@cufe.edu.cn""]","[""Yong Shi"", ""Biao Li"", ""Bo Wang"", ""Zhiquan Qi"", ""Jiabin Liu"", ""Fan Meng""]","[""Super Resolution"", ""Low-quality JPG"", ""Recovering details""]","Super Resolution (SR) is a fundamental and important low-level computer vision (CV) task. Different from traditional SR models, this study concentrates on a specific but realistic SR issue: How can we obtain satisfied SR results from compressed JPG (C-JPG) image, which widely exists on the Internet. In general, C-JPG can release storage space while keeping considerable quality in visual. However, further image processing operations, e.g., SR, will suffer from enlarging inner artificial details and result in unacceptable outputs. To address this problem, we propose a novel SR structure with two specifically designed components, as well as a cycle loss. In short, there are mainly three contributions to this paper. First, our research can generate high-qualified SR images for prevalent C-JPG images. Second, we propose a functional sub-model to recover information for C-JPG images, instead of the perspective of noise elimination in traditional SR approaches. Third, we further integrate cycle loss into SR solver to build a hybrid loss function for better SR generation. Experiments show that our approach achieves outstanding performance among state-of-the-art methods.",/pdf/4a2d63aeba7e3ac141524624bce9bcd20a9682ca.pdf,ICLR,2020,We solve the specific SR issue of low-quality JPG images by functional sub-models. +SJgdpxHFvH,BkesQWWFDr,1569440000000.0,1577170000000.0,2578,Meta-Learning Initializations for Image Segmentation,"[""seanmhendryx@gmail.com"", ""imaleach@gmail.com"", ""pauldhein@email.arizona.edu"", ""claytonm@email.arizona.edu""]","[""Sean M. Hendryx"", ""Andrew B. Leach"", ""Paul D. Hein"", ""Clayton T. Morrison""]","[""meta-learning"", ""image segmentation""]","While meta-learning approaches that utilize neural network representations have made progress in few-shot image classification, reinforcement learning, and, more recently, image semantic segmentation, the training algorithms and model architectures have become increasingly specialized to the few-shot domain. A natural question that arises is how to develop learning systems that scale from few-shot to many-shot settings while yielding human level performance in both. One scalable potential approach that does not require ensembling many models nor the computational costs of relation networks, is to meta-learn an initialization. In this work, we study first-order meta-learning of initializations for deep neural networks that must produce dense, structured predictions given an arbitrary amount of train- ing data for a new task. Our primary contributions include (1), an extension and experimental analysis of first-order model agnostic meta-learning algorithms (including FOMAML and Reptile) to image segmentation, (2) a formalization of the generalization error of episodic meta-learning algorithms, which we leverage to decrease error on unseen tasks, (3) a novel neural network architecture built for parameter efficiency which we call EfficientLab, and (4) an empirical study of how meta-learned initializations compare to ImageNet initializations as the training set size increases. We show that meta-learned initializations for image segmentation smoothly transition from canonical few-shot learning problems to larger datasets, outperforming random and ImageNet-trained initializations. Finally, we show both theoretically and empirically that a key limitation of MAML-type algorithms is that when adapting to new tasks, a single update procedure is used that is not conditioned on the data. We find that our network, with an empirically estimated optimal update procedure yields state of the art results on the FSS-1000 dataset, while only requiring one forward pass through a single model at evaluation time.",/pdf/152432d2753602c281541c89cb837fee2f5cc32c.pdf,ICLR,2020,We show that model agnostic meta-learning extends to the high dimensionality and dense prediction of image segmentation. +jHefDGsorp5,5e-1g_X7jmc,1601310000000.0,1616010000000.0,2246,Molecule Optimization by Explainable Evolution,"[""~Binghong_Chen1"", ""~Tianzhe_Wang1"", ""~Chengtao_Li1"", ""~Hanjun_Dai1"", ""~Le_Song1""]","[""Binghong Chen"", ""Tianzhe Wang"", ""Chengtao Li"", ""Hanjun Dai"", ""Le Song""]","[""Molecule Design"", ""Explainable Model"", ""Evolutionary Algorithm"", ""Reinforcement Learning"", ""Graph Generative Model""]","Optimizing molecules for desired properties is a fundamental yet challenging task in chemistry, material science, and drug discovery. This paper develops a novel algorithm for optimizing molecular properties via an Expectation-Maximization (EM) like explainable evolutionary process. The algorithm is designed to mimic human experts in the process of searching for desirable molecules and alternate between two stages: the first stage on explainable local search which identifies rationales, i.e., critical subgraph patterns accounting for desired molecular properties, and the second stage on molecule completion which explores the larger space of molecules containing good rationales. We test our approach against various baselines on a real-world multi-property optimization task where each method is given the same number of queries to the property oracle. We show that our evolution-by-explanation algorithm is 79% better than the best baseline in terms of a generic metric combining aspects such as success rate, novelty, and diversity. Human expert evaluation on optimized molecules shows that 60% of top molecules obtained from our methods are deemed successful. ",/pdf/885e03a6e7ca9e559b96bce0daf001f769f98de4.pdf,ICLR,2021,We propose a novel EM-like evolution-by-explanation algorithm alternating between an explainable graph model and a conditional generative model for molecule optimization. +BygZARVFDH,HklWXU5_PH,1569440000000.0,1577170000000.0,1419,Compositional Visual Generation with Energy Based Models,"[""yilundu@mit.edu"", ""lishuang@mit.edu"", ""mordatch@google.com""]","[""Yilun Du"", ""Shuang Li"", ""Igor Mordatch""]","[""Compositional Generation"", ""Energy Based Model"", ""Compositionality"", ""Generative Models""]","Humans are able to both learn quickly and rapidly adapt their knowledge. One major component is the ability to incrementally combine many simple concepts to accelerates the learning process. We show that energy based models are a promising class of models towards exhibiting these properties by directly combining probability distributions. This allows us to combine an arbitrary number of different distributions in a globally coherent manner. We show this compositionality property allows us to define three basic operators, logical conjunction, disjunction, and negation, on different concepts to generate plausible naturalistic images. Furthermore, by applying these abilities, we show that we are able to extrapolate concept combinations, continually combine previously learned concepts, and infer concept properties in a compositional manner.",/pdf/6fe872e38d51e298703c3cc656758345976384d5.pdf,ICLR,2020,"""We present flexible compositional image generation and its applications to continual learning and generalization""?" +RcjRb9pEQ-Q,lHdce11O4v7,1601310000000.0,1614990000000.0,1299,Fine-grained Synthesis of Unrestricted Adversarial Examples,"[""~Omid_Poursaeed2"", ""tj258@cornell.edu"", ""~Yordanos_Abraham_Goshu1"", ""harryyang@fb.com"", ""~Serge_Belongie1"", ""~Ser-Nam_Lim3""]","[""Omid Poursaeed"", ""Tianxing Jiang"", ""Yordanos Abraham Goshu"", ""Harry Yang"", ""Serge Belongie"", ""Ser-Nam Lim""]","[""adversarial examples"", ""unrestricted attacks"", ""generative models"", ""adversarial training"", ""generative adversarial networks""]","We propose a novel approach for generating unrestricted adversarial examples by manipulating fine-grained aspects of image generation. Unlike existing unrestricted attacks that typically hand-craft geometric transformations, we learn stylistic and stochastic modifications leveraging state-of-the-art generative models. This allows us to manipulate an image in a controlled, fine-grained manner without being bounded by a norm threshold. Our approach can be used for targeted and non-targeted unrestricted attacks on classification, semantic segmentation and object detection models. Our attacks can bypass certified defenses, yet our adversarial images look indistinguishable from natural images as verified by human evaluation. Moreover, we demonstrate that adversarial training with our examples improves performance of the model on clean images without requiring any modifications to the architecture. We perform experiments on LSUN, CelebA-HQ and COCO-Stuff as high resolution datasets to validate efficacy of our proposed approach. ",/pdf/b983a5b596a74e898b82fee8ed3a7873bb5d9230.pdf,ICLR,2021,"A novel approach for fine-grained generation of unrestricted adversarial examples in classification, segmentation and detection which improves the model's performance on clean images." +S1x8WnA5Ym,HkeIOXlKYX,1538090000000.0,1545360000000.0,1173,Learning Diverse Generations using Determinantal Point Processes,"[""m.elfeki11@gmail.com"", ""coupriec@fb.com"", ""elhoseiny@fb.com""]","[""Mohamed Elfeki"", ""Camille Couprie"", ""Mohamed Elhoseiny""]","[""Generative Adversarial Networks""]","Generative models have proven to be an outstanding tool for representing high-dimensional probability distributions and generating realistic looking images. A fundamental characteristic of generative models is their ability to produce multi-modal outputs. However, while training, they are often susceptible to mode collapse, which means that the model is limited in mapping the input noise to only a few modes of the true data distribution. In this paper, we draw inspiration from Determinantal Point Process (DPP) to devise a generative model that alleviates mode collapse while producing higher quality samples. DPP is an elegant probabilistic measure used to model negative correlations within a subset and hence quantify its diversity. We use DPP kernel to model the diversity in real data as well as in synthetic data. Then, we devise a generation penalty term that encourages the generator to synthesize data with a similar diversity to real data. In contrast to previous state-of-the-art generative models that tend to use additional trainable parameters or complex training paradigms, our method does not change the original training scheme. Embedded in an adversarial training and variational autoencoder, our Generative DPP approach shows a consistent resistance to mode-collapse on a wide-variety of synthetic data and natural image datasets including MNIST, CIFAR10, and CelebA, while outperforming state-of-the-art methods for data-efficiency, convergence-time, and generation quality. Our code will be made publicly available.",/pdf/29259a263b7d1a024afaa7a665a73a48932bb7d4.pdf,ICLR,2019,The addition of a diversity criterion inspired from DPP in the GAN objective avoids mode collapse and leads to better generations. +9EsrXMzlFQY,znAhoSS3v9j,1601310000000.0,1613010000000.0,874,Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors,"[""~Yu_Sun11"", ""jiaming.liu@wustl.edu"", ""yiran.s@wustl.edu"", ""~Brendt_Wohlberg2"", ""~Ulugbek_Kamilov1""]","[""Yu Sun"", ""Jiaming Liu"", ""Yiran Sun"", ""Brendt Wohlberg"", ""Ulugbek Kamilov""]","[""Regularization by denoising"", ""Computational imaging"", ""asynchronous parallel algorithm"", ""Deep denoising priors""]","Regularization by denoising (RED) is a recently developed framework for solving inverse problems by integrating advanced denoisers as image priors. Recent work has shown its state-of-the-art performance when combined with pre-trained deep denoisers. However, current RED algorithms are inadequate for parallel processing on multicore systems. We address this issue by proposing a new{asynchronous RED (Async-RED) algorithm that enables asynchronous parallel processing of data, making it significantly faster than its serial counterparts for large-scale inverse problems. The computational complexity of Async-RED is further reduced by using a random subset of measurements at every iteration. We present a complete theoretical analysis of the algorithm by establishing its convergence under explicit assumptions on the data-fidelity and the denoiser. We validate Async-RED on image recovery using pre-trained deep denoisers as priors.",/pdf/42abafb63caa1b6ddc6bda1b8e8337b1c2a9db91.pdf,ICLR,2021,Our work develops a novel deep-regularized asynchronous parallel method with provable convergence guarantees for solving large-scale inverse problems. +BkS3fnl0W,SJN2MngRW,1509110000000.0,1518730000000.0,384,Semi-supervised Outlier Detection using Generative And Adversary Framework,"[""jindong.gu@siemens.com"", ""schubert@dbs.ifi.lmu.de"", ""volker.tresp@siemens.com""]","[""Jindong Gu"", ""Matthias Schubert"", ""Volker Tresp""]","[""Semi-supervised Learning"", ""Generative And Adversary Framework"", ""One-class classification"", ""Outlier detection""]","In a conventional binary/multi-class classification task, the decision boundary is supported by data from two or more classes. However, in one-class classification task, only data from one class are available. To build an robust outlier detector using only data from a positive class, we propose a corrupted GAN(CorGAN), a deep convolutional Generative Adversary Network requiring no convergence during training. In the adversarial process of training CorGAN, the Generator is supposed to generate outlier samples for negative class, and the Discriminator as an one-class classifier is trained to distinguish data from training datasets (i.e. positive class) and generated data from the Generator (i.e. negative class). To improve the performance of the Discriminator (one-class classifier), we also propose a lot of techniques to improve the performance of the model. The proposed model outperforms the traditional method PCA + PSVM and the solution based on Autoencoder.",/pdf/389e7afc32f0edf45b8723eb9a78d5f405607c59.pdf,ICLR,2018, +rygFWAEFwS,HJgBUUNuwr,1569440000000.0,1583910000000.0,974,Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well,"[""vipul_gupta@berkeley.edu"", ""sakle@apple.com"", ""ddecoste@apple.com""]","[""Vipul Gupta"", ""Santiago Akle Serrano"", ""Dennis DeCoste""]","[""Large batch training"", ""Distributed neural network training"", ""Stochastic Weight Averaging""]","We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time. We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and ImageNet.",/pdf/32c8465e34f3ae0a7720b761305d0037ada9cd77.pdf,ICLR,2020,"We propose SWAP, a distributed algorithm for large-batch training of neural networks." +S1MQ6jCcK7,r1xdP5-5FQ,1538090000000.0,1545360000000.0,789,ChoiceNet: Robust Learning by Revealing Output Correlations,"[""sungjoon.s.choi@gmail.com"", ""sanghoon.hong@kakaobrain.com"", ""kyungjae.lee@cpslab.snu.ac.kr"", ""sungbin.lim@kakaobrain.com""]","[""Sungjoon Choi"", ""Sanghoon Hong"", ""Kyungjae Lee"", ""Sungbin Lim""]","[""Robust Deep Learning"", ""weakly supervised learning""]","In this paper, we focus on the supervised learning problem with corrupt training data. We assume that the training dataset is generated from a mixture of a target distribution and other unknown distributions. We estimate the quality of each data by revealing the correlation between the generated distribution and the target distribution. To this end, we present a novel framework referred to here as ChoiceNet that can robustly infer the target distribution in the presence of inconsistent data. We demonstrate that the proposed framework is applicable to both classification and regression tasks. Particularly, ChoiceNet is evaluated in comprehensive experiments, where we show that it constantly outperforms existing baseline methods in the handling of noisy data in synthetic regression tasks as well as behavior cloning problems. In the classification tasks, we apply the proposed method to the MNIST and CIFAR-10 datasets and it shows superior performances in terms of robustness to different types of noisy labels.",/pdf/4cd489b01f1010e8423e898e52434e02241a86ee.pdf,ICLR,2019, +#NAME?,ydkeTBUCRLS,1601310000000.0,1614990000000.0,407,Learn Robust Features via Orthogonal Multi-Path,"[""~Kun_Fang1"", ""~Xiaolin_Huang1"", ""~Yingwen_Wu1"", ""~Tao_Li12"", ""jieyang@sjtu.edu.cn""]","[""Kun Fang"", ""Xiaolin Huang"", ""Yingwen Wu"", ""Tao Li"", ""Jie Yang""]","[""adversarial robustness"", ""orthogonal multi-path""]"," It is now widely known that by adversarial attacks, clean images with invisible perturbations can fool deep neural networks. + To defend adversarial attacks, we design a block containing multiple paths to learn robust features and the parameters of these paths are required to be orthogonal with each other. + The so-called Orthogonal Multi-Path (OMP) block could be posed in any layer of a neural network. + Via forward learning and backward correction, one OMP block makes the neural networks learn features that are appropriate for all the paths and hence are expected to be robust. With careful design and thorough experiments on e.g., the positions of imposing orthogonality constraint, and the trade-off between the variety and accuracy, + the robustness of the neural networks is significantly improved. + For example, under white-box PGD attack with $l_\infty$ bound ${8}/{255}$ (this is a fierce attack that can make the accuracy of many vanilla neural networks drop to nearly $10\%$ on CIFAR10), VGG16 with the proposed OMP block could keep over $50\%$ accuracy. For black-box attacks, neural networks equipped with an OMP block have accuracy over $80\%$. The performance under both white-box and black-box attacks is much better than the existing state-of-the-art adversarial defenders. ",/pdf/cdb8e5e44b8cddf2929eba48bd44c0c41c3e9775.pdf,ICLR,2021,We propose a novel defence method via embedding orthogonal multi-path into a neural network to enhance the robustness. +IUYthV32lbK,6Fg3mxR8WCG2,1601310000000.0,1614990000000.0,2165,On the Certified Robustness for Ensemble Models and Beyond,"[""~Zhuolin_Yang1"", ""~Linyi_Li1"", ""~Xiaojun_Xu1"", ""~Bhavya_Kailkhura1"", ""~Bo_Li19""]","[""Zhuolin Yang"", ""Linyi Li"", ""Xiaojun Xu"", ""Bhavya Kailkhura"", ""Bo Li""]","[""Adversarial Machine Learning"", ""Model Ensemble"", ""Certified Robustness""]","Recent studies show that deep neural networks (DNN) are vulnerable to adversarial examples, which aim to mislead DNNs to make arbitrarily incorrect predictions. To defend against such attacks, both empirical and theoretical defense approaches have been proposed for a single ML model. In this work, we aim to explore and characterize the robustness conditions for ensemble ML models. We prove that the diversified gradient and large confidence margin are sufficient and necessary conditions for certifiably robust ensemble models under the model-smoothness assumption. We also show that an ensemble model can achieve higher certified robustness than a single base model based on these conditions. To our best knowledge, this is the first work providing tight conditions for the ensemble robustness. Inspired by our analysis, we propose the lightweight Diversity Regularized Training (DRT) for ensemble models. We derive the certified robustness of DRT based ensembles such as standard Weighted Ensemble and Max-Margin Ensemble following the sufficient and necessary conditions. Besides, to efficiently calculate the model-smoothness, we leverage adapted randomized model smoothing to obtain the certified robustness for different ensembles in practice. We show that the certified robustness of ensembles, on the other hand, verifies the necessity of DRT. To compare different ensembles, we prove that when the adversarial transferability among base models is high, Max-Margin Ensemble can achieve higher certified robustness than Weighted Ensemble; vice versa. Extensive experiments show that ensemble models trained with DRT can achieve the state-of-the-art certified robustness under various settings. Our work will shed light on future analysis for robust ensemble models. ",/pdf/12b1a88c375bcae0ad3a4e06564be248a5b65651.pdf,ICLR,2021,We analyze the sufficient and necessary conditions on certified ensemble robustness and propose Diversity-Regularized Training (DRT) to boost the certified robustness of ensemble models. +BJlzm64tDH,Syxm4q0UvH,1569440000000.0,1583910000000.0,440,Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model,"[""xwhan@cs.ucsb.edu"", ""jingfeidu@fb.com"", ""william@cs.ucsb.edu"", ""ves@fb.com""]","[""Wenhan Xiong"", ""Jingfei Du"", ""William Yang Wang"", ""Veselin Stoyanov""]",[]," Recent breakthroughs of pretrained language models have shown the effectiveness of self-supervised learning for a wide range of natural language processing (NLP) tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains.",/pdf/0ffdcc09ff094f333d60c31cf0f41479ee47ecc9.pdf,ICLR,2020, +FcfH5Pskt2G,EvtpyCpt5DN,1601310000000.0,1614990000000.0,919,Clearing the Path for Truly Semantic Representation Learning,"[""~Dominik_Zietlow1"", ""~Michal_Rolinek2"", ""~Georg_Martius1""]","[""Dominik Zietlow"", ""Michal Rolinek"", ""Georg Martius""]","[""Representation Learning"", ""Disentanglement"", ""Unsupervised Learning"", ""Semantic Representations"", ""VAE"", ""Causal Representations"", ""PCA""]","The performance of $\beta$-Variational-Autoencoders ($\beta$-VAEs) and their variants on learning semantically meaningful, disentangled representations is unparalleled. On the other hand, there are theoretical arguments suggesting impossibility of unsupervised disentanglement. In this work, we show that small perturbations of existing datasets hide the convenient correlation structure that is easily exploited by VAE-based architectures. To demonstrate this, we construct modified versions of the standard datasets on which (i) the generative factors are perfectly preserved; (ii) each image undergoes a transformation barely visible to the human eye; (iii) the leading disentanglement architectures fail to produce disentangled representations. We intend for these datasets to play a role in separating correlation-based models from those that discover the true causal structure. + +The construction of the modifications is non-trivial and relies on recent progress on mechanistic understanding of $\beta$-VAEs and their connection to PCA, while also providing additional insights that might be of stand-alone interest.",/pdf/2dc700922eb9258d4469fa1eebe771aa3fa53ca5.pdf,ICLR,2021, +HylKJhCcKm,ByeeIy5cYX,1538090000000.0,1545360000000.0,1004,Generalized Capsule Networks with Trainable Routing Procedure,"[""chen478@iu.edu"", ""cw234@iu.edu"", ""tz11@iu.edu"", ""djcran@iu.edu""]","[""Zhenhua Chen"", ""Chuhua Wang"", ""Tiancong Zhao"", ""David Crandall""]","[""Capsule networks"", ""generalization"", ""scalability"", ""adversarial robustness""]","CapsNet (Capsule Network) was first proposed by Sabour et al. (2017) and lateranother version of CapsNet was proposed by Hinton et al. (2018). CapsNet hasbeen proved effective in modeling spatial features with much fewer parameters.However, the routing procedures (dynamic routing and EM routing) in both pa-pers are not well incorporated into the whole training process, and the optimalnumber for the routing procedure has to be found manually. We propose Gen-eralized GapsNet (G-CapsNet) to overcome this disadvantages by incorporatingthe routing procedure into the optimization. We implement two versions of G-CapsNet (fully-connected and convolutional) on CAFFE (Jia et al. (2014)) andevaluate them by testing the accuracy on MNIST & CIFAR10, the robustness towhite-box & black-box attack, and the generalization ability on GAN-generatedsynthetic images. We also explore the scalability of G-CapsNet by constructinga relatively deep G-CapsNet. The experiment shows that G-CapsNet has goodgeneralization ability and scalability. ",/pdf/879862ec204ec733901e0ac88e56a36e876b6c8a.pdf,ICLR,2019,A scalable capsule network +BkxadR4KvS,BJxI_gOOwB,1569440000000.0,1577170000000.0,1227,Insights on Visual Representations for Embodied Navigation Tasks,"[""etw@gatech.edu"", ""julian.straub@oculus.com"", ""irfan@gatech.edu"", ""dbatra@gatech.edu"", ""judy@gatech.edu"", ""arimorcos@gmail.com""]","[""Erik Wijmans"", ""Julian Straub"", ""Irfan Essa"", ""Dhruv Batra"", ""Judy Hoffman"", ""Ari Morcos""]",[],"Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task. In this work, we study the underlying potential causes for this specialization by measuring the similarity between representations trained on related, but distinct tasks. We use the recently proposed projection weighted Canonical Correlation Analysis (PWCCA) to examine the task dependence of visual representations learned across different embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task. Interestingly, we show that if the tasks constrain the agent to spatially disjoint parts of the environment, differences in representation emerge for SqueezeNet models but less-so for ResNets, suggesting that ResNets feature inductive biases which encourage more task-agnostic representations, even in the context of spatially separated tasks. We generalize our analysis to examine permutations of an environment and find, surprisingly, permutations of an environment also do not influence the visual representation. Our analysis provides insight on the overfitting of representations in RL and provides suggestions of how to design tasks that induce task-agnostic representations.",/pdf/4686d08d4a84856b339b89eaf05fa2ae78b60a4f.pdf,ICLR,2020, +SyxfEn09Y7,BklLeyEROQ,1538090000000.0,1551440000000.0,1431,G-SGD: Optimizing ReLU Neural Networks in its Positively Scale-Invariant Space,"[""meq@microsoft.com"", ""zhengsx@mail.ustc.edu.cn"", ""huzhang@microsoft.com"", ""wche@microsoft.com"", ""qiwye@microsoft.com"", ""mazm@amt.ac.cn"", ""ynh@ustc.edu.cn"", ""tyliu@microsoft.com""]","[""Qi Meng"", ""Shuxin Zheng"", ""Huishuai Zhang"", ""Wei Chen"", ""Qiwei Ye"", ""Zhi-Ming Ma"", ""Nenghai Yu"", ""Tie-Yan Liu""]","[""optimization"", ""neural network"", ""irreducible positively scale-invariant space"", ""deep learning""]","It is well known that neural networks with rectified linear units (ReLU) activation functions are positively scale-invariant. Conventional algorithms like stochastic gradient descent optimize the neural networks in the vector space of weights, which is, however, not positively scale-invariant. This mismatch may lead to problems during the optimization process. Then, a natural question is: \emph{can we construct a new vector space that is positively scale-invariant and sufficient to represent ReLU neural networks so as to better facilitate the optimization process }? In this paper, we provide our positive answer to this question. First, we conduct a formal study on the positive scaling operators which forms a transformation group, denoted as $\mathcal{G}$. We prove that the value of a path (i.e. the product of the weights along the path) in the neural network is invariant to positive scaling and the value vector of all the paths is sufficient to represent the neural networks under mild conditions. Second, we show that one can identify some basis paths out of all the paths and prove that the linear span of their value vectors (denoted as $\mathcal{G}$-space) is an invariant space with lower dimension under the positive scaling group. Finally, we design stochastic gradient descent algorithm in $\mathcal{G}$-space (abbreviated as $\mathcal{G}$-SGD) to optimize the value vector of the basis paths of neural networks with little extra cost by leveraging back-propagation. Our experiments show that $\mathcal{G}$-SGD significantly outperforms the conventional SGD algorithm in optimizing ReLU networks on benchmark datasets. ",/pdf/50f435d28dd581dc7852f89d47857843841ad243.pdf,ICLR,2019, +MBdafA3G9k,cX8VIW2CdTx,1601310000000.0,1614990000000.0,1390,Visual Imitation with Reinforcement Learning using Recurrent Siamese Networks,"[""~Glen_Berseth1"", ""~Florian_Golemo1"", ""~Christopher_Pal1""]","[""Glen Berseth"", ""Florian Golemo"", ""Christopher Pal""]","[""Reinforcement Learning"", ""Imitation learning""]","It would be desirable for a reinforcement learning (RL) based agent to learn behaviour by merely watching a demonstration. However, defining rewards that facilitate this goal within the RL paradigm remains a challenge. Here we address this problem with Siamese networks, trained to compute distances between observed behaviours and an agent's behaviours. We use an RNN-based comparator model to learn such distances in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we have also found that the inclusion of multi-task data and an additional image encoding loss helps enforce temporal consistency and improve policy learning. These two components appear to balance reward for matching a specific instance of a behaviour versus that behaviour in general. Furthermore, we focus here on a particularly challenging form of this problem where only a single demonstration is provided for a given task -- the one-shot learning setting. We demonstrate our approach on humanoid, dog and raptor agents in 2D and a 3D quadruped and humanoid. In these environments, we show that our method outperforms the state-of-the-art, GAIfO (i.e. GAIL without access to actions) and TCNs.",/pdf/35d531f4acea159158663ecd22f20f614c61fddc.pdf,ICLR,2021,Learning recurrent distance functions between videos to enable imitation learning from a single motion clip. +SJeWHlSYDB,rJlZ-ugKPS,1569440000000.0,1577170000000.0,2274,SPREAD DIVERGENCE,"[""mingtian.zhang.17@ucl.ac.uk"", ""david.barber@ucl.ac.uk"", ""thomas.bird.17@ucl.ac.uk"", ""peter.hayes.15@ucl.ac.uk"", ""r.habib@cs.ucl.ac.uk""]","[""Mingtian Zhang"", ""David Barber"", ""Thomas Bird"", ""Peter Hayes"", ""Raza Habib""]","[""divergence minimization"", ""generative model"", ""variational inference""]","For distributions $p$ and $q$ with different supports, the divergence $\div{p}{q}$ may not exist. We define a spread divergence $\sdiv{p}{q}$ on modified $p$ and $q$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a spread divergence to train and improve implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).",/pdf/5dc6531d109da0247305d2e70cce0c493988ed7e.pdf,ICLR,2020,A new divergence family dealing with distributions with different supports for training implicit generative models. +2G9u-wu2tXP,1rPNCRgUMG5,1601310000000.0,1614990000000.0,431,Continual learning using hash-routed convolutional neural networks,"[""~Ahmad_Berjaoui1""]","[""Ahmad Berjaoui""]","[""Lifelong learning"", ""continual learning"", ""feature hashing""]","Continual learning could shift the machine learning paradigm from data centric to model centric. A continual learning model needs to scale efficiently to handle semantically different datasets, while avoiding unnecessary growth. We introduce hash-routed convolutional neural networks: a group of convolutional units where data flows dynamically. Feature maps are compared using feature hashing and similar data is routed to the same units. A hash-routed network provides excellent plasticity thanks to its routed nature, while generating stable features through the use of orthogonal feature hashing. Each unit evolves separately and new units can be added (to be used only when necessary). Hash-routed networks achieve excellent performance across a variety of typical continual learning benchmarks without storing raw data and train using only gradient descent. Besides providing a continual learning framework for supervised tasks with encouraging results, our model can be used for unsupervised or reinforcement learning.",/pdf/4ff632eb3b7d179585d79daa9352f523bf7d6516.pdf,ICLR,2021,"We present a scalable continual learning framework composed of individual units, selected using feature hashing." +PQ2Cel-1rJh,FUiBm2Zxs0,1601310000000.0,1614990000000.0,3113,Pea-KD: Parameter-efficient and accurate Knowledge Distillation,"[""~IKHYUN_CHO1"", ""~U_Kang1""]","[""IKHYUN CHO"", ""U Kang""]","[""BERT"", ""Deep Learning"", ""Natural Language Processing"", ""Transformer"", ""Knowledge Distillation"", ""Parameter Sharing""]","How can we efficiently compress a model while maintaining its performance? Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model's level of performance as much as possible. However, the existing KD methods suffer from the following limitations. First, since the student model is small in absolute size, it inherently lacks model complexity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. + +In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher's Predictions (PTP). Using this combination, we are capable of alleviating the KD's limitations. SPS is a new parameter sharing method that allows greater model complexity for the student model. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model's performance. Experiments conducted on different datasets and tasks show that the proposed approach improves the student model's performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins. +",/pdf/73e1fa211b7a950aef7587202346d058d92c3558.pdf,ICLR,2021,It introduces a new Knowledge Distillation method. It improves the performance of the student model with 2 main modules: novel parameter sharing and new pretraining which uses the teacher model's predictions. +zdrls6LIX4W,Q9_XVM4xxiR,1601310000000.0,1614990000000.0,2762,A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning,"[""~Dong-Ki_Kim1"", ""~Miao_Liu1"", ""~Matthew_D_Riemer1"", ""~Chuangchuang_Sun1"", ""abdulhai@mit.edu"", ""~Golnaz_Habibi1"", ""slcot@mit.edu"", ""~Gerald_Tesauro1"", ""~JONATHAN_P_HOW1""]","[""Dong-Ki Kim"", ""Miao Liu"", ""Matthew D Riemer"", ""Chuangchuang Sun"", ""Marwa Abdulhai"", ""Golnaz Habibi"", ""Sebastian Lopez-Cot"", ""Gerald Tesauro"", ""JONATHAN P HOW""]","[""Multiagent reinforcement learning"", ""Meta-learning"", ""Non-stationarity""]","A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other agents that are also simultaneously learning. In particular, each agent perceives the environment as effectively non-stationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural nonstationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to these multiagent settings. This is achieved by modeling our gradient updates to directly consider both an agent’s own non-stationary policy dynamics and the non-stationary policy dynamics of other agents interacting with it in the environment. We find that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently combines key aspects of previous state of the art approaches on this topic. We test our method on several multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than previous related approaches across the spectrum of mixed incentive, competitive, and cooperative environments.",/pdf/58bfb0dbf51fcbc019dc498d8e24889dca9961f1.pdf,ICLR,2021,We introduce a novel meta-multiagent policy gradient theorem based on meta-learning that can adapt quickly to non-stationarity in the policies of other agents in the environment. +r1eowANFvr,BJgHUtPuvS,1569440000000.0,1587730000000.0,1188,Towards Fast Adaptation of Neural Architectures with Meta Learning,"[""liandz@shanghaitech.edu.cn"", ""yzheng3xg@gmail.com"", ""xuyt@shanghaitech.edu.cn"", ""alanlu@tencent.com"", ""goshawklin@tencent.com"", ""masonzhao@tencent.com"", ""jzhuang@uta.edu"", ""gaoshh@shanghaitech.edu.cn""]","[""Dongze Lian"", ""Yin Zheng"", ""Yintao Xu"", ""Yanxiong Lu"", ""Leyu Lin"", ""Peilin Zhao"", ""Junzhou Huang"", ""Shenghua Gao""]","[""Fast adaptation"", ""Meta learning"", ""NAS""]","Recently, Neural Architecture Search (NAS) has been successfully applied to multiple artificial intelligence areas and shows better performance compared with hand-designed networks. However, the existing NAS methods only target a specific task. Most of them usually do well in searching an architecture for single task but are troublesome for multiple datasets or multiple tasks. Generally, the architecture for a new task is either searched from scratch, which is neither efficient nor flexible enough for practical application scenarios, or borrowed from the ones searched on other tasks, which might be not optimal. In order to tackle the transferability of NAS and conduct fast adaptation of neural architectures, we propose a novel Transferable Neural Architecture Search method based on meta-learning in this paper, which is termed as T-NAS. T-NAS learns a meta-architecture that is able to adapt to a new task quickly through a few gradient steps, which makes the transferred architecture suitable for the specific task. Extensive experiments show that T-NAS achieves state-of-the-art performance in few-shot learning and comparable performance in supervised learning but with 50x less searching cost, which demonstrates the effectiveness of our method.",/pdf/fd0e3852a23144279c50256f6898b7496a6d93e8.pdf,ICLR,2020,A meta-learning method for fast adaptation of neural architectures. +HyNxRZ9xg,,1478270000000.0,1478270000000.0,171,Cat2Vec: Learning Distributed Representation of Multi-field Categorical Data,"[""ying.wen@cs.ucl.ac.uk"", ""jun.wang@cs.ucl.ac.uk"", ""tychen@apex.sjtu.edu.cn"", ""wnzhang@apex.sjtu.edu.cn""]","[""Ying Wen"", ""Jun Wang"", ""Tianyao Chen"", ""Weinan Zhang""]","[""Unsupervised Learning"", ""Deep learning"", ""Applications""]","This paper presents a method of learning distributed representation for multi-field categorical data, which is a common data format with various applications such as recommender systems, social link prediction, and computational advertising. The success of non-linear models, e.g., factorisation machines, boosted trees, has proved the potential of exploring the interactions among inter-field categories. Inspired by Word2Vec, the distributed representation for natural language, we propose Cat2Vec (categories to vectors) model. In Cat2Vec, a low-dimensional continuous vector is automatically learned for each category in each field. The interactions among inter-field categories are further explored by different neural gates and the most informative ones are selected by pooling layers. In our experiments, with the exploration of the interactions between pairwise categories over layers, the model attains great improvement over state-of-the-art models in a supervised learning task, e.g., click prediction, while capturing the most significant interactions from the data. ",/pdf/cd760342b596795f95d54247704ea58682d5cde6.pdf,ICLR,2017,an unsupervised pairwise interaction model to learning the distributed representation of multi-field categorical data +65sCF5wmhpv,KExKzVl9Aw_,1601310000000.0,1614990000000.0,1624,Learning to Observe with Reinforcement Learning,"[""~Mehmet_Koseoglu1"", ""ecekundura@gmail.com"", ""~Ayca_Ozcelikkale1""]","[""Mehmet Koseoglu"", ""Ece Kunduracioglu"", ""Ayca Ozcelikkale""]","[""Reinforcement learning"", ""observation strategies"", ""active data collection""]","We consider a decision making problem where an autonomous agent decides on which actions to take based on the observations it collects from the environment. We are interested in revealing the information structure of the observation space illustrating which type of observations are the most important (such as position versus velocity) and the dependence of this on the state of agent (such as at the bottom versus top of a hill). We approach this problem by associating a cost with collecting observations which increases with the accuracy. We adopt a reinforcement learning (RL) framework where the RL agent learns to adjust the accuracy of the observations alongside learning to perform the original task. We consider both the scenario where the accuracy can be adjusted continuously and also the scenario where the agent has to choose between given preset levels, such as taking a sample perfectly or not taking a sample at all. In contrast to the existing work that mostly focuses on sample efficiency during training, our focus is on the behaviour during the actual task. Our results illustrate that the RL agent can learn to use the observation space efficiently and obtain satisfactory performance in the original task while collecting effectively smaller amount of data. By uncovering the relative usefulness of different types of observations and trade-offs within, these results also provide insights for further design of active data acquisition schemes. ",/pdf/23d11b817cb5f9e40aced34856e7dc9f50fc03f3.pdf,ICLR,2021,We propose a reinforcement learning based active data acqusition framework which reveals the information structure of the observation space demonstrating the type of observations that are the most important during the actual operation of the agent. +pHsHaXAv8m-,cKEOvsMy3Qz,1601310000000.0,1614990000000.0,731,Towards Principled Representation Learning for Entity Alignment,"[""~Lingbing_Guo1"", ""~Zequn_Sun1"", ""mingyangchen@zju.edu.cn"", ""whu@nju.edu.cn"", ""~Huajun_Chen1""]","[""Lingbing Guo"", ""Zequn Sun"", ""Mingyang Chen"", ""Wei Hu"", ""Huajun Chen""]","[""Representation Learning"", ""Knowledge Graph"", ""Entity Alignment"", ""Knowledge Graph Embedding""]","Knowledge graph (KG) representation learning for entity alignment has recently received great attention. Compared with conventional methods, these embedding-based ones are considered to be robuster for highly-heterogeneous and cross-lingual entity alignment scenarios as they do not rely on the quality of machine translation or feature extraction. Despite the significant improvement that has been made, there is little understanding of how the embedding-based entity alignment methods actually work. Most existing methods rest on the foundation that a small number of pre-aligned entities can serve as anchors to connect the embedding spaces of two KGs. But no one investigates the rationality of such foundation. In this paper, we define a typical paradigm abstracted from the existing methods, and analyze how the representation discrepancy between two potentially-aligned entities is implicitly bounded by a predefined margin in the scoring function for embedding learning. However, such a margin cannot guarantee to be tight enough for alignment learning. We mitigate this problem by proposing a new approach that explicitly learns KG-invariant and principled entity representations, meanwhile preserves the original infrastructure of existing methods. In this sense, the model not only pursues the closeness of aligned entities on geometric distance, but also aligns the neural ontologies of two KGs to eliminate the discrepancy in feature distribution and underlying ontology knowledge. Our experiments demonstrate consistent and significant improvement in performance against the existing embedding-based entity alignment methods, including several state-of-the-art ones.",/pdf/3300b127d7c97226f47fe6588d36135df3a9e283.pdf,ICLR,2021,A principled approach to learn principled representations for entity alignment. +BJgbzhC5Ym,ryxoHTT5Fm,1538090000000.0,1545360000000.0,1236,NECST: Neural Joint Source-Channel Coding,"[""kechoi@cs.stanford.edu"", ""kedart@stanford.edu"", ""tsachy@stanford.edu"", ""ermon@cs.stanford.edu""]","[""Kristy Choi"", ""Kedar Tatwawadi"", ""Tsachy Weissman"", ""Stefano Ermon""]","[""joint source-channel coding"", ""deep generative models"", ""unsupervised learning""]","For reliable transmission across a noisy communication channel, classical results from information theory show that it is asymptotically optimal to separate out the source and channel coding processes. However, this decomposition can fall short in the finite bit-length regime, as it requires non-trivial tuning of hand-crafted codes and assumes infinite computational power for decoding. In this work, we propose Neural Error Correcting and Source Trimming (NECST) codes to jointly learn the encoding and decoding processes in an end-to-end fashion. By adding noise into the latent codes to simulate the channel during training, we learn to both compress and error-correct given a fixed bit-length and computational budget. We obtain codes that are not only competitive against several capacity-approaching channel codes, but also learn useful robust representations of the data for downstream tasks such as classification. Finally, we learn an extremely fast neural decoder, yielding almost an order of magnitude in speedup compared to standard decoding methods based on iterative belief propagation. ",/pdf/7271c99447d7f81eb0d82dd06db84d0dfd07bdb7.pdf,ICLR,2019,jointly learn compression + error correcting codes with deep learning +Utc4Yd1RD_s,EpOnA652w4-,1601310000000.0,1614990000000.0,10,Towards Defending Multiple Adversarial Perturbations via Gated Batch Normalization,"[""~Aishan_Liu1"", ""~Shiyu_Tang1"", ""~Xianglong_Liu2"", ""~Xinyun_Chen1"", ""~Lei_Huang1"", ""~Zhuozhuo_Tu1"", ""~Dawn_Song1"", ""~Dacheng_Tao1""]","[""Aishan Liu"", ""Shiyu Tang"", ""Xianglong Liu"", ""Xinyun Chen"", ""Lei Huang"", ""Zhuozhuo Tu"", ""Dawn Song"", ""Dacheng Tao""]","[""adversarial examples"", ""multiple adversarial peturbation types"", ""adversarial robustness""]","There is now extensive evidence demonstrating that deep neural networks are vulnerable to adversarial examples, motivating the development of defenses against adversarial attacks. However, existing adversarial defenses typically improve model robustness against individual specific perturbation types. Some recent methods improve model robustness against adversarial attacks in multiple $\ell_p$ balls, but their performance against each perturbation type is still far from satisfactory. To better understand this phenomenon, we propose the \emph{multi-domain} hypothesis, stating that different types of adversarial perturbations are drawn from different domains. Guided by the multi-domain hypothesis, we propose~\emph{Gated Batch Normalization (GBN)}, a novel building block for deep neural networks that improves robustness against multiple perturbation types. GBN consists of a gated sub-network and a multi-branch batch normalization (BN) layer, where the gated sub-network separates different perturbation types, and each BN branch is in charge of a single perturbation type and learns domain-specific statistics for input transformation. Then, features from different branches are aligned as domain-invariant representations for the subsequent layers. We perform extensive evaluations of our approach on MNIST, CIFAR-10, and Tiny-ImageNet, and in doing so demonstrate that GBN outperforms previous defense proposals against multiple perturbation types, \ie, $\ell_1$, $\ell_2$, and $\ell_{\infty}$ perturbations, by large margins of 10-20\%.",/pdf/0e952681c8686c14af9a92c49e7d8bab4dde2b5a.pdf,ICLR,2021,"To defend multiple adversarial perturbations, we propose the multi-domain hypothesis; guided by that, we propose Gated Batch Normalization (GBN), a novel building block for DNNs that improves robustness against multiple perturbation types." +HJedXaEtvS,HJlOsegwvH,1569440000000.0,1583910000000.0,455,Editable Neural Networks,"[""ant.sinitsin@gmail.com"", ""vsevolod-pl@yandex.ru"", ""alagaster@yandex.ru"", ""sapopov@yandex-team.ru"", ""artem.babenko@phystech.edu""]","[""Anton Sinitsin"", ""Vsevolod Plokhotnyuk"", ""Dmitry Pyrkin"", ""Sergei Popov"", ""Artem Babenko""]","[""editing"", ""editable"", ""meta-learning"", ""maml""]","These days deep neural networks are ubiquitously used in a wide range of tasks, from image classification and machine translation to face identification and self-driving cars. In many applications, a single model error can lead to devastating financial, reputational and even life-threatening consequences. Therefore, it is crucially important to correct model mistakes quickly as they appear. In this work, we investigate the problem of neural network editing - how one can efficiently patch a mistake of the model on a particular sample, without influencing the model behavior on other samples. Namely, we propose Editable Training, a model-agnostic training technique that encourages fast editing of the trained model. We empirically demonstrate the effectiveness of this method on large-scale image classification and machine translation tasks.",/pdf/99693462793c1e5fb613755206e3113d362988a8.pdf,ICLR,2020,Training neural networks so you can efficiently patch them later. +84gjULz1t5,ivTcUuR9Js0,1601310000000.0,1616020000000.0,1280,Linear Convergent Decentralized Optimization with Compression,"[""~Xiaorui_Liu1"", ""liyao6@msu.edu"", ""wangron6@msu.edu"", ""~Jiliang_Tang1"", ""myan@msu.edu""]","[""Xiaorui Liu"", ""Yao Li"", ""Rongrong Wang"", ""Jiliang Tang"", ""Ming Yan""]","[""Decentralized Optimization"", ""Communication Compression"", ""Linear Convergence"", ""Heterogeneous data""]","Communication compression has become a key strategy to speed up distributed optimization. However, existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. They are unsatisfactory in terms of convergence rate, stability, and the capability to handle heterogeneous data. Motivated by primal-dual algorithms, this paper proposes the first \underline{L}in\underline{EA}r convergent \underline{D}ecentralized algorithm with compression, LEAD. Our theory describes the coupled dynamics of the inexact primal and dual update as well as compression error, and we provide the first consensus error bound in such settings without assuming bounded gradients. Experiments on convex problems validate our theoretical analysis, and empirical study on deep neural nets shows that LEAD is applicable to non-convex problems.",/pdf/2c18c598153405d899eb172db4db2e112b25c66f.pdf,ICLR,2021,A Linear Convergent Decentralized Optimization with Communication Compression +p8agn6bmTbr,VTiXRC8fRz,1601310000000.0,1615490000000.0,2269,Usable Information and Evolution of Optimal Representations During Training,"[""~Michael_Kleinman2"", ""~Alessandro_Achille1"", ""~Daksh_Idnani1"", ""~Jonathan_Kao1""]","[""Michael Kleinman"", ""Alessandro Achille"", ""Daksh Idnani"", ""Jonathan Kao""]","[""Usable Information"", ""Representation Learning"", ""Learning Dynamics"", ""Initialization"", ""SGD""]","We introduce a notion of usable information contained in the representation learned by a deep network, and use it to study how optimal representations for the task emerge during training. We show that the implicit regularization coming from training with Stochastic Gradient Descent with a high learning-rate and small batch size plays an important role in learning minimal sufficient representations for the task. In the process of arriving at a minimal sufficient representation, we find that the content of the representation changes dynamically during training. In particular, we find that semantically meaningful but ultimately irrelevant information is encoded in the early transient dynamics of training, before being later discarded. In addition, we evaluate how perturbing the initial part of training impacts the learning dynamics and the resulting representations. We show these effects on both perceptual decision-making tasks inspired by neuroscience literature, as well as on standard image classification tasks.",/pdf/ecfb28e9a1edfd9c52876b78d81632b816d662b2.pdf,ICLR,2021, +BkxA5lBFvH,rkgVHyWKwB,1569440000000.0,1577170000000.0,2491,Hope For The Best But Prepare For The Worst: Cautious Adaptation In RL Agents,"[""jessezhang@berkeley.edu"", ""bcheung@berkeley.edu"", ""cbfinn@cs.stanford.edu"", ""dineshjayaraman@berkeley.edu"", ""svlevine@eecs.berkeley.edu""]","[""Jesse Zhang"", ""Brian Cheung"", ""Chelsea Finn"", ""Dinesh Jayaraman"", ""Sergey Levine""]","[""safety"", ""risk"", ""uncertainty"", ""adaptation""]","We study the problem of safe adaptation: given a model trained on a variety of past experiences for some task, can this model learn to perform that task in a new situation while avoiding catastrophic failure? This problem setting occurs frequently in real-world reinforcement learning scenarios such as a vehicle adapting to drive in a new city, or a robotic drone adapting a policy trained only in simulation. While learning without catastrophic failures is exceptionally difficult, prior experience can allow us to learn models that make this much easier. These models might not directly transfer to new settings, but can enable cautious adaptation that is substantially safer than na\""{i}ve adaptation as well as learning from scratch. Building on this intuition, we propose risk-averse domain adaptation (RADA). RADA works in two steps: it first trains probabilistic model-based RL agents in a population of source domains to gain experience and capture epistemic uncertainty about the environment dynamics. Then, when dropped into a new environment, it employs a pessimistic exploration policy, selecting actions that have the best worst-case performance as forecasted by the probabilistic model. We show that this simple maximin policy accelerates domain adaptation in a safety-critical driving environment with varying vehicle sizes. We compare our approach against other approaches for adapting to new environments, including meta-reinforcement learning.",/pdf/8e9d18159b301518f6a24ed018e6392e3a2b3d66.pdf,ICLR,2020,Adaptation of an RL agent in a target environment with unknown dynamics is fast and safe when we transfer prior experience in a variety of environments and then select risk-averse actions during adaptation. +ByedzkrKvH,S1lZ2f3dPH,1569440000000.0,1583910000000.0,1585,Double Neural Counterfactual Regret Minimization,"[""ken.lh@antfin.com"", ""hkl163251@antfin.com"", ""yaohua.zsh@antfin.com"", ""yuan.qi@antfin.com"", ""lsong@cc.gatech.edu""]","[""Hui Li"", ""Kailiang Hu"", ""Shaohua Zhang"", ""Yuan Qi"", ""Le Song""]","[""Counterfactual Regret Minimization"", ""Imperfect Information game"", ""Neural Strategy"", ""Deep Learning"", ""Robust Sampling""]","Counterfactual regret minimization (CFR) is a fundamental and effective technique for solving Imperfect Information Games (IIG). However, the original CFR algorithm only works for discrete states and action spaces, and the resulting strategy is maintained as a tabular representation. Such tabular representation limits the method from being directly applied to large games. In this paper, we propose a double neural representation for the IIGs, where one neural network represents the cumulative regret, and the other represents the average strategy. Such neural representations allow us to avoid manual game abstraction and carry out end-to-end optimization. To make the learning efficient, we also developed several novel techniques including a robust sampling method and a mini-batch Monte Carlo Counterfactual Regret Minimization (MCCFR) method, which may be of independent interests. Empirically, on games tractable to tabular approaches, neural strategies trained with our algorithm converge comparably to their tabular counterparts, and significantly outperform those based on deep reinforcement learning. On extremely large games with billions of decision nodes, our approach achieved strong performance while using hundreds of times less memory than the tabular CFR. On head-to-head matches of hands-up no-limit texas hold'em, our neural agent beat the strong agent ABS-CFR by $9.8\pm4.1$ chips per game. It's a successful application of neural CFR in large games. +",/pdf/b8cc78ebf1ae64586fb7ed77793297f0a9b4a816.pdf,ICLR,2020,We proposed a double neural framework to solve large-scale imperfect information game. +S1ANxQW0b,SJWNxQ-RZ,1509140000000.0,1519430000000.0,1110,Maximum a Posteriori Policy Optimisation,"[""abbas.abdolmaleky@gmail.com"", ""springenberg@google.com"", ""heess@google.com"", ""tassa@google.com"", ""munos@google.com""]","[""Abbas Abdolmaleki"", ""Jost Tobias Springenberg"", ""Yuval Tassa"", ""Remi Munos"", ""Nicolas Heess"", ""Martin Riedmiller""]","[""Reinforcement Learning"", ""Variational Inference"", ""Control""]","We introduce a new algorithm for reinforcement learning called Maximum a-posteriori Policy Optimisation (MPO) based on coordinate ascent on a relative-entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings.",/pdf/70aeec67aeaaee17963d1ec28cbaf95bba94fbbb.pdf,ICLR,2018, +S9MPX7ejmv,zzNbxlAODSq,1601310000000.0,1614990000000.0,1858,Approximating Pareto Frontier through Bayesian-optimization-directed Robust Multi-objective Reinforcement Learning,"[""~Xiangkun_He1"", ""~Jianye_HAO1"", ""~Dong_Li10"", ""~Bin_Wang12"", ""~Wulong_Liu1""]","[""Xiangkun He"", ""Jianye HAO"", ""Dong Li"", ""Bin Wang"", ""Wulong Liu""]","[""Reinforcement Learning"", ""Multi\u2013objective Optimization"", ""Adversarial Machine Learning"", ""Bayesian Optimization""]","Many real-word decision or control problems involve multiple conflicting objectives and uncertainties, which requires learned policies are not only Pareto optimal but also robust. In this paper, we proposed a novel algorithm to approximate a representation for robust Pareto frontier through Bayesian-optimization-directed robust multi-objective reinforcement learning (BRMORL). Firstly, environmental uncertainty is modeled as an adversarial agent over the entire space of preferences by incorporating zero-sum game into multi-objective reinforcement learning (MORL). Secondly, a comprehensive metric based on hypervolume and information entropy is presented to evaluate convergence, diversity and evenness of the distribution for Pareto solutions. Thirdly, the agent’s learning process is regarded as a black-box, and the comprehensive metric we proposed is computed after each episode of training, then a Bayesian optimization (BO) algorithm is adopted to guide the agent to evolve towards improving the quality of the approximated Pareto frontier. Finally, we demonstrate the effectiveness of proposed approach on challenging multi-objective tasks across four environments, and show our scheme can produce robust policies under environmental uncertainty.",/pdf/e80e60cd3f29bf39759621c57a868095376e59cb.pdf,ICLR,2021,We proposed a novel approach to approximate a representation for robust Pareto frontier through Bayesian-optimization-directed robust multi-objective reinforcement learning. +HCSgyPUfeDj,Kw9V0yjlmW,1601310000000.0,1616000000000.0,1955,Learning and Evaluating Representations for Deep One-Class Classification,"[""~Kihyuk_Sohn1"", ""~Chun-Liang_Li1"", ""~Jinsung_Yoon1"", ""minhojin@google.com"", ""~Tomas_Pfister1""]","[""Kihyuk Sohn"", ""Chun-Liang Li"", ""Jinsung Yoon"", ""Minho Jin"", ""Tomas Pfister""]","[""deep one-class classification"", ""self-supervised learning""]","We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. The framework not only allows to learn better representations, but also permits building one-class classifiers that are faithful to the target task. We argue that classifiers inspired by the statistical perspective in generative or discriminative models are more effective than existing approaches, such as a normality score from a surrogate classifier. We thoroughly evaluate different self-supervised representation learning algorithms under the proposed framework for one-class classification. Moreover, we present a novel distribution-augmented contrastive learning that extends training distributions via data augmentation to obstruct the uniformity of contrastive representations. In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks, including novelty and anomaly detection. Finally, we present visual explanations, confirming that the decision-making process of deep one-class classifiers is intuitive to humans. The code is available at https://github.com/google-research/deep_representation_one_class. +",/pdf/8d26ea264a20bf96973da687dd1e99b4054b6388.pdf,ICLR,2021,"We present a two-stage framework for deep one-class classification, composed of state-of-the-art self-supervised representation learning followed by generative or discriminative one-class classifiers." +B1xMEerYvB,Syx1mUgYwB,1569440000000.0,1583910000000.0,2240,Smooth markets: A basic mechanism for organizing gradient-based learners,"[""dbalduzzi@google.com"", ""lejlot@google.com"", ""edwardhughes@google.com"", ""jzl@google.com"", ""imgemp@google.com"", ""twa@google.com"", ""georgios.piliouras@gmail.com"", ""thore@google.com""]","[""David Balduzzi"", ""Wojciech M. Czarnecki"", ""Tom Anthony"", ""Ian Gemp"", ""Edward Hughes"", ""Joel Leibo"", ""Georgios Piliouras"", ""Thore Graepel""]","[""game theory"", ""optimization"", ""gradient descent"", ""adversarial learning""]","With the success of modern machine learning, it is becoming increasingly important to understand and control how learning algorithms interact. Unfortunately, negative results from game theory show there is little hope of understanding or controlling general n-player games. We therefore introduce smooth markets (SM-games), a class of n-player games with pairwise zero sum interactions. SM-games codify a common design pattern in machine learning that includes some GANs, adversarial training, and other recent algorithms. We show that SM-games are amenable to analysis and optimization using first-order methods.",/pdf/ab64e52dccd7ce6a4df0e4bebb7b3d342e651c6f.pdf,ICLR,2020,We introduce a class of n-player games suited to gradient-based methods. +BygIV2CcKm,Byx2L5n9tQ,1538090000000.0,1545360000000.0,1452,Learning to Augment Influential Data,"[""iamdh@kaist.ac.kr"", ""cd_yoo@kaist.ac.kr""]","[""Donghoon Lee"", ""Chang D. Yoo""]","[""data augmentation"", ""influence function"", ""generative adversarial network""]","Data augmentation is a technique to reduce overfitting and to improve generalization by increasing the number of labeled data samples by performing label preserving transformations; however, it is currently conducted in a trial and error manner. A composition of predefined transformations, such as rotation, scaling and cropping, is performed on training samples, and its effect on performance over test samples can only be empirically evaluated and cannot be predicted. This paper considers an influence function which predicts how generalization is affected by a particular augmented training sample in terms of validation loss. The influence function provides an approximation of the change in validation loss without comparing the performance which includes and excludes the sample in the training process. A differentiable augmentation model that generalizes the conventional composition of predefined transformations is also proposed. The differentiable augmentation model and reformulation of the influence function allow the parameters of the augmented model to be directly updated by backpropagation to minimize the validation loss. The experimental results show that the proposed method provides better generalization over conventional data augmentation methods.",/pdf/ab1fbda481fb0c0e34d9a4d54a3802796ff7c9db.pdf,ICLR,2019, +rJgqjREtvS,HylGX7K_DS,1569440000000.0,1577170000000.0,1330,CRNet: Image Super-Resolution Using A Convolutional Sparse Coding Inspired Network,"[""zmlhome@whu.edu.cn"", ""liuzhou@whu.edu.cn"", ""jingwei_he@whu.edu.cn"", ""ly.wd@whu.edu.cn""]","[""Menglei Zhang"", ""Zhou Liu"", ""Jingwei He"", ""Lei Yu""]","[""Convolutional sparse coding"", ""LISTA"", ""image super-resolution""]","Convolutional Sparse Coding (CSC) has been attracting more and more attention in recent years, for making full use of image global correlation to improve performance on various computer vision applications. However, very few studies focus on solving CSC based image Super-Resolution (SR) problem. As a consequence, there is no significant progress in this area over a period of time. In this paper, we exploit the natural connection between CSC and Convolutional Neural Networks (CNN) to address CSC based image SR. Specifically, Convolutional Iterative Soft Thresholding Algorithm (CISTA) is introduced to solve CSC problem and it can be implemented using CNN architectures. Then we develop a novel CSC based SR framework analogy to the traditional SC based SR methods. Two models inspired by this framework are proposed for pre-/post-upsampling SR, respectively. Compared with recent state-of-the-art SR methods, both of our proposed models show superior performance in terms of both quantitative and qualitative measurements.",/pdf/db40f2cc1022aebe4c17b98bfc1a02f5e50ccdef.pdf,ICLR,2020, +Hkl_sAVtwr,S1euifK_DS,1569440000000.0,1577170000000.0,1324,Compressed Sensing with Deep Image Prior and Learned Regularization,"[""davemvanveen@gmail.com"", ""ajiljalal@utexas.edu"", ""soltanol@usc.edu"", ""ecprice@cs.utexas.edu"", ""sriram@austin.utexas.edu"", ""dimakis@austin.utexas.edu""]","[""Dave Van Veen"", ""Ajil Jalal"", ""Mahdi Soltanolkotabi"", ""Eric Price"", ""Sriram Vishwanath"", ""Alexandros G. Dimakis""]","[""compressed sensing"", ""sparsity"", ""inverse problems""]","We propose a novel method for compressed sensing recovery using +untrained deep generative models. Our method is based on the recently +proposed Deep Image Prior (DIP), wherein the convolutional weights of +the network are optimized to match the observed measurements. We show +that this approach can be applied to solve any differentiable linear inverse +problem, outperforming previous unlearned methods. Unlike various learned approaches based on generative models, our method does not require pre-training over large datasets. We further introduce a novel learned regularization technique, which incorporates prior information on the network weights. This reduces reconstruction error, especially for noisy measurements. Finally we prove that, using the DIP optimization approach, moderately overparameterized single-layer networks trained can perfectly fit any signal despite the nonconvex nature of the fitting problem. This theoretical result provides justification for early stopping.",/pdf/960068efdade58de64b1b641bcccfdba53ac168b.pdf,ICLR,2020,Compressed sensing methods with untrained networks and theoretical guarantees